* [PATCH 0/2] two fixes for md @ 2022-05-05 8:16 Guoqing Jiang 2022-05-05 8:16 ` [PATCH V3 1/2] md: don't unregister sync_thread with reconfig_mutex held Guoqing Jiang ` (2 more replies) 0 siblings, 3 replies; 45+ messages in thread From: Guoqing Jiang @ 2022-05-05 8:16 UTC (permalink / raw) To: song; +Cc: buczek, linux-raid, Guoqing Jiang Hi, The first patch addressed the issue in the link as reported by Donald. https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t Also add another one per discussion of v2 patch. Thanks, Guoqing Guoqing Jiang (2): md: don't unregister sync_thread with reconfig_mutex held md: protect md_unregister_thread from reentrancy drivers/md/md.c | 24 ++++++++++++++++++------ drivers/md/md.h | 5 +++++ 2 files changed, 23 insertions(+), 6 deletions(-) -- 2.31.1 ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH V3 1/2] md: don't unregister sync_thread with reconfig_mutex held 2022-05-05 8:16 [PATCH 0/2] two fixes for md Guoqing Jiang @ 2022-05-05 8:16 ` Guoqing Jiang 2022-05-05 14:02 ` kernel test robot 2022-05-05 18:04 ` kernel test robot 2022-05-05 8:16 ` [PATCH 2/2] md: protect md_unregister_thread from reentrancy Guoqing Jiang 2022-05-06 11:36 ` [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held Guoqing Jiang 2 siblings, 2 replies; 45+ messages in thread From: Guoqing Jiang @ 2022-05-05 8:16 UTC (permalink / raw) To: song; +Cc: buczek, linux-raid, Guoqing Jiang, Guoqing Jiang From: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Unregister sync_thread doesn't need to hold reconfig_mutex since it doesn't reconfigure array. And it could cause deadlock problem for raid5 as follows: 1. process A tried to reap sync thread with reconfig_mutex held after echo idle to sync_action. 2. raid5 sync thread was blocked if there were too many active stripes. 3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer) which causes the number of active stripes can't be decreased. 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able to hold reconfig_mutex. More details in the link: https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t Let's call md_unregister_thread between mddev_unlock and mddev_lock if reconfig_mutex is held, and mddev_is_locked is introduced accordingly. Reported-by: Donald Buczek <buczek@molgen.mpg.de> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> --- v2: https://lore.kernel.org/all/1613177399-22024-1-git-send-email-guoqing.jiang@cloud.ionos.com/#t v1: https://lore.kernel.org/all/1612923676-18294-1-git-send-email-guoqing.jiang@cloud.ionos.com/ drivers/md/md.c | 9 ++++++++- drivers/md/md.h | 5 +++++ 2 files changed, 13 insertions(+), 1 deletion(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 309b3af906ad..a70e7f0f9268 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9432,10 +9432,17 @@ void md_reap_sync_thread(struct mddev *mddev) { struct md_rdev *rdev; sector_t old_dev_sectors = mddev->dev_sectors; - bool is_reshaped = false; + bool is_reshaped = false, is_locked = false; + if (mddev_is_locked(mddev)) { + is_locked = true; + mddev_unlock(mddev); + } /* resync has finished, collect result */ md_unregister_thread(&mddev->sync_thread); + if (is_locked) + mddev_lock(mddev); + if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) && !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) && mddev->degraded != mddev->raid_disks) { diff --git a/drivers/md/md.h b/drivers/md/md.h index 6ac283864533..af6f3978b62b 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -549,6 +549,11 @@ static inline int mddev_trylock(struct mddev *mddev) } extern void mddev_unlock(struct mddev *mddev); +static inline int mddev_is_locked(struct mddev *mddev) +{ + return mutex_is_locked(&mddev->reconfig_mutex); +} + static inline void md_sync_acct(struct block_device *bdev, unsigned long nr_sectors) { atomic_add(nr_sectors, &bdev->bd_disk->sync_io); -- 2.31.1 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [PATCH V3 1/2] md: don't unregister sync_thread with reconfig_mutex held 2022-05-05 8:16 ` [PATCH V3 1/2] md: don't unregister sync_thread with reconfig_mutex held Guoqing Jiang @ 2022-05-05 14:02 ` kernel test robot 2022-05-05 18:04 ` kernel test robot 1 sibling, 0 replies; 45+ messages in thread From: kernel test robot @ 2022-05-05 14:02 UTC (permalink / raw) To: Guoqing Jiang, song; +Cc: kbuild-all, buczek, linux-raid, Guoqing Jiang Hi Guoqing, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on song-md/md-next] [also build test WARNING on v5.18-rc5 next-20220505] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/intel-lab-lkp/linux/commits/Guoqing-Jiang/two-fixes-for-md/20220505-162202 base: git://git.kernel.org/pub/scm/linux/kernel/git/song/md.git md-next config: mips-allyesconfig (https://download.01.org/0day-ci/archive/20220505/202205052148.TTOFRBQx-lkp@intel.com/config) compiler: mips-linux-gcc (GCC) 11.3.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/intel-lab-lkp/linux/commit/e8e9c97eb79c337a89a98a92106cfa6139a7c9e0 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Guoqing-Jiang/two-fixes-for-md/20220505-162202 git checkout e8e9c97eb79c337a89a98a92106cfa6139a7c9e0 # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.3.0 make.cross W=1 O=build_dir ARCH=mips SHELL=/bin/bash drivers/md/ If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All warnings (new ones prefixed by >>): drivers/md/md.c: In function 'md_reap_sync_thread': >> drivers/md/md.c:9448:17: warning: ignoring return value of 'mddev_lock' declared with attribute 'warn_unused_result' [-Wunused-result] 9448 | mddev_lock(mddev); | ^~~~~~~~~~~~~~~~~ vim +9448 drivers/md/md.c 9434 9435 void md_reap_sync_thread(struct mddev *mddev) 9436 { 9437 struct md_rdev *rdev; 9438 sector_t old_dev_sectors = mddev->dev_sectors; 9439 bool is_reshaped = false, is_locked = false; 9440 9441 if (mddev_is_locked(mddev)) { 9442 is_locked = true; 9443 mddev_unlock(mddev); 9444 } 9445 /* resync has finished, collect result */ 9446 md_unregister_thread(&mddev->sync_thread); 9447 if (is_locked) > 9448 mddev_lock(mddev); 9449 9450 if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) && 9451 !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) && 9452 mddev->degraded != mddev->raid_disks) { 9453 /* success...*/ 9454 /* activate any spares */ 9455 if (mddev->pers->spare_active(mddev)) { 9456 sysfs_notify_dirent_safe(mddev->sysfs_degraded); 9457 set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags); 9458 } 9459 } 9460 if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) && 9461 mddev->pers->finish_reshape) { 9462 mddev->pers->finish_reshape(mddev); 9463 if (mddev_is_clustered(mddev)) 9464 is_reshaped = true; 9465 } 9466 9467 /* If array is no-longer degraded, then any saved_raid_disk 9468 * information must be scrapped. 9469 */ 9470 if (!mddev->degraded) 9471 rdev_for_each(rdev, mddev) 9472 rdev->saved_raid_disk = -1; 9473 9474 md_update_sb(mddev, 1); 9475 /* MD_SB_CHANGE_PENDING should be cleared by md_update_sb, so we can 9476 * call resync_finish here if MD_CLUSTER_RESYNC_LOCKED is set by 9477 * clustered raid */ 9478 if (test_and_clear_bit(MD_CLUSTER_RESYNC_LOCKED, &mddev->flags)) 9479 md_cluster_ops->resync_finish(mddev); 9480 clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery); 9481 clear_bit(MD_RECOVERY_DONE, &mddev->recovery); 9482 clear_bit(MD_RECOVERY_SYNC, &mddev->recovery); 9483 clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery); 9484 clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery); 9485 clear_bit(MD_RECOVERY_CHECK, &mddev->recovery); 9486 /* 9487 * We call md_cluster_ops->update_size here because sync_size could 9488 * be changed by md_update_sb, and MD_RECOVERY_RESHAPE is cleared, 9489 * so it is time to update size across cluster. 9490 */ 9491 if (mddev_is_clustered(mddev) && is_reshaped 9492 && !test_bit(MD_CLOSING, &mddev->flags)) 9493 md_cluster_ops->update_size(mddev, old_dev_sectors); 9494 wake_up(&resync_wait); 9495 /* flag recovery needed just to double check */ 9496 set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); 9497 sysfs_notify_dirent_safe(mddev->sysfs_action); 9498 md_new_event(); 9499 if (mddev->event_work.func) 9500 queue_work(md_misc_wq, &mddev->event_work); 9501 } 9502 EXPORT_SYMBOL(md_reap_sync_thread); 9503 -- 0-DAY CI Kernel Test Service https://01.org/lkp ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH V3 1/2] md: don't unregister sync_thread with reconfig_mutex held 2022-05-05 8:16 ` [PATCH V3 1/2] md: don't unregister sync_thread with reconfig_mutex held Guoqing Jiang 2022-05-05 14:02 ` kernel test robot @ 2022-05-05 18:04 ` kernel test robot 2022-05-06 2:34 ` Guoqing Jiang 1 sibling, 1 reply; 45+ messages in thread From: kernel test robot @ 2022-05-05 18:04 UTC (permalink / raw) To: Guoqing Jiang, song; +Cc: llvm, kbuild-all, buczek, linux-raid, Guoqing Jiang Hi Guoqing, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on song-md/md-next] [also build test WARNING on v5.18-rc5 next-20220505] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/intel-lab-lkp/linux/commits/Guoqing-Jiang/two-fixes-for-md/20220505-162202 base: git://git.kernel.org/pub/scm/linux/kernel/git/song/md.git md-next config: hexagon-randconfig-r045-20220505 (https://download.01.org/0day-ci/archive/20220506/202205060116.J42KgtCW-lkp@intel.com/config) compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 5e004fb787698440a387750db7f8028e7cb14cfc) reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/intel-lab-lkp/linux/commit/e8e9c97eb79c337a89a98a92106cfa6139a7c9e0 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Guoqing-Jiang/two-fixes-for-md/20220505-162202 git checkout e8e9c97eb79c337a89a98a92106cfa6139a7c9e0 # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash drivers/md/ If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All warnings (new ones prefixed by >>): >> drivers/md/md.c:9448:3: warning: ignoring return value of function declared with 'warn_unused_result' attribute [-Wunused-result] mddev_lock(mddev); ^~~~~~~~~~ ~~~~~ 1 warning generated. vim +/warn_unused_result +9448 drivers/md/md.c 9434 9435 void md_reap_sync_thread(struct mddev *mddev) 9436 { 9437 struct md_rdev *rdev; 9438 sector_t old_dev_sectors = mddev->dev_sectors; 9439 bool is_reshaped = false, is_locked = false; 9440 9441 if (mddev_is_locked(mddev)) { 9442 is_locked = true; 9443 mddev_unlock(mddev); 9444 } 9445 /* resync has finished, collect result */ 9446 md_unregister_thread(&mddev->sync_thread); 9447 if (is_locked) > 9448 mddev_lock(mddev); 9449 9450 if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) && 9451 !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) && 9452 mddev->degraded != mddev->raid_disks) { 9453 /* success...*/ 9454 /* activate any spares */ 9455 if (mddev->pers->spare_active(mddev)) { 9456 sysfs_notify_dirent_safe(mddev->sysfs_degraded); 9457 set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags); 9458 } 9459 } 9460 if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) && 9461 mddev->pers->finish_reshape) { 9462 mddev->pers->finish_reshape(mddev); 9463 if (mddev_is_clustered(mddev)) 9464 is_reshaped = true; 9465 } 9466 9467 /* If array is no-longer degraded, then any saved_raid_disk 9468 * information must be scrapped. 9469 */ 9470 if (!mddev->degraded) 9471 rdev_for_each(rdev, mddev) 9472 rdev->saved_raid_disk = -1; 9473 9474 md_update_sb(mddev, 1); 9475 /* MD_SB_CHANGE_PENDING should be cleared by md_update_sb, so we can 9476 * call resync_finish here if MD_CLUSTER_RESYNC_LOCKED is set by 9477 * clustered raid */ 9478 if (test_and_clear_bit(MD_CLUSTER_RESYNC_LOCKED, &mddev->flags)) 9479 md_cluster_ops->resync_finish(mddev); 9480 clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery); 9481 clear_bit(MD_RECOVERY_DONE, &mddev->recovery); 9482 clear_bit(MD_RECOVERY_SYNC, &mddev->recovery); 9483 clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery); 9484 clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery); 9485 clear_bit(MD_RECOVERY_CHECK, &mddev->recovery); 9486 /* 9487 * We call md_cluster_ops->update_size here because sync_size could 9488 * be changed by md_update_sb, and MD_RECOVERY_RESHAPE is cleared, 9489 * so it is time to update size across cluster. 9490 */ 9491 if (mddev_is_clustered(mddev) && is_reshaped 9492 && !test_bit(MD_CLOSING, &mddev->flags)) 9493 md_cluster_ops->update_size(mddev, old_dev_sectors); 9494 wake_up(&resync_wait); 9495 /* flag recovery needed just to double check */ 9496 set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); 9497 sysfs_notify_dirent_safe(mddev->sysfs_action); 9498 md_new_event(); 9499 if (mddev->event_work.func) 9500 queue_work(md_misc_wq, &mddev->event_work); 9501 } 9502 EXPORT_SYMBOL(md_reap_sync_thread); 9503 -- 0-DAY CI Kernel Test Service https://01.org/lkp ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH V3 1/2] md: don't unregister sync_thread with reconfig_mutex held 2022-05-05 18:04 ` kernel test robot @ 2022-05-06 2:34 ` Guoqing Jiang 0 siblings, 0 replies; 45+ messages in thread From: Guoqing Jiang @ 2022-05-06 2:34 UTC (permalink / raw) To: kernel test robot, song; +Cc: llvm, kbuild-all, buczek, linux-raid On 5/6/22 2:04 AM, kernel test robot wrote: > Hi Guoqing, > > Thank you for the patch! Perhaps something to improve: > > [auto build test WARNING on song-md/md-next] > [also build test WARNING on v5.18-rc5 next-20220505] > [If your patch is applied to the wrong git tree, kindly drop us a note. > And when submitting patch, we suggest to use '--base' as documented in > https://git-scm.com/docs/git-format-patch] > > url: https://github.com/intel-lab-lkp/linux/commits/Guoqing-Jiang/two-fixes-for-md/20220505-162202 > base: git://git.kernel.org/pub/scm/linux/kernel/git/song/md.git md-next > config: hexagon-randconfig-r045-20220505 (https://download.01.org/0day-ci/archive/20220506/202205060116.J42KgtCW-lkp@intel.com/config) > compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 5e004fb787698440a387750db7f8028e7cb14cfc) > reproduce (this is a W=1 build): > wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross > chmod +x ~/bin/make.cross > # https://github.com/intel-lab-lkp/linux/commit/e8e9c97eb79c337a89a98a92106cfa6139a7c9e0 > git remote add linux-review https://github.com/intel-lab-lkp/linux > git fetch --no-tags linux-review Guoqing-Jiang/two-fixes-for-md/20220505-162202 > git checkout e8e9c97eb79c337a89a98a92106cfa6139a7c9e0 > # save the config file > mkdir build_dir && cp config build_dir/.config > COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash drivers/md/ > > If you fix the issue, kindly add following tag as appropriate > Reported-by: kernel test robot <lkp@intel.com> > > All warnings (new ones prefixed by >>): > >>> drivers/md/md.c:9448:3: warning: ignoring return value of function declared with 'warn_unused_result' attribute [-Wunused-result] > mddev_lock(mddev); > ^~~~~~~~~~ ~~~~~ > 1 warning generated. Thanks! I should call mddev_lock_nointr here. Guoqing ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH V3 1/2] md: don't unregister sync_thread with reconfig_mutex held @ 2022-05-06 2:34 ` Guoqing Jiang 0 siblings, 0 replies; 45+ messages in thread From: Guoqing Jiang @ 2022-05-06 2:34 UTC (permalink / raw) To: kbuild-all [-- Attachment #1: Type: text/plain, Size: 2080 bytes --] On 5/6/22 2:04 AM, kernel test robot wrote: > Hi Guoqing, > > Thank you for the patch! Perhaps something to improve: > > [auto build test WARNING on song-md/md-next] > [also build test WARNING on v5.18-rc5 next-20220505] > [If your patch is applied to the wrong git tree, kindly drop us a note. > And when submitting patch, we suggest to use '--base' as documented in > https://git-scm.com/docs/git-format-patch] > > url: https://github.com/intel-lab-lkp/linux/commits/Guoqing-Jiang/two-fixes-for-md/20220505-162202 > base: git://git.kernel.org/pub/scm/linux/kernel/git/song/md.git md-next > config: hexagon-randconfig-r045-20220505 (https://download.01.org/0day-ci/archive/20220506/202205060116.J42KgtCW-lkp(a)intel.com/config) > compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 5e004fb787698440a387750db7f8028e7cb14cfc) > reproduce (this is a W=1 build): > wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross > chmod +x ~/bin/make.cross > # https://github.com/intel-lab-lkp/linux/commit/e8e9c97eb79c337a89a98a92106cfa6139a7c9e0 > git remote add linux-review https://github.com/intel-lab-lkp/linux > git fetch --no-tags linux-review Guoqing-Jiang/two-fixes-for-md/20220505-162202 > git checkout e8e9c97eb79c337a89a98a92106cfa6139a7c9e0 > # save the config file > mkdir build_dir && cp config build_dir/.config > COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash drivers/md/ > > If you fix the issue, kindly add following tag as appropriate > Reported-by: kernel test robot <lkp@intel.com> > > All warnings (new ones prefixed by >>): > >>> drivers/md/md.c:9448:3: warning: ignoring return value of function declared with 'warn_unused_result' attribute [-Wunused-result] > mddev_lock(mddev); > ^~~~~~~~~~ ~~~~~ > 1 warning generated. Thanks! I should call mddev_lock_nointr here. Guoqing ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 2/2] md: protect md_unregister_thread from reentrancy 2022-05-05 8:16 [PATCH 0/2] two fixes for md Guoqing Jiang 2022-05-05 8:16 ` [PATCH V3 1/2] md: don't unregister sync_thread with reconfig_mutex held Guoqing Jiang @ 2022-05-05 8:16 ` Guoqing Jiang 2022-05-09 6:39 ` Song Liu 2022-05-06 11:36 ` [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held Guoqing Jiang 2 siblings, 1 reply; 45+ messages in thread From: Guoqing Jiang @ 2022-05-05 8:16 UTC (permalink / raw) To: song; +Cc: buczek, linux-raid, Guoqing Jiang, Guoqing Jiang From: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Generally, the md_unregister_thread is called with reconfig_mutex, but raid_message in dm-raid doesn't hold reconfig_mutex to unregister thread, so md_unregister_thread can be called simulitaneously from two call sites in theory. Then after previous commit which remove the protection of reconfig_mutex for md_unregister_thread completely, the potential issue could be worse than before. Let's take pers_lock at the beginning of function to ensure reentrancy. Reported-by: Donald Buczek <buczek@molgen.mpg.de> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> --- drivers/md/md.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index a70e7f0f9268..c401e063bec8 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7962,17 +7962,22 @@ EXPORT_SYMBOL(md_register_thread); void md_unregister_thread(struct md_thread **threadp) { - struct md_thread *thread = *threadp; - if (!thread) - return; - pr_debug("interrupting MD-thread pid %d\n", task_pid_nr(thread->tsk)); - /* Locking ensures that mddev_unlock does not wake_up a + struct md_thread *thread; + + /* + * Locking ensures that mddev_unlock does not wake_up a * non-existent thread */ spin_lock(&pers_lock); + thread = *threadp; + if (!thread) { + spin_unlock(&pers_lock); + return; + } *threadp = NULL; spin_unlock(&pers_lock); + pr_debug("interrupting MD-thread pid %d\n", task_pid_nr(thread->tsk)); kthread_stop(thread->tsk); kfree(thread); } -- 2.31.1 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [PATCH 2/2] md: protect md_unregister_thread from reentrancy 2022-05-05 8:16 ` [PATCH 2/2] md: protect md_unregister_thread from reentrancy Guoqing Jiang @ 2022-05-09 6:39 ` Song Liu 2022-05-09 8:12 ` Guoqing Jiang 0 siblings, 1 reply; 45+ messages in thread From: Song Liu @ 2022-05-09 6:39 UTC (permalink / raw) To: Guoqing Jiang; +Cc: Donald Buczek, linux-raid, Guoqing Jiang On Thu, May 5, 2022 at 1:18 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote: > > From: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> > > Generally, the md_unregister_thread is called with reconfig_mutex, but > raid_message in dm-raid doesn't hold reconfig_mutex to unregister thread, > so md_unregister_thread can be called simulitaneously from two call sites > in theory. Can we add lock/unlock into raid_message? Are there some constraints here? Thanks, Song > > Then after previous commit which remove the protection of reconfig_mutex > for md_unregister_thread completely, the potential issue could be worse > than before. > > Let's take pers_lock at the beginning of function to ensure reentrancy. > > Reported-by: Donald Buczek <buczek@molgen.mpg.de> > Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> > --- > drivers/md/md.c | 15 ++++++++++----- > 1 file changed, 10 insertions(+), 5 deletions(-) > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index a70e7f0f9268..c401e063bec8 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -7962,17 +7962,22 @@ EXPORT_SYMBOL(md_register_thread); > > void md_unregister_thread(struct md_thread **threadp) > { > - struct md_thread *thread = *threadp; > - if (!thread) > - return; > - pr_debug("interrupting MD-thread pid %d\n", task_pid_nr(thread->tsk)); > - /* Locking ensures that mddev_unlock does not wake_up a > + struct md_thread *thread; > + > + /* > + * Locking ensures that mddev_unlock does not wake_up a > * non-existent thread > */ > spin_lock(&pers_lock); > + thread = *threadp; > + if (!thread) { > + spin_unlock(&pers_lock); > + return; > + } > *threadp = NULL; > spin_unlock(&pers_lock); > > + pr_debug("interrupting MD-thread pid %d\n", task_pid_nr(thread->tsk)); > kthread_stop(thread->tsk); > kfree(thread); > } > -- > 2.31.1 > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 2/2] md: protect md_unregister_thread from reentrancy 2022-05-09 6:39 ` Song Liu @ 2022-05-09 8:12 ` Guoqing Jiang 0 siblings, 0 replies; 45+ messages in thread From: Guoqing Jiang @ 2022-05-09 8:12 UTC (permalink / raw) To: Song Liu; +Cc: Donald Buczek, linux-raid, Guoqing Jiang On 5/9/22 2:39 PM, Song Liu wrote: > On Thu, May 5, 2022 at 1:18 AM Guoqing Jiang<guoqing.jiang@linux.dev> wrote: >> From: Guoqing Jiang<guoqing.jiang@cloud.ionos.com> >> >> Generally, the md_unregister_thread is called with reconfig_mutex, but >> raid_message in dm-raid doesn't hold reconfig_mutex to unregister thread, >> so md_unregister_thread can be called simulitaneously from two call sites >> in theory. > Can we add lock/unlock into raid_message? Are there some constraints here? Honestly, I don't know about dm-raid well, and there was no reply from dm people as you might know, so I prefer leave it as it is. Thanks, Guoqing ^ permalink raw reply [flat|nested] 45+ messages in thread
* [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-05 8:16 [PATCH 0/2] two fixes for md Guoqing Jiang 2022-05-05 8:16 ` [PATCH V3 1/2] md: don't unregister sync_thread with reconfig_mutex held Guoqing Jiang 2022-05-05 8:16 ` [PATCH 2/2] md: protect md_unregister_thread from reentrancy Guoqing Jiang @ 2022-05-06 11:36 ` Guoqing Jiang 2022-05-09 6:37 ` Song Liu 2022-05-09 8:18 ` Donald Buczek 2 siblings, 2 replies; 45+ messages in thread From: Guoqing Jiang @ 2022-05-06 11:36 UTC (permalink / raw) To: song; +Cc: buczek, linux-raid From: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Unregister sync_thread doesn't need to hold reconfig_mutex since it doesn't reconfigure array. And it could cause deadlock problem for raid5 as follows: 1. process A tried to reap sync thread with reconfig_mutex held after echo idle to sync_action. 2. raid5 sync thread was blocked if there were too many active stripes. 3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer) which causes the number of active stripes can't be decreased. 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able to hold reconfig_mutex. More details in the link: https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t Let's call unregister thread between mddev_unlock and mddev_lock_nointr (thanks for the report from kernel test robot <lkp@intel.com>) if the reconfig_mutex is held, and mddev_is_locked is introduced accordingly. Reported-by: Donald Buczek <buczek@molgen.mpg.de> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> --- drivers/md/md.c | 9 ++++++++- drivers/md/md.h | 5 +++++ 2 files changed, 13 insertions(+), 1 deletion(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 309b3af906ad..525f65682356 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9432,10 +9432,17 @@ void md_reap_sync_thread(struct mddev *mddev) { struct md_rdev *rdev; sector_t old_dev_sectors = mddev->dev_sectors; - bool is_reshaped = false; + bool is_reshaped = false, is_locked = false; + if (mddev_is_locked(mddev)) { + is_locked = true; + mddev_unlock(mddev); + } /* resync has finished, collect result */ md_unregister_thread(&mddev->sync_thread); + if (is_locked) + mddev_lock_nointr(mddev); + if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) && !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) && mddev->degraded != mddev->raid_disks) { diff --git a/drivers/md/md.h b/drivers/md/md.h index 6ac283864533..af6f3978b62b 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -549,6 +549,11 @@ static inline int mddev_trylock(struct mddev *mddev) } extern void mddev_unlock(struct mddev *mddev); +static inline int mddev_is_locked(struct mddev *mddev) +{ + return mutex_is_locked(&mddev->reconfig_mutex); +} + static inline void md_sync_acct(struct block_device *bdev, unsigned long nr_sectors) { atomic_add(nr_sectors, &bdev->bd_disk->sync_io); -- 2.31.1 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-06 11:36 ` [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held Guoqing Jiang @ 2022-05-09 6:37 ` Song Liu 2022-05-09 8:09 ` Guoqing Jiang 2022-05-09 8:18 ` Donald Buczek 1 sibling, 1 reply; 45+ messages in thread From: Song Liu @ 2022-05-09 6:37 UTC (permalink / raw) To: Guoqing Jiang; +Cc: Donald Buczek, linux-raid On Fri, May 6, 2022 at 4:37 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote: > > From: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> > > Unregister sync_thread doesn't need to hold reconfig_mutex since it > doesn't reconfigure array. > > And it could cause deadlock problem for raid5 as follows: > > 1. process A tried to reap sync thread with reconfig_mutex held after echo > idle to sync_action. > 2. raid5 sync thread was blocked if there were too many active stripes. > 3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer) > which causes the number of active stripes can't be decreased. > 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able > to hold reconfig_mutex. > > More details in the link: > https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t > > Let's call unregister thread between mddev_unlock and mddev_lock_nointr > (thanks for the report from kernel test robot <lkp@intel.com>) if the > reconfig_mutex is held, and mddev_is_locked is introduced accordingly. mddev_is_locked() feels really hacky to me. It cannot tell whether mddev is locked by current thread. So technically, we can unlock reconfigure_mutex for other thread by accident, no? > > Reported-by: Donald Buczek <buczek@molgen.mpg.de> > Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> > --- > drivers/md/md.c | 9 ++++++++- > drivers/md/md.h | 5 +++++ > 2 files changed, 13 insertions(+), 1 deletion(-) > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 309b3af906ad..525f65682356 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -9432,10 +9432,17 @@ void md_reap_sync_thread(struct mddev *mddev) > { > struct md_rdev *rdev; > sector_t old_dev_sectors = mddev->dev_sectors; > - bool is_reshaped = false; > + bool is_reshaped = false, is_locked = false; > > + if (mddev_is_locked(mddev)) { > + is_locked = true; > + mddev_unlock(mddev); > + } > /* resync has finished, collect result */ > md_unregister_thread(&mddev->sync_thread); > + if (is_locked) > + mddev_lock_nointr(mddev); > + > if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) && > !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) && > mddev->degraded != mddev->raid_disks) { > diff --git a/drivers/md/md.h b/drivers/md/md.h > index 6ac283864533..af6f3978b62b 100644 > --- a/drivers/md/md.h > +++ b/drivers/md/md.h > @@ -549,6 +549,11 @@ static inline int mddev_trylock(struct mddev *mddev) > } > extern void mddev_unlock(struct mddev *mddev); > > +static inline int mddev_is_locked(struct mddev *mddev) > +{ > + return mutex_is_locked(&mddev->reconfig_mutex); > +} > + > static inline void md_sync_acct(struct block_device *bdev, unsigned long nr_sectors) > { > atomic_add(nr_sectors, &bdev->bd_disk->sync_io); > -- > 2.31.1 > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-09 6:37 ` Song Liu @ 2022-05-09 8:09 ` Guoqing Jiang 2022-05-09 9:32 ` Wols Lists 2022-05-10 6:44 ` Song Liu 0 siblings, 2 replies; 45+ messages in thread From: Guoqing Jiang @ 2022-05-09 8:09 UTC (permalink / raw) To: Song Liu; +Cc: Donald Buczek, linux-raid On 5/9/22 2:37 PM, Song Liu wrote: > On Fri, May 6, 2022 at 4:37 AM Guoqing Jiang<guoqing.jiang@linux.dev> wrote: >> From: Guoqing Jiang<guoqing.jiang@cloud.ionos.com> >> >> Unregister sync_thread doesn't need to hold reconfig_mutex since it >> doesn't reconfigure array. >> >> And it could cause deadlock problem for raid5 as follows: >> >> 1. process A tried to reap sync thread with reconfig_mutex held after echo >> idle to sync_action. >> 2. raid5 sync thread was blocked if there were too many active stripes. >> 3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer) >> which causes the number of active stripes can't be decreased. >> 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able >> to hold reconfig_mutex. >> >> More details in the link: >> https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t >> >> Let's call unregister thread between mddev_unlock and mddev_lock_nointr >> (thanks for the report from kernel test robot<lkp@intel.com>) if the >> reconfig_mutex is held, and mddev_is_locked is introduced accordingly. > mddev_is_locked() feels really hacky to me. It cannot tell whether > mddev is locked > by current thread. So technically, we can unlock reconfigure_mutex for > other thread > by accident, no? I can switch back to V2 if you think that is the correct way to do though no one comment about the change in dm-raid. Thanks, Guoqing ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-09 8:09 ` Guoqing Jiang @ 2022-05-09 9:32 ` Wols Lists 2022-05-09 10:37 ` Guoqing Jiang 2022-05-10 6:44 ` Song Liu 1 sibling, 1 reply; 45+ messages in thread From: Wols Lists @ 2022-05-09 9:32 UTC (permalink / raw) To: Guoqing Jiang, Song Liu; +Cc: Donald Buczek, linux-raid On 09/05/2022 09:09, Guoqing Jiang wrote: > I can switch back to V2 if you think that is the correct way to do > though no > one comment about the change in dm-raid. DON'T QUOTE ME ON THIS but it is very confusing, as I believe dm-raid is part of Device Managment, which is actually nothing to do with md-raid (apart from, like many block layer drivers, md-raid uses dm to manage the device beneath it.) Cheers, Wol ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-09 9:32 ` Wols Lists @ 2022-05-09 10:37 ` Guoqing Jiang 2022-05-09 11:19 ` Wols Lists 0 siblings, 1 reply; 45+ messages in thread From: Guoqing Jiang @ 2022-05-09 10:37 UTC (permalink / raw) To: Wols Lists, Song Liu; +Cc: Donald Buczek, linux-raid On 5/9/22 5:32 PM, Wols Lists wrote: > On 09/05/2022 09:09, Guoqing Jiang wrote: >> I can switch back to V2 if you think that is the correct way to do >> though no >> one comment about the change in dm-raid. > > DON'T QUOTE ME ON THIS > > but it is very confusing, as I believe dm-raid is part of Device > Managment, which is actually nothing to do with md-raid No, dm-raid (I mean dm-raid.c here) is a bridge between dm framwork and md raid, which makes dm can reuse md raid code to avoid re-implementation raid logic inside dm itself. > (apart from, like many block layer drivers, md-raid uses dm to manage > the device beneath it.) The above statement is very confusing. Thanks, Guoqing ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-09 10:37 ` Guoqing Jiang @ 2022-05-09 11:19 ` Wols Lists 2022-05-09 11:26 ` Guoqing Jiang 0 siblings, 1 reply; 45+ messages in thread From: Wols Lists @ 2022-05-09 11:19 UTC (permalink / raw) To: Guoqing Jiang, Song Liu; +Cc: Donald Buczek, linux-raid On 09/05/2022 11:37, Guoqing Jiang wrote: > > > On 5/9/22 5:32 PM, Wols Lists wrote: >> On 09/05/2022 09:09, Guoqing Jiang wrote: >>> I can switch back to V2 if you think that is the correct way to do >>> though no >>> one comment about the change in dm-raid. >> >> DON'T QUOTE ME ON THIS >> >> but it is very confusing, as I believe dm-raid is part of Device >> Managment, which is actually nothing to do with md-raid > > > No, dm-raid (I mean dm-raid.c here) is a bridge between dm framwork and > md raid, which > makes dm can reuse md raid code to avoid re-implementation raid logic > inside dm itself. Hmm... Which backs up what I thought - if it's used by the dm people to call md-raid, then asking for info/comments about it on the md list is likely to get you no-where... >> (apart from, like many block layer drivers, md-raid uses dm to manage >> the device beneath it.) > > The above statement is very confusing. > Yup. I've never understood it. But, from all my time on the list, it is noticeable that nobody seems to know or care much about dm, and this is simply one more data point that agrees with that impression. If you want to know about dm-raid, you probably need to go where the dm people hang out. Cheers, Wol ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-09 11:19 ` Wols Lists @ 2022-05-09 11:26 ` Guoqing Jiang 0 siblings, 0 replies; 45+ messages in thread From: Guoqing Jiang @ 2022-05-09 11:26 UTC (permalink / raw) To: Wols Lists, Song Liu; +Cc: Donald Buczek, linux-raid On 5/9/22 7:19 PM, Wols Lists wrote: > > Hmm... > > Which backs up what I thought - if it's used by the dm people to call > md-raid, then asking for info/comments about it on the md list is > likely to get you no-where... I don't think you know the V2 well which had been cced to dm list, just FYI. https://lore.kernel.org/all/1613177399-22024-1-git-send-email-guoqing.jiang@cloud.ionos.com/#t Thanks, Guoqing ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-09 8:09 ` Guoqing Jiang 2022-05-09 9:32 ` Wols Lists @ 2022-05-10 6:44 ` Song Liu 2022-05-10 12:01 ` Donald Buczek 2022-05-20 18:27 ` Logan Gunthorpe 1 sibling, 2 replies; 45+ messages in thread From: Song Liu @ 2022-05-10 6:44 UTC (permalink / raw) To: Guoqing Jiang; +Cc: Donald Buczek, linux-raid On Mon, May 9, 2022 at 1:09 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote: > > > > On 5/9/22 2:37 PM, Song Liu wrote: > > On Fri, May 6, 2022 at 4:37 AM Guoqing Jiang<guoqing.jiang@linux.dev> wrote: > >> From: Guoqing Jiang<guoqing.jiang@cloud.ionos.com> > >> > >> Unregister sync_thread doesn't need to hold reconfig_mutex since it > >> doesn't reconfigure array. > >> > >> And it could cause deadlock problem for raid5 as follows: > >> > >> 1. process A tried to reap sync thread with reconfig_mutex held after echo > >> idle to sync_action. > >> 2. raid5 sync thread was blocked if there were too many active stripes. > >> 3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer) > >> which causes the number of active stripes can't be decreased. > >> 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able > >> to hold reconfig_mutex. > >> > >> More details in the link: > >> https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t > >> > >> Let's call unregister thread between mddev_unlock and mddev_lock_nointr > >> (thanks for the report from kernel test robot<lkp@intel.com>) if the > >> reconfig_mutex is held, and mddev_is_locked is introduced accordingly. > > mddev_is_locked() feels really hacky to me. It cannot tell whether > > mddev is locked > > by current thread. So technically, we can unlock reconfigure_mutex for > > other thread > > by accident, no? > > I can switch back to V2 if you think that is the correct way to do though no > one comment about the change in dm-raid. I guess v2 is the best at the moment. I pushed a slightly modified v2 to md-next. Thanks for working on this. Song ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-10 6:44 ` Song Liu @ 2022-05-10 12:01 ` Donald Buczek 2022-05-10 12:09 ` Guoqing Jiang 2022-05-20 18:27 ` Logan Gunthorpe 1 sibling, 1 reply; 45+ messages in thread From: Donald Buczek @ 2022-05-10 12:01 UTC (permalink / raw) To: Song Liu, Guoqing Jiang; +Cc: linux-raid On 5/10/22 8:44 AM, Song Liu wrote: > On Mon, May 9, 2022 at 1:09 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote: >> >> >> >> On 5/9/22 2:37 PM, Song Liu wrote: >>> On Fri, May 6, 2022 at 4:37 AM Guoqing Jiang<guoqing.jiang@linux.dev> wrote: >>>> From: Guoqing Jiang<guoqing.jiang@cloud.ionos.com> >>>> >>>> Unregister sync_thread doesn't need to hold reconfig_mutex since it >>>> doesn't reconfigure array. >>>> >>>> And it could cause deadlock problem for raid5 as follows: >>>> >>>> 1. process A tried to reap sync thread with reconfig_mutex held after echo >>>> idle to sync_action. >>>> 2. raid5 sync thread was blocked if there were too many active stripes. >>>> 3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer) >>>> which causes the number of active stripes can't be decreased. >>>> 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able >>>> to hold reconfig_mutex. >>>> >>>> More details in the link: >>>> https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t >>>> >>>> Let's call unregister thread between mddev_unlock and mddev_lock_nointr >>>> (thanks for the report from kernel test robot<lkp@intel.com>) if the >>>> reconfig_mutex is held, and mddev_is_locked is introduced accordingly. >>> mddev_is_locked() feels really hacky to me. It cannot tell whether >>> mddev is locked >>> by current thread. So technically, we can unlock reconfigure_mutex for >>> other thread >>> by accident, no? >> >> I can switch back to V2 if you think that is the correct way to do though no >> one comment about the change in dm-raid. > > I guess v2 is the best at the moment. I pushed a slightly modified v2 to > md-next. I think, this can be used to get a double-free from md_unregister_thread. Please review https://lore.kernel.org/linux-raid/8312a154-14fb-6f07-0cf1-8c970187cc49@molgen.mpg.de/ Best Donald > > Thanks for working on this. > Song > -- Donald Buczek buczek@molgen.mpg.de Tel: +49 30 8413 1433 ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-10 12:01 ` Donald Buczek @ 2022-05-10 12:09 ` Guoqing Jiang 2022-05-10 12:35 ` Donald Buczek 0 siblings, 1 reply; 45+ messages in thread From: Guoqing Jiang @ 2022-05-10 12:09 UTC (permalink / raw) To: Donald Buczek, Song Liu; +Cc: linux-raid On 5/10/22 8:01 PM, Donald Buczek wrote: > >> I guess v2 is the best at the moment. I pushed a slightly modified v2 to >> md-next. > > I think, this can be used to get a double-free from md_unregister_thread. > > Please review > > https://lore.kernel.org/linux-raid/8312a154-14fb-6f07-0cf1-8c970187cc49@molgen.mpg.de/ That is supposed to be addressed by the second one, pls consider it too. [PATCH 2/2] md: protect md_unregister_thread from reentrancy Thanks, Guoqing ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-10 12:09 ` Guoqing Jiang @ 2022-05-10 12:35 ` Donald Buczek 2022-05-10 18:02 ` Song Liu 0 siblings, 1 reply; 45+ messages in thread From: Donald Buczek @ 2022-05-10 12:35 UTC (permalink / raw) To: Guoqing Jiang, Song Liu; +Cc: linux-raid On 5/10/22 2:09 PM, Guoqing Jiang wrote: > > > On 5/10/22 8:01 PM, Donald Buczek wrote: >> >>> I guess v2 is the best at the moment. I pushed a slightly modified v2 to >>> md-next. >> >> I think, this can be used to get a double-free from md_unregister_thread. >> >> Please review >> >> https://lore.kernel.org/linux-raid/8312a154-14fb-6f07-0cf1-8c970187cc49@molgen.mpg.de/ > > That is supposed to be addressed by the second one, pls consider it too. Right, but this has not been pulled into md-next. I just wanted to note, that the current state of md-next has this problem. If the other patch is taken, too, and works as intended, that would be solved. > [PATCH 2/2] md: protect md_unregister_thread from reentrancy Looks good to me. Best Donald > > Thanks, > Guoqing -- Donald Buczek buczek@molgen.mpg.de Tel: +49 30 8413 1433 ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-10 12:35 ` Donald Buczek @ 2022-05-10 18:02 ` Song Liu 2022-05-11 8:10 ` Guoqing Jiang 0 siblings, 1 reply; 45+ messages in thread From: Song Liu @ 2022-05-10 18:02 UTC (permalink / raw) To: Donald Buczek; +Cc: Guoqing Jiang, linux-raid On Tue, May 10, 2022 at 5:35 AM Donald Buczek <buczek@molgen.mpg.de> wrote: > > On 5/10/22 2:09 PM, Guoqing Jiang wrote: > > > > > > On 5/10/22 8:01 PM, Donald Buczek wrote: > >> > >>> I guess v2 is the best at the moment. I pushed a slightly modified v2 to > >>> md-next. > >> > >> I think, this can be used to get a double-free from md_unregister_thread. > >> > >> Please review > >> > >> https://lore.kernel.org/linux-raid/8312a154-14fb-6f07-0cf1-8c970187cc49@molgen.mpg.de/ > > > > That is supposed to be addressed by the second one, pls consider it too. > > Right, but this has not been pulled into md-next. I just wanted to note, that the current state of md-next has this problem. > > If the other patch is taken, too, and works as intended, that would be solved. > > > [PATCH 2/2] md: protect md_unregister_thread from reentrancy Good catch! Guoqing, current 2/2 doesn't apply cleanly. Could you please resend it on top of md-next? Thanks, Song ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-10 18:02 ` Song Liu @ 2022-05-11 8:10 ` Guoqing Jiang 2022-05-11 21:45 ` Song Liu 0 siblings, 1 reply; 45+ messages in thread From: Guoqing Jiang @ 2022-05-11 8:10 UTC (permalink / raw) To: Song Liu, Donald Buczek; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 1621 bytes --] On 5/11/22 2:02 AM, Song Liu wrote: > On Tue, May 10, 2022 at 5:35 AM Donald Buczek <buczek@molgen.mpg.de> wrote: >> On 5/10/22 2:09 PM, Guoqing Jiang wrote: >>> >>> On 5/10/22 8:01 PM, Donald Buczek wrote: >>>>> I guess v2 is the best at the moment. I pushed a slightly modified v2 to >>>>> md-next. >>>> I think, this can be used to get a double-free from md_unregister_thread. >>>> >>>> Please review >>>> >>>> https://lore.kernel.org/linux-raid/8312a154-14fb-6f07-0cf1-8c970187cc49@molgen.mpg.de/ >>> That is supposed to be addressed by the second one, pls consider it too. >> Right, but this has not been pulled into md-next. I just wanted to note, that the current state of md-next has this problem. Thanks for reminder. >> If the other patch is taken, too, and works as intended, that would be solved. >> >>> [PATCH 2/2] md: protect md_unregister_thread from reentrancy > Good catch! > > Guoqing, current 2/2 doesn't apply cleanly. Could you please resend it on top of > md-next? Hmm, no issue from my side. ~/source/md> git am 0001-md-protect-md_unregister_thread-from-reentrancy.patch Applying: md: protect md_unregister_thread from reentrancy ~/source/md> git log --oneline |head -5 dc7147a88766 md: protect md_unregister_thread from reentrancy 5a36c493dc82 md: don't unregister sync_thread with reconfig_mutex held 49c3b9266a71 block: null_blk: Improve device creation with configfs db060f54e0c5 block: null_blk: Cleanup messages b3a0a73e8a79 block: null_blk: Cleanup device creation and deletion Anyway, it is attached. I will rebase it to your latest tree if something gets wrong. Thanks, Guoqing [-- Attachment #2: 0001-md-protect-md_unregister_thread-from-reentrancy.patch --] [-- Type: text/x-patch, Size: 1751 bytes --] From a2da80f62f15023e3fee7a02488c143dfff647b3 Mon Sep 17 00:00:00 2001 From: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Date: Fri, 29 Apr 2022 16:49:09 +0800 Subject: [PATCH 2/2] md: protect md_unregister_thread from reentrancy Generally, the md_unregister_thread is called with reconfig_mutex, but raid_message in dm-raid doesn't hold reconfig_mutex to unregister thread, so md_unregister_thread can be called simulitaneously from two call sites in theory. Then after previous commit which remove the protection of reconfig_mutex for md_unregister_thread completely, the potential issue could be worse than before. Let's take pers_lock at the beginning of function to ensure reentrancy. Reported-by: Donald Buczek <buczek@molgen.mpg.de> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> --- drivers/md/md.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index a70e7f0f9268..c401e063bec8 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7962,17 +7962,22 @@ EXPORT_SYMBOL(md_register_thread); void md_unregister_thread(struct md_thread **threadp) { - struct md_thread *thread = *threadp; - if (!thread) - return; - pr_debug("interrupting MD-thread pid %d\n", task_pid_nr(thread->tsk)); - /* Locking ensures that mddev_unlock does not wake_up a + struct md_thread *thread; + + /* + * Locking ensures that mddev_unlock does not wake_up a * non-existent thread */ spin_lock(&pers_lock); + thread = *threadp; + if (!thread) { + spin_unlock(&pers_lock); + return; + } *threadp = NULL; spin_unlock(&pers_lock); + pr_debug("interrupting MD-thread pid %d\n", task_pid_nr(thread->tsk)); kthread_stop(thread->tsk); kfree(thread); } -- 2.31.1 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-11 8:10 ` Guoqing Jiang @ 2022-05-11 21:45 ` Song Liu 0 siblings, 0 replies; 45+ messages in thread From: Song Liu @ 2022-05-11 21:45 UTC (permalink / raw) To: Guoqing Jiang; +Cc: Donald Buczek, linux-raid On Wed, May 11, 2022 at 1:10 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote: > > > > On 5/11/22 2:02 AM, Song Liu wrote: > > On Tue, May 10, 2022 at 5:35 AM Donald Buczek <buczek@molgen.mpg.de> wrote: > >> On 5/10/22 2:09 PM, Guoqing Jiang wrote: > >>> > >>> On 5/10/22 8:01 PM, Donald Buczek wrote: > >>>>> I guess v2 is the best at the moment. I pushed a slightly modified v2 to > >>>>> md-next. > >>>> I think, this can be used to get a double-free from md_unregister_thread. > >>>> > >>>> Please review > >>>> > >>>> https://lore.kernel.org/linux-raid/8312a154-14fb-6f07-0cf1-8c970187cc49@molgen.mpg.de/ > >>> That is supposed to be addressed by the second one, pls consider it too. > >> Right, but this has not been pulled into md-next. I just wanted to note, that the current state of md-next has this problem. > > Thanks for reminder. > > >> If the other patch is taken, too, and works as intended, that would be solved. > >> > >>> [PATCH 2/2] md: protect md_unregister_thread from reentrancy > > Good catch! > > > > Guoqing, current 2/2 doesn't apply cleanly. Could you please resend it on top of > > md-next? > > Hmm, no issue from my side. > > ~/source/md> git am > 0001-md-protect-md_unregister_thread-from-reentrancy.patch > Applying: md: protect md_unregister_thread from reentrancy > > ~/source/md> git log --oneline |head -5 > dc7147a88766 md: protect md_unregister_thread from reentrancy > 5a36c493dc82 md: don't unregister sync_thread with reconfig_mutex held > 49c3b9266a71 block: null_blk: Improve device creation with configfs > db060f54e0c5 block: null_blk: Cleanup messages > b3a0a73e8a79 block: null_blk: Cleanup device creation and deletion > > Anyway, it is attached. I will rebase it to your latest tree if > something gets wrong. Applied to md-next. Thanks! Song ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-10 6:44 ` Song Liu 2022-05-10 12:01 ` Donald Buczek @ 2022-05-20 18:27 ` Logan Gunthorpe 2022-05-21 18:23 ` Donald Buczek 2022-06-02 8:12 ` Xiao Ni 1 sibling, 2 replies; 45+ messages in thread From: Logan Gunthorpe @ 2022-05-20 18:27 UTC (permalink / raw) To: Song Liu, Guoqing Jiang; +Cc: Donald Buczek, linux-raid Hi, On 2022-05-10 00:44, Song Liu wrote: > On Mon, May 9, 2022 at 1:09 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote: >> On 5/9/22 2:37 PM, Song Liu wrote: >>> On Fri, May 6, 2022 at 4:37 AM Guoqing Jiang<guoqing.jiang@linux.dev> wrote: >>>> From: Guoqing Jiang<guoqing.jiang@cloud.ionos.com> >>>> >>>> Unregister sync_thread doesn't need to hold reconfig_mutex since it >>>> doesn't reconfigure array. >>>> >>>> And it could cause deadlock problem for raid5 as follows: >>>> >>>> 1. process A tried to reap sync thread with reconfig_mutex held after echo >>>> idle to sync_action. >>>> 2. raid5 sync thread was blocked if there were too many active stripes. >>>> 3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer) >>>> which causes the number of active stripes can't be decreased. >>>> 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able >>>> to hold reconfig_mutex. >>>> >>>> More details in the link: >>>> https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t >>>> >>>> Let's call unregister thread between mddev_unlock and mddev_lock_nointr >>>> (thanks for the report from kernel test robot<lkp@intel.com>) if the >>>> reconfig_mutex is held, and mddev_is_locked is introduced accordingly. >>> mddev_is_locked() feels really hacky to me. It cannot tell whether >>> mddev is locked >>> by current thread. So technically, we can unlock reconfigure_mutex for >>> other thread >>> by accident, no? >> >> I can switch back to V2 if you think that is the correct way to do though no >> one comment about the change in dm-raid. > > I guess v2 is the best at the moment. I pushed a slightly modified v2 to > md-next. > I noticed a clear regression with mdadm tests with this patch in md-next (7e6ba434cc6080). Before the patch, tests 07reshape5intr and 07revert-grow would fail fairly infrequently (about 1 in 4 runs for the former and 1 in 30 runs for the latter). After this patch, both tests always fail. I don't have time to dig into why this is, but it would be nice if someone can at least fix the regression. It is hard to make any progress on these tests if we are continuing to further break them. Logan ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-20 18:27 ` Logan Gunthorpe @ 2022-05-21 18:23 ` Donald Buczek 2022-05-23 1:08 ` Guoqing Jiang 2022-05-24 15:51 ` Logan Gunthorpe 2022-06-02 8:12 ` Xiao Ni 1 sibling, 2 replies; 45+ messages in thread From: Donald Buczek @ 2022-05-21 18:23 UTC (permalink / raw) To: Logan Gunthorpe, Song Liu, Guoqing Jiang; +Cc: linux-raid On 20.05.22 20:27, Logan Gunthorpe wrote: > > Hi, > > On 2022-05-10 00:44, Song Liu wrote: >> On Mon, May 9, 2022 at 1:09 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote: >>> On 5/9/22 2:37 PM, Song Liu wrote: >>>> On Fri, May 6, 2022 at 4:37 AM Guoqing Jiang<guoqing.jiang@linux.dev> wrote: >>>>> From: Guoqing Jiang<guoqing.jiang@cloud.ionos.com> >>>>> >>>>> Unregister sync_thread doesn't need to hold reconfig_mutex since it >>>>> doesn't reconfigure array. >>>>> >>>>> And it could cause deadlock problem for raid5 as follows: >>>>> >>>>> 1. process A tried to reap sync thread with reconfig_mutex held after echo >>>>> idle to sync_action. >>>>> 2. raid5 sync thread was blocked if there were too many active stripes. >>>>> 3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer) >>>>> which causes the number of active stripes can't be decreased. >>>>> 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able >>>>> to hold reconfig_mutex. >>>>> >>>>> More details in the link: >>>>> https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t >>>>> >>>>> Let's call unregister thread between mddev_unlock and mddev_lock_nointr >>>>> (thanks for the report from kernel test robot<lkp@intel.com>) if the >>>>> reconfig_mutex is held, and mddev_is_locked is introduced accordingly. >>>> mddev_is_locked() feels really hacky to me. It cannot tell whether >>>> mddev is locked >>>> by current thread. So technically, we can unlock reconfigure_mutex for >>>> other thread >>>> by accident, no? >>> >>> I can switch back to V2 if you think that is the correct way to do though no >>> one comment about the change in dm-raid. >> >> I guess v2 is the best at the moment. I pushed a slightly modified v2 to >> md-next. >> > > I noticed a clear regression with mdadm tests with this patch in md-next > (7e6ba434cc6080). > > Before the patch, tests 07reshape5intr and 07revert-grow would fail > fairly infrequently (about 1 in 4 runs for the former and 1 in 30 runs > for the latter). > > After this patch, both tests always fail. > > I don't have time to dig into why this is, but it would be nice if > someone can at least fix the regression. It is hard to make any progress > on these tests if we are continuing to further break them. Hmmm. I wanted to try to help a bit by reproducing and digging into this. But it seems that more or less ALL tests hang my system one way or another. I've used a qemu/kvm machine with md-next and mdraid master. Is this supposed to work? I can investigate the bugs I see, but probably that is a waste of time because I'm doing something wrong fundamentally? This is an example from 00raid0: [ 57.434064] md: md0 stopped. [ 57.586951] md0: detected capacity change from 0 to 107520 [ 57.618454] BUG: kernel NULL pointer dereference, address: 0000000000000094 [ 57.620830] #PF: supervisor read access in kernel mode [ 57.622554] #PF: error_code(0x0000) - not-present page [ 57.624273] PGD 800000010d5ee067 P4D 800000010d5ee067 PUD 10df28067 PMD 0 [ 57.626548] Oops: 0000 [#1] PREEMPT SMP PTI [ 57.627942] CPU: 3 PID: 1064 Comm: mkfs.ext3 Not tainted 5.18.0-rc3.mx64.425-00108-g6ad84d559b8c #77 [ 57.630952] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 [ 57.635927] RIP: 0010:bfq_bio_bfqg+0x26/0x80 [ 57.638027] Code: 00 0f 1f 00 0f 1f 44 00 00 55 53 48 89 fd 48 8b 56 48 48 89 f7 48 85 d2 74 32 48 63 05 53 54 1c 01 48 83 c0 16 48 8b 5c c2 08 <80> bb 94 00 00 00 00 70 [ 57.645295] RSP: 0018:ffffc90001c27b38 EFLAGS: 00010006 [ 57.647414] RAX: 0000000000000018 RBX: 0000000000000000 RCX: 0000000000000001 [ 57.650039] RDX: ffff888109297800 RSI: ffff8881032ba180 RDI: ffff8881032ba180 [ 57.652541] RBP: ffff888102177800 R08: ffff88810c9004c8 R09: ffff88810318cb00 [ 57.654852] R10: 0000000000000000 R11: ffff8881032ba180 R12: ffff88810318cae0 [ 57.657128] R13: ffff888102177800 R14: ffffc90001c27ca8 R15: ffffc90001c27c00 [ 57.659316] FS: 00007fdfce47d440(0000) GS:ffff8882b5ac0000(0000) knlGS:0000000000000000 [ 57.661700] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 57.663461] CR2: 0000000000000094 CR3: 000000010d438002 CR4: 0000000000170ee0 [ 57.665453] Call Trace: [ 57.666479] <TASK> [ 57.667382] bfq_bic_update_cgroup+0x28/0x1b0 [ 57.668724] bfq_insert_requests+0x233/0x2340 [ 57.670049] ? ioc_find_get_icq+0x21c/0x2a0 [ 57.671315] ? bfq_prepare_request+0x11/0x30 [ 57.672565] blk_mq_sched_insert_requests+0x5c/0x150 [ 57.673891] blk_mq_flush_plug_list+0xe1/0x2a0 [ 57.675140] __blk_flush_plug+0xdf/0x120 [ 57.676259] io_schedule_prepare+0x3d/0x50 [ 57.677373] io_schedule_timeout+0xf/0x40 [ 57.678465] wait_for_completion_io+0x78/0x140 [ 57.679578] submit_bio_wait+0x5b/0x80 [ 57.680575] blkdev_issue_discard+0x65/0xb0 [ 57.681640] blkdev_common_ioctl+0x391/0x8f0 [ 57.682712] blkdev_ioctl+0x216/0x2a0 [ 57.683648] __x64_sys_ioctl+0x76/0xb0 [ 57.684607] do_syscall_64+0x42/0x90 [ 57.685527] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 57.686645] RIP: 0033:0x7fdfce56dc17 [ 57.687535] Code: 48 c7 c3 ff ff ff ff 48 89 d8 5b 5d 41 5c c3 0f 1f 40 00 48 89 e8 48 f7 d8 48 39 c3 0f 92 c0 eb 93 66 90 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 08 [ 57.691055] RSP: 002b:00007ffe24319828 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 57.692537] RAX: ffffffffffffffda RBX: 00000000004645a0 RCX: 00007fdfce56dc17 [ 57.693905] RDX: 00007ffe24319830 RSI: 0000000000001277 RDI: 0000000000000003 [ 57.695288] RBP: 0000000000460960 R08: 0000000000000400 R09: 0000000000000000 [ 57.696645] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 57.697954] R13: 000000000000d200 R14: 0000000000000000 R15: 0000000000000000 [ 57.699281] </TASK> [ 57.699901] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs 8021q garp stp mrp llc bochs drm_vram_helper drm_ttm_helper kvm_intel ttm drm_kms_helper kvm drm fb_sys_fops vi4 [ 57.705955] CR2: 0000000000000094 [ 57.706710] ---[ end trace 0000000000000000 ]--- [ 57.707599] RIP: 0010:bfq_bio_bfqg+0x26/0x80 [ 57.708434] Code: 00 0f 1f 00 0f 1f 44 00 00 55 53 48 89 fd 48 8b 56 48 48 89 f7 48 85 d2 74 32 48 63 05 53 54 1c 01 48 83 c0 16 48 8b 5c c2 08 <80> bb 94 00 00 00 00 70 [ 57.711426] RSP: 0018:ffffc90001c27b38 EFLAGS: 00010006 [ 57.712391] RAX: 0000000000000018 RBX: 0000000000000000 RCX: 0000000000000001 [ 57.713605] RDX: ffff888109297800 RSI: ffff8881032ba180 RDI: ffff8881032ba180 [ 57.714811] RBP: ffff888102177800 R08: ffff88810c9004c8 R09: ffff88810318cb00 [ 57.716018] R10: 0000000000000000 R11: ffff8881032ba180 R12: ffff88810318cae0 [ 57.717236] R13: ffff888102177800 R14: ffffc90001c27ca8 R15: ffffc90001c27c00 [ 57.718438] FS: 00007fdfce47d440(0000) GS:ffff8882b5ac0000(0000) knlGS:0000000000000000 [ 57.719778] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 57.720808] CR2: 0000000000000094 CR3: 000000010d438002 CR4: 0000000000170ee0 [ 57.722019] note: mkfs.ext3[1064] exited with preempt_count 1 [ 57.723067] ------------[ cut here ]------------ [ 57.723960] WARNING: CPU: 3 PID: 1064 at kernel/exit.c:741 do_exit+0x8cb/0xbc0 [ 57.725196] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs 8021q garp stp mrp llc bochs drm_vram_helper drm_ttm_helper kvm_intel ttm drm_kms_helper kvm drm fb_sys_fops vi4 [ 57.731011] CPU: 3 PID: 1064 Comm: mkfs.ext3 Tainted: G D 5.18.0-rc3.mx64.425-00108-g6ad84d559b8c #77 [ 57.732704] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 [ 57.734853] RIP: 0010:do_exit+0x8cb/0xbc0 [ 57.735711] Code: e9 13 ff ff ff 48 8b bb e0 04 00 00 31 f6 e8 4c db ff ff e9 98 fd ff ff 4c 89 e6 bf 05 06 00 00 e8 8a c8 00 00 e9 41 f8 ff ff <0f> 0b e9 6b f7 ff ff 4b [ 57.738851] RSP: 0018:ffffc90001c27ee8 EFLAGS: 00010082 [ 57.739899] RAX: 0000000000000000 RBX: ffff888101e48000 RCX: 0000000000000000 [ 57.741196] RDX: 0000000000000001 RSI: ffffffff8220a969 RDI: 0000000000000009 [ 57.742485] RBP: 0000000000000009 R08: 0000000000000000 R09: c0000000ffffbfff [ 57.743777] R10: 00007fdfce47d440 R11: ffffc90001c27d60 R12: 0000000000000009 [ 57.745081] R13: 0000000000000046 R14: 0000000000000000 R15: 0000000000000000 [ 57.746388] FS: 00007fdfce47d440(0000) GS:ffff8882b5ac0000(0000) knlGS:0000000000000000 [ 57.747806] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 57.748931] CR2: 0000000000000094 CR3: 000000010d438002 CR4: 0000000000170ee0 [ 57.750225] Call Trace: [ 57.750894] <TASK> [ 57.751535] make_task_dead+0x41/0xf0 [ 57.752369] rewind_stack_and_make_dead+0x17/0x17 [ 57.753336] RIP: 0033:0x7fdfce56dc17 [ 57.754155] Code: 48 c7 c3 ff ff ff ff 48 89 d8 5b 5d 41 5c c3 0f 1f 40 00 48 89 e8 48 f7 d8 48 39 c3 0f 92 c0 eb 93 66 90 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 08 [ 57.757318] RSP: 002b:00007ffe24319828 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 57.758669] RAX: ffffffffffffffda RBX: 00000000004645a0 RCX: 00007fdfce56dc17 [ 57.759956] RDX: 00007ffe24319830 RSI: 0000000000001277 RDI: 0000000000000003 [ 57.761256] RBP: 0000000000460960 R08: 0000000000000400 R09: 0000000000000000 [ 57.762531] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 57.763806] R13: 000000000000d200 R14: 0000000000000000 R15: 0000000000000000 [ 57.765177] </TASK> [ 57.765813] ---[ end trace 0000000000000000 ]--- [ 57.790046] md0: detected capacity change from 107520 to 0 [ 57.792834] md: md0 stopped. [ 78.843853] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 78.845334] rcu: 10-...0: (0 ticks this GP) idle=07b/1/0x4000000000000000 softirq=1140/1140 fqs=4805 [ 78.847246] (detected by 13, t=21005 jiffies, g=9013, q=1419) [ 78.848619] Sending NMI from CPU 13 to CPUs 10: [ 78.849810] NMI backtrace for cpu 10 [ 78.849813] CPU: 10 PID: 1081 Comm: mdadm Tainted: G D W 5.18.0-rc3.mx64.425-00108-g6ad84d559b8c #77 [ 78.849816] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 [ 78.849817] RIP: 0010:queued_spin_lock_slowpath+0x4c/0x1d0 [ 78.849832] Code: 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 75 0f b8 01 00 00 00 66 89 07 5b 5d 41 5c c3 f3 90 <8b> 07 84 c0 75 f8 eb e7 [ 78.849834] RSP: 0018:ffffc90001c9f9e0 EFLAGS: 00000002 [ 78.849837] RAX: 0000000000040101 RBX: ffff88810c914fc8 RCX: 0000000000000000 [ 78.849838] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888102177c30 [ 78.849840] RBP: 0000000000000000 R08: ffff88810c914fc8 R09: ffff888106a4ed10 [ 78.849841] R10: ffffc90001c9fae8 R11: ffff888101b048d8 R12: ffff888103833000 [ 78.849842] R13: ffff888102177800 R14: ffffc90001c9fb20 R15: ffffc90001c9fa78 [ 78.849844] FS: 00007fd3d66c4340(0000) GS:ffff8882b5c80000(0000) knlGS:0000000000000000 [ 78.849847] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 78.849848] CR2: 00000000004a5b58 CR3: 000000010d438001 CR4: 0000000000170ee0 [ 78.849850] Call Trace: [ 78.849853] <TASK> [ 78.849855] bfq_insert_requests+0xae/0x2340 [ 78.849862] ? submit_bio_noacct_nocheck+0x225/0x2b0 [ 78.849868] blk_mq_sched_insert_requests+0x5c/0x150 [ 78.849872] blk_mq_flush_plug_list+0xe1/0x2a0 [ 78.849876] __blk_flush_plug+0xdf/0x120 [ 78.849879] blk_finish_plug+0x27/0x40 [ 78.849882] read_pages+0x15b/0x360 [ 78.849891] page_cache_ra_unbounded+0x120/0x170 [ 78.849894] filemap_get_pages+0xdd/0x5f0 [ 78.849899] filemap_read+0xbf/0x350 [ 78.849902] ? __mod_memcg_lruvec_state+0x72/0xc0 [ 78.849907] ? __mod_lruvec_page_state+0xb4/0x160 [ 78.849909] ? folio_add_lru+0x51/0x80 [ 78.849912] ? _raw_spin_unlock+0x12/0x30 [ 78.849916] ? __handle_mm_fault+0xdee/0x14d0 [ 78.849921] blkdev_read_iter+0xa9/0x180 [ 78.849924] new_sync_read+0x109/0x180 [ 78.849929] vfs_read+0x187/0x1b0 [ 78.849932] ksys_read+0xa1/0xe0 [ 78.849935] do_syscall_64+0x42/0x90 [ 78.849938] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 78.849941] RIP: 0033:0x7fd3d6322f8e [ 78.849944] Code: c0 e9 c6 fe ff ff 48 8d 3d a7 07 0a 00 48 83 ec 08 e8 b6 e1 01 00 66 0f 1f 44 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 59 [ 78.849945] RSP: 002b:00007ffe92d46ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 78.849948] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd3d6322f8e [ 78.849949] RDX: 0000000000001000 RSI: 00000000004a3000 RDI: 0000000000000003 [ 78.849950] RBP: 0000000000000003 R08: 00000000004a3000 R09: 0000000000000003 [ 78.849951] R10: 00007fd3d623d0a8 R11: 0000000000000246 R12: 00000000004a2a60 [ 78.849952] R13: 0000000000000000 R14: 00000000004a3000 R15: 000000000048a4a0 [ 78.849954] </TASK> Best Donald > Logan -- Donald Buczek buczek@molgen.mpg.de Tel: +49 30 8413 1433 ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-21 18:23 ` Donald Buczek @ 2022-05-23 1:08 ` Guoqing Jiang 2022-05-23 5:41 ` Donald Buczek 2022-05-24 15:51 ` Logan Gunthorpe 1 sibling, 1 reply; 45+ messages in thread From: Guoqing Jiang @ 2022-05-23 1:08 UTC (permalink / raw) To: Donald Buczek, Logan Gunthorpe, Song Liu; +Cc: linux-raid On 5/22/22 2:23 AM, Donald Buczek wrote: > On 20.05.22 20:27, Logan Gunthorpe wrote: >> >> Hi, >> >> On 2022-05-10 00:44, Song Liu wrote: >>> On Mon, May 9, 2022 at 1:09 AM Guoqing Jiang >>> <guoqing.jiang@linux.dev> wrote: >>>> On 5/9/22 2:37 PM, Song Liu wrote: >>>>> On Fri, May 6, 2022 at 4:37 AM Guoqing >>>>> Jiang<guoqing.jiang@linux.dev> wrote: >>>>>> From: Guoqing Jiang<guoqing.jiang@cloud.ionos.com> >>>>>> >>>>>> Unregister sync_thread doesn't need to hold reconfig_mutex since it >>>>>> doesn't reconfigure array. >>>>>> >>>>>> And it could cause deadlock problem for raid5 as follows: >>>>>> >>>>>> 1. process A tried to reap sync thread with reconfig_mutex held >>>>>> after echo >>>>>> idle to sync_action. >>>>>> 2. raid5 sync thread was blocked if there were too many active >>>>>> stripes. >>>>>> 3. SB_CHANGE_PENDING was set (because of write IO comes from >>>>>> upper layer) >>>>>> which causes the number of active stripes can't be decreased. >>>>>> 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was >>>>>> not able >>>>>> to hold reconfig_mutex. >>>>>> >>>>>> More details in the link: >>>>>> https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t >>>>>> >>>>>> >>>>>> Let's call unregister thread between mddev_unlock and >>>>>> mddev_lock_nointr >>>>>> (thanks for the report from kernel test robot<lkp@intel.com>) if the >>>>>> reconfig_mutex is held, and mddev_is_locked is introduced >>>>>> accordingly. >>>>> mddev_is_locked() feels really hacky to me. It cannot tell whether >>>>> mddev is locked >>>>> by current thread. So technically, we can unlock reconfigure_mutex >>>>> for >>>>> other thread >>>>> by accident, no? >>>> >>>> I can switch back to V2 if you think that is the correct way to do >>>> though no >>>> one comment about the change in dm-raid. >>> >>> I guess v2 is the best at the moment. I pushed a slightly modified >>> v2 to >>> md-next. >>> >> >> I noticed a clear regression with mdadm tests with this patch in md-next >> (7e6ba434cc6080). >> >> Before the patch, tests 07reshape5intr and 07revert-grow would fail >> fairly infrequently (about 1 in 4 runs for the former and 1 in 30 runs >> for the latter). >> >> After this patch, both tests always fail. >> >> I don't have time to dig into why this is, but it would be nice if >> someone can at least fix the regression. It is hard to make any progress >> on these tests if we are continuing to further break them. > > Hmmm. I wanted to try to help a bit by reproducing and digging into this. > > But it seems that more or less ALL tests hang my system one way or > another. > > I've used a qemu/kvm machine with md-next and mdraid master. > > Is this supposed to work? > > I can investigate the bugs I see, but probably that is a waste of time > because I'm doing something wrong fundamentally? > > This is an example from 00raid0: > > [ 57.434064] md: md0 stopped. > [ 57.586951] md0: detected capacity change from 0 to 107520 > [ 57.618454] BUG: kernel NULL pointer dereference, address: > 0000000000000094 > [ 57.620830] #PF: supervisor read access in kernel mode > [ 57.622554] #PF: error_code(0x0000) - not-present page > [ 57.624273] PGD 800000010d5ee067 P4D 800000010d5ee067 PUD 10df28067 > PMD 0 > [ 57.626548] Oops: 0000 [#1] PREEMPT SMP PTI > [ 57.627942] CPU: 3 PID: 1064 Comm: mkfs.ext3 Not tainted > 5.18.0-rc3.mx64.425-00108-g6ad84d559b8c #77 > [ 57.630952] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 > [ 57.635927] RIP: 0010:bfq_bio_bfqg+0x26/0x80 > [ 57.638027] Code: 00 0f 1f 00 0f 1f 44 00 00 55 53 48 89 fd 48 8b > 56 48 48 89 f7 48 85 d2 74 32 48 63 05 53 54 1c 01 48 83 c0 16 48 8b > 5c c2 08 <80> bb 94 00 00 00 00 70 > [ 57.645295] RSP: 0018:ffffc90001c27b38 EFLAGS: 00010006 > [ 57.647414] RAX: 0000000000000018 RBX: 0000000000000000 RCX: > 0000000000000001 > [ 57.650039] RDX: ffff888109297800 RSI: ffff8881032ba180 RDI: > ffff8881032ba180 > [ 57.652541] RBP: ffff888102177800 R08: ffff88810c9004c8 R09: > ffff88810318cb00 > [ 57.654852] R10: 0000000000000000 R11: ffff8881032ba180 R12: > ffff88810318cae0 > [ 57.657128] R13: ffff888102177800 R14: ffffc90001c27ca8 R15: > ffffc90001c27c00 > [ 57.659316] FS: 00007fdfce47d440(0000) GS:ffff8882b5ac0000(0000) > knlGS:0000000000000000 > [ 57.661700] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 57.663461] CR2: 0000000000000094 CR3: 000000010d438002 CR4: > 0000000000170ee0 > [ 57.665453] Call Trace: > [ 57.666479] <TASK> > [ 57.667382] bfq_bic_update_cgroup+0x28/0x1b0 > [ 57.668724] bfq_insert_requests+0x233/0x2340 > [ 57.670049] ? ioc_find_get_icq+0x21c/0x2a0 > [ 57.671315] ? bfq_prepare_request+0x11/0x30 > [ 57.672565] blk_mq_sched_insert_requests+0x5c/0x150 > [ 57.673891] blk_mq_flush_plug_list+0xe1/0x2a0 > [ 57.675140] __blk_flush_plug+0xdf/0x120 > [ 57.676259] io_schedule_prepare+0x3d/0x50 > [ 57.677373] io_schedule_timeout+0xf/0x40 > [ 57.678465] wait_for_completion_io+0x78/0x140 > [ 57.679578] submit_bio_wait+0x5b/0x80 > [ 57.680575] blkdev_issue_discard+0x65/0xb0 > [ 57.681640] blkdev_common_ioctl+0x391/0x8f0 > [ 57.682712] blkdev_ioctl+0x216/0x2a0 > [ 57.683648] __x64_sys_ioctl+0x76/0xb0 > [ 57.684607] do_syscall_64+0x42/0x90 > [ 57.685527] entry_SYSCALL_64_after_hwframe+0x44/0xae > [ 57.686645] RIP: 0033:0x7fdfce56dc17 > [ 57.687535] Code: 48 c7 c3 ff ff ff ff 48 89 d8 5b 5d 41 5c c3 0f > 1f 40 00 48 89 e8 48 f7 d8 48 39 c3 0f 92 c0 eb 93 66 90 b8 10 00 00 > 00 0f 05 <48> 3d 01 f0 ff ff 73 08 > [ 57.691055] RSP: 002b:00007ffe24319828 EFLAGS: 00000246 ORIG_RAX: > 0000000000000010 > [ 57.692537] RAX: ffffffffffffffda RBX: 00000000004645a0 RCX: > 00007fdfce56dc17 > [ 57.693905] RDX: 00007ffe24319830 RSI: 0000000000001277 RDI: > 0000000000000003 > [ 57.695288] RBP: 0000000000460960 R08: 0000000000000400 R09: > 0000000000000000 > [ 57.696645] R10: 0000000000000000 R11: 0000000000000246 R12: > 0000000000000000 > [ 57.697954] R13: 000000000000d200 R14: 0000000000000000 R15: > 0000000000000000 > [ 57.699281] </TASK> > [ 57.699901] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs 8021q garp > stp mrp llc bochs drm_vram_helper drm_ttm_helper kvm_intel ttm > drm_kms_helper kvm drm fb_sys_fops vi4 > [ 57.705955] CR2: 0000000000000094 > [ 57.706710] ---[ end trace 0000000000000000 ]--- > [ 57.707599] RIP: 0010:bfq_bio_bfqg+0x26/0x80 > [ 57.708434] Code: 00 0f 1f 00 0f 1f 44 00 00 55 53 48 89 fd 48 8b > 56 48 48 89 f7 48 85 d2 74 32 48 63 05 53 54 1c 01 48 83 c0 16 48 8b > 5c c2 08 <80> bb 94 00 00 00 00 70 > [ 57.711426] RSP: 0018:ffffc90001c27b38 EFLAGS: 00010006 > [ 57.712391] RAX: 0000000000000018 RBX: 0000000000000000 RCX: > 0000000000000001 > [ 57.713605] RDX: ffff888109297800 RSI: ffff8881032ba180 RDI: > ffff8881032ba180 > [ 57.714811] RBP: ffff888102177800 R08: ffff88810c9004c8 R09: > ffff88810318cb00 > [ 57.716018] R10: 0000000000000000 R11: ffff8881032ba180 R12: > ffff88810318cae0 > [ 57.717236] R13: ffff888102177800 R14: ffffc90001c27ca8 R15: > ffffc90001c27c00 > [ 57.718438] FS: 00007fdfce47d440(0000) GS:ffff8882b5ac0000(0000) > knlGS:0000000000000000 > [ 57.719778] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 57.720808] CR2: 0000000000000094 CR3: 000000010d438002 CR4: > 0000000000170ee0 > [ 57.722019] note: mkfs.ext3[1064] exited with preempt_count 1 > [ 57.723067] ------------[ cut here ]------------ > [ 57.723960] WARNING: CPU: 3 PID: 1064 at kernel/exit.c:741 > do_exit+0x8cb/0xbc0 > [ 57.725196] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs 8021q garp > stp mrp llc bochs drm_vram_helper drm_ttm_helper kvm_intel ttm > drm_kms_helper kvm drm fb_sys_fops vi4 > [ 57.731011] CPU: 3 PID: 1064 Comm: mkfs.ext3 Tainted: G D > 5.18.0-rc3.mx64.425-00108-g6ad84d559b8c #77 > [ 57.732704] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 > [ 57.734853] RIP: 0010:do_exit+0x8cb/0xbc0 > [ 57.735711] Code: e9 13 ff ff ff 48 8b bb e0 04 00 00 31 f6 e8 4c > db ff ff e9 98 fd ff ff 4c 89 e6 bf 05 06 00 00 e8 8a c8 00 00 e9 41 > f8 ff ff <0f> 0b e9 6b f7 ff ff 4b > [ 57.738851] RSP: 0018:ffffc90001c27ee8 EFLAGS: 00010082 > [ 57.739899] RAX: 0000000000000000 RBX: ffff888101e48000 RCX: > 0000000000000000 > [ 57.741196] RDX: 0000000000000001 RSI: ffffffff8220a969 RDI: > 0000000000000009 > [ 57.742485] RBP: 0000000000000009 R08: 0000000000000000 R09: > c0000000ffffbfff > [ 57.743777] R10: 00007fdfce47d440 R11: ffffc90001c27d60 R12: > 0000000000000009 > [ 57.745081] R13: 0000000000000046 R14: 0000000000000000 R15: > 0000000000000000 > [ 57.746388] FS: 00007fdfce47d440(0000) GS:ffff8882b5ac0000(0000) > knlGS:0000000000000000 > [ 57.747806] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 57.748931] CR2: 0000000000000094 CR3: 000000010d438002 CR4: > 0000000000170ee0 > [ 57.750225] Call Trace: > [ 57.750894] <TASK> > [ 57.751535] make_task_dead+0x41/0xf0 > [ 57.752369] rewind_stack_and_make_dead+0x17/0x17 > [ 57.753336] RIP: 0033:0x7fdfce56dc17 > [ 57.754155] Code: 48 c7 c3 ff ff ff ff 48 89 d8 5b 5d 41 5c c3 0f > 1f 40 00 48 89 e8 48 f7 d8 48 39 c3 0f 92 c0 eb 93 66 90 b8 10 00 00 > 00 0f 05 <48> 3d 01 f0 ff ff 73 08 > [ 57.757318] RSP: 002b:00007ffe24319828 EFLAGS: 00000246 ORIG_RAX: > 0000000000000010 > [ 57.758669] RAX: ffffffffffffffda RBX: 00000000004645a0 RCX: > 00007fdfce56dc17 > [ 57.759956] RDX: 00007ffe24319830 RSI: 0000000000001277 RDI: > 0000000000000003 > [ 57.761256] RBP: 0000000000460960 R08: 0000000000000400 R09: > 0000000000000000 > [ 57.762531] R10: 0000000000000000 R11: 0000000000000246 R12: > 0000000000000000 > [ 57.763806] R13: 000000000000d200 R14: 0000000000000000 R15: > 0000000000000000 > [ 57.765177] </TASK> > [ 57.765813] ---[ end trace 0000000000000000 ]--- > [ 57.790046] md0: detected capacity change from 107520 to 0 > [ 57.792834] md: md0 stopped. > [ 78.843853] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: > [ 78.845334] rcu: 10-...0: (0 ticks this GP) > idle=07b/1/0x4000000000000000 softirq=1140/1140 fqs=4805 > [ 78.847246] (detected by 13, t=21005 jiffies, g=9013, q=1419) > [ 78.848619] Sending NMI from CPU 13 to CPUs 10: > [ 78.849810] NMI backtrace for cpu 10 > [ 78.849813] CPU: 10 PID: 1081 Comm: mdadm Tainted: G D > W 5.18.0-rc3.mx64.425-00108-g6ad84d559b8c #77 > [ 78.849816] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 > [ 78.849817] RIP: 0010:queued_spin_lock_slowpath+0x4c/0x1d0 > [ 78.849832] Code: 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 > a9 00 01 ff ff 75 1b 85 c0 75 0f b8 01 00 00 00 66 89 07 5b 5d 41 5c > c3 f3 90 <8b> 07 84 c0 75 f8 eb e7 > [ 78.849834] RSP: 0018:ffffc90001c9f9e0 EFLAGS: 00000002 > [ 78.849837] RAX: 0000000000040101 RBX: ffff88810c914fc8 RCX: > 0000000000000000 > [ 78.849838] RDX: 0000000000000000 RSI: 0000000000000000 RDI: > ffff888102177c30 > [ 78.849840] RBP: 0000000000000000 R08: ffff88810c914fc8 R09: > ffff888106a4ed10 > [ 78.849841] R10: ffffc90001c9fae8 R11: ffff888101b048d8 R12: > ffff888103833000 > [ 78.849842] R13: ffff888102177800 R14: ffffc90001c9fb20 R15: > ffffc90001c9fa78 > [ 78.849844] FS: 00007fd3d66c4340(0000) GS:ffff8882b5c80000(0000) > knlGS:0000000000000000 > [ 78.849847] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 78.849848] CR2: 00000000004a5b58 CR3: 000000010d438001 CR4: > 0000000000170ee0 > [ 78.849850] Call Trace: > [ 78.849853] <TASK> > [ 78.849855] bfq_insert_requests+0xae/0x2340 > [ 78.849862] ? submit_bio_noacct_nocheck+0x225/0x2b0 > [ 78.849868] blk_mq_sched_insert_requests+0x5c/0x150 > [ 78.849872] blk_mq_flush_plug_list+0xe1/0x2a0 > [ 78.849876] __blk_flush_plug+0xdf/0x120 > [ 78.849879] blk_finish_plug+0x27/0x40 > [ 78.849882] read_pages+0x15b/0x360 > [ 78.849891] page_cache_ra_unbounded+0x120/0x170 > [ 78.849894] filemap_get_pages+0xdd/0x5f0 > [ 78.849899] filemap_read+0xbf/0x350 > [ 78.849902] ? __mod_memcg_lruvec_state+0x72/0xc0 > [ 78.849907] ? __mod_lruvec_page_state+0xb4/0x160 > [ 78.849909] ? folio_add_lru+0x51/0x80 > [ 78.849912] ? _raw_spin_unlock+0x12/0x30 > [ 78.849916] ? __handle_mm_fault+0xdee/0x14d0 > [ 78.849921] blkdev_read_iter+0xa9/0x180 > [ 78.849924] new_sync_read+0x109/0x180 > [ 78.849929] vfs_read+0x187/0x1b0 > [ 78.849932] ksys_read+0xa1/0xe0 > [ 78.849935] do_syscall_64+0x42/0x90 > [ 78.849938] entry_SYSCALL_64_after_hwframe+0x44/0xae > [ 78.849941] RIP: 0033:0x7fd3d6322f8e > [ 78.849944] Code: c0 e9 c6 fe ff ff 48 8d 3d a7 07 0a 00 48 83 ec > 08 e8 b6 e1 01 00 66 0f 1f 44 00 00 64 8b 04 25 18 00 00 00 85 c0 75 > 14 0f 05 <48> 3d 00 f0 ff ff 77 59 > [ 78.849945] RSP: 002b:00007ffe92d46ea8 EFLAGS: 00000246 ORIG_RAX: > 0000000000000000 > [ 78.849948] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: > 00007fd3d6322f8e > [ 78.849949] RDX: 0000000000001000 RSI: 00000000004a3000 RDI: > 0000000000000003 > [ 78.849950] RBP: 0000000000000003 R08: 00000000004a3000 R09: > 0000000000000003 > [ 78.849951] R10: 00007fd3d623d0a8 R11: 0000000000000246 R12: > 00000000004a2a60 > [ 78.849952] R13: 0000000000000000 R14: 00000000004a3000 R15: > 000000000048a4a0 > [ 78.849954] </TASK> Looks like bfq or block issue, will try it from my side. Thanks, Guoqing ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-23 1:08 ` Guoqing Jiang @ 2022-05-23 5:41 ` Donald Buczek 2022-05-23 9:51 ` Guoqing Jiang ` (2 more replies) 0 siblings, 3 replies; 45+ messages in thread From: Donald Buczek @ 2022-05-23 5:41 UTC (permalink / raw) To: Guoqing Jiang, Logan Gunthorpe, Song Liu, Logan Gunthorpe; +Cc: linux-raid On 23.05.22 03:08, Guoqing Jiang wrote: > > > On 5/22/22 2:23 AM, Donald Buczek wrote: >> On 20.05.22 20:27, Logan Gunthorpe wrote: >>> >>> Hi, >>> >>> On 2022-05-10 00:44, Song Liu wrote: >>>> On Mon, May 9, 2022 at 1:09 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote: >>>>> On 5/9/22 2:37 PM, Song Liu wrote: >>>>>> On Fri, May 6, 2022 at 4:37 AM Guoqing Jiang<guoqing.jiang@linux.dev> wrote: >>>>>>> From: Guoqing Jiang<guoqing.jiang@cloud.ionos.com> >>>>>>> >>>>>>> Unregister sync_thread doesn't need to hold reconfig_mutex since it >>>>>>> doesn't reconfigure array. >>>>>>> >>>>>>> And it could cause deadlock problem for raid5 as follows: >>>>>>> >>>>>>> 1. process A tried to reap sync thread with reconfig_mutex held after echo >>>>>>> idle to sync_action. >>>>>>> 2. raid5 sync thread was blocked if there were too many active stripes. >>>>>>> 3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer) >>>>>>> which causes the number of active stripes can't be decreased. >>>>>>> 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able >>>>>>> to hold reconfig_mutex. >>>>>>> >>>>>>> More details in the link: >>>>>>> https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t >>>>>>> >>>>>>> Let's call unregister thread between mddev_unlock and mddev_lock_nointr >>>>>>> (thanks for the report from kernel test robot<lkp@intel.com>) if the >>>>>>> reconfig_mutex is held, and mddev_is_locked is introduced accordingly. >>>>>> mddev_is_locked() feels really hacky to me. It cannot tell whether >>>>>> mddev is locked >>>>>> by current thread. So technically, we can unlock reconfigure_mutex for >>>>>> other thread >>>>>> by accident, no? >>>>> >>>>> I can switch back to V2 if you think that is the correct way to do though no >>>>> one comment about the change in dm-raid. >>>> >>>> I guess v2 is the best at the moment. I pushed a slightly modified v2 to >>>> md-next. >>>> >>> >>> I noticed a clear regression with mdadm tests with this patch in md-next >>> (7e6ba434cc6080). >>> >>> Before the patch, tests 07reshape5intr and 07revert-grow would fail >>> fairly infrequently (about 1 in 4 runs for the former and 1 in 30 runs >>> for the latter). >>> >>> After this patch, both tests always fail. >>> >>> I don't have time to dig into why this is, but it would be nice if >>> someone can at least fix the regression. It is hard to make any progress >>> on these tests if we are continuing to further break them. >> >> Hmmm. I wanted to try to help a bit by reproducing and digging into this. >> >> But it seems that more or less ALL tests hang my system one way or another. >> >> I've used a qemu/kvm machine with md-next and mdraid master. >> >> Is this supposed to work? >> >> I can investigate the bugs I see, but probably that is a waste of time because I'm doing something wrong fundamentally? >> >> This is an example from 00raid0: >> >> [ 57.434064] md: md0 stopped. >> [ 57.586951] md0: detected capacity change from 0 to 107520 >> [ 57.618454] BUG: kernel NULL pointer dereference, address: 0000000000000094 >> [ 57.620830] #PF: supervisor read access in kernel mode >> [ 57.622554] #PF: error_code(0x0000) - not-present page >> [ 57.624273] PGD 800000010d5ee067 P4D 800000010d5ee067 PUD 10df28067 PMD 0 >> [ 57.626548] Oops: 0000 [#1] PREEMPT SMP PTI >> [ 57.627942] CPU: 3 PID: 1064 Comm: mkfs.ext3 Not tainted 5.18.0-rc3.mx64.425-00108-g6ad84d559b8c #77 >> [ 57.630952] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 >> [ 57.635927] RIP: 0010:bfq_bio_bfqg+0x26/0x80 >> [ 57.638027] Code: 00 0f 1f 00 0f 1f 44 00 00 55 53 48 89 fd 48 8b 56 48 48 89 f7 48 85 d2 74 32 48 63 05 53 54 1c 01 48 83 c0 16 48 8b 5c c2 08 <80> bb 94 00 00 00 00 70 >> [ 57.645295] RSP: 0018:ffffc90001c27b38 EFLAGS: 00010006 >> [ 57.647414] RAX: 0000000000000018 RBX: 0000000000000000 RCX: 0000000000000001 >> [ 57.650039] RDX: ffff888109297800 RSI: ffff8881032ba180 RDI: ffff8881032ba180 >> [ 57.652541] RBP: ffff888102177800 R08: ffff88810c9004c8 R09: ffff88810318cb00 >> [ 57.654852] R10: 0000000000000000 R11: ffff8881032ba180 R12: ffff88810318cae0 >> [ 57.657128] R13: ffff888102177800 R14: ffffc90001c27ca8 R15: ffffc90001c27c00 >> [ 57.659316] FS: 00007fdfce47d440(0000) GS:ffff8882b5ac0000(0000) knlGS:0000000000000000 >> [ 57.661700] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 57.663461] CR2: 0000000000000094 CR3: 000000010d438002 CR4: 0000000000170ee0 >> [ 57.665453] Call Trace: >> [ 57.666479] <TASK> >> [ 57.667382] bfq_bic_update_cgroup+0x28/0x1b0 >> [ 57.668724] bfq_insert_requests+0x233/0x2340 >> [ 57.670049] ? ioc_find_get_icq+0x21c/0x2a0 >> [ 57.671315] ? bfq_prepare_request+0x11/0x30 >> [ 57.672565] blk_mq_sched_insert_requests+0x5c/0x150 >> [ 57.673891] blk_mq_flush_plug_list+0xe1/0x2a0 >> [ 57.675140] __blk_flush_plug+0xdf/0x120 >> [ 57.676259] io_schedule_prepare+0x3d/0x50 >> [ 57.677373] io_schedule_timeout+0xf/0x40 >> [ 57.678465] wait_for_completion_io+0x78/0x140 >> [ 57.679578] submit_bio_wait+0x5b/0x80 >> [ 57.680575] blkdev_issue_discard+0x65/0xb0 >> [ 57.681640] blkdev_common_ioctl+0x391/0x8f0 >> [ 57.682712] blkdev_ioctl+0x216/0x2a0 >> [ 57.683648] __x64_sys_ioctl+0x76/0xb0 >> [ 57.684607] do_syscall_64+0x42/0x90 >> [ 57.685527] entry_SYSCALL_64_after_hwframe+0x44/0xae >> [ 57.686645] RIP: 0033:0x7fdfce56dc17 >> [ 57.687535] Code: 48 c7 c3 ff ff ff ff 48 89 d8 5b 5d 41 5c c3 0f 1f 40 00 48 89 e8 48 f7 d8 48 39 c3 0f 92 c0 eb 93 66 90 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 08 >> [ 57.691055] RSP: 002b:00007ffe24319828 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 >> [ 57.692537] RAX: ffffffffffffffda RBX: 00000000004645a0 RCX: 00007fdfce56dc17 >> [ 57.693905] RDX: 00007ffe24319830 RSI: 0000000000001277 RDI: 0000000000000003 >> [ 57.695288] RBP: 0000000000460960 R08: 0000000000000400 R09: 0000000000000000 >> [ 57.696645] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 >> [ 57.697954] R13: 000000000000d200 R14: 0000000000000000 R15: 0000000000000000 >> [ 57.699281] </TASK> >> [ 57.699901] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs 8021q garp stp mrp llc bochs drm_vram_helper drm_ttm_helper kvm_intel ttm drm_kms_helper kvm drm fb_sys_fops vi4 >> [ 57.705955] CR2: 0000000000000094 >> [ 57.706710] ---[ end trace 0000000000000000 ]--- >> [ 57.707599] RIP: 0010:bfq_bio_bfqg+0x26/0x80 >> [ 57.708434] Code: 00 0f 1f 00 0f 1f 44 00 00 55 53 48 89 fd 48 8b 56 48 48 89 f7 48 85 d2 74 32 48 63 05 53 54 1c 01 48 83 c0 16 48 8b 5c c2 08 <80> bb 94 00 00 00 00 70 >> [ 57.711426] RSP: 0018:ffffc90001c27b38 EFLAGS: 00010006 >> [ 57.712391] RAX: 0000000000000018 RBX: 0000000000000000 RCX: 0000000000000001 >> [ 57.713605] RDX: ffff888109297800 RSI: ffff8881032ba180 RDI: ffff8881032ba180 >> [ 57.714811] RBP: ffff888102177800 R08: ffff88810c9004c8 R09: ffff88810318cb00 >> [ 57.716018] R10: 0000000000000000 R11: ffff8881032ba180 R12: ffff88810318cae0 >> [ 57.717236] R13: ffff888102177800 R14: ffffc90001c27ca8 R15: ffffc90001c27c00 >> [ 57.718438] FS: 00007fdfce47d440(0000) GS:ffff8882b5ac0000(0000) knlGS:0000000000000000 >> [ 57.719778] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 57.720808] CR2: 0000000000000094 CR3: 000000010d438002 CR4: 0000000000170ee0 >> [ 57.722019] note: mkfs.ext3[1064] exited with preempt_count 1 >> [ 57.723067] ------------[ cut here ]------------ >> [ 57.723960] WARNING: CPU: 3 PID: 1064 at kernel/exit.c:741 do_exit+0x8cb/0xbc0 >> [ 57.725196] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs 8021q garp stp mrp llc bochs drm_vram_helper drm_ttm_helper kvm_intel ttm drm_kms_helper kvm drm fb_sys_fops vi4 >> [ 57.731011] CPU: 3 PID: 1064 Comm: mkfs.ext3 Tainted: G D 5.18.0-rc3.mx64.425-00108-g6ad84d559b8c #77 >> [ 57.732704] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 >> [ 57.734853] RIP: 0010:do_exit+0x8cb/0xbc0 >> [ 57.735711] Code: e9 13 ff ff ff 48 8b bb e0 04 00 00 31 f6 e8 4c db ff ff e9 98 fd ff ff 4c 89 e6 bf 05 06 00 00 e8 8a c8 00 00 e9 41 f8 ff ff <0f> 0b e9 6b f7 ff ff 4b >> [ 57.738851] RSP: 0018:ffffc90001c27ee8 EFLAGS: 00010082 >> [ 57.739899] RAX: 0000000000000000 RBX: ffff888101e48000 RCX: 0000000000000000 >> [ 57.741196] RDX: 0000000000000001 RSI: ffffffff8220a969 RDI: 0000000000000009 >> [ 57.742485] RBP: 0000000000000009 R08: 0000000000000000 R09: c0000000ffffbfff >> [ 57.743777] R10: 00007fdfce47d440 R11: ffffc90001c27d60 R12: 0000000000000009 >> [ 57.745081] R13: 0000000000000046 R14: 0000000000000000 R15: 0000000000000000 >> [ 57.746388] FS: 00007fdfce47d440(0000) GS:ffff8882b5ac0000(0000) knlGS:0000000000000000 >> [ 57.747806] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 57.748931] CR2: 0000000000000094 CR3: 000000010d438002 CR4: 0000000000170ee0 >> [ 57.750225] Call Trace: >> [ 57.750894] <TASK> >> [ 57.751535] make_task_dead+0x41/0xf0 >> [ 57.752369] rewind_stack_and_make_dead+0x17/0x17 >> [ 57.753336] RIP: 0033:0x7fdfce56dc17 >> [ 57.754155] Code: 48 c7 c3 ff ff ff ff 48 89 d8 5b 5d 41 5c c3 0f 1f 40 00 48 89 e8 48 f7 d8 48 39 c3 0f 92 c0 eb 93 66 90 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 08 >> [ 57.757318] RSP: 002b:00007ffe24319828 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 >> [ 57.758669] RAX: ffffffffffffffda RBX: 00000000004645a0 RCX: 00007fdfce56dc17 >> [ 57.759956] RDX: 00007ffe24319830 RSI: 0000000000001277 RDI: 0000000000000003 >> [ 57.761256] RBP: 0000000000460960 R08: 0000000000000400 R09: 0000000000000000 >> [ 57.762531] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 >> [ 57.763806] R13: 000000000000d200 R14: 0000000000000000 R15: 0000000000000000 >> [ 57.765177] </TASK> >> [ 57.765813] ---[ end trace 0000000000000000 ]--- >> [ 57.790046] md0: detected capacity change from 107520 to 0 >> [ 57.792834] md: md0 stopped. >> [ 78.843853] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: >> [ 78.845334] rcu: 10-...0: (0 ticks this GP) idle=07b/1/0x4000000000000000 softirq=1140/1140 fqs=4805 >> [ 78.847246] (detected by 13, t=21005 jiffies, g=9013, q=1419) >> [ 78.848619] Sending NMI from CPU 13 to CPUs 10: >> [ 78.849810] NMI backtrace for cpu 10 >> [ 78.849813] CPU: 10 PID: 1081 Comm: mdadm Tainted: G D W 5.18.0-rc3.mx64.425-00108-g6ad84d559b8c #77 >> [ 78.849816] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 >> [ 78.849817] RIP: 0010:queued_spin_lock_slowpath+0x4c/0x1d0 >> [ 78.849832] Code: 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 75 0f b8 01 00 00 00 66 89 07 5b 5d 41 5c c3 f3 90 <8b> 07 84 c0 75 f8 eb e7 >> [ 78.849834] RSP: 0018:ffffc90001c9f9e0 EFLAGS: 00000002 >> [ 78.849837] RAX: 0000000000040101 RBX: ffff88810c914fc8 RCX: 0000000000000000 >> [ 78.849838] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888102177c30 >> [ 78.849840] RBP: 0000000000000000 R08: ffff88810c914fc8 R09: ffff888106a4ed10 >> [ 78.849841] R10: ffffc90001c9fae8 R11: ffff888101b048d8 R12: ffff888103833000 >> [ 78.849842] R13: ffff888102177800 R14: ffffc90001c9fb20 R15: ffffc90001c9fa78 >> [ 78.849844] FS: 00007fd3d66c4340(0000) GS:ffff8882b5c80000(0000) knlGS:0000000000000000 >> [ 78.849847] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 78.849848] CR2: 00000000004a5b58 CR3: 000000010d438001 CR4: 0000000000170ee0 >> [ 78.849850] Call Trace: >> [ 78.849853] <TASK> >> [ 78.849855] bfq_insert_requests+0xae/0x2340 >> [ 78.849862] ? submit_bio_noacct_nocheck+0x225/0x2b0 >> [ 78.849868] blk_mq_sched_insert_requests+0x5c/0x150 >> [ 78.849872] blk_mq_flush_plug_list+0xe1/0x2a0 >> [ 78.849876] __blk_flush_plug+0xdf/0x120 >> [ 78.849879] blk_finish_plug+0x27/0x40 >> [ 78.849882] read_pages+0x15b/0x360 >> [ 78.849891] page_cache_ra_unbounded+0x120/0x170 >> [ 78.849894] filemap_get_pages+0xdd/0x5f0 >> [ 78.849899] filemap_read+0xbf/0x350 >> [ 78.849902] ? __mod_memcg_lruvec_state+0x72/0xc0 >> [ 78.849907] ? __mod_lruvec_page_state+0xb4/0x160 >> [ 78.849909] ? folio_add_lru+0x51/0x80 >> [ 78.849912] ? _raw_spin_unlock+0x12/0x30 >> [ 78.849916] ? __handle_mm_fault+0xdee/0x14d0 >> [ 78.849921] blkdev_read_iter+0xa9/0x180 >> [ 78.849924] new_sync_read+0x109/0x180 >> [ 78.849929] vfs_read+0x187/0x1b0 >> [ 78.849932] ksys_read+0xa1/0xe0 >> [ 78.849935] do_syscall_64+0x42/0x90 >> [ 78.849938] entry_SYSCALL_64_after_hwframe+0x44/0xae >> [ 78.849941] RIP: 0033:0x7fd3d6322f8e >> [ 78.849944] Code: c0 e9 c6 fe ff ff 48 8d 3d a7 07 0a 00 48 83 ec 08 e8 b6 e1 01 00 66 0f 1f 44 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 59 >> [ 78.849945] RSP: 002b:00007ffe92d46ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 >> [ 78.849948] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd3d6322f8e >> [ 78.849949] RDX: 0000000000001000 RSI: 00000000004a3000 RDI: 0000000000000003 >> [ 78.849950] RBP: 0000000000000003 R08: 00000000004a3000 R09: 0000000000000003 >> [ 78.849951] R10: 00007fd3d623d0a8 R11: 0000000000000246 R12: 00000000004a2a60 >> [ 78.849952] R13: 0000000000000000 R14: 00000000004a3000 R15: 000000000048a4a0 >> [ 78.849954] </TASK> > > Looks like bfq or block issue, will try it from my side. FYI: I've used loop devices on a virtio disk. I later discovered Logans patches [1], which I were not aware of, as I'm not subscribed to the lists. [1]: https://lore.kernel.org/linux-raid/20220519191311.17119-6-logang@deltatee.com/T/#u The series seems to acknowledge that there are open problems and tries to fix them. So I've used his md-bug branch from https://github.com/sbates130272/linux-p2pmem but it didn't look better. So I understand, the mdadm tests *are* supposed to work and every bug I see here is worth analyzing? Or is Logan hunting down everything anyway? Best Donald -- Donald Buczek buczek@molgen.mpg.de Tel: +49 30 8413 1433 ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-23 5:41 ` Donald Buczek @ 2022-05-23 9:51 ` Guoqing Jiang 2022-05-24 16:13 ` Logan Gunthorpe 2022-05-30 9:55 ` Guoqing Jiang 2022-05-24 15:58 ` Logan Gunthorpe 2022-05-25 9:17 ` Guoqing Jiang 2 siblings, 2 replies; 45+ messages in thread From: Guoqing Jiang @ 2022-05-23 9:51 UTC (permalink / raw) To: Donald Buczek, Logan Gunthorpe, Song Liu; +Cc: linux-raid >>>> I noticed a clear regression with mdadm tests with this patch in >>>> md-next >>>> (7e6ba434cc6080). >>>> >>>> Before the patch, tests 07reshape5intr and 07revert-grow would fail >>>> fairly infrequently (about 1 in 4 runs for the former and 1 in 30 runs >>>> for the latter). >>>> >>>> After this patch, both tests always fail. >>>> >>>> I don't have time to dig into why this is, but it would be nice if >>>> someone can at least fix the regression. It is hard to make any >>>> progress >>>> on these tests if we are continuing to further break them. I have tried with both ubuntu 22.04 kernel which is 5.15 and vanilla 5.12, none of them can pass your mentioned tests. [root@localhost mdadm]# lsblk|grep vd vda 252:0 0 1G 0 disk vdb 252:16 0 1G 0 disk vdc 252:32 0 1G 0 disk vdd 252:48 0 1G 0 disk [root@localhost mdadm]# ./test --dev=disk --disks=/dev/vd{a..d} --tests=05r1-add-internalbitmap Testing on linux-5.12.0-default kernel /root/mdadm/tests/05r1-add-internalbitmap... succeeded [root@localhost mdadm]# ./test --dev=disk --disks=/dev/vd{a..d} --tests=07reshape5intr Testing on linux-5.12.0-default kernel /root/mdadm/tests/07reshape5intr... FAILED - see /var/tmp/07reshape5intr.log and /var/tmp/fail07reshape5intr.log for details [root@localhost mdadm]# ./test --dev=disk --disks=/dev/vd{a..d} --tests=07revert-grow Testing on linux-5.12.0-default kernel /root/mdadm/tests/07revert-grow... FAILED - see /var/tmp/07revert-grow.log and /var/tmp/fail07revert-grow.log for details [root@localhost mdadm]# head -10 /var/tmp/07revert-grow.log | grep mdadm + . /root/mdadm/tests/07revert-grow *++ mdadm -CR --assume-clean /dev/md0 -l5 -n4 -x1 /dev/vda /dev/vdb /dev/vdc /dev/vdd /dev/vda /dev/vdb /dev/vdc /dev/vdd --metadata=0.9** * The above line is clearly wrong from my understanding. And let's check ubuntu 22.04. root@vm:/home/gjiang/mdadm# lsblk|grep vd vda 252:0 0 1G 0 disk vdb 252:16 0 1G 0 disk vdc 252:32 0 1G 0 disk root@vm:/home/gjiang/mdadm# ./test --dev=disk --disks=/dev/vd{a..d} --tests=05r1-failfast Testing on linux-5.15.0-30-generic kernel /home/gjiang/mdadm/tests/05r1-failfast... succeeded root@vm:/home/gjiang/mdadm# ./test --dev=disk --disks=/dev/vd{a..c} --tests=07reshape5intr Testing on linux-5.15.0-30-generic kernel /home/gjiang/mdadm/tests/07reshape5intr... FAILED - see /var/tmp/07reshape5intr.log and /var/tmp/fail07reshape5intr.log for details root@vm:/home/gjiang/mdadm# ./test --dev=disk --disks=/dev/vd{a..c} --tests=07revert-grow Testing on linux-5.15.0-30-generic kernel /home/gjiang/mdadm/tests/07revert-grow... FAILED - see /var/tmp/07revert-grow.log and /var/tmp/fail07revert-grow.log for details So I would not consider it is regression. [ ... ] > FYI: I've used loop devices on a virtio disk. > > I later discovered Logans patches [1], which I were not aware of, as > I'm not subscribed to the lists. > > [1]: > https://lore.kernel.org/linux-raid/20220519191311.17119-6-logang@deltatee.com/T/#u > > The series seems to acknowledge that there are open problems and tries > to fix them. > So I've used his md-bug branch from > https://github.com/sbates130272/linux-p2pmem but it didn't look better. Thanks for your effort. > So I understand, the mdadm tests *are* supposed to work and every bug > I see here is worth analyzing? Or is Logan hunting down everything > anyway? Yes, it was supposed to be. But unfortunately, it was kind of broken, good news is people are aware of it and try to make it works/better, pls see other links. [1] https://lore.kernel.org/linux-raid/EA6887B4-2A44-49D0-ACF9-C04CC92AFD87@oracle.com/T/#t [2] https://lore.kernel.org/linux-raid/CALTww2-mbfZRcHu_95Q+WANXZ9ayRwjXvyvqP5Gseeb86dEy=w@mail.gmail.com/T/#t Thanks, Guoqing ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-23 9:51 ` Guoqing Jiang @ 2022-05-24 16:13 ` Logan Gunthorpe 2022-05-25 9:04 ` Guoqing Jiang 2022-05-30 9:55 ` Guoqing Jiang 1 sibling, 1 reply; 45+ messages in thread From: Logan Gunthorpe @ 2022-05-24 16:13 UTC (permalink / raw) To: Guoqing Jiang, Donald Buczek, Song Liu; +Cc: linux-raid On 2022-05-23 03:51, Guoqing Jiang wrote: > I have tried with both ubuntu 22.04 kernel which is 5.15 and vanilla > 5.12, none of them > can pass your mentioned tests. > > [root@localhost mdadm]# lsblk|grep vd > vda 252:0 0 1G 0 disk > vdb 252:16 0 1G 0 disk > vdc 252:32 0 1G 0 disk > vdd 252:48 0 1G 0 disk > [root@localhost mdadm]# ./test --dev=disk --disks=/dev/vd{a..d} > --tests=05r1-add-internalbitmap > Testing on linux-5.12.0-default kernel > /root/mdadm/tests/05r1-add-internalbitmap... succeeded > [root@localhost mdadm]# ./test --dev=disk --disks=/dev/vd{a..d} > --tests=07reshape5intr > Testing on linux-5.12.0-default kernel > /root/mdadm/tests/07reshape5intr... FAILED - see > /var/tmp/07reshape5intr.log and /var/tmp/fail07reshape5intr.log for details > [root@localhost mdadm]# ./test --dev=disk --disks=/dev/vd{a..d} > --tests=07revert-grow > Testing on linux-5.12.0-default kernel > /root/mdadm/tests/07revert-grow... FAILED - see > /var/tmp/07revert-grow.log and /var/tmp/fail07revert-grow.log for details > [root@localhost mdadm]# head -10 /var/tmp/07revert-grow.log | grep mdadm > + . /root/mdadm/tests/07revert-grow > *++ mdadm -CR --assume-clean /dev/md0 -l5 -n4 -x1 /dev/vda /dev/vdb > /dev/vdc /dev/vdd /dev/vda /dev/vdb /dev/vdc /dev/vdd --metadata=0.9** > * > The above line is clearly wrong from my understanding. > > And let's check ubuntu 22.04. > > root@vm:/home/gjiang/mdadm# lsblk|grep vd > vda 252:0 0 1G 0 disk > vdb 252:16 0 1G 0 disk > vdc 252:32 0 1G 0 disk > root@vm:/home/gjiang/mdadm# ./test --dev=disk --disks=/dev/vd{a..d} > --tests=05r1-failfast > Testing on linux-5.15.0-30-generic kernel > /home/gjiang/mdadm/tests/05r1-failfast... succeeded > root@vm:/home/gjiang/mdadm# ./test --dev=disk --disks=/dev/vd{a..c} > --tests=07reshape5intr > Testing on linux-5.15.0-30-generic kernel > /home/gjiang/mdadm/tests/07reshape5intr... FAILED - see > /var/tmp/07reshape5intr.log and /var/tmp/fail07reshape5intr.log for details > root@vm:/home/gjiang/mdadm# ./test --dev=disk --disks=/dev/vd{a..c} > --tests=07revert-grow > Testing on linux-5.15.0-30-generic kernel > /home/gjiang/mdadm/tests/07revert-grow... FAILED - see > /var/tmp/07revert-grow.log and /var/tmp/fail07revert-grow.log for details > > So I would not consider it is regression. I definitely had those test working (at least some of the time) before I rebased on md-next or if I revert 7e6ba434cc6080. You might need to try my branch (plus that patch reverted) and my mdadm branch as there are a number of fixes that may have helped with that specific test. https://github.com/lsgunth/mdadm/ bugfixes2 https://github.com/sbates130272/linux-p2pmem md-bug Thanks, Logan ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-24 16:13 ` Logan Gunthorpe @ 2022-05-25 9:04 ` Guoqing Jiang 2022-05-25 18:22 ` Logan Gunthorpe 0 siblings, 1 reply; 45+ messages in thread From: Guoqing Jiang @ 2022-05-25 9:04 UTC (permalink / raw) To: Logan Gunthorpe, Donald Buczek, Song Liu; +Cc: linux-raid On 5/25/22 12:13 AM, Logan Gunthorpe wrote: > On 2022-05-23 03:51, Guoqing Jiang wrote: >> I have tried with both ubuntu 22.04 kernel which is 5.15 and vanilla >> 5.12, none of them >> can pass your mentioned tests. >> >> [root@localhost mdadm]# lsblk|grep vd >> vda 252:0 0 1G 0 disk >> vdb 252:16 0 1G 0 disk >> vdc 252:32 0 1G 0 disk >> vdd 252:48 0 1G 0 disk >> [root@localhost mdadm]# ./test --dev=disk --disks=/dev/vd{a..d} >> --tests=05r1-add-internalbitmap >> Testing on linux-5.12.0-default kernel >> /root/mdadm/tests/05r1-add-internalbitmap... succeeded >> [root@localhost mdadm]# ./test --dev=disk --disks=/dev/vd{a..d} >> --tests=07reshape5intr >> Testing on linux-5.12.0-default kernel >> /root/mdadm/tests/07reshape5intr... FAILED - see >> /var/tmp/07reshape5intr.log and /var/tmp/fail07reshape5intr.log for details >> [root@localhost mdadm]# ./test --dev=disk --disks=/dev/vd{a..d} >> --tests=07revert-grow >> Testing on linux-5.12.0-default kernel >> /root/mdadm/tests/07revert-grow... FAILED - see >> /var/tmp/07revert-grow.log and /var/tmp/fail07revert-grow.log for details >> [root@localhost mdadm]# head -10 /var/tmp/07revert-grow.log | grep mdadm >> + . /root/mdadm/tests/07revert-grow >> *++ mdadm -CR --assume-clean /dev/md0 -l5 -n4 -x1 /dev/vda /dev/vdb >> /dev/vdc /dev/vdd /dev/vda /dev/vdb /dev/vdc /dev/vdd --metadata=0.9** >> * >> The above line is clearly wrong from my understanding. >> >> And let's check ubuntu 22.04. >> >> root@vm:/home/gjiang/mdadm# lsblk|grep vd >> vda 252:0 0 1G 0 disk >> vdb 252:16 0 1G 0 disk >> vdc 252:32 0 1G 0 disk >> root@vm:/home/gjiang/mdadm# ./test --dev=disk --disks=/dev/vd{a..d} >> --tests=05r1-failfast >> Testing on linux-5.15.0-30-generic kernel >> /home/gjiang/mdadm/tests/05r1-failfast... succeeded >> root@vm:/home/gjiang/mdadm# ./test --dev=disk --disks=/dev/vd{a..c} >> --tests=07reshape5intr >> Testing on linux-5.15.0-30-generic kernel >> /home/gjiang/mdadm/tests/07reshape5intr... FAILED - see >> /var/tmp/07reshape5intr.log and /var/tmp/fail07reshape5intr.log for details >> root@vm:/home/gjiang/mdadm# ./test --dev=disk --disks=/dev/vd{a..c} >> --tests=07revert-grow >> Testing on linux-5.15.0-30-generic kernel >> /home/gjiang/mdadm/tests/07revert-grow... FAILED - see >> /var/tmp/07revert-grow.log and /var/tmp/fail07revert-grow.log for details >> >> So I would not consider it is regression. > I definitely had those test working (at least some of the time) before I > rebased on md-next or if I revert 7e6ba434cc6080. You might need to try > my branch (plus that patch reverted) and my mdadm branch as there are a > number of fixes that may have helped with that specific test. > > https://github.com/lsgunth/mdadm/ bugfixes2 > https://github.com/sbates130272/linux-p2pmem md-bug I would prefer to focus on block tree or md tree. With latest block tree (commit 44d8538d7e7dbee7246acda3b706c8134d15b9cb), I get below similar issue as Donald reported, it happened with the cmd (which did work with 5.12 kernel). vm79:~/mdadm> sudo ./test --dev=loop --tests=05r1-add-internalbitmap May 25 04:48:51 vm79 kernel: Call Trace: May 25 04:48:51 vm79 kernel: <TASK> May 25 04:48:51 vm79 kernel: bfq_bic_update_cgroup+0x28/0x1b0 May 25 04:48:51 vm79 kernel: bfq_insert_requests+0x29d/0x22d0 May 25 04:48:51 vm79 kernel: ? ioc_find_get_icq+0x21c/0x2a0 May 25 04:48:51 vm79 kernel: ? bfq_prepare_request+0x11/0x30 May 25 04:48:51 vm79 kernel: blk_mq_sched_insert_request+0x8b/0x100 May 25 04:48:51 vm79 kernel: blk_mq_submit_bio+0x44c/0x540 May 25 04:48:51 vm79 kernel: __submit_bio+0xe8/0x160 May 25 04:48:51 vm79 kernel: submit_bio_noacct_nocheck+0xf0/0x2b0 May 25 04:48:51 vm79 kernel: ? submit_bio+0x3e/0xd0 May 25 04:48:51 vm79 kernel: submit_bio+0x3e/0xd0 May 25 04:48:51 vm79 kernel: submit_bh_wbc+0x117/0x140 May 25 04:48:51 vm79 kernel: block_read_full_page+0x1eb/0x4f0 May 25 04:48:51 vm79 kernel: ? blkdev_llseek+0x60/0x60 May 25 04:48:51 vm79 kernel: ? folio_add_lru+0x51/0x80 May 25 04:48:51 vm79 kernel: do_read_cache_folio+0x3b4/0x5e0 May 25 04:48:51 vm79 kernel: ? kmem_cache_alloc_node+0x183/0x2e0 May 25 04:48:51 vm79 kernel: ? alloc_vmap_area+0x9f/0x8a0 May 25 04:48:51 vm79 kernel: read_cache_page+0x15/0x80 May 25 04:48:51 vm79 kernel: read_part_sector+0x38/0x140 May 25 04:48:51 vm79 kernel: read_lba+0x105/0x220 May 25 04:48:51 vm79 kernel: efi_partition+0xed/0x7f0 Thanks, Guoqing ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-25 9:04 ` Guoqing Jiang @ 2022-05-25 18:22 ` Logan Gunthorpe 2022-05-26 9:46 ` Jan Kara 2022-05-26 11:53 ` Jan Kara 0 siblings, 2 replies; 45+ messages in thread From: Logan Gunthorpe @ 2022-05-25 18:22 UTC (permalink / raw) To: Guoqing Jiang, Donald Buczek, Song Liu, Jan Kara, Jens Axboe, Paolo Valente Cc: linux-raid On 2022-05-25 03:04, Guoqing Jiang wrote: > I would prefer to focus on block tree or md tree. With latest block tree > (commit 44d8538d7e7dbee7246acda3b706c8134d15b9cb), I get below > similar issue as Donald reported, it happened with the cmd (which did > work with 5.12 kernel). > > vm79:~/mdadm> sudo ./test --dev=loop --tests=05r1-add-internalbitmap Ok, so this test passes for me, but my VM was not running with bfq. It also seems we have layers upon layers of different bugs to untangle. Perhaps you can try the tests with bfq disabled to make progress on the other regression I reported. If I enable bfq and set the loop devices to the bfq scheduler, then I hit the same bug as you and Donald. It's clearly a NULL pointer de-reference in the bfq code, which seems to be triggered on the partition read after mdadm opens a block device (not sure if it's the md device or the loop device but I suspect the latter seeing it's not going through any md code). Simplifying things down a bit, the null pointer dereference can be triggered by creating an md device with loop devices that have bfq scheduler set: mdadm --create --run /dev/md0 --level=1 -n2 /dev/loop0 /dev/loop1 The crash occurs in bfq_bio_bfqg() with blkg_to_bfqg() returning NULL. It's hard to trace where the NULL comes from in there -- the code is a bit complex. I've found that the bfq bug exists in current md-next (42b805af102) but did not trigger in the base tag of v5.18-rc3. Bisecting revealed the bug was introduced by: 4e54a2493e58 ("bfq: Get rid of __bio_blkcg() usage") Reverting that commit and the next commit (075a53b7) on top of md-next was confirmed to fix the bug. I've copied Jan, Jens and Paolo who can hopefully help with this. A cleaned up stack trace follows this email for their benefit. Logan -- BUG: KASAN: null-ptr-deref in bfq_bio_bfqg+0x65/0xf0 Read of size 1 at addr 0000000000000094 by task mdadm/850 CPU: 1 PID: 850 Comm: mdadm Not tainted 5.18.0-rc3-eid-vmlocalyes-dbg-00005-g42b805af1024 #2113 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x5a/0x74 kasan_report.cold+0x5f/0x1a9 __asan_load1+0x4d/0x50 bfq_bio_bfqg+0x65/0xf0 bfq_bic_update_cgroup+0x2f/0x340 bfq_insert_requests+0x568/0x5800 blk_mq_sched_insert_request+0x180/0x230 blk_mq_submit_bio+0x9f0/0xe50 __submit_bio+0xeb/0x100 submit_bio_noacct_nocheck+0x1fd/0x470 submit_bio_noacct+0x350/0xa80 submit_bio+0x84/0xf0 submit_bh_wbc+0x27a/0x2b0 block_read_full_page+0x578/0xb60 blkdev_readpage+0x18/0x20 do_read_cache_folio+0x290/0x430 read_cache_page+0x41/0x130 read_part_sector+0x7a/0x3d0 read_lba+0x161/0x340 efi_partition+0x1ce/0xdd0 bdev_disk_changed+0x2e9/0x6a0 blkdev_get_whole+0xd5/0x140 blkdev_get_by_dev.part.0+0x37f/0x570 blkdev_get_by_dev+0x51/0x60 blkdev_open+0xa4/0x140 do_dentry_open+0x2a7/0x6d0 vfs_open+0x58/0x60 path_openat+0x77e/0x13f0 do_filp_open+0x154/0x280 do_sys_openat2+0x119/0x2c0 __x64_sys_openat+0xe7/0x160 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -- bfq_bio_bfqg+0x65/0xf0: bfq_bio_bfqg at block/bfq-cgroup.c:619 614 struct blkcg_gq *blkg = bio->bi_blkg; 615 struct bfq_group *bfqg; 616 617 while (blkg) { 618 bfqg = blkg_to_bfqg(blkg); >619< if (bfqg->online) { 620 bio_associate_blkg_from_css(bio, 621 return bfqg; 622 } 623 blkg = blkg->parent; 624 ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-25 18:22 ` Logan Gunthorpe @ 2022-05-26 9:46 ` Jan Kara 2022-05-26 11:53 ` Jan Kara 1 sibling, 0 replies; 45+ messages in thread From: Jan Kara @ 2022-05-26 9:46 UTC (permalink / raw) To: Logan Gunthorpe Cc: Guoqing Jiang, Donald Buczek, Song Liu, Jan Kara, Jens Axboe, Paolo Valente, linux-raid On Wed 25-05-22 12:22:06, Logan Gunthorpe wrote: > > > On 2022-05-25 03:04, Guoqing Jiang wrote: > > I would prefer to focus on block tree or md tree. With latest block tree > > (commit 44d8538d7e7dbee7246acda3b706c8134d15b9cb), I get below > > similar issue as Donald reported, it happened with the cmd (which did > > work with 5.12 kernel). > > > > vm79:~/mdadm> sudo ./test --dev=loop --tests=05r1-add-internalbitmap > > Ok, so this test passes for me, but my VM was not running with bfq. It > also seems we have layers upon layers of different bugs to untangle. > Perhaps you can try the tests with bfq disabled to make progress on the > other regression I reported. > > If I enable bfq and set the loop devices to the bfq scheduler, then I > hit the same bug as you and Donald. It's clearly a NULL pointer > de-reference in the bfq code, which seems to be triggered on the > partition read after mdadm opens a block device (not sure if it's the md > device or the loop device but I suspect the latter seeing it's not going > through any md code). > > Simplifying things down a bit, the null pointer dereference can be > triggered by creating an md device with loop devices that have bfq > scheduler set: > > mdadm --create --run /dev/md0 --level=1 -n2 /dev/loop0 /dev/loop1 > > The crash occurs in bfq_bio_bfqg() with blkg_to_bfqg() returning NULL. > It's hard to trace where the NULL comes from in there -- the code is a > bit complex. > > I've found that the bfq bug exists in current md-next (42b805af102) but > did not trigger in the base tag of v5.18-rc3. Bisecting revealed the bug > was introduced by: > > 4e54a2493e58 ("bfq: Get rid of __bio_blkcg() usage") > > Reverting that commit and the next commit (075a53b7) on top of md-next > was confirmed to fix the bug. > > I've copied Jan, Jens and Paolo who can hopefully help with this. A > cleaned up stack trace follows this email for their benefit. Indeed. Thanks for easy reproduction steps. I've reproduced the issue locally and I'm looking into it because from a quick look into the code this shouldn't be possible which means I'm missing something obvious :). Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-25 18:22 ` Logan Gunthorpe 2022-05-26 9:46 ` Jan Kara @ 2022-05-26 11:53 ` Jan Kara 2022-05-31 6:11 ` Christoph Hellwig 1 sibling, 1 reply; 45+ messages in thread From: Jan Kara @ 2022-05-26 11:53 UTC (permalink / raw) To: Logan Gunthorpe Cc: Guoqing Jiang, Donald Buczek, Song Liu, Jan Kara, Jens Axboe, Paolo Valente, linux-raid, Tejun Heo, Christoph Hellwig, linux-block [Added couple of CCs since this seems to be an issue in the generic block layer] On Wed 25-05-22 12:22:06, Logan Gunthorpe wrote: > On 2022-05-25 03:04, Guoqing Jiang wrote: > > I would prefer to focus on block tree or md tree. With latest block tree > > (commit 44d8538d7e7dbee7246acda3b706c8134d15b9cb), I get below > > similar issue as Donald reported, it happened with the cmd (which did > > work with 5.12 kernel). > > > > vm79:~/mdadm> sudo ./test --dev=loop --tests=05r1-add-internalbitmap > > Ok, so this test passes for me, but my VM was not running with bfq. It > also seems we have layers upon layers of different bugs to untangle. > Perhaps you can try the tests with bfq disabled to make progress on the > other regression I reported. > > If I enable bfq and set the loop devices to the bfq scheduler, then I > hit the same bug as you and Donald. It's clearly a NULL pointer > de-reference in the bfq code, which seems to be triggered on the > partition read after mdadm opens a block device (not sure if it's the md > device or the loop device but I suspect the latter seeing it's not going > through any md code). > > Simplifying things down a bit, the null pointer dereference can be > triggered by creating an md device with loop devices that have bfq > scheduler set: > > mdadm --create --run /dev/md0 --level=1 -n2 /dev/loop0 /dev/loop1 > > The crash occurs in bfq_bio_bfqg() with blkg_to_bfqg() returning NULL. > It's hard to trace where the NULL comes from in there -- the code is a > bit complex. > > I've found that the bfq bug exists in current md-next (42b805af102) but > did not trigger in the base tag of v5.18-rc3. Bisecting revealed the bug > was introduced by: > > 4e54a2493e58 ("bfq: Get rid of __bio_blkcg() usage") > > Reverting that commit and the next commit (075a53b7) on top of md-next > was confirmed to fix the bug. > > I've copied Jan, Jens and Paolo who can hopefully help with this. A > cleaned up stack trace follows this email for their benefit. So I've debugged this. The crash happens on the very first bio submitted to the md0 device. The problem is that this bio gets remapped to loop0 - this happens through bio_alloc_clone() -> __bio_clone() which ends up calling bio_clone_blkg_association(). Now the resulting bio is inconsistent - it's dst_bio->bi_bdev is pointing to loop0 while dst_bio->bi_blkg is pointing to blkcg_gq associated with md0 request queue. And this breaks BFQ because when this bio is inserted to loop0 request queue, BFQ looks at bio->bi_blkg->q (it is a bit more complex than that but this is the gist of the problem), expects its data there but BFQ is not initialized for md0 request_queue. Now I think this is a bug in __bio_clone() but the inconsistency in the bio is very much what we asked bio_clone_blkg_association() to do so maybe I'm missing something and bios that are associated with one bdev but pointing to blkg of another bdev are fine and controllers are supposed to handle that (although I'm not sure how should they do that). So I'm asking here before I just go and delete bio_clone_blkg_association() from __bio_clone()... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-26 11:53 ` Jan Kara @ 2022-05-31 6:11 ` Christoph Hellwig 2022-05-31 7:43 ` Jan Kara 0 siblings, 1 reply; 45+ messages in thread From: Christoph Hellwig @ 2022-05-31 6:11 UTC (permalink / raw) To: Jan Kara Cc: Logan Gunthorpe, Guoqing Jiang, Donald Buczek, Song Liu, Jens Axboe, Paolo Valente, linux-raid, Tejun Heo, Christoph Hellwig, linux-block On Thu, May 26, 2022 at 01:53:36PM +0200, Jan Kara wrote: > So I've debugged this. The crash happens on the very first bio submitted to > the md0 device. The problem is that this bio gets remapped to loop0 - this > happens through bio_alloc_clone() -> __bio_clone() which ends up calling > bio_clone_blkg_association(). Now the resulting bio is inconsistent - it's > dst_bio->bi_bdev is pointing to loop0 while dst_bio->bi_blkg is pointing to > blkcg_gq associated with md0 request queue. And this breaks BFQ because > when this bio is inserted to loop0 request queue, BFQ looks at > bio->bi_blkg->q (it is a bit more complex than that but this is the gist > of the problem), expects its data there but BFQ is not initialized for md0 > request_queue. > > Now I think this is a bug in __bio_clone() but the inconsistency in the bio > is very much what we asked bio_clone_blkg_association() to do so maybe I'm > missing something and bios that are associated with one bdev but pointing > to blkg of another bdev are fine and controllers are supposed to handle > that (although I'm not sure how should they do that). So I'm asking here > before I just go and delete bio_clone_blkg_association() from > __bio_clone()... This behavior probably goes back to my commit here: ommit d92c370a16cbe0276954c761b874bd024a7e4fac Author: Christoph Hellwig <hch@lst.de> Date: Sat Jun 27 09:31:48 2020 +0200 block: really clone the block cgroup in bio_clone_blkg_association and it seems everyone else was fine with that behavior so far. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-31 6:11 ` Christoph Hellwig @ 2022-05-31 7:43 ` Jan Kara 0 siblings, 0 replies; 45+ messages in thread From: Jan Kara @ 2022-05-31 7:43 UTC (permalink / raw) To: Christoph Hellwig Cc: Jan Kara, Logan Gunthorpe, Guoqing Jiang, Donald Buczek, Song Liu, Jens Axboe, Paolo Valente, linux-raid, Tejun Heo, linux-block On Mon 30-05-22 23:11:47, Christoph Hellwig wrote: > On Thu, May 26, 2022 at 01:53:36PM +0200, Jan Kara wrote: > > So I've debugged this. The crash happens on the very first bio submitted to > > the md0 device. The problem is that this bio gets remapped to loop0 - this > > happens through bio_alloc_clone() -> __bio_clone() which ends up calling > > bio_clone_blkg_association(). Now the resulting bio is inconsistent - it's > > dst_bio->bi_bdev is pointing to loop0 while dst_bio->bi_blkg is pointing to > > blkcg_gq associated with md0 request queue. And this breaks BFQ because > > when this bio is inserted to loop0 request queue, BFQ looks at > > bio->bi_blkg->q (it is a bit more complex than that but this is the gist > > of the problem), expects its data there but BFQ is not initialized for md0 > > request_queue. > > > > Now I think this is a bug in __bio_clone() but the inconsistency in the bio > > is very much what we asked bio_clone_blkg_association() to do so maybe I'm > > missing something and bios that are associated with one bdev but pointing > > to blkg of another bdev are fine and controllers are supposed to handle > > that (although I'm not sure how should they do that). So I'm asking here > > before I just go and delete bio_clone_blkg_association() from > > __bio_clone()... > > This behavior probably goes back to my commit here: > > ommit d92c370a16cbe0276954c761b874bd024a7e4fac > Author: Christoph Hellwig <hch@lst.de> > Date: Sat Jun 27 09:31:48 2020 +0200 > > block: really clone the block cgroup in bio_clone_blkg_association > > and it seems everyone else was fine with that behavior so far. Yes, thanks for the pointer. I agree nobody else was crashing due to this but it will be causing interesting inconsistencies in accounting and throttling in all the IO controllers - e.g. usually both original and cloned bio will get accounted to md0 device and loop0 device settings will be completely ignored. From my experience these things do not get really tested too much until some customer tries to use them :). So I think we have to effectively revert your above change. I'll send a patch. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-23 9:51 ` Guoqing Jiang 2022-05-24 16:13 ` Logan Gunthorpe @ 2022-05-30 9:55 ` Guoqing Jiang 2022-05-30 16:35 ` Logan Gunthorpe 1 sibling, 1 reply; 45+ messages in thread From: Guoqing Jiang @ 2022-05-30 9:55 UTC (permalink / raw) To: Donald Buczek, Logan Gunthorpe, Song Liu; +Cc: linux-raid On 5/23/22 5:51 PM, Guoqing Jiang wrote: > > >>>>> I noticed a clear regression with mdadm tests with this patch in >>>>> md-next >>>>> (7e6ba434cc6080). >>>>> >>>>> Before the patch, tests 07reshape5intr and 07revert-grow would fail >>>>> fairly infrequently (about 1 in 4 runs for the former and 1 in 30 >>>>> runs >>>>> for the latter). >>>>> >>>>> After this patch, both tests always fail. >>>>> >>>>> I don't have time to dig into why this is, but it would be nice if >>>>> someone can at least fix the regression. It is hard to make any >>>>> progress >>>>> on these tests if we are continuing to further break them. > > I have tried with both ubuntu 22.04 kernel which is 5.15 and vanilla > 5.12, none of them > can pass your mentioned tests. > > [root@localhost mdadm]# lsblk|grep vd > vda 252:0 0 1G 0 disk > vdb 252:16 0 1G 0 disk > vdc 252:32 0 1G 0 disk > vdd 252:48 0 1G 0 disk > [root@localhost mdadm]# ./test --dev=disk --disks=/dev/vd{a..d} > --tests=05r1-add-internalbitmap > Testing on linux-5.12.0-default kernel > /root/mdadm/tests/05r1-add-internalbitmap... succeeded > [root@localhost mdadm]# ./test --dev=disk --disks=/dev/vd{a..d} > --tests=07reshape5intr > Testing on linux-5.12.0-default kernel > /root/mdadm/tests/07reshape5intr... FAILED - see > /var/tmp/07reshape5intr.log and /var/tmp/fail07reshape5intr.log for > details > [root@localhost mdadm]# ./test --dev=disk --disks=/dev/vd{a..d} > --tests=07revert-grow > Testing on linux-5.12.0-default kernel > /root/mdadm/tests/07revert-grow... FAILED - see > /var/tmp/07revert-grow.log and /var/tmp/fail07revert-grow.log for details > [root@localhost mdadm]# head -10 /var/tmp/07revert-grow.log | grep mdadm > + . /root/mdadm/tests/07revert-grow > *++ mdadm -CR --assume-clean /dev/md0 -l5 -n4 -x1 /dev/vda /dev/vdb > /dev/vdc /dev/vdd /dev/vda /dev/vdb /dev/vdc /dev/vdd --metadata=0.9** > * > The above line is clearly wrong from my understanding. > > And let's check ubuntu 22.04. > > root@vm:/home/gjiang/mdadm# lsblk|grep vd > vda 252:0 0 1G 0 disk > vdb 252:16 0 1G 0 disk > vdc 252:32 0 1G 0 disk > root@vm:/home/gjiang/mdadm# ./test --dev=disk --disks=/dev/vd{a..d} > --tests=05r1-failfast > Testing on linux-5.15.0-30-generic kernel > /home/gjiang/mdadm/tests/05r1-failfast... succeeded > root@vm:/home/gjiang/mdadm# ./test --dev=disk --disks=/dev/vd{a..c} > --tests=07reshape5intr > Testing on linux-5.15.0-30-generic kernel > /home/gjiang/mdadm/tests/07reshape5intr... FAILED - see > /var/tmp/07reshape5intr.log and /var/tmp/fail07reshape5intr.log for > details > root@vm:/home/gjiang/mdadm# ./test --dev=disk --disks=/dev/vd{a..c} > --tests=07revert-grow > Testing on linux-5.15.0-30-generic kernel > /home/gjiang/mdadm/tests/07revert-grow... FAILED - see > /var/tmp/07revert-grow.log and /var/tmp/fail07revert-grow.log for details > > So I would not consider it is regression. I tried with 5.18.0-rc3, no problem for 07reshape5intr (will investigate why it failed with this patch), but 07revert-grow still failed without apply this one. From fail07revert-grow.log, it shows below issues. [ 7856.233515] mdadm[25246]: segfault at 0 ip 000000000040fe56 sp 00007ffdcf252800 error 4 in mdadm[400000+81000] [ 7856.233544] Code: 00 48 8d 7c 24 30 e8 79 30 ff ff 48 8d 7c 24 30 31 f6 31 c0 e8 db 34 ff ff 85 c0 79 77 bf 26 50 46 00 b9 04 00 00 00 48 89 de <f3> a6 0f 97 c0 1c 00 84 c0 75 18 e8 fa 36 ff ff 48 0f be 53 04 48 [ 7866.871747] mdadm[25463]: segfault at 0 ip 000000000040fe56 sp 00007ffe91e39800 error 4 in mdadm[400000+81000] [ 7866.871760] Code: 00 48 8d 7c 24 30 e8 79 30 ff ff 48 8d 7c 24 30 31 f6 31 c0 e8 db 34 ff ff 85 c0 79 77 bf 26 50 46 00 b9 04 00 00 00 48 89 de <f3> a6 0f 97 c0 1c 00 84 c0 75 18 e8 fa 36 ff ff 48 0f be 53 04 48 [ 7876.779855] ====================================================== [ 7876.779858] WARNING: possible circular locking dependency detected [ 7876.779861] 5.18.0-rc3-57-default #28 Tainted: G E [ 7876.779864] ------------------------------------------------------ [ 7876.779867] mdadm/25444 is trying to acquire lock: [ 7876.779870] ffff991817749938 ((wq_completion)md_misc){+.+.}-{0:0}, at: flush_workqueue+0x87/0x470 [ 7876.779879] but task is already holding lock: [ 7876.779882] ffff9917c5c1c2c0 (&mddev->reconfig_mutex){+.+.}-{3:3}, at: action_store+0x11a/0x2c0 [md_mod] [ 7876.779892] which lock already depends on the new lock. [ 7876.779896] the existing dependency chain (in reverse order) is: [ 7876.779899] -> #3 (&mddev->reconfig_mutex){+.+.}-{3:3}: [ 7876.779904] __mutex_lock+0x8f/0x920 [ 7876.779909] layout_store+0x47/0x120 [md_mod] [ 7876.779914] md_attr_store+0x7a/0xc0 [md_mod] [ 7876.779919] kernfs_fop_write_iter+0x135/0x1b0 [ 7876.779924] new_sync_write+0x10c/0x190 [ 7876.779927] vfs_write+0x30e/0x370 [ 7876.779930] ksys_write+0xa4/0xe0 [ 7876.779933] do_syscall_64+0x3a/0x80 [ 7876.779936] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 7876.779940] -> #2 (kn->active#359){++++}-{0:0}: [ 7876.779945] __kernfs_remove+0x28c/0x2e0 [ 7876.779948] kernfs_remove_by_name_ns+0x52/0x90 [ 7876.779952] remove_files.isra.1+0x30/0x70 [ 7876.779955] sysfs_remove_group+0x3d/0x80 [ 7876.779958] sysfs_remove_groups+0x29/0x40 [ 7876.779962] __kobject_del+0x1b/0x80 [ 7876.779965] kobject_del+0xf/0x20 [ 7876.779968] mddev_delayed_delete+0x15/0x20 [md_mod] [ 7876.779973] process_one_work+0x2d8/0x650 [ 7876.779976] worker_thread+0x2a/0x3b0 [ 7876.779979] kthread+0xe8/0x110 [ 7876.779981] ret_from_fork+0x22/0x30 [ 7876.779985] -> #1 ((work_completion)(&mddev->del_work)#2){+.+.}-{0:0}: [ 7876.779990] process_one_work+0x2af/0x650 [ 7876.779993] worker_thread+0x2a/0x3b0 [ 7876.779996] kthread+0xe8/0x110 [ 7876.779998] ret_from_fork+0x22/0x30 [ 7876.780001] -> #0 ((wq_completion)md_misc){+.+.}-{0:0}: [ 7876.780005] __lock_acquire+0x12a0/0x1770 [ 7876.780009] lock_acquire+0x277/0x310 [ 7876.780011] flush_workqueue+0xae/0x470 [ 7876.780014] action_store+0x188/0x2c0 [md_mod] [ 7876.780019] md_attr_store+0x7a/0xc0 [md_mod] [ 7876.780024] kernfs_fop_write_iter+0x135/0x1b0 [ 7876.780027] new_sync_write+0x10c/0x190 [ 7876.780030] vfs_write+0x30e/0x370 [ 7876.780033] ksys_write+0xa4/0xe0 [ 7876.780036] do_syscall_64+0x3a/0x80 [ 7876.780038] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 7876.780042] other info that might help us debug this: [ 7876.780045] Chain exists of: (wq_completion)md_misc --> kn->active#359 --> &mddev->reconfig_mutex [ 7876.780052] Possible unsafe locking scenario: [ 7876.780054] CPU0 CPU1 [ 7876.780056] ---- ---- [ 7876.780059] lock(&mddev->reconfig_mutex); [ 7876.780061] lock(kn->active#359); [ 7876.780064] lock(&mddev->reconfig_mutex); [ 7876.780068] lock((wq_completion)md_misc); [ 7876.780070] *** DEADLOCK *** [ 7876.780073] 5 locks held by mdadm/25444: [ 7876.780075] #0: ffff9917c769a458 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0xa4/0xe0 [ 7876.780082] #1: ffff9917de1a6518 (&of->prealloc_mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x5c/0x1b0 [ 7876.780088] #2: ffff9917de1a6488 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x103/0x1b0 [ 7876.780094] #3: ffff99181b0d2158 (kn->active#287){++++}-{0:0}, at: kernfs_fop_write_iter+0x10c/0x1b0 [ 7876.780100] #4: ffff9917c5c1c2c0 (&mddev->reconfig_mutex){+.+.}-{3:3}, at: action_store+0x11a/0x2c0 [md_mod] [ 7876.780108] stack backtrace: [ 7876.780111] CPU: 1 PID: 25444 Comm: mdadm Tainted: G E 5.18.0-rc3-57-default #28 [ 7876.780115] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014 [ 7876.780120] Call Trace: [ 7876.780122] <TASK> [ 7876.780124] dump_stack_lvl+0x55/0x6d [ 7876.780128] check_noncircular+0x105/0x120 [ 7876.780133] ? __lock_acquire+0x12a0/0x1770 [ 7876.780136] ? lockdep_lock+0x21/0x90 [ 7876.780139] __lock_acquire+0x12a0/0x1770 [ 7876.780143] lock_acquire+0x277/0x310 [ 7876.780146] ? flush_workqueue+0x87/0x470 [ 7876.780149] ? __raw_spin_lock_init+0x3b/0x60 [ 7876.780152] ? lockdep_init_map_type+0x58/0x250 [ 7876.780156] flush_workqueue+0xae/0x470 [ 7876.780158] ? flush_workqueue+0x87/0x470 [ 7876.780161] ? _raw_spin_unlock+0x29/0x40 [ 7876.780166] ? action_store+0x188/0x2c0 [md_mod] [ 7876.780171] action_store+0x188/0x2c0 [md_mod] [ 7876.780176] md_attr_store+0x7a/0xc0 [md_mod] [ 7876.780182] kernfs_fop_write_iter+0x135/0x1b0 [ 7876.780186] new_sync_write+0x10c/0x190 [ 7876.780190] vfs_write+0x30e/0x370 [ 7876.780193] ksys_write+0xa4/0xe0 [ 7876.780197] do_syscall_64+0x3a/0x80 [ 7876.780200] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 7876.780203] RIP: 0033:0x7f7cd43f9c03 [ 7876.780206] Code: 2d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 f3 c3 0f 1f 00 41 54 55 49 89 d4 53 48 89 [ 7876.780213] RSP: 002b:00007fff35b600b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 7876.780218] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f7cd43f9c03 [ 7876.780221] RDX: 0000000000000004 RSI: 00000000004669a7 RDI: 0000000000000003 [ 7876.780224] RBP: 00000000004669a7 R08: 000000000000000b R09: 00000000ffffffff [ 7876.780227] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000183d730 [ 7876.780231] R13: 00000000ffffffff R14: 0000000000001800 R15: 0000000000000000 [ 7876.780236] </TASK> [ 7876.781007] md: reshape of RAID array md0 [ 7876.785949] md: md0: reshape interrupted. [ 7877.193686] md0: detected capacity change from 107520 to 0 [ 7877.193694] md: md0 stopped. [ 7877.314438] debugfs: Directory 'md0' with parent 'block' already present! [ 7877.649549] md: array md0 already has disks! [ 7877.665188] md/raid:md0: device loop0 operational as raid disk 0 [ 7877.665193] md/raid:md0: device loop3 operational as raid disk 3 [ 7877.665197] md/raid:md0: device loop2 operational as raid disk 2 [ 7877.665199] md/raid:md0: device loop1 operational as raid disk 1 [ 7877.665202] md/raid:md0: device loop4 operational as raid disk 4 [ 7877.665205] md/raid:md0: force stripe size 512 for reshape [ 7877.667172] md/raid:md0: raid level 5 active with 4 out of 4 devices, algorithm 2 [ 7877.667360] md0: detected capacity change from 0 to 107520 [ 7878.107865] mdadm[25693]: segfault at 0 ip 000000000040fe56 sp 00007fff7a845d50 error 4 in mdadm[400000+81000] [ 7878.107878] Code: 00 48 8d 7c 24 30 e8 79 30 ff ff 48 8d 7c 24 30 31 f6 31 c0 e8 db 34 ff ff 85 c0 79 77 bf 26 50 46 00 b9 04 00 00 00 48 89 de <f3> a6 0f 97 c0 1c 00 84 c0 75 18 e8 fa 36 ff ff 48 0f be 53 04 48 [ 7878.428298] md: reshape of RAID array md0 [ 7880.552543] md: md0: reshape done. [ 7882.053117] md0: detected capacity change from 107520 to 0 [ 7882.053124] md: md0 stopped. [ ... ] [ 7904.978417] md0: detected capacity change from 0 to 107520 [ 7905.346749] mdadm[26328]: segfault at 0 ip 000000000040fe56 sp 00007ffc28540c20 error 4 in mdadm[400000+81000] [ 7905.346764] Code: 00 48 8d 7c 24 30 e8 79 30 ff ff 48 8d 7c 24 30 31 f6 31 c0 e8 db 34 ff ff 85 c0 79 77 bf 26 50 46 00 b9 04 00 00 00 48 89 de <f3> a6 0f 97 c0 1c 00 84 c0 75 18 e8 fa 36 ff ff 48 0f be 53 04 48 [ 7905.476488] md: reshape of RAID array md0 [ 7907.530152] md: md0: reshape done. [ 7909.149524] md0: detected capacity change from 107520 to 0 [ 7909.149533] md: md0 stopped. Thanks, Guoqing ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-30 9:55 ` Guoqing Jiang @ 2022-05-30 16:35 ` Logan Gunthorpe 2022-05-31 8:13 ` Guoqing Jiang 0 siblings, 1 reply; 45+ messages in thread From: Logan Gunthorpe @ 2022-05-30 16:35 UTC (permalink / raw) To: Guoqing Jiang, Donald Buczek, Song Liu; +Cc: linux-raid On 2022-05-30 03:55, Guoqing Jiang wrote: > I tried with 5.18.0-rc3, no problem for 07reshape5intr (will investigate > why it failed with this patch), but 07revert-grow still failed without > apply this one. > > From fail07revert-grow.log, it shows below issues. > > [ 7856.233515] mdadm[25246]: segfault at 0 ip 000000000040fe56 sp > 00007ffdcf252800 error 4 in mdadm[400000+81000] > [ 7856.233544] Code: 00 48 8d 7c 24 30 e8 79 30 ff ff 48 8d 7c 24 30 31 > f6 31 c0 e8 db 34 ff ff 85 c0 79 77 bf 26 50 46 00 b9 04 00 00 00 48 89 > de <f3> a6 0f 97 c0 1c 00 84 c0 75 18 e8 fa 36 ff ff 48 0f be 53 04 48 > > [ 7866.871747] mdadm[25463]: segfault at 0 ip 000000000040fe56 sp > 00007ffe91e39800 error 4 in mdadm[400000+81000] > [ 7866.871760] Code: 00 48 8d 7c 24 30 e8 79 30 ff ff 48 8d 7c 24 30 31 > f6 31 c0 e8 db 34 ff ff 85 c0 79 77 bf 26 50 46 00 b9 04 00 00 00 48 89 > de <f3> a6 0f 97 c0 1c 00 84 c0 75 18 e8 fa 36 ff ff 48 0f be 53 04 48 > > [ 7876.779855] ====================================================== > [ 7876.779858] WARNING: possible circular locking dependency detected > [ 7876.779861] 5.18.0-rc3-57-default #28 Tainted: G E > [ 7876.779864] ------------------------------------------------------ > [ 7876.779867] mdadm/25444 is trying to acquire lock: > [ 7876.779870] ffff991817749938 ((wq_completion)md_misc){+.+.}-{0:0}, > at: flush_workqueue+0x87/0x470 > [ 7876.779879] > but task is already holding lock: > [ 7876.779882] ffff9917c5c1c2c0 (&mddev->reconfig_mutex){+.+.}-{3:3}, > at: action_store+0x11a/0x2c0 [md_mod] > [ 7876.779892] > which lock already depends on the new lock. > Hmm, strange. I'm definitely running with lockdep and even if I try the test on my machine, on v5.18-rc3, I don't get this error. Not sure why. In any case it looks like we recently added a flush_workqueue(md_misc_wq) call in action_store() which runs with the mddev_lock() held. According to your lockdep warning, that can deadlock. That call was added in this commit: Fixes: cc1ffe61c026 ("md: add new workqueue for delete rdev") Can we maybe run flush_workqueue() before we take mddev_lock()? Logan ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-30 16:35 ` Logan Gunthorpe @ 2022-05-31 8:13 ` Guoqing Jiang 0 siblings, 0 replies; 45+ messages in thread From: Guoqing Jiang @ 2022-05-31 8:13 UTC (permalink / raw) To: Logan Gunthorpe, Donald Buczek, Song Liu; +Cc: linux-raid On 5/31/22 12:35 AM, Logan Gunthorpe wrote: > On 2022-05-30 03:55, Guoqing Jiang wrote: >> I tried with 5.18.0-rc3, no problem for 07reshape5intr (will investigate >> why it failed with this patch), but 07revert-grow still failed without >> apply this one. >> >> From fail07revert-grow.log, it shows below issues. >> >> [ 7856.233515] mdadm[25246]: segfault at 0 ip 000000000040fe56 sp >> 00007ffdcf252800 error 4 in mdadm[400000+81000] >> [ 7856.233544] Code: 00 48 8d 7c 24 30 e8 79 30 ff ff 48 8d 7c 24 30 31 >> f6 31 c0 e8 db 34 ff ff 85 c0 79 77 bf 26 50 46 00 b9 04 00 00 00 48 89 >> de <f3> a6 0f 97 c0 1c 00 84 c0 75 18 e8 fa 36 ff ff 48 0f be 53 04 48 >> >> [ 7866.871747] mdadm[25463]: segfault at 0 ip 000000000040fe56 sp >> 00007ffe91e39800 error 4 in mdadm[400000+81000] >> [ 7866.871760] Code: 00 48 8d 7c 24 30 e8 79 30 ff ff 48 8d 7c 24 30 31 >> f6 31 c0 e8 db 34 ff ff 85 c0 79 77 bf 26 50 46 00 b9 04 00 00 00 48 89 >> de <f3> a6 0f 97 c0 1c 00 84 c0 75 18 e8 fa 36 ff ff 48 0f be 53 04 48 >> >> [ 7876.779855] ====================================================== >> [ 7876.779858] WARNING: possible circular locking dependency detected >> [ 7876.779861] 5.18.0-rc3-57-default #28 Tainted: G E >> [ 7876.779864] ------------------------------------------------------ >> [ 7876.779867] mdadm/25444 is trying to acquire lock: >> [ 7876.779870] ffff991817749938 ((wq_completion)md_misc){+.+.}-{0:0}, >> at: flush_workqueue+0x87/0x470 >> [ 7876.779879] >> but task is already holding lock: >> [ 7876.779882] ffff9917c5c1c2c0 (&mddev->reconfig_mutex){+.+.}-{3:3}, >> at: action_store+0x11a/0x2c0 [md_mod] >> [ 7876.779892] >> which lock already depends on the new lock. >> > Hmm, strange. I'm definitely running with lockdep and even if I try the > test on my machine, on v5.18-rc3, I don't get this error. Not sure why. > > In any case it looks like we recently added a > flush_workqueue(md_misc_wq) call in action_store() which runs with the > mddev_lock() held. According to your lockdep warning, that can deadlock. It was originally added by f851b60db if I am not misunderstood. > That call was added in this commit: > > Fixes: cc1ffe61c026 ("md: add new workqueue for delete rdev") The above fix commit didn't add it. And cc1ffe61c026 was added to avoid other lockdep warnings, IIRC it just added work_pending checking before flush. > Can we maybe run flush_workqueue() before we take mddev_lock()? Currently, I am not sure, need to investigate and test. Anyway, it is on my todo list unless someone beats me 😉. Thanks, Guoqing ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-23 5:41 ` Donald Buczek 2022-05-23 9:51 ` Guoqing Jiang @ 2022-05-24 15:58 ` Logan Gunthorpe 2022-05-24 18:16 ` Song Liu 2022-05-25 9:17 ` Guoqing Jiang 2 siblings, 1 reply; 45+ messages in thread From: Logan Gunthorpe @ 2022-05-24 15:58 UTC (permalink / raw) To: Donald Buczek, Guoqing Jiang, Song Liu; +Cc: linux-raid On 2022-05-22 23:41, Donald Buczek wrote: >> Looks like bfq or block issue, will try it from my side. > > FYI: I've used loop devices on a virtio disk. > > I later discovered Logans patches [1], which I were not aware of, as I'm not subscribed to the lists. > > [1]: https://lore.kernel.org/linux-raid/20220519191311.17119-6-logang@deltatee.com/T/#u > > The series seems to acknowledge that there are open problems and tries to fix them. > So I've used his md-bug branch from https://github.com/sbates130272/linux-p2pmem but it didn't look better. > > So I understand, the mdadm tests *are* supposed to work and every bug I see here is worth analyzing? Or is Logan hunting down everything anyway? I'm not hunting down everything. There's too much brokenness. I've done a bunch of work: there's that series plus an mdadm branch I'll send to the list later (just linked it on my previous email). But even after all that, I still have ~25 broken tests, but I've marked those tests and they shouldn't stop the test script from running everything. Logan ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-24 15:58 ` Logan Gunthorpe @ 2022-05-24 18:16 ` Song Liu 0 siblings, 0 replies; 45+ messages in thread From: Song Liu @ 2022-05-24 18:16 UTC (permalink / raw) To: Logan Gunthorpe; +Cc: Donald Buczek, Guoqing Jiang, linux-raid On Tue, May 24, 2022 at 8:58 AM Logan Gunthorpe <logang@deltatee.com> wrote: > > > > On 2022-05-22 23:41, Donald Buczek wrote: > >> Looks like bfq or block issue, will try it from my side. > > > > FYI: I've used loop devices on a virtio disk. > > > > I later discovered Logans patches [1], which I were not aware of, as I'm not subscribed to the lists. > > > > [1]: https://lore.kernel.org/linux-raid/20220519191311.17119-6-logang@deltatee.com/T/#u > > > > The series seems to acknowledge that there are open problems and tries to fix them. > > So I've used his md-bug branch from https://github.com/sbates130272/linux-p2pmem but it didn't look better. > > > > So I understand, the mdadm tests *are* supposed to work and every bug I see here is worth analyzing? Or is Logan hunting down everything anyway? > > I'm not hunting down everything. There's too much brokenness. I've done > a bunch of work: there's that series plus an mdadm branch I'll send to > the list later (just linked it on my previous email). But even after all > that, I still have ~25 broken tests, but I've marked those tests and > they shouldn't stop the test script from running everything. Thanks for your fixes so far. I will work with Jes to look into the remaining failures and develop plan to fix them. Song ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-23 5:41 ` Donald Buczek 2022-05-23 9:51 ` Guoqing Jiang 2022-05-24 15:58 ` Logan Gunthorpe @ 2022-05-25 9:17 ` Guoqing Jiang 2 siblings, 0 replies; 45+ messages in thread From: Guoqing Jiang @ 2022-05-25 9:17 UTC (permalink / raw) To: Donald Buczek, Logan Gunthorpe, Song Liu; +Cc: linux-raid On 5/23/22 1:41 PM, Donald Buczek wrote: >>> >>> [ 57.434064] md: md0 stopped. >>> [ 57.586951] md0: detected capacity change from 0 to 107520 >>> [ 57.618454] BUG: kernel NULL pointer dereference, address: >>> 0000000000000094 >>> [ 57.620830] #PF: supervisor read access in kernel mode >>> [ 57.622554] #PF: error_code(0x0000) - not-present page >>> [ 57.624273] PGD 800000010d5ee067 P4D 800000010d5ee067 PUD >>> 10df28067 PMD 0 >>> [ 57.626548] Oops: 0000 [#1] PREEMPT SMP PTI >>> [ 57.627942] CPU: 3 PID: 1064 Comm: mkfs.ext3 Not tainted >>> 5.18.0-rc3.mx64.425-00108-g6ad84d559b8c #77 >>> [ 57.630952] Hardware name: QEMU Standard PC (i440FX + PIIX, >>> 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 >>> [ 57.635927] RIP: 0010:bfq_bio_bfqg+0x26/0x80 >>> [ 57.638027] Code: 00 0f 1f 00 0f 1f 44 00 00 55 53 48 89 fd 48 8b >>> 56 48 48 89 f7 48 85 d2 74 32 48 63 05 53 54 1c 01 48 83 c0 16 48 8b >>> 5c c2 08 <80> bb 94 00 00 00 00 70 >>> [ 57.645295] RSP: 0018:ffffc90001c27b38 EFLAGS: 00010006 >>> [ 57.647414] RAX: 0000000000000018 RBX: 0000000000000000 RCX: >>> 0000000000000001 >>> [ 57.650039] RDX: ffff888109297800 RSI: ffff8881032ba180 RDI: >>> ffff8881032ba180 >>> [ 57.652541] RBP: ffff888102177800 R08: ffff88810c9004c8 R09: >>> ffff88810318cb00 >>> [ 57.654852] R10: 0000000000000000 R11: ffff8881032ba180 R12: >>> ffff88810318cae0 >>> [ 57.657128] R13: ffff888102177800 R14: ffffc90001c27ca8 R15: >>> ffffc90001c27c00 >>> [ 57.659316] FS: 00007fdfce47d440(0000) GS:ffff8882b5ac0000(0000) >>> knlGS:0000000000000000 >>> [ 57.661700] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [ 57.663461] CR2: 0000000000000094 CR3: 000000010d438002 CR4: >>> 0000000000170ee0 >>> [ 57.665453] Call Trace: >>> [ 57.666479] <TASK> >>> [ 57.667382] bfq_bic_update_cgroup+0x28/0x1b0 >>> [ 57.668724] bfq_insert_requests+0x233/0x2340 >>> [ 57.670049] ? ioc_find_get_icq+0x21c/0x2a0 >>> [ 57.671315] ? bfq_prepare_request+0x11/0x30 >>> [ 57.672565] blk_mq_sched_insert_requests+0x5c/0x150 >>> [ 57.673891] blk_mq_flush_plug_list+0xe1/0x2a0 >>> [ 57.675140] __blk_flush_plug+0xdf/0x120 >>> [ 57.676259] io_schedule_prepare+0x3d/0x50 >>> [ 57.677373] io_schedule_timeout+0xf/0x40 >>> [ 57.678465] wait_for_completion_io+0x78/0x140 >>> [ 57.679578] submit_bio_wait+0x5b/0x80 >>> [ 57.680575] blkdev_issue_discard+0x65/0xb0 >>> [ 57.681640] blkdev_common_ioctl+0x391/0x8f0 >>> [ 57.682712] blkdev_ioctl+0x216/0x2a0 >>> [ 57.683648] __x64_sys_ioctl+0x76/0xb0 >>> [ 57.684607] do_syscall_64+0x42/0x90 >>> [ 57.685527] entry_SYSCALL_64_after_hwframe+0x44/0xae >>> [ 57.686645] RIP: 0033:0x7fdfce56dc17 >>> [ 57.687535] Code: 48 c7 c3 ff ff ff ff 48 89 d8 5b 5d 41 5c c3 0f >>> 1f 40 00 48 89 e8 48 f7 d8 48 39 c3 0f 92 c0 eb 93 66 90 b8 10 00 00 >>> 00 0f 05 <48> 3d 01 f0 ff ff 73 08 >>> [ 57.691055] RSP: 002b:00007ffe24319828 EFLAGS: 00000246 ORIG_RAX: >>> 0000000000000010 >>> [ 57.692537] RAX: ffffffffffffffda RBX: 00000000004645a0 RCX: >>> 00007fdfce56dc17 >>> [ 57.693905] RDX: 00007ffe24319830 RSI: 0000000000001277 RDI: >>> 0000000000000003 >>> [ 57.695288] RBP: 0000000000460960 R08: 0000000000000400 R09: >>> 0000000000000000 >>> [ 57.696645] R10: 0000000000000000 R11: 0000000000000246 R12: >>> 0000000000000000 >>> [ 57.697954] R13: 000000000000d200 R14: 0000000000000000 R15: >>> 0000000000000000 >>> [ 57.699281] </TASK> >>> [ 57.699901] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs 8021q >>> garp stp mrp llc bochs drm_vram_helper drm_ttm_helper kvm_intel ttm >>> drm_kms_helper kvm drm fb_sys_fops vi4 >>> [ 57.705955] CR2: 0000000000000094 >>> [ 57.706710] ---[ end trace 0000000000000000 ]--- >>> [ 57.707599] RIP: 0010:bfq_bio_bfqg+0x26/0x80 >>> [ 57.708434] Code: 00 0f 1f 00 0f 1f 44 00 00 55 53 48 89 fd 48 8b >>> 56 48 48 89 f7 48 85 d2 74 32 48 63 05 53 54 1c 01 48 83 c0 16 48 8b >>> 5c c2 08 <80> bb 94 00 00 00 00 70 >>> [ 57.711426] RSP: 0018:ffffc90001c27b38 EFLAGS: 00010006 >>> [ 57.712391] RAX: 0000000000000018 RBX: 0000000000000000 RCX: >>> 0000000000000001 >>> [ 57.713605] RDX: ffff888109297800 RSI: ffff8881032ba180 RDI: >>> ffff8881032ba180 >>> [ 57.714811] RBP: ffff888102177800 R08: ffff88810c9004c8 R09: >>> ffff88810318cb00 >>> [ 57.716018] R10: 0000000000000000 R11: ffff8881032ba180 R12: >>> ffff88810318cae0 >>> [ 57.717236] R13: ffff888102177800 R14: ffffc90001c27ca8 R15: >>> ffffc90001c27c00 >>> [ 57.718438] FS: 00007fdfce47d440(0000) GS:ffff8882b5ac0000(0000) >>> knlGS:0000000000000000 >>> [ 57.719778] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [ 57.720808] CR2: 0000000000000094 CR3: 000000010d438002 CR4: >>> 0000000000170ee0 >>> [ 57.722019] note: mkfs.ext3[1064] exited with preempt_count 1 >>> [ 57.723067] ------------[ cut here ]------------ >>> [ 57.723960] WARNING: CPU: 3 PID: 1064 at kernel/exit.c:741 >>> do_exit+0x8cb/0xbc0 >>> [ 57.725196] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs 8021q >>> garp stp mrp llc bochs drm_vram_helper drm_ttm_helper kvm_intel ttm >>> drm_kms_helper kvm drm fb_sys_fops vi4 >>> [ 57.731011] CPU: 3 PID: 1064 Comm: mkfs.ext3 Tainted: G D >>> 5.18.0-rc3.mx64.425-00108-g6ad84d559b8c #77 >>> [ 57.732704] Hardware name: QEMU Standard PC (i440FX + PIIX, >>> 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 >>> [ 57.734853] RIP: 0010:do_exit+0x8cb/0xbc0 >>> [ 57.735711] Code: e9 13 ff ff ff 48 8b bb e0 04 00 00 31 f6 e8 4c >>> db ff ff e9 98 fd ff ff 4c 89 e6 bf 05 06 00 00 e8 8a c8 00 00 e9 41 >>> f8 ff ff <0f> 0b e9 6b f7 ff ff 4b >>> [ 57.738851] RSP: 0018:ffffc90001c27ee8 EFLAGS: 00010082 >>> [ 57.739899] RAX: 0000000000000000 RBX: ffff888101e48000 RCX: >>> 0000000000000000 >>> [ 57.741196] RDX: 0000000000000001 RSI: ffffffff8220a969 RDI: >>> 0000000000000009 >>> [ 57.742485] RBP: 0000000000000009 R08: 0000000000000000 R09: >>> c0000000ffffbfff >>> [ 57.743777] R10: 00007fdfce47d440 R11: ffffc90001c27d60 R12: >>> 0000000000000009 >>> [ 57.745081] R13: 0000000000000046 R14: 0000000000000000 R15: >>> 0000000000000000 >>> [ 57.746388] FS: 00007fdfce47d440(0000) GS:ffff8882b5ac0000(0000) >>> knlGS:0000000000000000 >>> [ 57.747806] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [ 57.748931] CR2: 0000000000000094 CR3: 000000010d438002 CR4: >>> 0000000000170ee0 >>> [ 57.750225] Call Trace: >>> [ 57.750894] <TASK> >>> [ 57.751535] make_task_dead+0x41/0xf0 >>> [ 57.752369] rewind_stack_and_make_dead+0x17/0x17 >>> [ 57.753336] RIP: 0033:0x7fdfce56dc17 >>> [ 57.754155] Code: 48 c7 c3 ff ff ff ff 48 89 d8 5b 5d 41 5c c3 0f >>> 1f 40 00 48 89 e8 48 f7 d8 48 39 c3 0f 92 c0 eb 93 66 90 b8 10 00 00 >>> 00 0f 05 <48> 3d 01 f0 ff ff 73 08 >>> [ 57.757318] RSP: 002b:00007ffe24319828 EFLAGS: 00000246 ORIG_RAX: >>> 0000000000000010 >>> [ 57.758669] RAX: ffffffffffffffda RBX: 00000000004645a0 RCX: >>> 00007fdfce56dc17 >>> [ 57.759956] RDX: 00007ffe24319830 RSI: 0000000000001277 RDI: >>> 0000000000000003 >>> [ 57.761256] RBP: 0000000000460960 R08: 0000000000000400 R09: >>> 0000000000000000 >>> [ 57.762531] R10: 0000000000000000 R11: 0000000000000246 R12: >>> 0000000000000000 >>> [ 57.763806] R13: 000000000000d200 R14: 0000000000000000 R15: >>> 0000000000000000 >>> [ 57.765177] </TASK> >>> [ 57.765813] ---[ end trace 0000000000000000 ]--- >>> [ 57.790046] md0: detected capacity change from 107520 to 0 >>> [ 57.792834] md: md0 stopped. >>> [ 78.843853] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: >>> [ 78.845334] rcu: 10-...0: (0 ticks this GP) >>> idle=07b/1/0x4000000000000000 softirq=1140/1140 fqs=4805 >>> [ 78.847246] (detected by 13, t=21005 jiffies, g=9013, q=1419) >>> [ 78.848619] Sending NMI from CPU 13 to CPUs 10: >>> [ 78.849810] NMI backtrace for cpu 10 >>> [ 78.849813] CPU: 10 PID: 1081 Comm: mdadm Tainted: G D >>> W 5.18.0-rc3.mx64.425-00108-g6ad84d559b8c #77 >>> [ 78.849816] Hardware name: QEMU Standard PC (i440FX + PIIX, >>> 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 >>> [ 78.849817] RIP: 0010:queued_spin_lock_slowpath+0x4c/0x1d0 >>> [ 78.849832] Code: 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 >>> a9 00 01 ff ff 75 1b 85 c0 75 0f b8 01 00 00 00 66 89 07 5b 5d 41 5c >>> c3 f3 90 <8b> 07 84 c0 75 f8 eb e7 >>> [ 78.849834] RSP: 0018:ffffc90001c9f9e0 EFLAGS: 00000002 >>> [ 78.849837] RAX: 0000000000040101 RBX: ffff88810c914fc8 RCX: >>> 0000000000000000 >>> [ 78.849838] RDX: 0000000000000000 RSI: 0000000000000000 RDI: >>> ffff888102177c30 >>> [ 78.849840] RBP: 0000000000000000 R08: ffff88810c914fc8 R09: >>> ffff888106a4ed10 >>> [ 78.849841] R10: ffffc90001c9fae8 R11: ffff888101b048d8 R12: >>> ffff888103833000 >>> [ 78.849842] R13: ffff888102177800 R14: ffffc90001c9fb20 R15: >>> ffffc90001c9fa78 >>> [ 78.849844] FS: 00007fd3d66c4340(0000) GS:ffff8882b5c80000(0000) >>> knlGS:0000000000000000 >>> [ 78.849847] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [ 78.849848] CR2: 00000000004a5b58 CR3: 000000010d438001 CR4: >>> 0000000000170ee0 >>> [ 78.849850] Call Trace: >>> [ 78.849853] <TASK> >>> [ 78.849855] bfq_insert_requests+0xae/0x2340 >>> [ 78.849862] ? submit_bio_noacct_nocheck+0x225/0x2b0 >>> [ 78.849868] blk_mq_sched_insert_requests+0x5c/0x150 >>> [ 78.849872] blk_mq_flush_plug_list+0xe1/0x2a0 >>> [ 78.849876] __blk_flush_plug+0xdf/0x120 >>> [ 78.849879] blk_finish_plug+0x27/0x40 >>> [ 78.849882] read_pages+0x15b/0x360 >>> [ 78.849891] page_cache_ra_unbounded+0x120/0x170 >>> [ 78.849894] filemap_get_pages+0xdd/0x5f0 >>> [ 78.849899] filemap_read+0xbf/0x350 >>> [ 78.849902] ? __mod_memcg_lruvec_state+0x72/0xc0 >>> [ 78.849907] ? __mod_lruvec_page_state+0xb4/0x160 >>> [ 78.849909] ? folio_add_lru+0x51/0x80 >>> [ 78.849912] ? _raw_spin_unlock+0x12/0x30 >>> [ 78.849916] ? __handle_mm_fault+0xdee/0x14d0 >>> [ 78.849921] blkdev_read_iter+0xa9/0x180 >>> [ 78.849924] new_sync_read+0x109/0x180 >>> [ 78.849929] vfs_read+0x187/0x1b0 >>> [ 78.849932] ksys_read+0xa1/0xe0 >>> [ 78.849935] do_syscall_64+0x42/0x90 >>> [ 78.849938] entry_SYSCALL_64_after_hwframe+0x44/0xae >>> [ 78.849941] RIP: 0033:0x7fd3d6322f8e >>> [ 78.849944] Code: c0 e9 c6 fe ff ff 48 8d 3d a7 07 0a 00 48 83 ec >>> 08 e8 b6 e1 01 00 66 0f 1f 44 00 00 64 8b 04 25 18 00 00 00 85 c0 75 >>> 14 0f 05 <48> 3d 00 f0 ff ff 77 59 >>> [ 78.849945] RSP: 002b:00007ffe92d46ea8 EFLAGS: 00000246 ORIG_RAX: >>> 0000000000000000 >>> [ 78.849948] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: >>> 00007fd3d6322f8e >>> [ 78.849949] RDX: 0000000000001000 RSI: 00000000004a3000 RDI: >>> 0000000000000003 >>> [ 78.849950] RBP: 0000000000000003 R08: 00000000004a3000 R09: >>> 0000000000000003 >>> [ 78.849951] R10: 00007fd3d623d0a8 R11: 0000000000000246 R12: >>> 00000000004a2a60 >>> [ 78.849952] R13: 0000000000000000 R14: 00000000004a3000 R15: >>> 000000000048a4a0 >>> [ 78.849954] </TASK> >> >> Looks like bfq or block issue, will try it from my side. Hmm, it could be md specific issue because I find below stack after similar call trace happened vm79:~> ps aux|grep "\[md"|grep D|awk '{print $2}' 2087 vm79:~> sudo cat /proc/2087/stack [<0>] raid1_sync_request+0x65e/0xb60 [raid1] [<0>] md_do_sync+0xa13/0xf50 [md_mod] [<0>] md_thread+0x131/0x180 [md_mod] [<0>] kthread+0xe8/0x110 [<0>] ret_from_fork+0x22/0x30 Donald, could you share the md process stack when you see the call trace? Thanks, Guoqing ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-21 18:23 ` Donald Buczek 2022-05-23 1:08 ` Guoqing Jiang @ 2022-05-24 15:51 ` Logan Gunthorpe 1 sibling, 0 replies; 45+ messages in thread From: Logan Gunthorpe @ 2022-05-24 15:51 UTC (permalink / raw) To: Donald Buczek, Song Liu, Guoqing Jiang; +Cc: linux-raid On 2022-05-21 12:23, Donald Buczek wrote: >> I noticed a clear regression with mdadm tests with this patch in md-next >> (7e6ba434cc6080). >> >> Before the patch, tests 07reshape5intr and 07revert-grow would fail >> fairly infrequently (about 1 in 4 runs for the former and 1 in 30 runs >> for the latter). >> >> After this patch, both tests always fail. >> >> I don't have time to dig into why this is, but it would be nice if >> someone can at least fix the regression. It is hard to make any progress >> on these tests if we are continuing to further break them. > > Hmmm. I wanted to try to help a bit by reproducing and digging into this. > > But it seems that more or less ALL tests hang my system one way or another. > > I've used a qemu/kvm machine with md-next and mdraid master. > > Is this supposed to work? I know those tests are full of bugs, I've been trying to remedy that somewhat. You can run with --tests=07reshape5intr to run a specific test. And I have a branch[1] with some fixen and also adds a --loop option to run tests multiple times. I expect 00raid0 to fail with the master branch, however I have not seen that specific trace back. Might be worth debugging that one if you can. Logan [1] https://github.com/lsgunth/mdadm bugfixes2 ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-20 18:27 ` Logan Gunthorpe 2022-05-21 18:23 ` Donald Buczek @ 2022-06-02 8:12 ` Xiao Ni 1 sibling, 0 replies; 45+ messages in thread From: Xiao Ni @ 2022-06-02 8:12 UTC (permalink / raw) To: Logan Gunthorpe; +Cc: Song Liu, Guoqing Jiang, Donald Buczek, linux-raid Hi all I've tried the test and got the same result. 8b48ec23cc51(md: don't unregister sync_thread with reconfig_mutex held) introduces this problem. My environment: kernel: latest md-next branch mdadm: Logan's tree bugfixes2 branch (https://github.com/lsgunth/mdadm/) because Logan fixed some errors for the test cases. I'm also trying to fix other failures based on the tree. Case1: mdadm -CR /dev/md0 -amd -l5 -c 1024 -n2 --assume-clean /dev/loop1 /dev/loop2 mkfs.xfs /dev/md0 mount /dev/md0 /mnt/ mdadm /dev/md0 -a /dev/loop0 mdadm /dev/md0 --grow -n3 cat /proc/mdstat umount /mnt/ mdadm --stop /dev/md0 mdadm -A /dev/md0 /dev/loop[0-2] cat /proc/mdstat (reshape doesn't happen) mount /dev/md0 /mnt/ (fails here) mount: /mnt: wrong fs type, bad option, bad superblock on /dev/md0, missing codepage or helper program, or other error. Case2: I commented some codes, built the md module and insmod the new module //if (reconfig_mutex_held) // mddev_unlock(mddev); /* resync has finished, collect result */ md_unregister_thread(&mddev->sync_thread); //if (reconfig_mutex_held) // mddev_lock_nointr(mddev); The test Case1 can pass Need to look more to find the root cause. It looks like some information in superblock doesn't sync to member devices. So after assemble, it can't do reshape again. Need more time to look at this. Best Regards Xiao On Sat, May 21, 2022 at 2:27 AM Logan Gunthorpe <logang@deltatee.com> wrote: > > > Hi, > > On 2022-05-10 00:44, Song Liu wrote: > > On Mon, May 9, 2022 at 1:09 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote: > >> On 5/9/22 2:37 PM, Song Liu wrote: > >>> On Fri, May 6, 2022 at 4:37 AM Guoqing Jiang<guoqing.jiang@linux.dev> wrote: > >>>> From: Guoqing Jiang<guoqing.jiang@cloud.ionos.com> > >>>> > >>>> Unregister sync_thread doesn't need to hold reconfig_mutex since it > >>>> doesn't reconfigure array. > >>>> > >>>> And it could cause deadlock problem for raid5 as follows: > >>>> > >>>> 1. process A tried to reap sync thread with reconfig_mutex held after echo > >>>> idle to sync_action. > >>>> 2. raid5 sync thread was blocked if there were too many active stripes. > >>>> 3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer) > >>>> which causes the number of active stripes can't be decreased. > >>>> 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able > >>>> to hold reconfig_mutex. > >>>> > >>>> More details in the link: > >>>> https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t > >>>> > >>>> Let's call unregister thread between mddev_unlock and mddev_lock_nointr > >>>> (thanks for the report from kernel test robot<lkp@intel.com>) if the > >>>> reconfig_mutex is held, and mddev_is_locked is introduced accordingly. > >>> mddev_is_locked() feels really hacky to me. It cannot tell whether > >>> mddev is locked > >>> by current thread. So technically, we can unlock reconfigure_mutex for > >>> other thread > >>> by accident, no? > >> > >> I can switch back to V2 if you think that is the correct way to do though no > >> one comment about the change in dm-raid. > > > > I guess v2 is the best at the moment. I pushed a slightly modified v2 to > > md-next. > > > > I noticed a clear regression with mdadm tests with this patch in md-next > (7e6ba434cc6080). > > Before the patch, tests 07reshape5intr and 07revert-grow would fail > fairly infrequently (about 1 in 4 runs for the former and 1 in 30 runs > for the latter). > > After this patch, both tests always fail. > > I don't have time to dig into why this is, but it would be nice if > someone can at least fix the regression. It is hard to make any progress > on these tests if we are continuing to further break them. > > Logan > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-06 11:36 ` [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held Guoqing Jiang 2022-05-09 6:37 ` Song Liu @ 2022-05-09 8:18 ` Donald Buczek 2022-05-09 8:48 ` Guoqing Jiang 1 sibling, 1 reply; 45+ messages in thread From: Donald Buczek @ 2022-05-09 8:18 UTC (permalink / raw) To: Guoqing Jiang, song; +Cc: linux-raid On 5/6/22 1:36 PM, Guoqing Jiang wrote: > From: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> > > Unregister sync_thread doesn't need to hold reconfig_mutex since it > doesn't reconfigure array. > > And it could cause deadlock problem for raid5 as follows: > > 1. process A tried to reap sync thread with reconfig_mutex held after echo > idle to sync_action. > 2. raid5 sync thread was blocked if there were too many active stripes. > 3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer) > which causes the number of active stripes can't be decreased. > 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able > to hold reconfig_mutex. > > More details in the link: > https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t > > Let's call unregister thread between mddev_unlock and mddev_lock_nointr > (thanks for the report from kernel test robot <lkp@intel.com>) if the > reconfig_mutex is held, and mddev_is_locked is introduced accordingly. > > Reported-by: Donald Buczek <buczek@molgen.mpg.de> > Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> > --- > drivers/md/md.c | 9 ++++++++- > drivers/md/md.h | 5 +++++ > 2 files changed, 13 insertions(+), 1 deletion(-) > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 309b3af906ad..525f65682356 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -9432,10 +9432,17 @@ void md_reap_sync_thread(struct mddev *mddev) > { > struct md_rdev *rdev; > sector_t old_dev_sectors = mddev->dev_sectors; > - bool is_reshaped = false; > + bool is_reshaped = false, is_locked = false; > > + if (mddev_is_locked(mddev)) { > + is_locked = true; > + mddev_unlock(mddev); Hmmm. Can it be excluded, that another task is holding the mutex? Best Donald > + } > /* resync has finished, collect result */ > md_unregister_thread(&mddev->sync_thread); > + if (is_locked) > + mddev_lock_nointr(mddev); > + > if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) && > !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) && > mddev->degraded != mddev->raid_disks) { > diff --git a/drivers/md/md.h b/drivers/md/md.h > index 6ac283864533..af6f3978b62b 100644 > --- a/drivers/md/md.h > +++ b/drivers/md/md.h > @@ -549,6 +549,11 @@ static inline int mddev_trylock(struct mddev *mddev) > } > extern void mddev_unlock(struct mddev *mddev); > > +static inline int mddev_is_locked(struct mddev *mddev) > +{ > + return mutex_is_locked(&mddev->reconfig_mutex); > +} > + > static inline void md_sync_acct(struct block_device *bdev, unsigned long nr_sectors) > { > atomic_add(nr_sectors, &bdev->bd_disk->sync_io); > -- Donald Buczek buczek@molgen.mpg.de Tel: +49 30 8413 1433 ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held 2022-05-09 8:18 ` Donald Buczek @ 2022-05-09 8:48 ` Guoqing Jiang 0 siblings, 0 replies; 45+ messages in thread From: Guoqing Jiang @ 2022-05-09 8:48 UTC (permalink / raw) To: Donald Buczek, song; +Cc: linux-raid On 5/9/22 4:18 PM, Donald Buczek wrote: > >> --- a/drivers/md/md.c >> +++ b/drivers/md/md.c >> @@ -9432,10 +9432,17 @@ void md_reap_sync_thread(struct mddev *mddev) >> { >> struct md_rdev *rdev; >> sector_t old_dev_sectors = mddev->dev_sectors; >> - bool is_reshaped = false; >> + bool is_reshaped = false, is_locked = false; >> + if (mddev_is_locked(mddev)) { >> + is_locked = true; >> + mddev_unlock(mddev); > > > Hmmm. Can it be excluded, that another task is holding the mutex? Yes, it is hacky as Song said, so we probably go back to v2. Thanks, Guoqing ^ permalink raw reply [flat|nested] 45+ messages in thread
end of thread, other threads:[~2022-06-02 8:13 UTC | newest] Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-05-05 8:16 [PATCH 0/2] two fixes for md Guoqing Jiang 2022-05-05 8:16 ` [PATCH V3 1/2] md: don't unregister sync_thread with reconfig_mutex held Guoqing Jiang 2022-05-05 14:02 ` kernel test robot 2022-05-05 18:04 ` kernel test robot 2022-05-06 2:34 ` Guoqing Jiang 2022-05-06 2:34 ` Guoqing Jiang 2022-05-05 8:16 ` [PATCH 2/2] md: protect md_unregister_thread from reentrancy Guoqing Jiang 2022-05-09 6:39 ` Song Liu 2022-05-09 8:12 ` Guoqing Jiang 2022-05-06 11:36 ` [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held Guoqing Jiang 2022-05-09 6:37 ` Song Liu 2022-05-09 8:09 ` Guoqing Jiang 2022-05-09 9:32 ` Wols Lists 2022-05-09 10:37 ` Guoqing Jiang 2022-05-09 11:19 ` Wols Lists 2022-05-09 11:26 ` Guoqing Jiang 2022-05-10 6:44 ` Song Liu 2022-05-10 12:01 ` Donald Buczek 2022-05-10 12:09 ` Guoqing Jiang 2022-05-10 12:35 ` Donald Buczek 2022-05-10 18:02 ` Song Liu 2022-05-11 8:10 ` Guoqing Jiang 2022-05-11 21:45 ` Song Liu 2022-05-20 18:27 ` Logan Gunthorpe 2022-05-21 18:23 ` Donald Buczek 2022-05-23 1:08 ` Guoqing Jiang 2022-05-23 5:41 ` Donald Buczek 2022-05-23 9:51 ` Guoqing Jiang 2022-05-24 16:13 ` Logan Gunthorpe 2022-05-25 9:04 ` Guoqing Jiang 2022-05-25 18:22 ` Logan Gunthorpe 2022-05-26 9:46 ` Jan Kara 2022-05-26 11:53 ` Jan Kara 2022-05-31 6:11 ` Christoph Hellwig 2022-05-31 7:43 ` Jan Kara 2022-05-30 9:55 ` Guoqing Jiang 2022-05-30 16:35 ` Logan Gunthorpe 2022-05-31 8:13 ` Guoqing Jiang 2022-05-24 15:58 ` Logan Gunthorpe 2022-05-24 18:16 ` Song Liu 2022-05-25 9:17 ` Guoqing Jiang 2022-05-24 15:51 ` Logan Gunthorpe 2022-06-02 8:12 ` Xiao Ni 2022-05-09 8:18 ` Donald Buczek 2022-05-09 8:48 ` Guoqing Jiang
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.