All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] raid1: ensure bio doesn't have more than BIO_MAX_VECS sectors
@ 2021-08-13  6:05 Guoqing Jiang
  2021-08-13  7:49 ` Christoph Hellwig
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Guoqing Jiang @ 2021-08-13  6:05 UTC (permalink / raw)
  To: song; +Cc: linux-raid, jens

From: Guoqing Jiang <jiangguoqing@kylinos.cn>

We can't split bio with more than BIO_MAX_VECS sectors, otherwise the
below call trace was triggered because we could allocate oversized
write behind bio later.

[ 8.097936] bvec_alloc+0x90/0xc0
[ 8.098934] bio_alloc_bioset+0x1b3/0x260
[ 8.099959] raid1_make_request+0x9ce/0xc50 [raid1]
[ 8.100988] ? __bio_clone_fast+0xa8/0xe0
[ 8.102008] md_handle_request+0x158/0x1d0 [md_mod]
[ 8.103050] md_submit_bio+0xcd/0x110 [md_mod]
[ 8.104084] submit_bio_noacct+0x139/0x530
[ 8.105127] submit_bio+0x78/0x1d0
[ 8.106163] ext4_io_submit+0x48/0x60 [ext4]
[ 8.107242] ext4_writepages+0x652/0x1170 [ext4]
[ 8.108300] ? do_writepages+0x41/0x100
[ 8.109338] ? __ext4_mark_inode_dirty+0x240/0x240 [ext4]
[ 8.110406] do_writepages+0x41/0x100
[ 8.111450] __filemap_fdatawrite_range+0xc5/0x100
[ 8.112513] file_write_and_wait_range+0x61/0xb0
[ 8.113564] ext4_sync_file+0x73/0x370 [ext4]
[ 8.114607] __x64_sys_fsync+0x33/0x60
[ 8.115635] do_syscall_64+0x33/0x40
[ 8.116670] entry_SYSCALL_64_after_hwframe+0x44/0xae

[1]. https://bugs.archlinux.org/task/70992

Reported-by: Jens Stutte <jens@chianterastutte.eu>
Tested-by: Jens Stutte <jens@chianterastutte.eu>
Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
---
Note: this depends on commit 018eca456c4b4dca56aaf1ec27f309c74d0fe246
in linux-block for-next branch.

 drivers/md/raid1.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 3c44c4bb40fc..ab21abc056b8 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1454,6 +1454,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 		goto retry_write;
 	}
 
+	max_sectors = min_t(int, max_sectors, BIO_MAX_VECS * PAGE_SECTORS);
 	if (max_sectors < bio_sectors(bio)) {
 		struct bio *split = bio_split(bio, max_sectors,
 					      GFP_NOIO, &conf->bio_split);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] raid1: ensure bio doesn't have more than BIO_MAX_VECS sectors
  2021-08-13  6:05 [PATCH] raid1: ensure bio doesn't have more than BIO_MAX_VECS sectors Guoqing Jiang
@ 2021-08-13  7:49 ` Christoph Hellwig
  2021-08-13  8:38   ` Guoqing Jiang
  2021-08-13  9:27   ` kernel test robot
  2021-08-13 10:12   ` kernel test robot
  2 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2021-08-13  7:49 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: song, linux-raid, jens, linux-block

On Fri, Aug 13, 2021 at 02:05:10PM +0800, Guoqing Jiang wrote:
> From: Guoqing Jiang <jiangguoqing@kylinos.cn>
> 
> We can't split bio with more than BIO_MAX_VECS sectors, otherwise the
> below call trace was triggered because we could allocate oversized
> write behind bio later.
> 
> [ 8.097936] bvec_alloc+0x90/0xc0
> [ 8.098934] bio_alloc_bioset+0x1b3/0x260
> [ 8.099959] raid1_make_request+0x9ce/0xc50 [raid1]

Which bio_alloc_bioset is this?  The one in alloc_behind_master_bio?

In which case I think you want to limit the reduction of max_sectors
to just the write behind case, and clearly document what is going on.

In general the size of a bio only depends on the number of vectors, not
the total I/O size.  But alloc_behind_master_bio allocates new backing
pages using order 0 allocations, so in this exceptional case the total
size oes actually matter.

While we're at it: this huge memory allocation looks really deadlock
prone.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] raid1: ensure bio doesn't have more than BIO_MAX_VECS sectors
  2021-08-13  7:49 ` Christoph Hellwig
@ 2021-08-13  8:38   ` Guoqing Jiang
  2021-08-14  7:55     ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Guoqing Jiang @ 2021-08-13  8:38 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: song, linux-raid, jens, linux-block



On 8/13/21 3:49 PM, Christoph Hellwig wrote:
> On Fri, Aug 13, 2021 at 02:05:10PM +0800, Guoqing Jiang wrote:
>> From: Guoqing Jiang <jiangguoqing@kylinos.cn>
>>
>> We can't split bio with more than BIO_MAX_VECS sectors, otherwise the
>> below call trace was triggered because we could allocate oversized
>> write behind bio later.
>>
>> [ 8.097936] bvec_alloc+0x90/0xc0
>> [ 8.098934] bio_alloc_bioset+0x1b3/0x260
>> [ 8.099959] raid1_make_request+0x9ce/0xc50 [raid1]
> Which bio_alloc_bioset is this?  The one in alloc_behind_master_bio?

Yes, it should be the one since bio_clone_fast calls bio_alloc_bioset 
with 0 iovecs.

> In which case I think you want to limit the reduction of max_sectors
> to just the write behind case, and clearly document what is going on.

Ok, thanks.

> In general the size of a bio only depends on the number of vectors, not
> the total I/O size.  But alloc_behind_master_bio allocates new backing
> pages using order 0 allocations, so in this exceptional case the total
> size oes actually matter.
>
> While we're at it: this huge memory allocation looks really deadlock
> prone.

Hmm, let me think more about it, or could you share your thought? 😉

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] raid1: ensure bio doesn't have more than BIO_MAX_VECS sectors
  2021-08-13  6:05 [PATCH] raid1: ensure bio doesn't have more than BIO_MAX_VECS sectors Guoqing Jiang
@ 2021-08-13  9:27   ` kernel test robot
  2021-08-13  9:27   ` kernel test robot
  2021-08-13 10:12   ` kernel test robot
  2 siblings, 0 replies; 17+ messages in thread
From: kernel test robot @ 2021-08-13  9:27 UTC (permalink / raw)
  To: Guoqing Jiang, song; +Cc: clang-built-linux, kbuild-all, linux-raid, jens

[-- Attachment #1: Type: text/plain, Size: 10658 bytes --]

Hi Guoqing,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on song-md/md-next]
[also build test ERROR on v5.14-rc5 next-20210812]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Guoqing-Jiang/raid1-ensure-bio-doesn-t-have-more-than-BIO_MAX_VECS-sectors/20210813-140810
base:   git://git.kernel.org/pub/scm/linux/kernel/git/song/md.git md-next
config: hexagon-randconfig-r001-20210813 (attached as .config)
compiler: clang version 12.0.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/29b7720a83de1deea0d8ecfafe0db46146636b15
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Guoqing-Jiang/raid1-ensure-bio-doesn-t-have-more-than-BIO_MAX_VECS-sectors/20210813-140810
        git checkout 29b7720a83de1deea0d8ecfafe0db46146636b15
        # save the attached .config to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross O=build_dir ARCH=hexagon SHELL=/bin/bash drivers/md/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> drivers/md/raid1.c:1459:55: error: use of undeclared identifier 'PAGE_SECTORS'
           max_sectors = min_t(int, max_sectors, BIO_MAX_VECS * PAGE_SECTORS);
                                                                ^
>> drivers/md/raid1.c:1459:55: error: use of undeclared identifier 'PAGE_SECTORS'
   2 errors generated.


vim +/PAGE_SECTORS +1459 drivers/md/raid1.c

  1320	
  1321	static void raid1_write_request(struct mddev *mddev, struct bio *bio,
  1322					int max_write_sectors)
  1323	{
  1324		struct r1conf *conf = mddev->private;
  1325		struct r1bio *r1_bio;
  1326		int i, disks;
  1327		struct bitmap *bitmap = mddev->bitmap;
  1328		unsigned long flags;
  1329		struct md_rdev *blocked_rdev;
  1330		struct blk_plug_cb *cb;
  1331		struct raid1_plug_cb *plug = NULL;
  1332		int first_clone;
  1333		int max_sectors;
  1334	
  1335		if (mddev_is_clustered(mddev) &&
  1336		     md_cluster_ops->area_resyncing(mddev, WRITE,
  1337			     bio->bi_iter.bi_sector, bio_end_sector(bio))) {
  1338	
  1339			DEFINE_WAIT(w);
  1340			for (;;) {
  1341				prepare_to_wait(&conf->wait_barrier,
  1342						&w, TASK_IDLE);
  1343				if (!md_cluster_ops->area_resyncing(mddev, WRITE,
  1344								bio->bi_iter.bi_sector,
  1345								bio_end_sector(bio)))
  1346					break;
  1347				schedule();
  1348			}
  1349			finish_wait(&conf->wait_barrier, &w);
  1350		}
  1351	
  1352		/*
  1353		 * Register the new request and wait if the reconstruction
  1354		 * thread has put up a bar for new requests.
  1355		 * Continue immediately if no resync is active currently.
  1356		 */
  1357		wait_barrier(conf, bio->bi_iter.bi_sector);
  1358	
  1359		r1_bio = alloc_r1bio(mddev, bio);
  1360		r1_bio->sectors = max_write_sectors;
  1361	
  1362		if (conf->pending_count >= max_queued_requests) {
  1363			md_wakeup_thread(mddev->thread);
  1364			raid1_log(mddev, "wait queued");
  1365			wait_event(conf->wait_barrier,
  1366				   conf->pending_count < max_queued_requests);
  1367		}
  1368		/* first select target devices under rcu_lock and
  1369		 * inc refcount on their rdev.  Record them by setting
  1370		 * bios[x] to bio
  1371		 * If there are known/acknowledged bad blocks on any device on
  1372		 * which we have seen a write error, we want to avoid writing those
  1373		 * blocks.
  1374		 * This potentially requires several writes to write around
  1375		 * the bad blocks.  Each set of writes gets it's own r1bio
  1376		 * with a set of bios attached.
  1377		 */
  1378	
  1379		disks = conf->raid_disks * 2;
  1380	 retry_write:
  1381		blocked_rdev = NULL;
  1382		rcu_read_lock();
  1383		max_sectors = r1_bio->sectors;
  1384		for (i = 0;  i < disks; i++) {
  1385			struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev);
  1386			if (rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
  1387				atomic_inc(&rdev->nr_pending);
  1388				blocked_rdev = rdev;
  1389				break;
  1390			}
  1391			r1_bio->bios[i] = NULL;
  1392			if (!rdev || test_bit(Faulty, &rdev->flags)) {
  1393				if (i < conf->raid_disks)
  1394					set_bit(R1BIO_Degraded, &r1_bio->state);
  1395				continue;
  1396			}
  1397	
  1398			atomic_inc(&rdev->nr_pending);
  1399			if (test_bit(WriteErrorSeen, &rdev->flags)) {
  1400				sector_t first_bad;
  1401				int bad_sectors;
  1402				int is_bad;
  1403	
  1404				is_bad = is_badblock(rdev, r1_bio->sector, max_sectors,
  1405						     &first_bad, &bad_sectors);
  1406				if (is_bad < 0) {
  1407					/* mustn't write here until the bad block is
  1408					 * acknowledged*/
  1409					set_bit(BlockedBadBlocks, &rdev->flags);
  1410					blocked_rdev = rdev;
  1411					break;
  1412				}
  1413				if (is_bad && first_bad <= r1_bio->sector) {
  1414					/* Cannot write here at all */
  1415					bad_sectors -= (r1_bio->sector - first_bad);
  1416					if (bad_sectors < max_sectors)
  1417						/* mustn't write more than bad_sectors
  1418						 * to other devices yet
  1419						 */
  1420						max_sectors = bad_sectors;
  1421					rdev_dec_pending(rdev, mddev);
  1422					/* We don't set R1BIO_Degraded as that
  1423					 * only applies if the disk is
  1424					 * missing, so it might be re-added,
  1425					 * and we want to know to recover this
  1426					 * chunk.
  1427					 * In this case the device is here,
  1428					 * and the fact that this chunk is not
  1429					 * in-sync is recorded in the bad
  1430					 * block log
  1431					 */
  1432					continue;
  1433				}
  1434				if (is_bad) {
  1435					int good_sectors = first_bad - r1_bio->sector;
  1436					if (good_sectors < max_sectors)
  1437						max_sectors = good_sectors;
  1438				}
  1439			}
  1440			r1_bio->bios[i] = bio;
  1441		}
  1442		rcu_read_unlock();
  1443	
  1444		if (unlikely(blocked_rdev)) {
  1445			/* Wait for this device to become unblocked */
  1446			int j;
  1447	
  1448			for (j = 0; j < i; j++)
  1449				if (r1_bio->bios[j])
  1450					rdev_dec_pending(conf->mirrors[j].rdev, mddev);
  1451			r1_bio->state = 0;
  1452			allow_barrier(conf, bio->bi_iter.bi_sector);
  1453			raid1_log(mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
  1454			md_wait_for_blocked_rdev(blocked_rdev, mddev);
  1455			wait_barrier(conf, bio->bi_iter.bi_sector);
  1456			goto retry_write;
  1457		}
  1458	
> 1459		max_sectors = min_t(int, max_sectors, BIO_MAX_VECS * PAGE_SECTORS);
  1460		if (max_sectors < bio_sectors(bio)) {
  1461			struct bio *split = bio_split(bio, max_sectors,
  1462						      GFP_NOIO, &conf->bio_split);
  1463			bio_chain(split, bio);
  1464			submit_bio_noacct(bio);
  1465			bio = split;
  1466			r1_bio->master_bio = bio;
  1467			r1_bio->sectors = max_sectors;
  1468		}
  1469	
  1470		if (blk_queue_io_stat(bio->bi_bdev->bd_disk->queue))
  1471			r1_bio->start_time = bio_start_io_acct(bio);
  1472		atomic_set(&r1_bio->remaining, 1);
  1473		atomic_set(&r1_bio->behind_remaining, 0);
  1474	
  1475		first_clone = 1;
  1476	
  1477		for (i = 0; i < disks; i++) {
  1478			struct bio *mbio = NULL;
  1479			struct md_rdev *rdev = conf->mirrors[i].rdev;
  1480			if (!r1_bio->bios[i])
  1481				continue;
  1482	
  1483			if (first_clone) {
  1484				/* do behind I/O ?
  1485				 * Not if there are too many, or cannot
  1486				 * allocate memory, or a reader on WriteMostly
  1487				 * is waiting for behind writes to flush */
  1488				if (bitmap &&
  1489				    (atomic_read(&bitmap->behind_writes)
  1490				     < mddev->bitmap_info.max_write_behind) &&
  1491				    !waitqueue_active(&bitmap->behind_wait)) {
  1492					alloc_behind_master_bio(r1_bio, bio);
  1493				}
  1494	
  1495				md_bitmap_startwrite(bitmap, r1_bio->sector, r1_bio->sectors,
  1496						     test_bit(R1BIO_BehindIO, &r1_bio->state));
  1497				first_clone = 0;
  1498			}
  1499	
  1500			if (r1_bio->behind_master_bio)
  1501				mbio = bio_clone_fast(r1_bio->behind_master_bio,
  1502						      GFP_NOIO, &mddev->bio_set);
  1503			else
  1504				mbio = bio_clone_fast(bio, GFP_NOIO, &mddev->bio_set);
  1505	
  1506			if (r1_bio->behind_master_bio) {
  1507				if (test_bit(CollisionCheck, &rdev->flags))
  1508					wait_for_serialization(rdev, r1_bio);
  1509				if (test_bit(WriteMostly, &rdev->flags))
  1510					atomic_inc(&r1_bio->behind_remaining);
  1511			} else if (mddev->serialize_policy)
  1512				wait_for_serialization(rdev, r1_bio);
  1513	
  1514			r1_bio->bios[i] = mbio;
  1515	
  1516			mbio->bi_iter.bi_sector	= (r1_bio->sector +
  1517					   conf->mirrors[i].rdev->data_offset);
  1518			bio_set_dev(mbio, conf->mirrors[i].rdev->bdev);
  1519			mbio->bi_end_io	= raid1_end_write_request;
  1520			mbio->bi_opf = bio_op(bio) | (bio->bi_opf & (REQ_SYNC | REQ_FUA));
  1521			if (test_bit(FailFast, &conf->mirrors[i].rdev->flags) &&
  1522			    !test_bit(WriteMostly, &conf->mirrors[i].rdev->flags) &&
  1523			    conf->raid_disks - mddev->degraded > 1)
  1524				mbio->bi_opf |= MD_FAILFAST;
  1525			mbio->bi_private = r1_bio;
  1526	
  1527			atomic_inc(&r1_bio->remaining);
  1528	
  1529			if (mddev->gendisk)
  1530				trace_block_bio_remap(mbio, disk_devt(mddev->gendisk),
  1531						      r1_bio->sector);
  1532			/* flush_pending_writes() needs access to the rdev so...*/
  1533			mbio->bi_bdev = (void *)conf->mirrors[i].rdev;
  1534	
  1535			cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug));
  1536			if (cb)
  1537				plug = container_of(cb, struct raid1_plug_cb, cb);
  1538			else
  1539				plug = NULL;
  1540			if (plug) {
  1541				bio_list_add(&plug->pending, mbio);
  1542				plug->pending_cnt++;
  1543			} else {
  1544				spin_lock_irqsave(&conf->device_lock, flags);
  1545				bio_list_add(&conf->pending_bio_list, mbio);
  1546				conf->pending_count++;
  1547				spin_unlock_irqrestore(&conf->device_lock, flags);
  1548				md_wakeup_thread(mddev->thread);
  1549			}
  1550		}
  1551	
  1552		r1_bio_write_done(r1_bio);
  1553	
  1554		/* In case raid1d snuck in to freeze_array */
  1555		wake_up(&conf->wait_barrier);
  1556	}
  1557	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 28124 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] raid1: ensure bio doesn't have more than BIO_MAX_VECS sectors
@ 2021-08-13  9:27   ` kernel test robot
  0 siblings, 0 replies; 17+ messages in thread
From: kernel test robot @ 2021-08-13  9:27 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 10941 bytes --]

Hi Guoqing,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on song-md/md-next]
[also build test ERROR on v5.14-rc5 next-20210812]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Guoqing-Jiang/raid1-ensure-bio-doesn-t-have-more-than-BIO_MAX_VECS-sectors/20210813-140810
base:   git://git.kernel.org/pub/scm/linux/kernel/git/song/md.git md-next
config: hexagon-randconfig-r001-20210813 (attached as .config)
compiler: clang version 12.0.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/29b7720a83de1deea0d8ecfafe0db46146636b15
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Guoqing-Jiang/raid1-ensure-bio-doesn-t-have-more-than-BIO_MAX_VECS-sectors/20210813-140810
        git checkout 29b7720a83de1deea0d8ecfafe0db46146636b15
        # save the attached .config to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross O=build_dir ARCH=hexagon SHELL=/bin/bash drivers/md/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> drivers/md/raid1.c:1459:55: error: use of undeclared identifier 'PAGE_SECTORS'
           max_sectors = min_t(int, max_sectors, BIO_MAX_VECS * PAGE_SECTORS);
                                                                ^
>> drivers/md/raid1.c:1459:55: error: use of undeclared identifier 'PAGE_SECTORS'
   2 errors generated.


vim +/PAGE_SECTORS +1459 drivers/md/raid1.c

  1320	
  1321	static void raid1_write_request(struct mddev *mddev, struct bio *bio,
  1322					int max_write_sectors)
  1323	{
  1324		struct r1conf *conf = mddev->private;
  1325		struct r1bio *r1_bio;
  1326		int i, disks;
  1327		struct bitmap *bitmap = mddev->bitmap;
  1328		unsigned long flags;
  1329		struct md_rdev *blocked_rdev;
  1330		struct blk_plug_cb *cb;
  1331		struct raid1_plug_cb *plug = NULL;
  1332		int first_clone;
  1333		int max_sectors;
  1334	
  1335		if (mddev_is_clustered(mddev) &&
  1336		     md_cluster_ops->area_resyncing(mddev, WRITE,
  1337			     bio->bi_iter.bi_sector, bio_end_sector(bio))) {
  1338	
  1339			DEFINE_WAIT(w);
  1340			for (;;) {
  1341				prepare_to_wait(&conf->wait_barrier,
  1342						&w, TASK_IDLE);
  1343				if (!md_cluster_ops->area_resyncing(mddev, WRITE,
  1344								bio->bi_iter.bi_sector,
  1345								bio_end_sector(bio)))
  1346					break;
  1347				schedule();
  1348			}
  1349			finish_wait(&conf->wait_barrier, &w);
  1350		}
  1351	
  1352		/*
  1353		 * Register the new request and wait if the reconstruction
  1354		 * thread has put up a bar for new requests.
  1355		 * Continue immediately if no resync is active currently.
  1356		 */
  1357		wait_barrier(conf, bio->bi_iter.bi_sector);
  1358	
  1359		r1_bio = alloc_r1bio(mddev, bio);
  1360		r1_bio->sectors = max_write_sectors;
  1361	
  1362		if (conf->pending_count >= max_queued_requests) {
  1363			md_wakeup_thread(mddev->thread);
  1364			raid1_log(mddev, "wait queued");
  1365			wait_event(conf->wait_barrier,
  1366				   conf->pending_count < max_queued_requests);
  1367		}
  1368		/* first select target devices under rcu_lock and
  1369		 * inc refcount on their rdev.  Record them by setting
  1370		 * bios[x] to bio
  1371		 * If there are known/acknowledged bad blocks on any device on
  1372		 * which we have seen a write error, we want to avoid writing those
  1373		 * blocks.
  1374		 * This potentially requires several writes to write around
  1375		 * the bad blocks.  Each set of writes gets it's own r1bio
  1376		 * with a set of bios attached.
  1377		 */
  1378	
  1379		disks = conf->raid_disks * 2;
  1380	 retry_write:
  1381		blocked_rdev = NULL;
  1382		rcu_read_lock();
  1383		max_sectors = r1_bio->sectors;
  1384		for (i = 0;  i < disks; i++) {
  1385			struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev);
  1386			if (rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
  1387				atomic_inc(&rdev->nr_pending);
  1388				blocked_rdev = rdev;
  1389				break;
  1390			}
  1391			r1_bio->bios[i] = NULL;
  1392			if (!rdev || test_bit(Faulty, &rdev->flags)) {
  1393				if (i < conf->raid_disks)
  1394					set_bit(R1BIO_Degraded, &r1_bio->state);
  1395				continue;
  1396			}
  1397	
  1398			atomic_inc(&rdev->nr_pending);
  1399			if (test_bit(WriteErrorSeen, &rdev->flags)) {
  1400				sector_t first_bad;
  1401				int bad_sectors;
  1402				int is_bad;
  1403	
  1404				is_bad = is_badblock(rdev, r1_bio->sector, max_sectors,
  1405						     &first_bad, &bad_sectors);
  1406				if (is_bad < 0) {
  1407					/* mustn't write here until the bad block is
  1408					 * acknowledged*/
  1409					set_bit(BlockedBadBlocks, &rdev->flags);
  1410					blocked_rdev = rdev;
  1411					break;
  1412				}
  1413				if (is_bad && first_bad <= r1_bio->sector) {
  1414					/* Cannot write here at all */
  1415					bad_sectors -= (r1_bio->sector - first_bad);
  1416					if (bad_sectors < max_sectors)
  1417						/* mustn't write more than bad_sectors
  1418						 * to other devices yet
  1419						 */
  1420						max_sectors = bad_sectors;
  1421					rdev_dec_pending(rdev, mddev);
  1422					/* We don't set R1BIO_Degraded as that
  1423					 * only applies if the disk is
  1424					 * missing, so it might be re-added,
  1425					 * and we want to know to recover this
  1426					 * chunk.
  1427					 * In this case the device is here,
  1428					 * and the fact that this chunk is not
  1429					 * in-sync is recorded in the bad
  1430					 * block log
  1431					 */
  1432					continue;
  1433				}
  1434				if (is_bad) {
  1435					int good_sectors = first_bad - r1_bio->sector;
  1436					if (good_sectors < max_sectors)
  1437						max_sectors = good_sectors;
  1438				}
  1439			}
  1440			r1_bio->bios[i] = bio;
  1441		}
  1442		rcu_read_unlock();
  1443	
  1444		if (unlikely(blocked_rdev)) {
  1445			/* Wait for this device to become unblocked */
  1446			int j;
  1447	
  1448			for (j = 0; j < i; j++)
  1449				if (r1_bio->bios[j])
  1450					rdev_dec_pending(conf->mirrors[j].rdev, mddev);
  1451			r1_bio->state = 0;
  1452			allow_barrier(conf, bio->bi_iter.bi_sector);
  1453			raid1_log(mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
  1454			md_wait_for_blocked_rdev(blocked_rdev, mddev);
  1455			wait_barrier(conf, bio->bi_iter.bi_sector);
  1456			goto retry_write;
  1457		}
  1458	
> 1459		max_sectors = min_t(int, max_sectors, BIO_MAX_VECS * PAGE_SECTORS);
  1460		if (max_sectors < bio_sectors(bio)) {
  1461			struct bio *split = bio_split(bio, max_sectors,
  1462						      GFP_NOIO, &conf->bio_split);
  1463			bio_chain(split, bio);
  1464			submit_bio_noacct(bio);
  1465			bio = split;
  1466			r1_bio->master_bio = bio;
  1467			r1_bio->sectors = max_sectors;
  1468		}
  1469	
  1470		if (blk_queue_io_stat(bio->bi_bdev->bd_disk->queue))
  1471			r1_bio->start_time = bio_start_io_acct(bio);
  1472		atomic_set(&r1_bio->remaining, 1);
  1473		atomic_set(&r1_bio->behind_remaining, 0);
  1474	
  1475		first_clone = 1;
  1476	
  1477		for (i = 0; i < disks; i++) {
  1478			struct bio *mbio = NULL;
  1479			struct md_rdev *rdev = conf->mirrors[i].rdev;
  1480			if (!r1_bio->bios[i])
  1481				continue;
  1482	
  1483			if (first_clone) {
  1484				/* do behind I/O ?
  1485				 * Not if there are too many, or cannot
  1486				 * allocate memory, or a reader on WriteMostly
  1487				 * is waiting for behind writes to flush */
  1488				if (bitmap &&
  1489				    (atomic_read(&bitmap->behind_writes)
  1490				     < mddev->bitmap_info.max_write_behind) &&
  1491				    !waitqueue_active(&bitmap->behind_wait)) {
  1492					alloc_behind_master_bio(r1_bio, bio);
  1493				}
  1494	
  1495				md_bitmap_startwrite(bitmap, r1_bio->sector, r1_bio->sectors,
  1496						     test_bit(R1BIO_BehindIO, &r1_bio->state));
  1497				first_clone = 0;
  1498			}
  1499	
  1500			if (r1_bio->behind_master_bio)
  1501				mbio = bio_clone_fast(r1_bio->behind_master_bio,
  1502						      GFP_NOIO, &mddev->bio_set);
  1503			else
  1504				mbio = bio_clone_fast(bio, GFP_NOIO, &mddev->bio_set);
  1505	
  1506			if (r1_bio->behind_master_bio) {
  1507				if (test_bit(CollisionCheck, &rdev->flags))
  1508					wait_for_serialization(rdev, r1_bio);
  1509				if (test_bit(WriteMostly, &rdev->flags))
  1510					atomic_inc(&r1_bio->behind_remaining);
  1511			} else if (mddev->serialize_policy)
  1512				wait_for_serialization(rdev, r1_bio);
  1513	
  1514			r1_bio->bios[i] = mbio;
  1515	
  1516			mbio->bi_iter.bi_sector	= (r1_bio->sector +
  1517					   conf->mirrors[i].rdev->data_offset);
  1518			bio_set_dev(mbio, conf->mirrors[i].rdev->bdev);
  1519			mbio->bi_end_io	= raid1_end_write_request;
  1520			mbio->bi_opf = bio_op(bio) | (bio->bi_opf & (REQ_SYNC | REQ_FUA));
  1521			if (test_bit(FailFast, &conf->mirrors[i].rdev->flags) &&
  1522			    !test_bit(WriteMostly, &conf->mirrors[i].rdev->flags) &&
  1523			    conf->raid_disks - mddev->degraded > 1)
  1524				mbio->bi_opf |= MD_FAILFAST;
  1525			mbio->bi_private = r1_bio;
  1526	
  1527			atomic_inc(&r1_bio->remaining);
  1528	
  1529			if (mddev->gendisk)
  1530				trace_block_bio_remap(mbio, disk_devt(mddev->gendisk),
  1531						      r1_bio->sector);
  1532			/* flush_pending_writes() needs access to the rdev so...*/
  1533			mbio->bi_bdev = (void *)conf->mirrors[i].rdev;
  1534	
  1535			cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug));
  1536			if (cb)
  1537				plug = container_of(cb, struct raid1_plug_cb, cb);
  1538			else
  1539				plug = NULL;
  1540			if (plug) {
  1541				bio_list_add(&plug->pending, mbio);
  1542				plug->pending_cnt++;
  1543			} else {
  1544				spin_lock_irqsave(&conf->device_lock, flags);
  1545				bio_list_add(&conf->pending_bio_list, mbio);
  1546				conf->pending_count++;
  1547				spin_unlock_irqrestore(&conf->device_lock, flags);
  1548				md_wakeup_thread(mddev->thread);
  1549			}
  1550		}
  1551	
  1552		r1_bio_write_done(r1_bio);
  1553	
  1554		/* In case raid1d snuck in to freeze_array */
  1555		wake_up(&conf->wait_barrier);
  1556	}
  1557	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 28124 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] raid1: ensure bio doesn't have more than BIO_MAX_VECS sectors
  2021-08-13  6:05 [PATCH] raid1: ensure bio doesn't have more than BIO_MAX_VECS sectors Guoqing Jiang
@ 2021-08-13 10:12   ` kernel test robot
  2021-08-13  9:27   ` kernel test robot
  2021-08-13 10:12   ` kernel test robot
  2 siblings, 0 replies; 17+ messages in thread
From: kernel test robot @ 2021-08-13 10:12 UTC (permalink / raw)
  To: Guoqing Jiang, song; +Cc: kbuild-all, linux-raid, jens

[-- Attachment #1: Type: text/plain, Size: 13412 bytes --]

Hi Guoqing,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on song-md/md-next]
[also build test ERROR on v5.14-rc5 next-20210812]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Guoqing-Jiang/raid1-ensure-bio-doesn-t-have-more-than-BIO_MAX_VECS-sectors/20210813-140810
base:   git://git.kernel.org/pub/scm/linux/kernel/git/song/md.git md-next
config: nds32-randconfig-r035-20210813 (attached as .config)
compiler: nds32le-linux-gcc (GCC) 10.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/29b7720a83de1deea0d8ecfafe0db46146636b15
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Guoqing-Jiang/raid1-ensure-bio-doesn-t-have-more-than-BIO_MAX_VECS-sectors/20210813-140810
        git checkout 29b7720a83de1deea0d8ecfafe0db46146636b15
        # save the attached .config to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-10.3.0 make.cross O=build_dir ARCH=nds32 SHELL=/bin/bash drivers/md/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from include/linux/kernel.h:15,
                    from include/asm-generic/bug.h:20,
                    from ./arch/nds32/include/generated/asm/bug.h:1,
                    from include/linux/bug.h:5,
                    from include/linux/mmdebug.h:5,
                    from include/linux/gfp.h:5,
                    from include/linux/slab.h:15,
                    from drivers/md/raid1.c:26:
   drivers/md/raid1.c: In function 'raid1_write_request':
>> drivers/md/raid1.c:1459:55: error: 'PAGE_SECTORS' undeclared (first use in this function); did you mean 'PAGE_MEMORY'?
    1459 |  max_sectors = min_t(int, max_sectors, BIO_MAX_VECS * PAGE_SECTORS);
         |                                                       ^~~~~~~~~~~~
   include/linux/minmax.h:20:39: note: in definition of macro '__typecheck'
      20 |  (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
         |                                       ^
   include/linux/minmax.h:36:24: note: in expansion of macro '__safe_cmp'
      36 |  __builtin_choose_expr(__safe_cmp(x, y), \
         |                        ^~~~~~~~~~
   include/linux/minmax.h:104:27: note: in expansion of macro '__careful_cmp'
     104 | #define min_t(type, x, y) __careful_cmp((type)(x), (type)(y), <)
         |                           ^~~~~~~~~~~~~
   drivers/md/raid1.c:1459:16: note: in expansion of macro 'min_t'
    1459 |  max_sectors = min_t(int, max_sectors, BIO_MAX_VECS * PAGE_SECTORS);
         |                ^~~~~
   drivers/md/raid1.c:1459:55: note: each undeclared identifier is reported only once for each function it appears in
    1459 |  max_sectors = min_t(int, max_sectors, BIO_MAX_VECS * PAGE_SECTORS);
         |                                                       ^~~~~~~~~~~~
   include/linux/minmax.h:20:39: note: in definition of macro '__typecheck'
      20 |  (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
         |                                       ^
   include/linux/minmax.h:36:24: note: in expansion of macro '__safe_cmp'
      36 |  __builtin_choose_expr(__safe_cmp(x, y), \
         |                        ^~~~~~~~~~
   include/linux/minmax.h:104:27: note: in expansion of macro '__careful_cmp'
     104 | #define min_t(type, x, y) __careful_cmp((type)(x), (type)(y), <)
         |                           ^~~~~~~~~~~~~
   drivers/md/raid1.c:1459:16: note: in expansion of macro 'min_t'
    1459 |  max_sectors = min_t(int, max_sectors, BIO_MAX_VECS * PAGE_SECTORS);
         |                ^~~~~
>> include/linux/minmax.h:36:2: error: first argument to '__builtin_choose_expr' not a constant
      36 |  __builtin_choose_expr(__safe_cmp(x, y), \
         |  ^~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:104:27: note: in expansion of macro '__careful_cmp'
     104 | #define min_t(type, x, y) __careful_cmp((type)(x), (type)(y), <)
         |                           ^~~~~~~~~~~~~
   drivers/md/raid1.c:1459:16: note: in expansion of macro 'min_t'
    1459 |  max_sectors = min_t(int, max_sectors, BIO_MAX_VECS * PAGE_SECTORS);
         |                ^~~~~


vim +1459 drivers/md/raid1.c

  1320	
  1321	static void raid1_write_request(struct mddev *mddev, struct bio *bio,
  1322					int max_write_sectors)
  1323	{
  1324		struct r1conf *conf = mddev->private;
  1325		struct r1bio *r1_bio;
  1326		int i, disks;
  1327		struct bitmap *bitmap = mddev->bitmap;
  1328		unsigned long flags;
  1329		struct md_rdev *blocked_rdev;
  1330		struct blk_plug_cb *cb;
  1331		struct raid1_plug_cb *plug = NULL;
  1332		int first_clone;
  1333		int max_sectors;
  1334	
  1335		if (mddev_is_clustered(mddev) &&
  1336		     md_cluster_ops->area_resyncing(mddev, WRITE,
  1337			     bio->bi_iter.bi_sector, bio_end_sector(bio))) {
  1338	
  1339			DEFINE_WAIT(w);
  1340			for (;;) {
  1341				prepare_to_wait(&conf->wait_barrier,
  1342						&w, TASK_IDLE);
  1343				if (!md_cluster_ops->area_resyncing(mddev, WRITE,
  1344								bio->bi_iter.bi_sector,
  1345								bio_end_sector(bio)))
  1346					break;
  1347				schedule();
  1348			}
  1349			finish_wait(&conf->wait_barrier, &w);
  1350		}
  1351	
  1352		/*
  1353		 * Register the new request and wait if the reconstruction
  1354		 * thread has put up a bar for new requests.
  1355		 * Continue immediately if no resync is active currently.
  1356		 */
  1357		wait_barrier(conf, bio->bi_iter.bi_sector);
  1358	
  1359		r1_bio = alloc_r1bio(mddev, bio);
  1360		r1_bio->sectors = max_write_sectors;
  1361	
  1362		if (conf->pending_count >= max_queued_requests) {
  1363			md_wakeup_thread(mddev->thread);
  1364			raid1_log(mddev, "wait queued");
  1365			wait_event(conf->wait_barrier,
  1366				   conf->pending_count < max_queued_requests);
  1367		}
  1368		/* first select target devices under rcu_lock and
  1369		 * inc refcount on their rdev.  Record them by setting
  1370		 * bios[x] to bio
  1371		 * If there are known/acknowledged bad blocks on any device on
  1372		 * which we have seen a write error, we want to avoid writing those
  1373		 * blocks.
  1374		 * This potentially requires several writes to write around
  1375		 * the bad blocks.  Each set of writes gets it's own r1bio
  1376		 * with a set of bios attached.
  1377		 */
  1378	
  1379		disks = conf->raid_disks * 2;
  1380	 retry_write:
  1381		blocked_rdev = NULL;
  1382		rcu_read_lock();
  1383		max_sectors = r1_bio->sectors;
  1384		for (i = 0;  i < disks; i++) {
  1385			struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev);
  1386			if (rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
  1387				atomic_inc(&rdev->nr_pending);
  1388				blocked_rdev = rdev;
  1389				break;
  1390			}
  1391			r1_bio->bios[i] = NULL;
  1392			if (!rdev || test_bit(Faulty, &rdev->flags)) {
  1393				if (i < conf->raid_disks)
  1394					set_bit(R1BIO_Degraded, &r1_bio->state);
  1395				continue;
  1396			}
  1397	
  1398			atomic_inc(&rdev->nr_pending);
  1399			if (test_bit(WriteErrorSeen, &rdev->flags)) {
  1400				sector_t first_bad;
  1401				int bad_sectors;
  1402				int is_bad;
  1403	
  1404				is_bad = is_badblock(rdev, r1_bio->sector, max_sectors,
  1405						     &first_bad, &bad_sectors);
  1406				if (is_bad < 0) {
  1407					/* mustn't write here until the bad block is
  1408					 * acknowledged*/
  1409					set_bit(BlockedBadBlocks, &rdev->flags);
  1410					blocked_rdev = rdev;
  1411					break;
  1412				}
  1413				if (is_bad && first_bad <= r1_bio->sector) {
  1414					/* Cannot write here at all */
  1415					bad_sectors -= (r1_bio->sector - first_bad);
  1416					if (bad_sectors < max_sectors)
  1417						/* mustn't write more than bad_sectors
  1418						 * to other devices yet
  1419						 */
  1420						max_sectors = bad_sectors;
  1421					rdev_dec_pending(rdev, mddev);
  1422					/* We don't set R1BIO_Degraded as that
  1423					 * only applies if the disk is
  1424					 * missing, so it might be re-added,
  1425					 * and we want to know to recover this
  1426					 * chunk.
  1427					 * In this case the device is here,
  1428					 * and the fact that this chunk is not
  1429					 * in-sync is recorded in the bad
  1430					 * block log
  1431					 */
  1432					continue;
  1433				}
  1434				if (is_bad) {
  1435					int good_sectors = first_bad - r1_bio->sector;
  1436					if (good_sectors < max_sectors)
  1437						max_sectors = good_sectors;
  1438				}
  1439			}
  1440			r1_bio->bios[i] = bio;
  1441		}
  1442		rcu_read_unlock();
  1443	
  1444		if (unlikely(blocked_rdev)) {
  1445			/* Wait for this device to become unblocked */
  1446			int j;
  1447	
  1448			for (j = 0; j < i; j++)
  1449				if (r1_bio->bios[j])
  1450					rdev_dec_pending(conf->mirrors[j].rdev, mddev);
  1451			r1_bio->state = 0;
  1452			allow_barrier(conf, bio->bi_iter.bi_sector);
  1453			raid1_log(mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
  1454			md_wait_for_blocked_rdev(blocked_rdev, mddev);
  1455			wait_barrier(conf, bio->bi_iter.bi_sector);
  1456			goto retry_write;
  1457		}
  1458	
> 1459		max_sectors = min_t(int, max_sectors, BIO_MAX_VECS * PAGE_SECTORS);
  1460		if (max_sectors < bio_sectors(bio)) {
  1461			struct bio *split = bio_split(bio, max_sectors,
  1462						      GFP_NOIO, &conf->bio_split);
  1463			bio_chain(split, bio);
  1464			submit_bio_noacct(bio);
  1465			bio = split;
  1466			r1_bio->master_bio = bio;
  1467			r1_bio->sectors = max_sectors;
  1468		}
  1469	
  1470		if (blk_queue_io_stat(bio->bi_bdev->bd_disk->queue))
  1471			r1_bio->start_time = bio_start_io_acct(bio);
  1472		atomic_set(&r1_bio->remaining, 1);
  1473		atomic_set(&r1_bio->behind_remaining, 0);
  1474	
  1475		first_clone = 1;
  1476	
  1477		for (i = 0; i < disks; i++) {
  1478			struct bio *mbio = NULL;
  1479			struct md_rdev *rdev = conf->mirrors[i].rdev;
  1480			if (!r1_bio->bios[i])
  1481				continue;
  1482	
  1483			if (first_clone) {
  1484				/* do behind I/O ?
  1485				 * Not if there are too many, or cannot
  1486				 * allocate memory, or a reader on WriteMostly
  1487				 * is waiting for behind writes to flush */
  1488				if (bitmap &&
  1489				    (atomic_read(&bitmap->behind_writes)
  1490				     < mddev->bitmap_info.max_write_behind) &&
  1491				    !waitqueue_active(&bitmap->behind_wait)) {
  1492					alloc_behind_master_bio(r1_bio, bio);
  1493				}
  1494	
  1495				md_bitmap_startwrite(bitmap, r1_bio->sector, r1_bio->sectors,
  1496						     test_bit(R1BIO_BehindIO, &r1_bio->state));
  1497				first_clone = 0;
  1498			}
  1499	
  1500			if (r1_bio->behind_master_bio)
  1501				mbio = bio_clone_fast(r1_bio->behind_master_bio,
  1502						      GFP_NOIO, &mddev->bio_set);
  1503			else
  1504				mbio = bio_clone_fast(bio, GFP_NOIO, &mddev->bio_set);
  1505	
  1506			if (r1_bio->behind_master_bio) {
  1507				if (test_bit(CollisionCheck, &rdev->flags))
  1508					wait_for_serialization(rdev, r1_bio);
  1509				if (test_bit(WriteMostly, &rdev->flags))
  1510					atomic_inc(&r1_bio->behind_remaining);
  1511			} else if (mddev->serialize_policy)
  1512				wait_for_serialization(rdev, r1_bio);
  1513	
  1514			r1_bio->bios[i] = mbio;
  1515	
  1516			mbio->bi_iter.bi_sector	= (r1_bio->sector +
  1517					   conf->mirrors[i].rdev->data_offset);
  1518			bio_set_dev(mbio, conf->mirrors[i].rdev->bdev);
  1519			mbio->bi_end_io	= raid1_end_write_request;
  1520			mbio->bi_opf = bio_op(bio) | (bio->bi_opf & (REQ_SYNC | REQ_FUA));
  1521			if (test_bit(FailFast, &conf->mirrors[i].rdev->flags) &&
  1522			    !test_bit(WriteMostly, &conf->mirrors[i].rdev->flags) &&
  1523			    conf->raid_disks - mddev->degraded > 1)
  1524				mbio->bi_opf |= MD_FAILFAST;
  1525			mbio->bi_private = r1_bio;
  1526	
  1527			atomic_inc(&r1_bio->remaining);
  1528	
  1529			if (mddev->gendisk)
  1530				trace_block_bio_remap(mbio, disk_devt(mddev->gendisk),
  1531						      r1_bio->sector);
  1532			/* flush_pending_writes() needs access to the rdev so...*/
  1533			mbio->bi_bdev = (void *)conf->mirrors[i].rdev;
  1534	
  1535			cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug));
  1536			if (cb)
  1537				plug = container_of(cb, struct raid1_plug_cb, cb);
  1538			else
  1539				plug = NULL;
  1540			if (plug) {
  1541				bio_list_add(&plug->pending, mbio);
  1542				plug->pending_cnt++;
  1543			} else {
  1544				spin_lock_irqsave(&conf->device_lock, flags);
  1545				bio_list_add(&conf->pending_bio_list, mbio);
  1546				conf->pending_count++;
  1547				spin_unlock_irqrestore(&conf->device_lock, flags);
  1548				md_wakeup_thread(mddev->thread);
  1549			}
  1550		}
  1551	
  1552		r1_bio_write_done(r1_bio);
  1553	
  1554		/* In case raid1d snuck in to freeze_array */
  1555		wake_up(&conf->wait_barrier);
  1556	}
  1557	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 30071 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] raid1: ensure bio doesn't have more than BIO_MAX_VECS sectors
@ 2021-08-13 10:12   ` kernel test robot
  0 siblings, 0 replies; 17+ messages in thread
From: kernel test robot @ 2021-08-13 10:12 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 13738 bytes --]

Hi Guoqing,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on song-md/md-next]
[also build test ERROR on v5.14-rc5 next-20210812]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Guoqing-Jiang/raid1-ensure-bio-doesn-t-have-more-than-BIO_MAX_VECS-sectors/20210813-140810
base:   git://git.kernel.org/pub/scm/linux/kernel/git/song/md.git md-next
config: nds32-randconfig-r035-20210813 (attached as .config)
compiler: nds32le-linux-gcc (GCC) 10.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/29b7720a83de1deea0d8ecfafe0db46146636b15
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Guoqing-Jiang/raid1-ensure-bio-doesn-t-have-more-than-BIO_MAX_VECS-sectors/20210813-140810
        git checkout 29b7720a83de1deea0d8ecfafe0db46146636b15
        # save the attached .config to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-10.3.0 make.cross O=build_dir ARCH=nds32 SHELL=/bin/bash drivers/md/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from include/linux/kernel.h:15,
                    from include/asm-generic/bug.h:20,
                    from ./arch/nds32/include/generated/asm/bug.h:1,
                    from include/linux/bug.h:5,
                    from include/linux/mmdebug.h:5,
                    from include/linux/gfp.h:5,
                    from include/linux/slab.h:15,
                    from drivers/md/raid1.c:26:
   drivers/md/raid1.c: In function 'raid1_write_request':
>> drivers/md/raid1.c:1459:55: error: 'PAGE_SECTORS' undeclared (first use in this function); did you mean 'PAGE_MEMORY'?
    1459 |  max_sectors = min_t(int, max_sectors, BIO_MAX_VECS * PAGE_SECTORS);
         |                                                       ^~~~~~~~~~~~
   include/linux/minmax.h:20:39: note: in definition of macro '__typecheck'
      20 |  (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
         |                                       ^
   include/linux/minmax.h:36:24: note: in expansion of macro '__safe_cmp'
      36 |  __builtin_choose_expr(__safe_cmp(x, y), \
         |                        ^~~~~~~~~~
   include/linux/minmax.h:104:27: note: in expansion of macro '__careful_cmp'
     104 | #define min_t(type, x, y) __careful_cmp((type)(x), (type)(y), <)
         |                           ^~~~~~~~~~~~~
   drivers/md/raid1.c:1459:16: note: in expansion of macro 'min_t'
    1459 |  max_sectors = min_t(int, max_sectors, BIO_MAX_VECS * PAGE_SECTORS);
         |                ^~~~~
   drivers/md/raid1.c:1459:55: note: each undeclared identifier is reported only once for each function it appears in
    1459 |  max_sectors = min_t(int, max_sectors, BIO_MAX_VECS * PAGE_SECTORS);
         |                                                       ^~~~~~~~~~~~
   include/linux/minmax.h:20:39: note: in definition of macro '__typecheck'
      20 |  (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
         |                                       ^
   include/linux/minmax.h:36:24: note: in expansion of macro '__safe_cmp'
      36 |  __builtin_choose_expr(__safe_cmp(x, y), \
         |                        ^~~~~~~~~~
   include/linux/minmax.h:104:27: note: in expansion of macro '__careful_cmp'
     104 | #define min_t(type, x, y) __careful_cmp((type)(x), (type)(y), <)
         |                           ^~~~~~~~~~~~~
   drivers/md/raid1.c:1459:16: note: in expansion of macro 'min_t'
    1459 |  max_sectors = min_t(int, max_sectors, BIO_MAX_VECS * PAGE_SECTORS);
         |                ^~~~~
>> include/linux/minmax.h:36:2: error: first argument to '__builtin_choose_expr' not a constant
      36 |  __builtin_choose_expr(__safe_cmp(x, y), \
         |  ^~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:104:27: note: in expansion of macro '__careful_cmp'
     104 | #define min_t(type, x, y) __careful_cmp((type)(x), (type)(y), <)
         |                           ^~~~~~~~~~~~~
   drivers/md/raid1.c:1459:16: note: in expansion of macro 'min_t'
    1459 |  max_sectors = min_t(int, max_sectors, BIO_MAX_VECS * PAGE_SECTORS);
         |                ^~~~~


vim +1459 drivers/md/raid1.c

  1320	
  1321	static void raid1_write_request(struct mddev *mddev, struct bio *bio,
  1322					int max_write_sectors)
  1323	{
  1324		struct r1conf *conf = mddev->private;
  1325		struct r1bio *r1_bio;
  1326		int i, disks;
  1327		struct bitmap *bitmap = mddev->bitmap;
  1328		unsigned long flags;
  1329		struct md_rdev *blocked_rdev;
  1330		struct blk_plug_cb *cb;
  1331		struct raid1_plug_cb *plug = NULL;
  1332		int first_clone;
  1333		int max_sectors;
  1334	
  1335		if (mddev_is_clustered(mddev) &&
  1336		     md_cluster_ops->area_resyncing(mddev, WRITE,
  1337			     bio->bi_iter.bi_sector, bio_end_sector(bio))) {
  1338	
  1339			DEFINE_WAIT(w);
  1340			for (;;) {
  1341				prepare_to_wait(&conf->wait_barrier,
  1342						&w, TASK_IDLE);
  1343				if (!md_cluster_ops->area_resyncing(mddev, WRITE,
  1344								bio->bi_iter.bi_sector,
  1345								bio_end_sector(bio)))
  1346					break;
  1347				schedule();
  1348			}
  1349			finish_wait(&conf->wait_barrier, &w);
  1350		}
  1351	
  1352		/*
  1353		 * Register the new request and wait if the reconstruction
  1354		 * thread has put up a bar for new requests.
  1355		 * Continue immediately if no resync is active currently.
  1356		 */
  1357		wait_barrier(conf, bio->bi_iter.bi_sector);
  1358	
  1359		r1_bio = alloc_r1bio(mddev, bio);
  1360		r1_bio->sectors = max_write_sectors;
  1361	
  1362		if (conf->pending_count >= max_queued_requests) {
  1363			md_wakeup_thread(mddev->thread);
  1364			raid1_log(mddev, "wait queued");
  1365			wait_event(conf->wait_barrier,
  1366				   conf->pending_count < max_queued_requests);
  1367		}
  1368		/* first select target devices under rcu_lock and
  1369		 * inc refcount on their rdev.  Record them by setting
  1370		 * bios[x] to bio
  1371		 * If there are known/acknowledged bad blocks on any device on
  1372		 * which we have seen a write error, we want to avoid writing those
  1373		 * blocks.
  1374		 * This potentially requires several writes to write around
  1375		 * the bad blocks.  Each set of writes gets it's own r1bio
  1376		 * with a set of bios attached.
  1377		 */
  1378	
  1379		disks = conf->raid_disks * 2;
  1380	 retry_write:
  1381		blocked_rdev = NULL;
  1382		rcu_read_lock();
  1383		max_sectors = r1_bio->sectors;
  1384		for (i = 0;  i < disks; i++) {
  1385			struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev);
  1386			if (rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
  1387				atomic_inc(&rdev->nr_pending);
  1388				blocked_rdev = rdev;
  1389				break;
  1390			}
  1391			r1_bio->bios[i] = NULL;
  1392			if (!rdev || test_bit(Faulty, &rdev->flags)) {
  1393				if (i < conf->raid_disks)
  1394					set_bit(R1BIO_Degraded, &r1_bio->state);
  1395				continue;
  1396			}
  1397	
  1398			atomic_inc(&rdev->nr_pending);
  1399			if (test_bit(WriteErrorSeen, &rdev->flags)) {
  1400				sector_t first_bad;
  1401				int bad_sectors;
  1402				int is_bad;
  1403	
  1404				is_bad = is_badblock(rdev, r1_bio->sector, max_sectors,
  1405						     &first_bad, &bad_sectors);
  1406				if (is_bad < 0) {
  1407					/* mustn't write here until the bad block is
  1408					 * acknowledged*/
  1409					set_bit(BlockedBadBlocks, &rdev->flags);
  1410					blocked_rdev = rdev;
  1411					break;
  1412				}
  1413				if (is_bad && first_bad <= r1_bio->sector) {
  1414					/* Cannot write here at all */
  1415					bad_sectors -= (r1_bio->sector - first_bad);
  1416					if (bad_sectors < max_sectors)
  1417						/* mustn't write more than bad_sectors
  1418						 * to other devices yet
  1419						 */
  1420						max_sectors = bad_sectors;
  1421					rdev_dec_pending(rdev, mddev);
  1422					/* We don't set R1BIO_Degraded as that
  1423					 * only applies if the disk is
  1424					 * missing, so it might be re-added,
  1425					 * and we want to know to recover this
  1426					 * chunk.
  1427					 * In this case the device is here,
  1428					 * and the fact that this chunk is not
  1429					 * in-sync is recorded in the bad
  1430					 * block log
  1431					 */
  1432					continue;
  1433				}
  1434				if (is_bad) {
  1435					int good_sectors = first_bad - r1_bio->sector;
  1436					if (good_sectors < max_sectors)
  1437						max_sectors = good_sectors;
  1438				}
  1439			}
  1440			r1_bio->bios[i] = bio;
  1441		}
  1442		rcu_read_unlock();
  1443	
  1444		if (unlikely(blocked_rdev)) {
  1445			/* Wait for this device to become unblocked */
  1446			int j;
  1447	
  1448			for (j = 0; j < i; j++)
  1449				if (r1_bio->bios[j])
  1450					rdev_dec_pending(conf->mirrors[j].rdev, mddev);
  1451			r1_bio->state = 0;
  1452			allow_barrier(conf, bio->bi_iter.bi_sector);
  1453			raid1_log(mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
  1454			md_wait_for_blocked_rdev(blocked_rdev, mddev);
  1455			wait_barrier(conf, bio->bi_iter.bi_sector);
  1456			goto retry_write;
  1457		}
  1458	
> 1459		max_sectors = min_t(int, max_sectors, BIO_MAX_VECS * PAGE_SECTORS);
  1460		if (max_sectors < bio_sectors(bio)) {
  1461			struct bio *split = bio_split(bio, max_sectors,
  1462						      GFP_NOIO, &conf->bio_split);
  1463			bio_chain(split, bio);
  1464			submit_bio_noacct(bio);
  1465			bio = split;
  1466			r1_bio->master_bio = bio;
  1467			r1_bio->sectors = max_sectors;
  1468		}
  1469	
  1470		if (blk_queue_io_stat(bio->bi_bdev->bd_disk->queue))
  1471			r1_bio->start_time = bio_start_io_acct(bio);
  1472		atomic_set(&r1_bio->remaining, 1);
  1473		atomic_set(&r1_bio->behind_remaining, 0);
  1474	
  1475		first_clone = 1;
  1476	
  1477		for (i = 0; i < disks; i++) {
  1478			struct bio *mbio = NULL;
  1479			struct md_rdev *rdev = conf->mirrors[i].rdev;
  1480			if (!r1_bio->bios[i])
  1481				continue;
  1482	
  1483			if (first_clone) {
  1484				/* do behind I/O ?
  1485				 * Not if there are too many, or cannot
  1486				 * allocate memory, or a reader on WriteMostly
  1487				 * is waiting for behind writes to flush */
  1488				if (bitmap &&
  1489				    (atomic_read(&bitmap->behind_writes)
  1490				     < mddev->bitmap_info.max_write_behind) &&
  1491				    !waitqueue_active(&bitmap->behind_wait)) {
  1492					alloc_behind_master_bio(r1_bio, bio);
  1493				}
  1494	
  1495				md_bitmap_startwrite(bitmap, r1_bio->sector, r1_bio->sectors,
  1496						     test_bit(R1BIO_BehindIO, &r1_bio->state));
  1497				first_clone = 0;
  1498			}
  1499	
  1500			if (r1_bio->behind_master_bio)
  1501				mbio = bio_clone_fast(r1_bio->behind_master_bio,
  1502						      GFP_NOIO, &mddev->bio_set);
  1503			else
  1504				mbio = bio_clone_fast(bio, GFP_NOIO, &mddev->bio_set);
  1505	
  1506			if (r1_bio->behind_master_bio) {
  1507				if (test_bit(CollisionCheck, &rdev->flags))
  1508					wait_for_serialization(rdev, r1_bio);
  1509				if (test_bit(WriteMostly, &rdev->flags))
  1510					atomic_inc(&r1_bio->behind_remaining);
  1511			} else if (mddev->serialize_policy)
  1512				wait_for_serialization(rdev, r1_bio);
  1513	
  1514			r1_bio->bios[i] = mbio;
  1515	
  1516			mbio->bi_iter.bi_sector	= (r1_bio->sector +
  1517					   conf->mirrors[i].rdev->data_offset);
  1518			bio_set_dev(mbio, conf->mirrors[i].rdev->bdev);
  1519			mbio->bi_end_io	= raid1_end_write_request;
  1520			mbio->bi_opf = bio_op(bio) | (bio->bi_opf & (REQ_SYNC | REQ_FUA));
  1521			if (test_bit(FailFast, &conf->mirrors[i].rdev->flags) &&
  1522			    !test_bit(WriteMostly, &conf->mirrors[i].rdev->flags) &&
  1523			    conf->raid_disks - mddev->degraded > 1)
  1524				mbio->bi_opf |= MD_FAILFAST;
  1525			mbio->bi_private = r1_bio;
  1526	
  1527			atomic_inc(&r1_bio->remaining);
  1528	
  1529			if (mddev->gendisk)
  1530				trace_block_bio_remap(mbio, disk_devt(mddev->gendisk),
  1531						      r1_bio->sector);
  1532			/* flush_pending_writes() needs access to the rdev so...*/
  1533			mbio->bi_bdev = (void *)conf->mirrors[i].rdev;
  1534	
  1535			cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug));
  1536			if (cb)
  1537				plug = container_of(cb, struct raid1_plug_cb, cb);
  1538			else
  1539				plug = NULL;
  1540			if (plug) {
  1541				bio_list_add(&plug->pending, mbio);
  1542				plug->pending_cnt++;
  1543			} else {
  1544				spin_lock_irqsave(&conf->device_lock, flags);
  1545				bio_list_add(&conf->pending_bio_list, mbio);
  1546				conf->pending_count++;
  1547				spin_unlock_irqrestore(&conf->device_lock, flags);
  1548				md_wakeup_thread(mddev->thread);
  1549			}
  1550		}
  1551	
  1552		r1_bio_write_done(r1_bio);
  1553	
  1554		/* In case raid1d snuck in to freeze_array */
  1555		wake_up(&conf->wait_barrier);
  1556	}
  1557	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 30071 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] raid1: ensure bio doesn't have more than BIO_MAX_VECS sectors
  2021-08-13  8:38   ` Guoqing Jiang
@ 2021-08-14  7:55     ` Christoph Hellwig
  2021-08-14  8:57       ` Ming Lei
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2021-08-14  7:55 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: Christoph Hellwig, song, linux-raid, jens, linux-block

On Fri, Aug 13, 2021 at 04:38:59PM +0800, Guoqing Jiang wrote:
> 
> Ok, thanks.
> 
> > In general the size of a bio only depends on the number of vectors, not
> > the total I/O size.  But alloc_behind_master_bio allocates new backing
> > pages using order 0 allocations, so in this exceptional case the total
> > size oes actually matter.
> > 
> > While we're at it: this huge memory allocation looks really deadlock
> > prone.
> 
> Hmm, let me think more about it, or could you share your thought? ????

Well, you'd need a mempool which can fit the max payload of a bio,
that is BIO_MAX_VECS pages.

FYI, this is what I'd do instead of this patch for now.  We don't really
need a vetor per sector, just per page.  So this limits the I/O
size a little less.

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 3c44c4bb40fc..5b27d995302e 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1454,6 +1454,15 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 		goto retry_write;
 	}
 
+	/*
+	 * When using a bitmap, we may call alloc_behind_master_bio below.
+	 * alloc_behind_master_bio allocates a copy of the data payload a page
+	 * at a time and thus needs a new bio that can fit the whole payload
+	 * this bio in page sized chunks.
+	 */
+	if (bitmap)
+		max_sectors = min_t(int, max_sectors, BIO_MAX_VECS * PAGE_SIZE);
+
 	if (max_sectors < bio_sectors(bio)) {
 		struct bio *split = bio_split(bio, max_sectors,
 					      GFP_NOIO, &conf->bio_split);

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] raid1: ensure bio doesn't have more than BIO_MAX_VECS sectors
  2021-08-14  7:55     ` Christoph Hellwig
@ 2021-08-14  8:57       ` Ming Lei
  2021-08-16  6:27         ` Guoqing Jiang
  2021-08-16  9:37         ` Christoph Hellwig
  0 siblings, 2 replies; 17+ messages in thread
From: Ming Lei @ 2021-08-14  8:57 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Guoqing Jiang, song, linux-raid, jens, linux-block

On Sat, Aug 14, 2021 at 08:55:21AM +0100, Christoph Hellwig wrote:
> On Fri, Aug 13, 2021 at 04:38:59PM +0800, Guoqing Jiang wrote:
> > 
> > Ok, thanks.
> > 
> > > In general the size of a bio only depends on the number of vectors, not
> > > the total I/O size.  But alloc_behind_master_bio allocates new backing
> > > pages using order 0 allocations, so in this exceptional case the total
> > > size oes actually matter.
> > > 
> > > While we're at it: this huge memory allocation looks really deadlock
> > > prone.
> > 
> > Hmm, let me think more about it, or could you share your thought? ????
> 
> Well, you'd need a mempool which can fit the max payload of a bio,
> that is BIO_MAX_VECS pages.
> 
> FYI, this is what I'd do instead of this patch for now.  We don't really
> need a vetor per sector, just per page.  So this limits the I/O
> size a little less.
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 3c44c4bb40fc..5b27d995302e 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1454,6 +1454,15 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>  		goto retry_write;
>  	}
>  
> +	/*
> +	 * When using a bitmap, we may call alloc_behind_master_bio below.
> +	 * alloc_behind_master_bio allocates a copy of the data payload a page
> +	 * at a time and thus needs a new bio that can fit the whole payload
> +	 * this bio in page sized chunks.
> +	 */
> +	if (bitmap)
> +		max_sectors = min_t(int, max_sectors, BIO_MAX_VECS * PAGE_SIZE);

s/PAGE_SIZE/PAGE_SECTORS

> +
>  	if (max_sectors < bio_sectors(bio)) {
>  		struct bio *split = bio_split(bio, max_sectors,
>  					      GFP_NOIO, &conf->bio_split);
> 

Here the limit is max single-page vectors, and the above way may not work,
such as:

0 ~ 254: each bvec's length is 512
255: bvec's length is 8192

the total length is just 512*255 + 8192 = 138752 bytes = 271 sectors, but it
still may need 257 bvecs, which can't be allocated via bio_alloc_bioset().

One solution is to add queue limit of max_single_page_bvec, and let
blk_queue_split() handle it.



Thanks,
Ming


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] raid1: ensure bio doesn't have more than BIO_MAX_VECS sectors
  2021-08-14  8:57       ` Ming Lei
@ 2021-08-16  6:27         ` Guoqing Jiang
  2021-08-16  7:13           ` Ming Lei
  2021-08-16  9:37         ` Christoph Hellwig
  1 sibling, 1 reply; 17+ messages in thread
From: Guoqing Jiang @ 2021-08-16  6:27 UTC (permalink / raw)
  To: Ming Lei, Christoph Hellwig; +Cc: song, linux-raid, jens, linux-block

Hi Ming and Christoph,

On 8/14/21 4:57 PM, Ming Lei wrote:
> On Sat, Aug 14, 2021 at 08:55:21AM +0100, Christoph Hellwig wrote:
>> On Fri, Aug 13, 2021 at 04:38:59PM +0800, Guoqing Jiang wrote:
>>> Ok, thanks.
>>>
>>>> In general the size of a bio only depends on the number of vectors, not
>>>> the total I/O size.  But alloc_behind_master_bio allocates new backing
>>>> pages using order 0 allocations, so in this exceptional case the total
>>>> size oes actually matter.
>>>>
>>>> While we're at it: this huge memory allocation looks really deadlock
>>>> prone.
>>> Hmm, let me think more about it, or could you share your thought? ????
>> Well, you'd need a mempool which can fit the max payload of a bio,
>> that is BIO_MAX_VECS pages.

IIUC, the behind bio is allocated from bio_set (mddev->bio_set) which is 
allocated in md_run by
call bioset_init, so the mempool (bvec_pool) of  this bio_set is created 
by biovec_init_pool which
uses global biovec slabs. Do we really need another mempool? Or, there 
is no potential deadlock
  for this case.

>> FYI, this is what I'd do instead of this patch for now.  We don't really
>> need a vetor per sector, just per page.  So this limits the I/O
>> size a little less.
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index 3c44c4bb40fc..5b27d995302e 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -1454,6 +1454,15 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>>   		goto retry_write;
>>   	}
>>   
>> +	/*
>> +	 * When using a bitmap, we may call alloc_behind_master_bio below.
>> +	 * alloc_behind_master_bio allocates a copy of the data payload a page
>> +	 * at a time and thus needs a new bio that can fit the whole payload
>> +	 * this bio in page sized chunks.
>> +	 */

Thanks for the above, will copy it accordingly. I will check if 
WriteMostly is set before, then check both
the flag and bitmap.

>> +	if (bitmap)
>> +		max_sectors = min_t(int, max_sectors, BIO_MAX_VECS * PAGE_SIZE);
> s/PAGE_SIZE/PAGE_SECTORS

Agree.

>> +
>>   	if (max_sectors < bio_sectors(bio)) {
>>   		struct bio *split = bio_split(bio, max_sectors,
>>   					      GFP_NOIO, &conf->bio_split);
> Here the limit is max single-page vectors, and the above way may not work,
> such as:ust splitted and not
>
> 0 ~ 254: each bvec's length is 512
> 255: bvec's length is 8192
>
> the total length is just 512*255 + 8192 = 138752 bytes = 271 sectors, but it
> still may need 257 bvecs, which can't be allocated via bio_alloc_bioset().

Thanks for deeper looking! I guess it is because how vcnt is calculated.

> One solution is to add queue limit of max_single_page_bvec, and let
> blk_queue_split() handle it.

The path (blk_queue_split -> blk_bio_segment_split -> bvec_split_segs) 
which respects max_segments
of limit. Do you mean introduce max_single_page_bvec to limit? Then 
perform similar checking as for
  max_segment.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] raid1: ensure bio doesn't have more than BIO_MAX_VECS sectors
  2021-08-16  6:27         ` Guoqing Jiang
@ 2021-08-16  7:13           ` Ming Lei
  0 siblings, 0 replies; 17+ messages in thread
From: Ming Lei @ 2021-08-16  7:13 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: Christoph Hellwig, song, linux-raid, jens, linux-block

On Mon, Aug 16, 2021 at 02:27:48PM +0800, Guoqing Jiang wrote:
> Hi Ming and Christoph,
> 
> On 8/14/21 4:57 PM, Ming Lei wrote:
> > On Sat, Aug 14, 2021 at 08:55:21AM +0100, Christoph Hellwig wrote:
> > > On Fri, Aug 13, 2021 at 04:38:59PM +0800, Guoqing Jiang wrote:
> > > > Ok, thanks.
> > > > 
> > > > > In general the size of a bio only depends on the number of vectors, not
> > > > > the total I/O size.  But alloc_behind_master_bio allocates new backing
> > > > > pages using order 0 allocations, so in this exceptional case the total
> > > > > size oes actually matter.
> > > > > 
> > > > > While we're at it: this huge memory allocation looks really deadlock
> > > > > prone.
> > > > Hmm, let me think more about it, or could you share your thought? ????
> > > Well, you'd need a mempool which can fit the max payload of a bio,
> > > that is BIO_MAX_VECS pages.
> 
> IIUC, the behind bio is allocated from bio_set (mddev->bio_set) which is
> allocated in md_run by
> call bioset_init, so the mempool (bvec_pool) of  this bio_set is created by
> biovec_init_pool which
> uses global biovec slabs. Do we really need another mempool? Or, there is no
> potential deadlock
>  for this case.
> 
> > > FYI, this is what I'd do instead of this patch for now.  We don't really
> > > need a vetor per sector, just per page.  So this limits the I/O
> > > size a little less.
> > > 
> > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> > > index 3c44c4bb40fc..5b27d995302e 100644
> > > --- a/drivers/md/raid1.c
> > > +++ b/drivers/md/raid1.c
> > > @@ -1454,6 +1454,15 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
> > >   		goto retry_write;
> > >   	}
> > > +	/*
> > > +	 * When using a bitmap, we may call alloc_behind_master_bio below.
> > > +	 * alloc_behind_master_bio allocates a copy of the data payload a page
> > > +	 * at a time and thus needs a new bio that can fit the whole payload
> > > +	 * this bio in page sized chunks.
> > > +	 */
> 
> Thanks for the above, will copy it accordingly. I will check if WriteMostly
> is set before, then check both
> the flag and bitmap.
> 
> > > +	if (bitmap)
> > > +		max_sectors = min_t(int, max_sectors, BIO_MAX_VECS * PAGE_SIZE);
> > s/PAGE_SIZE/PAGE_SECTORS
> 
> Agree.
> 
> > > +
> > >   	if (max_sectors < bio_sectors(bio)) {
> > >   		struct bio *split = bio_split(bio, max_sectors,
> > >   					      GFP_NOIO, &conf->bio_split);
> > Here the limit is max single-page vectors, and the above way may not work,
> > such as:ust splitted and not
> > 
> > 0 ~ 254: each bvec's length is 512
> > 255: bvec's length is 8192
> > 
> > the total length is just 512*255 + 8192 = 138752 bytes = 271 sectors, but it
> > still may need 257 bvecs, which can't be allocated via bio_alloc_bioset().
> 
> Thanks for deeper looking! I guess it is because how vcnt is calculated.
> 
> > One solution is to add queue limit of max_single_page_bvec, and let
> > blk_queue_split() handle it.
> 
> The path (blk_queue_split -> blk_bio_segment_split -> bvec_split_segs) which
> respects max_segments
> of limit. Do you mean introduce max_single_page_bvec to limit? Then perform
> similar checking as for
>  max_segment.

Yes, then the bio is guaranteed to not reach max single-page bvec limit,
just like what __blk_queue_bounce() does.


thanks,
Ming


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] raid1: ensure bio doesn't have more than BIO_MAX_VECS sectors
  2021-08-14  8:57       ` Ming Lei
  2021-08-16  6:27         ` Guoqing Jiang
@ 2021-08-16  9:37         ` Christoph Hellwig
  2021-08-16 11:40           ` Ming Lei
  1 sibling, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2021-08-16  9:37 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Guoqing Jiang, song, linux-raid, jens, linux-block

On Sat, Aug 14, 2021 at 04:57:06PM +0800, Ming Lei wrote:
> > +	if (bitmap)
> > +		max_sectors = min_t(int, max_sectors, BIO_MAX_VECS * PAGE_SIZE);
> 
> s/PAGE_SIZE/PAGE_SECTORS

Yeah, max_sectors is in size units, I messed that up.

> 
> > +
> >  	if (max_sectors < bio_sectors(bio)) {
> >  		struct bio *split = bio_split(bio, max_sectors,
> >  					      GFP_NOIO, &conf->bio_split);
> > 
> 
> Here the limit is max single-page vectors, and the above way may not work,
> such as:
> 
> 0 ~ 254: each bvec's length is 512
> 255: bvec's length is 8192
> 
> the total length is just 512*255 + 8192 = 138752 bytes = 271 sectors, but it
> still may need 257 bvecs, which can't be allocated via bio_alloc_bioset().

Yes, we still need the rounding magic that alloc_behind_master_bio uses
here.

> One solution is to add queue limit of max_single_page_bvec, and let
> blk_queue_split() handle it.

I'd rather not bloat the core with this.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] raid1: ensure bio doesn't have more than BIO_MAX_VECS sectors
  2021-08-16  9:37         ` Christoph Hellwig
@ 2021-08-16 11:40           ` Ming Lei
  2021-08-17  5:06             ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Ming Lei @ 2021-08-16 11:40 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Guoqing Jiang, song, linux-raid, jens, linux-block

On Mon, Aug 16, 2021 at 10:37:54AM +0100, Christoph Hellwig wrote:
> On Sat, Aug 14, 2021 at 04:57:06PM +0800, Ming Lei wrote:
> > > +	if (bitmap)
> > > +		max_sectors = min_t(int, max_sectors, BIO_MAX_VECS * PAGE_SIZE);
> > 
> > s/PAGE_SIZE/PAGE_SECTORS
> 
> Yeah, max_sectors is in size units, I messed that up.
> 
> > 
> > > +
> > >  	if (max_sectors < bio_sectors(bio)) {
> > >  		struct bio *split = bio_split(bio, max_sectors,
> > >  					      GFP_NOIO, &conf->bio_split);
> > > 
> > 
> > Here the limit is max single-page vectors, and the above way may not work,
> > such as:
> > 
> > 0 ~ 254: each bvec's length is 512
> > 255: bvec's length is 8192
> > 
> > the total length is just 512*255 + 8192 = 138752 bytes = 271 sectors, but it
> > still may need 257 bvecs, which can't be allocated via bio_alloc_bioset().
> 
> Yes, we still need the rounding magic that alloc_behind_master_bio uses
> here.

But it is wrong to use max sectors to limit number of bvecs(segments), isn't it?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] raid1: ensure bio doesn't have more than BIO_MAX_VECS sectors
  2021-08-16 11:40           ` Ming Lei
@ 2021-08-17  5:06             ` Christoph Hellwig
  2021-08-17 12:32               ` Ming Lei
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2021-08-17  5:06 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Guoqing Jiang, song, linux-raid, jens, linux-block

On Mon, Aug 16, 2021 at 07:40:48PM +0800, Ming Lei wrote:
> > > 
> > > 0 ~ 254: each bvec's length is 512
> > > 255: bvec's length is 8192
> > > 
> > > the total length is just 512*255 + 8192 = 138752 bytes = 271 sectors, but it
> > > still may need 257 bvecs, which can't be allocated via bio_alloc_bioset().
> > 
> > Yes, we still need the rounding magic that alloc_behind_master_bio uses
> > here.
> 
> But it is wrong to use max sectors to limit number of bvecs(segments), isn't it?

The raid1 write behind code cares about the size ofa bio it can reach by
adding order 0 pages to it.  The bvecs are part of that and I think the
calculation in the patch documents that a well.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] raid1: ensure bio doesn't have more than BIO_MAX_VECS sectors
  2021-08-17  5:06             ` Christoph Hellwig
@ 2021-08-17 12:32               ` Ming Lei
  2021-09-24 15:34                 ` Jens Stutte (Archiv)
  0 siblings, 1 reply; 17+ messages in thread
From: Ming Lei @ 2021-08-17 12:32 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Guoqing Jiang, song, linux-raid, jens, linux-block

On Tue, Aug 17, 2021 at 06:06:12AM +0100, Christoph Hellwig wrote:
> On Mon, Aug 16, 2021 at 07:40:48PM +0800, Ming Lei wrote:
> > > > 
> > > > 0 ~ 254: each bvec's length is 512
> > > > 255: bvec's length is 8192
> > > > 
> > > > the total length is just 512*255 + 8192 = 138752 bytes = 271 sectors, but it
> > > > still may need 257 bvecs, which can't be allocated via bio_alloc_bioset().
> > > 
> > > Yes, we still need the rounding magic that alloc_behind_master_bio uses
> > > here.
> > 
> > But it is wrong to use max sectors to limit number of bvecs(segments), isn't it?
> 
> The raid1 write behind code cares about the size ofa bio it can reach by
> adding order 0 pages to it.  The bvecs are part of that and I think the
> calculation in the patch documents that a well.

Thinking of further, your and Guoqing's patch are correct & enough since
bio_copy_data() just copies bytes(sectors) stream from fs bio to the
write behind bio.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] raid1: ensure bio doesn't have more than BIO_MAX_VECS sectors
  2021-08-17 12:32               ` Ming Lei
@ 2021-09-24 15:34                 ` Jens Stutte (Archiv)
  2021-09-25 23:02                   ` Guoqing Jiang
  0 siblings, 1 reply; 17+ messages in thread
From: Jens Stutte (Archiv) @ 2021-09-24 15:34 UTC (permalink / raw)
  To: Ming Lei, Christoph Hellwig; +Cc: Guoqing Jiang, song, linux-raid, linux-block

[-- Attachment #1: Type: text/plain, Size: 3628 bytes --]

Hi all,

I just had the occasion to test the new patch as landed in arch linux 
5.14.7. Unfortunately it does not work for me. Attached you can find a 
modification that works for me, though I am not really sure why 
write_behind seems not to be set to true on my configuration. If there 
is any more data I can provide to help you to investigate, please let me 
know.

Thanks for any clues,

Jens

My configuration:

[root@vdr jens]# mdadm --detail -v /dev/md0
/dev/md0:
            Version : 1.2
      Creation Time : Fri Dec 26 09:50:53 2014
         Raid Level : raid1
         Array Size : 1953381440 (1862.89 GiB 2000.26 GB)
      Used Dev Size : 1953381440 (1862.89 GiB 2000.26 GB)
       Raid Devices : 2
      Total Devices : 2
        Persistence : Superblock is persistent

      Intent Bitmap : Internal

        Update Time : Fri Sep 24 17:30:51 2021
              State : active
     Active Devices : 2
    Working Devices : 2
     Failed Devices : 0
      Spare Devices : 0

Consistency Policy : bitmap

               Name : vdr:0  (local to host vdr)
               UUID : 5532ffda:ccbc790f:b50c4959:8f0fd43f
             Events : 32805

     Number   Major   Minor   RaidDevice State
        2       8       33        0      active sync   /dev/sdc1
        3       8       17        1      active sync   /dev/sdb1

[root@vdr jens]# mdadm -X /dev/sdb1
         Filename : /dev/sdb1
            Magic : 6d746962
          Version : 4
             UUID : 5532ffda:ccbc790f:b50c4959:8f0fd43f
           Events : 32804
   Events Cleared : 32804
            State : OK
        Chunksize : 64 MB
           Daemon : 5s flush period
       Write Mode : Allow write behind, max 4096
        Sync Size : 1953381440 (1862.89 GiB 2000.26 GB)
           Bitmap : 29807 bits (chunks), 3 dirty (0.0%)

[root@vdr jens]# mdadm -X /dev/sdc1
         Filename : /dev/sdc1
            Magic : 6d746962
          Version : 4
             UUID : 5532ffda:ccbc790f:b50c4959:8f0fd43f
           Events : 32804
   Events Cleared : 32804
            State : OK
        Chunksize : 64 MB
           Daemon : 5s flush period
       Write Mode : Allow write behind, max 4096
        Sync Size : 1953381440 (1862.89 GiB 2000.26 GB)
           Bitmap : 29807 bits (chunks), 3 dirty (0.0%)

Am 17.08.21 um 14:32 schrieb Ming Lei:
> On Tue, Aug 17, 2021 at 06:06:12AM +0100, Christoph Hellwig wrote:
>> On Mon, Aug 16, 2021 at 07:40:48PM +0800, Ming Lei wrote:
>>>>> 0 ~ 254: each bvec's length is 512
>>>>> 255: bvec's length is 8192
>>>>>
>>>>> the total length is just 512*255 + 8192 = 138752 bytes = 271 sectors, but it
>>>>> still may need 257 bvecs, which can't be allocated via bio_alloc_bioset().
>>>> Yes, we still need the rounding magic that alloc_behind_master_bio uses
>>>> here.
>>> But it is wrong to use max sectors to limit number of bvecs(segments), isn't it?
>> The raid1 write behind code cares about the size ofa bio it can reach by
>> adding order 0 pages to it.  The bvecs are part of that and I think the
>> calculation in the patch documents that a well.
> Thinking of further, your and Guoqing's patch are correct & enough since
> bio_copy_data() just copies bytes(sectors) stream from fs bio to the
> write behind bio.
>
>
> Thanks,
> Ming
>

[-- Attachment #2: raid1.patch --]
[-- Type: text/x-patch, Size: 2245 bytes --]

diff --unified --recursive --text archlinux-linux/drivers/md/raid1.c archlinux-linux-diff/drivers/md/raid1.c
--- archlinux-linux/drivers/md/raid1.c	2021-09-24 14:37:15.347771866 +0200
+++ archlinux-linux-diff/drivers/md/raid1.c	2021-09-24 14:40:02.443978319 +0200
@@ -1501,7 +1501,7 @@
 			 * Not if there are too many, or cannot
 			 * allocate memory, or a reader on WriteMostly
 			 * is waiting for behind writes to flush */
-			if (bitmap &&
+			if (bitmap && write_behind &&
 			    (atomic_read(&bitmap->behind_writes)
 			     < mddev->bitmap_info.max_write_behind) &&
 			    !waitqueue_active(&bitmap->behind_wait)) {
diff --unified --recursive --text archlinux-linux/drivers/md/raid1.c archlinux-linux-changed/drivers/md/raid1.c
--- archlinux-linux/drivers/md/raid1.c	2021-09-24 15:43:22.842680119 +0200
+++ archlinux-linux-changed/drivers/md/raid1.c	2021-09-24 15:43:59.426142955 +0200
@@ -1329,7 +1329,6 @@
 	struct raid1_plug_cb *plug = NULL;
 	int first_clone;
 	int max_sectors;
-	bool write_behind = false;
 
 	if (mddev_is_clustered(mddev) &&
 	     md_cluster_ops->area_resyncing(mddev, WRITE,
@@ -1383,14 +1382,6 @@
 	for (i = 0;  i < disks; i++) {
 		struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev);
 
-		/*
-		 * The write-behind io is only attempted on drives marked as
-		 * write-mostly, which means we could allocate write behind
-		 * bio later.
-		 */
-		if (rdev && test_bit(WriteMostly, &rdev->flags))
-			write_behind = true;
-
 		if (rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
 			atomic_inc(&rdev->nr_pending);
 			blocked_rdev = rdev;
@@ -1470,7 +1461,7 @@
 	 * at a time and thus needs a new bio that can fit the whole payload
 	 * this bio in page sized chunks.
 	 */
-	if (write_behind && bitmap)
+	if (bitmap)
 		max_sectors = min_t(int, max_sectors,
 				    BIO_MAX_VECS * (PAGE_SIZE >> 9));
 	if (max_sectors < bio_sectors(bio)) {
@@ -1501,7 +1492,7 @@
 			 * Not if there are too many, or cannot
 			 * allocate memory, or a reader on WriteMostly
 			 * is waiting for behind writes to flush */
-			if (bitmap &&
+			if (bitmap && 
 			    (atomic_read(&bitmap->behind_writes)
 			     < mddev->bitmap_info.max_write_behind) &&
 			    !waitqueue_active(&bitmap->behind_wait)) {

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] raid1: ensure bio doesn't have more than BIO_MAX_VECS sectors
  2021-09-24 15:34                 ` Jens Stutte (Archiv)
@ 2021-09-25 23:02                   ` Guoqing Jiang
  0 siblings, 0 replies; 17+ messages in thread
From: Guoqing Jiang @ 2021-09-25 23:02 UTC (permalink / raw)
  To: Jens Stutte (Archiv), Ming Lei, Christoph Hellwig
  Cc: song, linux-raid, linux-block



On 9/24/21 11:34 PM, Jens Stutte (Archiv) wrote:
> Hi all,
>
> I just had the occasion to test the new patch as landed in arch linux 
> 5.14.7. Unfortunately it does not work for me. Attached you can find a 
> modification that works for me, though I am not really sure why 
> write_behind seems not to be set to true on my configuration. If there 
> is any more data I can provide to help you to investigate, please let 
> me know.

Thanks for the report!  As commented in bugzilla, this is because 
write-behind
IO still happens even without write-mostly device. I will send new patch 
after
you confirm it works.


1. https://bugzilla.kernel.org/show_bug.cgi?id=213181

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-09-25 23:02 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-13  6:05 [PATCH] raid1: ensure bio doesn't have more than BIO_MAX_VECS sectors Guoqing Jiang
2021-08-13  7:49 ` Christoph Hellwig
2021-08-13  8:38   ` Guoqing Jiang
2021-08-14  7:55     ` Christoph Hellwig
2021-08-14  8:57       ` Ming Lei
2021-08-16  6:27         ` Guoqing Jiang
2021-08-16  7:13           ` Ming Lei
2021-08-16  9:37         ` Christoph Hellwig
2021-08-16 11:40           ` Ming Lei
2021-08-17  5:06             ` Christoph Hellwig
2021-08-17 12:32               ` Ming Lei
2021-09-24 15:34                 ` Jens Stutte (Archiv)
2021-09-25 23:02                   ` Guoqing Jiang
2021-08-13  9:27 ` kernel test robot
2021-08-13  9:27   ` kernel test robot
2021-08-13 10:12 ` kernel test robot
2021-08-13 10:12   ` kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.