All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC v2 0/8] md/raid5: set STRIPE_SIZE as a configurable value
@ 2020-04-21 12:39 Yufen Yu
  2020-04-21 12:39 ` [PATCH RFC v2 1/8] md/raid5: add CONFIG_MD_RAID456_STRIPE_SIZE to set STRIPE_SIZE Yufen Yu
                   ` (8 more replies)
  0 siblings, 9 replies; 16+ messages in thread
From: Yufen Yu @ 2020-04-21 12:39 UTC (permalink / raw)
  To: song; +Cc: linux-raid, neilb, guoqing.jiang, colyli, xni, houtao1, yuyufen

Hi, all

 For now, STRIPE_SIZE is equal to the value of PAGE_SIZE. That means, RAID5 will
 issus echo bio to disk at least 64KB when PAGE_SIZE is 64KB in arm64. However,
 filesystem usually issue bio in the unit of 4KB. Then, RAID5 will waste resource
 of disk bandwidth.

 To solve the problem, this patchset provide a new config CONFIG_MD_RAID456_STRIPE_SIZE
 to let user config STRIPE_SIZE. The default value is 4096.

 Normally, using default STRIPE_SIZE can get better performance. And NeilBrown have
 suggested just to fix the STRIPE_SIZE as 4096. But, out test result show that
 big value of STRIPE_SIZE may have better performance when size of issued IOs are
 mostly bigger than 4096. Thus, in this patchset, we still want to set STRIPE_SIZE
 as a configureable value.

 In current implementation, grow_buffers() uses alloc_page() to allocate the buffers
 for each stripe_head. With the change, it means we allocate 64K buffers but just
 use 4K of them. To save memory, we try to 'compress' multiple buffers of stripe_head
 to only one real page. Detail shows in patch #2.

 To evaluate the new feature, we create raid5 device '/dev/md5' with 4 SSD disk
 and test it on arm64 machine with 64KB PAGE_SIZE.
 
 1) We format /dev/md5 with mkfs.ext4 and mount ext4 with default configure on
    /mnt directory. Then, trying to test it by dbench with command:
    dbench -D /mnt -t 1000 10. Result show as:
 
    'CONFIG_MD_RAID456_STRIPE_SIZE = 64KB'
   
      Operation      Count    AvgLat    MaxLat
      ----------------------------------------
      NTCreateX    9805011     0.021    64.728
      Close        7202525     0.001     0.120
      Rename        415213     0.051    44.681
      Unlink       1980066     0.079    93.147
      Deltree          240     1.793     6.516
      Mkdir            120     0.004     0.007
      Qpathinfo    8887512     0.007    37.114
      Qfileinfo    1557262     0.001     0.030
      Qfsinfo      1629582     0.012     0.152
      Sfileinfo     798756     0.040    57.641
      Find         3436004     0.019    57.782
      WriteX       4887239     0.021    57.638
      ReadX        15370483     0.005    37.818
      LockX          31934     0.003     0.022
      UnlockX        31933     0.001     0.021
      Flush         687205    13.302   530.088
     
     Throughput 307.799 MB/sec  10 clients  10 procs  max_latency=530.091 ms
     -------------------------------------------------------
     
    'CONFIG_MD_RAID456_STRIPE_SIZE = 4KB'
     
      Operation      Count    AvgLat    MaxLat
      ----------------------------------------
      NTCreateX    11999166     0.021    36.380
      Close        8814128     0.001     0.122
      Rename        508113     0.051    29.169
      Unlink       2423242     0.070    38.141
      Deltree          300     1.885     7.155
      Mkdir            150     0.004     0.006
      Qpathinfo    10875921     0.007    35.485
      Qfileinfo    1905837     0.001     0.032
      Qfsinfo      1994304     0.012     0.125
      Sfileinfo     977450     0.029    26.489
      Find         4204952     0.019     9.361
      WriteX       5981890     0.019    27.804
      ReadX        18809742     0.004    33.491
      LockX          39074     0.003     0.025
      UnlockX        39074     0.001     0.014
      Flush         841022    10.712   458.848
     
     Throughput 376.777 MB/sec  10 clients  10 procs  max_latency=458.852 ms
     -------------------------------------------------------
 
    It shows that setting STREIP_SIZE as 4KB has higher thoughput, i.e.
    (376.777 vs 307.799) and has smaller latency (530.091 vs 458.852)
    than that setting as 64KB.
 
 2) We try to evaluate IO throughput for /dev/md5 by fio with config:
 
     [4KB randwrite]
     direct=1
     numjob=2
     iodepth=64
     ioengine=libaio
     filename=/dev/md5
     bs=4KB
     rw=randwrite
     
     [64KB write]
     direct=1
     numjob=2
     iodepth=64
     ioengine=libaio
     filename=/dev/md5
     bs=1MB
     rw=write
     
    The fio test result as follow:
     
                   +                   +
                   | STRIPE_SIZE(64KB) | STRIPE_SIZE(4KB)
     +----------------------------------------------------+
     4KB randwrite |     15MB/s        |      100MB/s
     +----------------------------------------------------+
     1MB write     |   1000MB/s        |      700MB/s
 
    The result show that when size of io is bigger than 4KB (64KB),
    64KB STRIPE_SIZE has much higher IOPS. But for 4KB randwrite, that
    means, size of io issued to device are smaller, 4KB STRIPE_SIZE
    have better performance.

V1:
 https://www.spinics.net/lists/raid/msg63111.html
 Just add CONFIG_MD_RAID456_STRIPE_SIZE to set STRIPE_SIZE

Yufen Yu (8):
  md/raid5: add CONFIG_MD_RAID456_STRIPE_SIZE to set STRIPE_SIZE
  md/raid5: add a member of r5pages for struct stripe_head
  md/raid5: allocate and free pages of r5pages
  md/raid5: set correct page offset for bi_io_vec in ops_run_io()
  md/raid5: set correct page offset for async_copy_data()
  md/raid5: add new xor function to support different page offset
  md/raid5: add offset array in scribble buffer
  md/raid5: compute xor with correct page offset

 crypto/async_tx/async_xor.c | 119 +++++++++++++++---
 drivers/md/Kconfig          |  16 +++
 drivers/md/raid5.c          | 233 ++++++++++++++++++++++++++++++------
 drivers/md/raid5.h          |  59 ++++++++-
 include/linux/async_tx.h    |  11 ++
 5 files changed, 384 insertions(+), 54 deletions(-)

-- 
2.21.1

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH RFC v2 1/8] md/raid5: add CONFIG_MD_RAID456_STRIPE_SIZE to set STRIPE_SIZE
  2020-04-21 12:39 [PATCH RFC v2 0/8] md/raid5: set STRIPE_SIZE as a configurable value Yufen Yu
@ 2020-04-21 12:39 ` Yufen Yu
  2020-04-23  6:23   ` kbuild test robot
  2020-05-08  0:56   ` Song Liu
  2020-04-21 12:39 ` [PATCH RFC v2 2/8] md/raid5: add a member of r5pages for struct stripe_head Yufen Yu
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 16+ messages in thread
From: Yufen Yu @ 2020-04-21 12:39 UTC (permalink / raw)
  To: song; +Cc: linux-raid, neilb, guoqing.jiang, colyli, xni, houtao1, yuyufen

In RAID5, if issued bio size is bigger than STRIPE_SIZE, it will be split
in the unit of STRIPE_SIZE and process them one by one. Even for size
less then STRIPE_SIZE, RAID5 also request data from disk at least of
STRIPE_SIZE.

Nowdays, STRIPE_SIZE is equal to the value of PAGE_SIZE. Since filesystem
usually issue bio in the unit of 4KB, there is no problem for PAGE_SIZE as
4KB. But, for 64KB PAGE_SIZE, bio from filesystem requests 4KB data while
RAID5 issue IO at least STRIPE_SIZE (64KB) each time. That will waste
resource of disk bandwidth and compute xor.

To avoding the waste, we want to add a new CONFIG option to adjust
STREIPE_SIZE. Default value is 4096. User can also set the value bigger
than 4KB for some special requirements, such as we know the issued io
size is more than 4KB.

To evaluate the new feature, we create raid5 device '/dev/md5' with
4 SSD disk and test it on arm64 machine with 64KB PAGE_SIZE.

1) We format /dev/md5 with mkfs.ext4 and mount ext4 with default
 configure on /mnt directory. Then, trying to test it by dbench with
 command: dbench -D /mnt -t 1000 10. Result show as:

 'CONFIG_MD_RAID456_STRIPE_SIZE = 64K'

  Operation      Count    AvgLat    MaxLat
  ----------------------------------------
  NTCreateX    9805011     0.021    64.728
  Close        7202525     0.001     0.120
  Rename        415213     0.051    44.681
  Unlink       1980066     0.079    93.147
  Deltree          240     1.793     6.516
  Mkdir            120     0.004     0.007
  Qpathinfo    8887512     0.007    37.114
  Qfileinfo    1557262     0.001     0.030
  Qfsinfo      1629582     0.012     0.152
  Sfileinfo     798756     0.040    57.641
  Find         3436004     0.019    57.782
  WriteX       4887239     0.021    57.638
  ReadX        15370483     0.005    37.818
  LockX          31934     0.003     0.022
  UnlockX        31933     0.001     0.021
  Flush         687205    13.302   530.088

 Throughput 307.799 MB/sec  10 clients  10 procs  max_latency=530.091 ms
 -------------------------------------------------------

 'STRIPE_SIZE = 4KB'

  Operation      Count    AvgLat    MaxLat
  ----------------------------------------
  NTCreateX    11999166     0.021    36.380
  Close        8814128     0.001     0.122
  Rename        508113     0.051    29.169
  Unlink       2423242     0.070    38.141
  Deltree          300     1.885     7.155
  Mkdir            150     0.004     0.006
  Qpathinfo    10875921     0.007    35.485
  Qfileinfo    1905837     0.001     0.032
  Qfsinfo      1994304     0.012     0.125
  Sfileinfo     977450     0.029    26.489
  Find         4204952     0.019     9.361
  WriteX       5981890     0.019    27.804
  ReadX        18809742     0.004    33.491
  LockX          39074     0.003     0.025
  UnlockX        39074     0.001     0.014
  Flush         841022    10.712   458.848

 Throughput 376.777 MB/sec  10 clients  10 procs  max_latency=458.852 ms
 -------------------------------------------------------

 It show that setting STREIP_SIZE as 4KB has higher thoughput, i.e.
 (376.777 vs 307.799) and has smaller latency (530.091 vs 458.852)
 than that setting as 64KB.

 2) We try to evaluate IO throughput for /dev/md5 by fio with config:

 [4KB randwrite]
 direct=1
 numjob=2
 iodepth=64
 ioengine=libaio
 filename=/dev/md5
 bs=4KB
 rw=randwrite

 [64KB write]
 direct=1
 numjob=2
 iodepth=64
 ioengine=libaio
 filename=/dev/md5
 bs=1MB
 rw=write

 The result as follow:

               +                   +
               | STRIPE_SIZE(64KB) | STRIPE_SIZE(4KB)
 +----------------------------------------------------+
 4KB randwrite |     15MB/s        |      100MB/s
 +----------------------------------------------------+
 1MB write     |   1000MB/s        |      700MB/s

 The result show that when size of io is bigger than 4KB (64KB),
 64KB STRIPE_SIZE has much higher IOPS. But for 4KB randwrite, that
 means, size of io issued to device are smaller, 4KB STRIPE_SIZE
 have better performance.

Thus, we provide a configure to set STRIPE_SIZE when PAGE_SIZE is bigger
than 4096. Normally, default value (4096) can get relatively good
performance. But if each issued io is bigger than 4096, setting value more
than 4096 may get better performance.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
---
 drivers/md/Kconfig | 18 ++++++++++++++++++
 drivers/md/raid5.h |  4 +++-
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index d6d5ab23c088..05c5a315358b 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -157,6 +157,24 @@ config MD_RAID456
 
 	  If unsure, say Y.
 
+config MD_RAID456_STRIPE_SIZE
+	int "RAID4/RAID5/RAID6 stripe size"
+	default "4096"
+	depends on MD_RAID456
+	help
+	  The default value is 4096. Just setting a new value when PAGE_SIZE
+	  is bigger than 4096.  In that case, you can set it as more than
+	  4096, such as 8KB, 16K, 64K, which requires that be a multiple of
+	  4K.
+
+	  When you try to set a big value, likely 64KB on arm64, that means,
+	  you know size of each io that issued to raid device is more than
+	  4096. Otherwise just use default value.
+
+	  Normally, using default STRIPE_SIZE can get better performance.
+	  Only change this value if you know what you are doing.
+
+
 config MD_MULTIPATH
 	tristate "Multipath I/O support"
 	depends on BLK_DEV_MD
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index f90e0704bed9..b0010991bdc8 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -472,7 +472,9 @@ struct disk_info {
  */
 
 #define NR_STRIPES		256
-#define STRIPE_SIZE		PAGE_SIZE
+#define STRIPE_SIZE	\
+	((PAGE_SIZE > 4096 && (CONFIG_MD_RAID456_STRIPE_SIZE % 4096 == 0)) ? \
+	 CONFIG_MD_RAID456_STRIPE_SIZE : PAGE_SIZE)
 #define STRIPE_SHIFT		(PAGE_SHIFT - 9)
 #define STRIPE_SECTORS		(STRIPE_SIZE>>9)
 #define	IO_THRESHOLD		1
-- 
2.21.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH RFC v2 2/8] md/raid5: add a member of r5pages for struct stripe_head
  2020-04-21 12:39 [PATCH RFC v2 0/8] md/raid5: set STRIPE_SIZE as a configurable value Yufen Yu
  2020-04-21 12:39 ` [PATCH RFC v2 1/8] md/raid5: add CONFIG_MD_RAID456_STRIPE_SIZE to set STRIPE_SIZE Yufen Yu
@ 2020-04-21 12:39 ` Yufen Yu
  2020-05-08  1:25   ` Song Liu
  2020-05-08 23:58   ` Song Liu
  2020-04-21 12:39 ` [PATCH RFC v2 3/8] md/raid5: allocate and free pages of r5pages Yufen Yu
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 16+ messages in thread
From: Yufen Yu @ 2020-04-21 12:39 UTC (permalink / raw)
  To: song; +Cc: linux-raid, neilb, guoqing.jiang, colyli, xni, houtao1, yuyufen

Since grow_buffers() uses alloc_page() allocate the buffers for each
stripe_head(), means, it will allocate 64K buffers and just use 4K
of them, after setting STRIPE_SIZE as 4096.

To avoid waste memory, we try to contain multiple 'page' of sh->dev
into one real page. That means, multiple sh->dev[i].page will point to
the only page with different offset. Example of 64K PAGE_SIZE and
4K STRIPE_SIZE as following:

                    64K PAGE_SIZE
          +---+---+---+---+------------------------------+
          |   |   |   |   |
          |   |   |   |   |
          +-+-+-+-+-+-+-+-+------------------------------+
            ^   ^   ^   ^
            |   |   |   +----------------------------+
            |   |   |                                |
            |   |   +-------------------+            |
            |   |                       |            |
            |   +----------+            |            |
            |              |            |            |
            +-+            |            |            |
              |            |            |            |
        +-----+-----+------+-----+------+-----+------+------+
sh      | offset(0) | offset(4K) | offset(8K) | offset(16K) |
 +      +-----------+------------+------------+-------------+
 +----> dev[0].page  dev[1].page  dev[2].page  dev[3].page

After compressing pages, the users of sh->dev[i].page need to be take care:

  1) When issue bio of stripe_head, bi_io_vec.bv_page will point to
     the page directly. So, we should make sure bv_offset been set with
     correct offset.

  2) When compute xor, the page will be passed to computer function.
     So, we also need to pass offset of that page to computer. Let it
     compute correct location of each sh->dev[i].page.

This patch will add a new member of r5pages in stripe_head to manage
all pages needed by each sh->dev[i]. We also add 'offset' for each r5dev
so that users can get related page offset easily. And add helper function
to get page and it's index in r5pages array by disk index. This is patch
in preparation for fallowing changes.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
---
 drivers/md/raid5.h | 56 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index b0010991bdc8..6296578e9585 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -246,6 +246,13 @@ struct stripe_head {
 		int 		     target, target2;
 		enum sum_check_flags zero_sum_result;
 	} ops;
+
+	/* These pages will be used by bios in dev[i] */
+	struct r5pages {
+		struct page	**page;
+		int	size; /* page array size */
+	} pages;
+
 	struct r5dev {
 		/* rreq and rvec are used for the replacement device when
 		 * writing data to both devices.
@@ -253,6 +260,7 @@ struct stripe_head {
 		struct bio	req, rreq;
 		struct bio_vec	vec, rvec;
 		struct page	*page, *orig_page;
+		unsigned int	offset;		/* offset of this page */
 		struct bio	*toread, *read, *towrite, *written;
 		sector_t	sector;			/* sector of this page */
 		unsigned long	flags;
@@ -754,6 +762,54 @@ static inline int algorithm_is_DDF(int layout)
 	return layout >= 8 && layout <= 10;
 }
 
+/*
+ * Return corresponding page index of r5pages array.
+ */
+static inline int raid5_get_page_index(struct stripe_head *sh, int disk_idx)
+{
+	WARN_ON(!sh->pages.page);
+	if (disk_idx >= sh->raid_conf->pool_size)
+		return -ENOENT;
+
+	return (disk_idx * STRIPE_SIZE) / PAGE_SIZE;
+}
+
+/*
+ * Return offset of the corresponding page for r5dev.
+ */
+static inline int raid5_get_page_offset(struct stripe_head *sh, int disk_idx)
+{
+	WARN_ON(!sh->pages.page);
+	if (disk_idx >= sh->raid_conf->pool_size)
+		return -ENOENT;
+
+	return (disk_idx * STRIPE_SIZE) % PAGE_SIZE;
+}
+
+/*
+ * Return corresponding page address for r5dev.
+ */
+static inline struct page *
+raid5_get_dev_page(struct stripe_head *sh, int disk_idx)
+{
+	int idx;
+
+	WARN_ON(!sh->pages.page);
+	idx = raid5_get_page_index(sh, disk_idx);
+	return sh->pages.page[idx];
+}
+
+/*
+ * We want to 'compress' multiple buffers to one real page for
+ * stripe_head when PAGE_SIZE is biggger than STRIPE_SIZE. If their
+ * values are equal, no need to use this strategy. For now, it just
+ * support raid level less than 5.
+ */
+static inline int raid5_compress_stripe_pages(struct r5conf *conf)
+{
+	return (PAGE_SIZE > STRIPE_SIZE) && (conf->level < 6);
+}
+
 extern void md_raid5_kick_device(struct r5conf *conf);
 extern int raid5_set_cache_size(struct mddev *mddev, int size);
 extern sector_t raid5_compute_blocknr(struct stripe_head *sh, int i, int previous);
-- 
2.21.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH RFC v2 3/8] md/raid5: allocate and free pages of r5pages
  2020-04-21 12:39 [PATCH RFC v2 0/8] md/raid5: set STRIPE_SIZE as a configurable value Yufen Yu
  2020-04-21 12:39 ` [PATCH RFC v2 1/8] md/raid5: add CONFIG_MD_RAID456_STRIPE_SIZE to set STRIPE_SIZE Yufen Yu
  2020-04-21 12:39 ` [PATCH RFC v2 2/8] md/raid5: add a member of r5pages for struct stripe_head Yufen Yu
@ 2020-04-21 12:39 ` Yufen Yu
  2020-04-21 12:39 ` [PATCH RFC v2 4/8] md/raid5: set correct page offset for bi_io_vec in ops_run_io() Yufen Yu
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Yufen Yu @ 2020-04-21 12:39 UTC (permalink / raw)
  To: song; +Cc: linux-raid, neilb, guoqing.jiang, colyli, xni, houtao1, yuyufen

When PAGE_SIZE is bigger than STRIPE_SIZE, try to allocate pages of
r5pages in grow_buffres() and free these pages in shrink_buffers().
Then, setting sh->dev[i].page and sh->dev[i].offset as the page in
array. Without enable compress, we just set offset as value of '0'.

When reshape raid array, the new allocated stripes can reuse old stripe's
pages. By the way, we called resize_stripes() only when grow raid array
disks, so that don't worry about memleak for old r5pages.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
---
 drivers/md/raid5.c | 142 ++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 128 insertions(+), 14 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3f96b4406902..0d1e5bf4e5f2 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -448,20 +448,72 @@ static struct stripe_head *get_free_stripe(struct r5conf *conf, int hash)
 	return sh;
 }
 
+/*
+ * Try to free all pages in r5pages array.
+ */
+static void free_stripe_pages(struct stripe_head *sh)
+{
+	int i;
+	struct page *p;
+
+	/* Have not allocate page pool */
+	if (!sh->pages.page)
+		return;
+
+	for (i = 0; i < sh->pages.size; i++) {
+		p = sh->pages.page[i];
+		if (p)
+			put_page(p);
+		sh->pages.page[i] = NULL;
+	}
+}
+
+/*
+ * Allocate pages for r5pages.
+ */
+static int alloc_stripe_pages(struct stripe_head *sh, gfp_t gfp)
+{
+	int i;
+	struct page *p;
+
+	for (i = 0; i < sh->pages.size; i++) {
+		/* The page have allocated. */
+		if (sh->pages.page[i])
+			continue;
+
+		p = alloc_page(gfp);
+		if (!p) {
+			free_stripe_pages(sh);
+			return -ENOMEM;
+		}
+		sh->pages.page[i] = p;
+	}
+	return 0;
+}
+
 static void shrink_buffers(struct stripe_head *sh)
 {
 	struct page *p;
 	int i;
 	int num = sh->raid_conf->pool_size;
 
-	for (i = 0; i < num ; i++) {
+	if (raid5_compress_stripe_pages(sh->raid_conf))
+		free_stripe_pages(sh); /* Free pages in r5pages */
+
+	for (i = 0; i < num; i++) {
 		WARN_ON(sh->dev[i].page != sh->dev[i].orig_page);
 		p = sh->dev[i].page;
-		if (!p)
+
+		/*
+		 * If we use pages in r5pages, these pages have been
+		 * freed in free_stripe_pages().
+		 */
+		if (raid5_compress_stripe_pages(sh->raid_conf) || !p)
 			continue;
 		sh->dev[i].page = NULL;
 		put_page(p);
 	}
+
 }
 
 static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
@@ -469,14 +521,26 @@ static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
 	int i;
 	int num = sh->raid_conf->pool_size;
 
+	if (raid5_compress_stripe_pages(sh->raid_conf) &&
+			alloc_stripe_pages(sh, gfp))
+		return -ENOMEM;
+
 	for (i = 0; i < num; i++) {
 		struct page *page;
+		unsigned int offset;
 
-		if (!(page = alloc_page(gfp))) {
-			return 1;
+		if (raid5_compress_stripe_pages(sh->raid_conf)) {
+			page = raid5_get_dev_page(sh, i);
+			offset = raid5_get_page_offset(sh, i);
+		} else {
+			page = alloc_page(gfp);
+			if (!page)
+				return -ENOMEM;
+			offset = 0;
 		}
 		sh->dev[i].page = page;
 		sh->dev[i].orig_page = page;
+		sh->dev[i].offset = offset;
 	}
 
 	return 0;
@@ -2123,6 +2187,9 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
 
 static void free_stripe(struct kmem_cache *sc, struct stripe_head *sh)
 {
+	if (sh->pages.page)
+		kfree(sh->pages.page);
+
 	if (sh->ppl_page)
 		__free_page(sh->ppl_page);
 	kmem_cache_free(sc, sh);
@@ -2154,14 +2221,28 @@ static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp,
 
 		if (raid5_has_ppl(conf)) {
 			sh->ppl_page = alloc_page(gfp);
-			if (!sh->ppl_page) {
-				free_stripe(sc, sh);
-				sh = NULL;
-			}
+			if (!sh->ppl_page)
+				goto fail;
+		}
+
+		if (raid5_compress_stripe_pages(conf)) {
+			int nr_page;
+
+			/* Each of the sh->dev[i] need one STRIPE_SIZE */
+			nr_page = (disks * STRIPE_SIZE + PAGE_SIZE - 1) / PAGE_SIZE;
+			sh->pages.page = kzalloc(sizeof(struct page *) * nr_page, gfp);
+			if (!sh->pages.page)
+				goto fail;
+			sh->pages.size = nr_page;
 		}
 	}
 	return sh;
+
+fail:
+	free_stripe(sc, sh);
+	return NULL;
 }
+
 static int grow_one_stripe(struct r5conf *conf, gfp_t gfp)
 {
 	struct stripe_head *sh;
@@ -2360,10 +2441,18 @@ static int resize_stripes(struct r5conf *conf, int newsize)
 		osh = get_free_stripe(conf, hash);
 		unlock_device_hash_lock(conf, hash);
 
-		for(i=0; i<conf->pool_size; i++) {
+		if (raid5_compress_stripe_pages(conf)) {
+			/* We reuse pages in r5pages of old stripe head */
+			for (i = 0; i < osh->pages.size; i++)
+				nsh->pages.page[i] = osh->pages.page[i];
+		}
+
+		for (i = 0; i < conf->pool_size; i++) {
 			nsh->dev[i].page = osh->dev[i].page;
 			nsh->dev[i].orig_page = osh->dev[i].page;
+			nsh->dev[i].offset = osh->dev[i].offset;
 		}
+
 		nsh->hash_lock_index = hash;
 		free_stripe(conf->slab_cache, osh);
 		cnt++;
@@ -2410,17 +2499,42 @@ static int resize_stripes(struct r5conf *conf, int newsize)
 
 	/* Step 4, return new stripes to service */
 	while(!list_empty(&newstripes)) {
+		struct page *p;
+		unsigned int offset;
 		nsh = list_entry(newstripes.next, struct stripe_head, lru);
 		list_del_init(&nsh->lru);
 
-		for (i=conf->raid_disks; i < newsize; i++)
-			if (nsh->dev[i].page == NULL) {
-				struct page *p = alloc_page(GFP_NOIO);
-				nsh->dev[i].page = p;
-				nsh->dev[i].orig_page = p;
+		/*
+		 * If we use r5pages, means, pages.size is not zero,
+		 * allocate pages it needed for new stripe_head.
+		 */
+		for (i = 0; i < nsh->pages.size; i++) {
+			if (nsh->pages.page[i] == NULL) {
+				p = alloc_page(GFP_NOIO);
 				if (!p)
 					err = -ENOMEM;
+				nsh->pages.page[i] = p;
 			}
+		}
+
+		for (i = conf->raid_disks; i < newsize; i++) {
+			if (nsh->dev[i].page)
+				continue;
+
+			if (raid5_compress_stripe_pages(conf)) {
+				p = raid5_get_dev_page(nsh, i);
+				offset = raid5_get_page_offset(nsh, i);
+			} else {
+				p = alloc_page(GFP_NOIO);
+				if (!p)
+					err = -ENOMEM;
+				offset = 0;
+			}
+
+			nsh->dev[i].page = p;
+			nsh->dev[i].orig_page = p;
+			nsh->dev[i].offset = offset;
+		}
 		raid5_release_stripe(nsh);
 	}
 	/* critical section pass, GFP_NOIO no longer needed */
-- 
2.21.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH RFC v2 4/8] md/raid5: set correct page offset for bi_io_vec in ops_run_io()
  2020-04-21 12:39 [PATCH RFC v2 0/8] md/raid5: set STRIPE_SIZE as a configurable value Yufen Yu
                   ` (2 preceding siblings ...)
  2020-04-21 12:39 ` [PATCH RFC v2 3/8] md/raid5: allocate and free pages of r5pages Yufen Yu
@ 2020-04-21 12:39 ` Yufen Yu
  2020-04-21 12:39 ` [PATCH RFC v2 5/8] md/raid5: set correct page offset for async_copy_data() Yufen Yu
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Yufen Yu @ 2020-04-21 12:39 UTC (permalink / raw)
  To: song; +Cc: linux-raid, neilb, guoqing.jiang, colyli, xni, houtao1, yuyufen

After using r5pages for each sh->dev[i], we need to set correct offset
of that page for bi_io_vec when issue bio. The value of offset is zero
without using r5pages.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
---
 drivers/md/raid5.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 0d1e5bf4e5f2..87c3bbfadf54 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1194,7 +1194,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 				sh->dev[i].vec.bv_page = sh->dev[i].page;
 			bi->bi_vcnt = 1;
 			bi->bi_io_vec[0].bv_len = STRIPE_SIZE;
-			bi->bi_io_vec[0].bv_offset = 0;
+			bi->bi_io_vec[0].bv_offset = sh->dev[i].offset;
 			bi->bi_iter.bi_size = STRIPE_SIZE;
 			bi->bi_write_hint = sh->dev[i].write_hint;
 			if (!rrdev)
@@ -1248,7 +1248,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 			sh->dev[i].rvec.bv_page = sh->dev[i].page;
 			rbi->bi_vcnt = 1;
 			rbi->bi_io_vec[0].bv_len = STRIPE_SIZE;
-			rbi->bi_io_vec[0].bv_offset = 0;
+			rbi->bi_io_vec[0].bv_offset = sh->dev[i].offset;
 			rbi->bi_iter.bi_size = STRIPE_SIZE;
 			rbi->bi_write_hint = sh->dev[i].write_hint;
 			sh->dev[i].write_hint = RWH_WRITE_LIFE_NOT_SET;
-- 
2.21.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH RFC v2 5/8] md/raid5: set correct page offset for async_copy_data()
  2020-04-21 12:39 [PATCH RFC v2 0/8] md/raid5: set STRIPE_SIZE as a configurable value Yufen Yu
                   ` (3 preceding siblings ...)
  2020-04-21 12:39 ` [PATCH RFC v2 4/8] md/raid5: set correct page offset for bi_io_vec in ops_run_io() Yufen Yu
@ 2020-04-21 12:39 ` Yufen Yu
  2020-04-21 12:39 ` [PATCH RFC v2 6/8] md/raid5: add new xor function to support different page offset Yufen Yu
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Yufen Yu @ 2020-04-21 12:39 UTC (permalink / raw)
  To: song; +Cc: linux-raid, neilb, guoqing.jiang, colyli, xni, houtao1, yuyufen

ops_run_biofill() and ops_run_biodrain() will call async_copy_data()
to copy sh->dev[i].page from or to bio. It also need to set correct
page offset for dev->page when use r5pages.

Without modifying original code logic, we replace 'page_offset' with
'page_offset + poff' simplify. In case of that wihtout using r5pages,
poff is zero.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
---
 drivers/md/raid5.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 87c3bbfadf54..52efacd486ab 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1290,7 +1290,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 
 static struct dma_async_tx_descriptor *
 async_copy_data(int frombio, struct bio *bio, struct page **page,
-	sector_t sector, struct dma_async_tx_descriptor *tx,
+	unsigned int poff, sector_t sector, struct dma_async_tx_descriptor *tx,
 	struct stripe_head *sh, int no_skipcopy)
 {
 	struct bio_vec bvl;
@@ -1325,6 +1325,7 @@ async_copy_data(int frombio, struct bio *bio, struct page **page,
 		else
 			clen = len;
 
+
 		if (clen > 0) {
 			b_offset += bvl.bv_offset;
 			bio_page = bvl.bv_page;
@@ -1335,11 +1336,12 @@ async_copy_data(int frombio, struct bio *bio, struct page **page,
 				    !no_skipcopy)
 					*page = bio_page;
 				else
-					tx = async_memcpy(*page, bio_page, page_offset,
-						  b_offset, clen, &submit);
+					tx = async_memcpy(*page, bio_page,
+						  page_offset + poff, b_offset,
+						  clen, &submit);
 			} else
 				tx = async_memcpy(bio_page, *page, b_offset,
-						  page_offset, clen, &submit);
+						  page_offset + poff, clen, &submit);
 		}
 		/* chain the operations */
 		submit.depend_tx = tx;
@@ -1410,7 +1412,7 @@ static void ops_run_biofill(struct stripe_head *sh)
 			while (rbi && rbi->bi_iter.bi_sector <
 				dev->sector + STRIPE_SECTORS) {
 				tx = async_copy_data(0, rbi, &dev->page,
-						     dev->sector, tx, sh, 0);
+						dev->offset, dev->sector, tx, sh, 0);
 				rbi = r5_next_bio(rbi, dev->sector);
 			}
 		}
@@ -1825,7 +1827,8 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 					set_bit(R5_Discard, &dev->flags);
 				else {
 					tx = async_copy_data(1, wbi, &dev->page,
-							     dev->sector, tx, sh,
+							     dev->offset, dev->sector,
+							     tx, sh,
 							     r5c_is_writeback(conf->log));
 					if (dev->page != dev->orig_page &&
 					    !r5c_is_writeback(conf->log)) {
-- 
2.21.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH RFC v2 6/8] md/raid5: add new xor function to support different page offset
  2020-04-21 12:39 [PATCH RFC v2 0/8] md/raid5: set STRIPE_SIZE as a configurable value Yufen Yu
                   ` (4 preceding siblings ...)
  2020-04-21 12:39 ` [PATCH RFC v2 5/8] md/raid5: set correct page offset for async_copy_data() Yufen Yu
@ 2020-04-21 12:39 ` Yufen Yu
  2020-04-21 12:39 ` [PATCH RFC v2 7/8] md/raid5: add offset array in scribble buffer Yufen Yu
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Yufen Yu @ 2020-04-21 12:39 UTC (permalink / raw)
  To: song; +Cc: linux-raid, neilb, guoqing.jiang, colyli, xni, houtao1, yuyufen

RAID5 will call async_xor() and async_xor_val() to compute xor.
However, both of them require common src/dst page offset. After
introducing pages array of r5pages, we want these xor computer
function to support different src/dst page offset.

For now, we just support RAID level less than 5 to use pages array,
so old xor computer function need to be reserved for other raid
level. Here, we add two new functions async_xor_offsets() and
async_xor_val_offsets() respectively for async_xor() and async_xor_val().

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
---
 crypto/async_tx/async_xor.c | 120 +++++++++++++++++++++++++++++++-----
 include/linux/async_tx.h    |  11 ++++
 2 files changed, 114 insertions(+), 17 deletions(-)

diff --git a/crypto/async_tx/async_xor.c b/crypto/async_tx/async_xor.c
index 4e5eebe52e6a..29a979358332 100644
--- a/crypto/async_tx/async_xor.c
+++ b/crypto/async_tx/async_xor.c
@@ -97,7 +97,8 @@ do_async_xor(struct dma_chan *chan, struct dmaengine_unmap_data *unmap,
 }
 
 static void
-do_sync_xor(struct page *dest, struct page **src_list, unsigned int offset,
+do_sync_xor_offsets(struct page *dest, unsigned int offset,
+		struct page **src_list, unsigned int *src_offset,
 	    int src_cnt, size_t len, struct async_submit_ctl *submit)
 {
 	int i;
@@ -114,7 +115,8 @@ do_sync_xor(struct page *dest, struct page **src_list, unsigned int offset,
 	/* convert to buffer pointers */
 	for (i = 0; i < src_cnt; i++)
 		if (src_list[i])
-			srcs[xor_src_cnt++] = page_address(src_list[i]) + offset;
+			srcs[xor_src_cnt++] = page_address(src_list[i]) +
+				(src_offset ? src_offset[i] : offset);
 	src_cnt = xor_src_cnt;
 	/* set destination address */
 	dest_buf = page_address(dest) + offset;
@@ -135,11 +137,31 @@ do_sync_xor(struct page *dest, struct page **src_list, unsigned int offset,
 	async_tx_sync_epilog(submit);
 }
 
+static inline bool
+dma_xor_aligned_offsets(struct dma_device *device, unsigned int offset,
+		unsigned int *src_offset, int src_cnt, int len)
+{
+	int i;
+
+	if (!is_dma_xor_aligned(device, offset, 0, len))
+		return false;
+
+	if (!src_offset)
+		return true;
+
+	for (i = 0; i < src_cnt; i++) {
+		if (!is_dma_xor_aligned(device, src_offset[i], 0, len))
+			return false;
+	}
+	return true;
+}
+
 /**
- * async_xor - attempt to xor a set of blocks with a dma engine.
+ * async_xor_offsets - attempt to xor a set of blocks with a dma engine.
  * @dest: destination page
+ * @offset: dst offset to start transaction
  * @src_list: array of source pages
- * @offset: common src/dst offset to start transaction
+ * @src_offset: array of source pages offset, NULL means common src/dst offset
  * @src_cnt: number of source pages
  * @len: length in bytes
  * @submit: submission / completion modifiers
@@ -157,8 +179,9 @@ do_sync_xor(struct page *dest, struct page **src_list, unsigned int offset,
  * is not specified.
  */
 struct dma_async_tx_descriptor *
-async_xor(struct page *dest, struct page **src_list, unsigned int offset,
-	  int src_cnt, size_t len, struct async_submit_ctl *submit)
+async_xor_offsets(struct page *dest, unsigned int offset,
+		struct page **src_list, unsigned int *src_offset,
+		int src_cnt, size_t len, struct async_submit_ctl *submit)
 {
 	struct dma_chan *chan = async_tx_find_channel(submit, DMA_XOR,
 						      &dest, 1, src_list,
@@ -171,7 +194,8 @@ async_xor(struct page *dest, struct page **src_list, unsigned int offset,
 	if (device)
 		unmap = dmaengine_get_unmap_data(device->dev, src_cnt+1, GFP_NOWAIT);
 
-	if (unmap && is_dma_xor_aligned(device, offset, 0, len)) {
+	if (unmap && dma_xor_aligned_offsets(device, offset,
+				src_offset, src_cnt, len)) {
 		struct dma_async_tx_descriptor *tx;
 		int i, j;
 
@@ -184,7 +208,8 @@ async_xor(struct page *dest, struct page **src_list, unsigned int offset,
 				continue;
 			unmap->to_cnt++;
 			unmap->addr[j++] = dma_map_page(device->dev, src_list[i],
-							offset, len, DMA_TO_DEVICE);
+							(src_offset ? src_offset[i] : offset),
+							len, DMA_TO_DEVICE);
 		}
 
 		/* map it bidirectional as it may be re-used as a source */
@@ -213,11 +238,42 @@ async_xor(struct page *dest, struct page **src_list, unsigned int offset,
 		/* wait for any prerequisite operations */
 		async_tx_quiesce(&submit->depend_tx);
 
-		do_sync_xor(dest, src_list, offset, src_cnt, len, submit);
+		do_sync_xor_offsets(dest, offset, src_list, src_offset,
+				src_cnt, len, submit);
 
 		return NULL;
 	}
 }
+EXPORT_SYMBOL_GPL(async_xor_offsets);
+
+/**
+ * async_xor - attempt to xor a set of blocks with a dma engine.
+ * @dest: destination page
+ * @src_list: array of source pages
+ * @offset: common src/dst offset to start transaction
+ * @src_cnt: number of source pages
+ * @len: length in bytes
+ * @submit: submission / completion modifiers
+ *
+ * honored flags: ASYNC_TX_ACK, ASYNC_TX_XOR_ZERO_DST, ASYNC_TX_XOR_DROP_DST
+ *
+ * xor_blocks always uses the dest as a source so the
+ * ASYNC_TX_XOR_ZERO_DST flag must be set to not include dest data in
+ * the calculation.  The assumption with dma eninges is that they only
+ * use the destination buffer as a source when it is explicity specified
+ * in the source list.
+ *
+ * src_list note: if the dest is also a source it must be at index zero.
+ * The contents of this array will be overwritten if a scribble region
+ * is not specified.
+ */
+struct dma_async_tx_descriptor *
+async_xor(struct page *dest, struct page **src_list, unsigned int offset,
+	  int src_cnt, size_t len, struct async_submit_ctl *submit)
+{
+	return async_xor_offsets(dest, offset, src_list, NULL,
+			src_cnt, len, submit);
+}
 EXPORT_SYMBOL_GPL(async_xor);
 
 static int page_is_zero(struct page *p, unsigned int offset, size_t len)
@@ -237,10 +293,11 @@ xor_val_chan(struct async_submit_ctl *submit, struct page *dest,
 }
 
 /**
- * async_xor_val - attempt a xor parity check with a dma engine.
+ * async_xor_val_offsets - attempt a xor parity check with a dma engine.
  * @dest: destination page used if the xor is performed synchronously
+ * @offset: des offset in pages to start transaction
  * @src_list: array of source pages
- * @offset: offset in pages to start transaction
+ * @src_offset: array of source pages offset, NULL means common src/det offset
  * @src_cnt: number of source pages
  * @len: length in bytes
  * @result: 0 if sum == 0 else non-zero
@@ -253,9 +310,10 @@ xor_val_chan(struct async_submit_ctl *submit, struct page *dest,
  * is not specified.
  */
 struct dma_async_tx_descriptor *
-async_xor_val(struct page *dest, struct page **src_list, unsigned int offset,
-	      int src_cnt, size_t len, enum sum_check_flags *result,
-	      struct async_submit_ctl *submit)
+async_xor_val_offsets(struct page *dest, unsigned int offset,
+		struct page **src_list, unsigned int *src_offset,
+		int src_cnt, size_t len, enum sum_check_flags *result,
+		struct async_submit_ctl *submit)
 {
 	struct dma_chan *chan = xor_val_chan(submit, dest, src_list, src_cnt, len);
 	struct dma_device *device = chan ? chan->device : NULL;
@@ -268,7 +326,7 @@ async_xor_val(struct page *dest, struct page **src_list, unsigned int offset,
 		unmap = dmaengine_get_unmap_data(device->dev, src_cnt, GFP_NOWAIT);
 
 	if (unmap && src_cnt <= device->max_xor &&
-	    is_dma_xor_aligned(device, offset, 0, len)) {
+	    dma_xor_aligned_offsets(device, offset, src_offset, src_cnt, len)) {
 		unsigned long dma_prep_flags = 0;
 		int i;
 
@@ -281,7 +339,8 @@ async_xor_val(struct page *dest, struct page **src_list, unsigned int offset,
 
 		for (i = 0; i < src_cnt; i++) {
 			unmap->addr[i] = dma_map_page(device->dev, src_list[i],
-						      offset, len, DMA_TO_DEVICE);
+						(src_offset ? src_offset[i] : offset),
+						len, DMA_TO_DEVICE);
 			unmap->to_cnt++;
 		}
 		unmap->len = len;
@@ -312,7 +371,8 @@ async_xor_val(struct page *dest, struct page **src_list, unsigned int offset,
 		submit->flags |= ASYNC_TX_XOR_DROP_DST;
 		submit->flags &= ~ASYNC_TX_ACK;
 
-		tx = async_xor(dest, src_list, offset, src_cnt, len, submit);
+		tx = async_xor_offsets(dest, offset, src_list, src_offset,
+				src_cnt, len, submit);
 
 		async_tx_quiesce(&tx);
 
@@ -325,6 +385,32 @@ async_xor_val(struct page *dest, struct page **src_list, unsigned int offset,
 
 	return tx;
 }
+EXPORT_SYMBOL_GPL(async_xor_val_offsets);
+
+/**
+ * async_xor_val - attempt a xor parity check with a dma engine.
+ * @dest: destination page used if the xor is performed synchronously
+ * @src_list: array of source pages
+ * @offset: offset in pages to start transaction
+ * @src_cnt: number of source pages
+ * @len: length in bytes
+ * @result: 0 if sum == 0 else non-zero
+ * @submit: submission / completion modifiers
+ *
+ * honored flags: ASYNC_TX_ACK
+ *
+ * src_list note: if the dest is also a source it must be at index zero.
+ * The contents of this array will be overwritten if a scribble region
+ * is not specified.
+ */
+struct dma_async_tx_descriptor *
+async_xor_val(struct page *dest, struct page **src_list, unsigned int offset,
+	      int src_cnt, size_t len, enum sum_check_flags *result,
+	      struct async_submit_ctl *submit)
+{
+	return async_xor_val_offsets(dest, offset, src_list, NULL, src_cnt,
+			len, result, submit);
+}
 EXPORT_SYMBOL_GPL(async_xor_val);
 
 MODULE_AUTHOR("Intel Corporation");
diff --git a/include/linux/async_tx.h b/include/linux/async_tx.h
index 75e582b8d2d9..8d79e2de06bd 100644
--- a/include/linux/async_tx.h
+++ b/include/linux/async_tx.h
@@ -162,11 +162,22 @@ struct dma_async_tx_descriptor *
 async_xor(struct page *dest, struct page **src_list, unsigned int offset,
 	  int src_cnt, size_t len, struct async_submit_ctl *submit);
 
+struct dma_async_tx_descriptor *
+async_xor_offsets(struct page *dest, unsigned int offset,
+		struct page **src_list, unsigned int *src_offset,
+		int src_cnt, size_t len, struct async_submit_ctl *submit);
+
 struct dma_async_tx_descriptor *
 async_xor_val(struct page *dest, struct page **src_list, unsigned int offset,
 	      int src_cnt, size_t len, enum sum_check_flags *result,
 	      struct async_submit_ctl *submit);
 
+struct dma_async_tx_descriptor *
+async_xor_val_offsets(struct page *dest, unsigned int offset,
+		struct page **src_list, unsigned int *src_offset,
+		int src_cnt, size_t len, enum sum_check_flags *result,
+		struct async_submit_ctl *submit);
+
 struct dma_async_tx_descriptor *
 async_memcpy(struct page *dest, struct page *src, unsigned int dest_offset,
 	     unsigned int src_offset, size_t len,
-- 
2.21.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH RFC v2 7/8] md/raid5: add offset array in scribble buffer
  2020-04-21 12:39 [PATCH RFC v2 0/8] md/raid5: set STRIPE_SIZE as a configurable value Yufen Yu
                   ` (5 preceding siblings ...)
  2020-04-21 12:39 ` [PATCH RFC v2 6/8] md/raid5: add new xor function to support different page offset Yufen Yu
@ 2020-04-21 12:39 ` Yufen Yu
  2020-04-21 12:39 ` [PATCH RFC v2 8/8] md/raid5: compute xor with correct page offset Yufen Yu
  2020-04-21 12:56 ` [PATCH RFC v2 0/8] md/raid5: set STRIPE_SIZE as a configurable value Paul Menzel
  8 siblings, 0 replies; 16+ messages in thread
From: Yufen Yu @ 2020-04-21 12:39 UTC (permalink / raw)
  To: song; +Cc: linux-raid, neilb, guoqing.jiang, colyli, xni, houtao1, yuyufen

When enable compresssion buffers for stripe_head, it need an offset
array to record page offset to compute xor. To avoid repeatly allocate
an new array each time, we add a memory region into scribble buffer
to record offset.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
---
 drivers/md/raid5.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 52efacd486ab..801491748043 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1467,6 +1467,15 @@ static addr_conv_t *to_addr_conv(struct stripe_head *sh,
 	return (void *) (to_addr_page(percpu, i) + sh->disks + 2);
 }
 
+/*
+ * Return a pointer to record offset address.
+ */
+static unsigned int *
+to_addr_offs(struct stripe_head *sh, struct raid5_percpu *percpu)
+{
+	return (unsigned int *) (to_addr_conv(sh, percpu, 0) + sh->disks + 2);
+}
+
 static struct dma_async_tx_descriptor *
 ops_run_compute5(struct stripe_head *sh, struct raid5_percpu *percpu)
 {
@@ -2315,8 +2324,9 @@ static int scribble_alloc(struct raid5_percpu *percpu,
 			  int num, int cnt, gfp_t flags)
 {
 	size_t obj_size =
-		sizeof(struct page *) * (num+2) +
-		sizeof(addr_conv_t) * (num+2);
+		sizeof(struct page *) * (num + 2) +
+		sizeof(addr_conv_t) * (num + 2) +
+		sizeof(unsigned int) * (num + 2);
 	void *scribble;
 
 	scribble = kvmalloc_array(cnt, obj_size, flags);
-- 
2.21.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH RFC v2 8/8] md/raid5: compute xor with correct page offset
  2020-04-21 12:39 [PATCH RFC v2 0/8] md/raid5: set STRIPE_SIZE as a configurable value Yufen Yu
                   ` (6 preceding siblings ...)
  2020-04-21 12:39 ` [PATCH RFC v2 7/8] md/raid5: add offset array in scribble buffer Yufen Yu
@ 2020-04-21 12:39 ` Yufen Yu
  2020-04-21 12:56 ` [PATCH RFC v2 0/8] md/raid5: set STRIPE_SIZE as a configurable value Paul Menzel
  8 siblings, 0 replies; 16+ messages in thread
From: Yufen Yu @ 2020-04-21 12:39 UTC (permalink / raw)
  To: song; +Cc: linux-raid, neilb, guoqing.jiang, colyli, xni, houtao1, yuyufen

When compute xor, the pages address will be passed to computer function.
After trying to use pages in r5pages, we also need to pass page offset
to let it know correct location of each page.

For now raid5-cache and raid5-ppl are supported only when PAGE_SIZE is
equal to 4096. In that case, r5pages will also not be used. So, we can
just set offset as zero.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
---
 drivers/md/raid5.c | 64 ++++++++++++++++++++++++++++++++++++----------
 1 file changed, 51 insertions(+), 13 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 801491748043..0c7056a46ba6 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1481,6 +1481,7 @@ ops_run_compute5(struct stripe_head *sh, struct raid5_percpu *percpu)
 {
 	int disks = sh->disks;
 	struct page **xor_srcs = to_addr_page(percpu, 0);
+	unsigned int *offs = to_addr_offs(sh, percpu);
 	int target = sh->ops.target;
 	struct r5dev *tgt = &sh->dev[target];
 	struct page *xor_dest = tgt->page;
@@ -1488,6 +1489,7 @@ ops_run_compute5(struct stripe_head *sh, struct raid5_percpu *percpu)
 	struct dma_async_tx_descriptor *tx;
 	struct async_submit_ctl submit;
 	int i;
+	unsigned int des_offset = tgt->offset;
 
 	BUG_ON(sh->batch_head);
 
@@ -1495,18 +1497,23 @@ ops_run_compute5(struct stripe_head *sh, struct raid5_percpu *percpu)
 		__func__, (unsigned long long)sh->sector, target);
 	BUG_ON(!test_bit(R5_Wantcompute, &tgt->flags));
 
-	for (i = disks; i--; )
-		if (i != target)
+	for (i = disks; i--; ) {
+		if (i != target) {
+			offs[count] = sh->dev[i].offset;
 			xor_srcs[count++] = sh->dev[i].page;
+		}
+	}
 
 	atomic_inc(&sh->count);
 
 	init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_ZERO_DST, NULL,
 			  ops_complete_compute, sh, to_addr_conv(sh, percpu, 0));
 	if (unlikely(count == 1))
-		tx = async_memcpy(xor_dest, xor_srcs[0], 0, 0, STRIPE_SIZE, &submit);
+		tx = async_memcpy(xor_dest, xor_srcs[0], des_offset, offs[0],
+				STRIPE_SIZE, &submit);
 	else
-		tx = async_xor(xor_dest, xor_srcs, 0, count, STRIPE_SIZE, &submit);
+		tx = async_xor_offsets(xor_dest, des_offset, xor_srcs, offs,
+				count, STRIPE_SIZE, &submit);
 
 	return tx;
 }
@@ -1745,10 +1752,12 @@ ops_run_prexor5(struct stripe_head *sh, struct raid5_percpu *percpu,
 {
 	int disks = sh->disks;
 	struct page **xor_srcs = to_addr_page(percpu, 0);
+	unsigned int *offs = to_addr_offs(sh, percpu);
 	int count = 0, pd_idx = sh->pd_idx, i;
 	struct async_submit_ctl submit;
 
 	/* existing parity data subtracted */
+	unsigned int des_offset = offs[count] = sh->dev[pd_idx].offset;
 	struct page *xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page;
 
 	BUG_ON(sh->batch_head);
@@ -1758,15 +1767,23 @@ ops_run_prexor5(struct stripe_head *sh, struct raid5_percpu *percpu,
 	for (i = disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
 		/* Only process blocks that are known to be uptodate */
-		if (test_bit(R5_InJournal, &dev->flags))
+		if (test_bit(R5_InJournal, &dev->flags)) {
+			/*
+			 * For this case, PAGE_SIZE must be 4KB and will not
+			 * use r5pages. So we can just set offset as zero.
+			 */
+			offs[count] = 0;
 			xor_srcs[count++] = dev->orig_page;
-		else if (test_bit(R5_Wantdrain, &dev->flags))
+		} else if (test_bit(R5_Wantdrain, &dev->flags)) {
+			offs[count] = dev->offset;
 			xor_srcs[count++] = dev->page;
+		}
 	}
 
 	init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_DROP_DST, tx,
 			  ops_complete_prexor, sh, to_addr_conv(sh, percpu, 0));
-	tx = async_xor(xor_dest, xor_srcs, 0, count, STRIPE_SIZE, &submit);
+	tx = async_xor_offsets(xor_dest, des_offset, xor_srcs, offs,
+			count, STRIPE_SIZE, &submit);
 
 	return tx;
 }
@@ -1916,6 +1933,7 @@ ops_run_reconstruct5(struct stripe_head *sh, struct raid5_percpu *percpu,
 {
 	int disks = sh->disks;
 	struct page **xor_srcs;
+	unsigned int *offs;
 	struct async_submit_ctl submit;
 	int count, pd_idx = sh->pd_idx, i;
 	struct page *xor_dest;
@@ -1924,6 +1942,7 @@ ops_run_reconstruct5(struct stripe_head *sh, struct raid5_percpu *percpu,
 	int j = 0;
 	struct stripe_head *head_sh = sh;
 	int last_stripe;
+	unsigned int des_offset;
 
 	pr_debug("%s: stripe %llu\n", __func__,
 		(unsigned long long)sh->sector);
@@ -1940,27 +1959,37 @@ ops_run_reconstruct5(struct stripe_head *sh, struct raid5_percpu *percpu,
 		ops_complete_reconstruct(sh);
 		return;
 	}
+
 again:
 	count = 0;
 	xor_srcs = to_addr_page(percpu, j);
+	offs = to_addr_offs(sh, percpu);
 	/* check if prexor is active which means only process blocks
 	 * that are part of a read-modify-write (written)
 	 */
 	if (head_sh->reconstruct_state == reconstruct_state_prexor_drain_run) {
 		prexor = 1;
+		des_offset = offs[count] = sh->dev[pd_idx].offset;
 		xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page;
+
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
 			if (head_sh->dev[i].written ||
-			    test_bit(R5_InJournal, &head_sh->dev[i].flags))
+			    test_bit(R5_InJournal, &head_sh->dev[i].flags)) {
+				offs[count] = dev->offset;
 				xor_srcs[count++] = dev->page;
+			}
 		}
 	} else {
 		xor_dest = sh->dev[pd_idx].page;
+		des_offset = sh->dev[pd_idx].offset;
+
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
-			if (i != pd_idx)
+			if (i != pd_idx) {
+				offs[count] = dev->offset;
 				xor_srcs[count++] = dev->page;
+			}
 		}
 	}
 
@@ -1986,9 +2015,12 @@ ops_run_reconstruct5(struct stripe_head *sh, struct raid5_percpu *percpu,
 	}
 
 	if (unlikely(count == 1))
-		tx = async_memcpy(xor_dest, xor_srcs[0], 0, 0, STRIPE_SIZE, &submit);
+		tx = async_memcpy(xor_dest, xor_srcs[0], des_offset,
+				offs[0], STRIPE_SIZE, &submit);
 	else
-		tx = async_xor(xor_dest, xor_srcs, 0, count, STRIPE_SIZE, &submit);
+		tx = async_xor_offsets(xor_dest, des_offset, xor_srcs,
+				offs, count, STRIPE_SIZE, &submit);
+
 	if (!last_stripe) {
 		j++;
 		sh = list_first_entry(&sh->batch_list, struct stripe_head,
@@ -2076,10 +2108,12 @@ static void ops_run_check_p(struct stripe_head *sh, struct raid5_percpu *percpu)
 	int qd_idx = sh->qd_idx;
 	struct page *xor_dest;
 	struct page **xor_srcs = to_addr_page(percpu, 0);
+	unsigned int *offs = to_addr_offs(sh, percpu);
 	struct dma_async_tx_descriptor *tx;
 	struct async_submit_ctl submit;
 	int count;
 	int i;
+	unsigned int dest_offset;
 
 	pr_debug("%s: stripe %llu\n", __func__,
 		(unsigned long long)sh->sector);
@@ -2087,17 +2121,21 @@ static void ops_run_check_p(struct stripe_head *sh, struct raid5_percpu *percpu)
 	BUG_ON(sh->batch_head);
 	count = 0;
 	xor_dest = sh->dev[pd_idx].page;
+	dest_offset = sh->dev[pd_idx].offset;
+	offs[count] = dest_offset;
 	xor_srcs[count++] = xor_dest;
+
 	for (i = disks; i--; ) {
 		if (i == pd_idx || i == qd_idx)
 			continue;
+		offs[count] = sh->dev[i].offset;
 		xor_srcs[count++] = sh->dev[i].page;
 	}
 
 	init_async_submit(&submit, 0, NULL, NULL, NULL,
 			  to_addr_conv(sh, percpu, 0));
-	tx = async_xor_val(xor_dest, xor_srcs, 0, count, STRIPE_SIZE,
-			   &sh->ops.zero_sum_result, &submit);
+	tx = async_xor_val_offsets(xor_dest, dest_offset, xor_srcs, offs,
+			count, STRIPE_SIZE, &sh->ops.zero_sum_result, &submit);
 
 	atomic_inc(&sh->count);
 	init_async_submit(&submit, ASYNC_TX_ACK, tx, ops_complete_check, sh, NULL);
-- 
2.21.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC v2 0/8] md/raid5: set STRIPE_SIZE as a configurable value
  2020-04-21 12:39 [PATCH RFC v2 0/8] md/raid5: set STRIPE_SIZE as a configurable value Yufen Yu
                   ` (7 preceding siblings ...)
  2020-04-21 12:39 ` [PATCH RFC v2 8/8] md/raid5: compute xor with correct page offset Yufen Yu
@ 2020-04-21 12:56 ` Paul Menzel
  2020-04-23  3:01   ` Yufen Yu
  8 siblings, 1 reply; 16+ messages in thread
From: Paul Menzel @ 2020-04-21 12:56 UTC (permalink / raw)
  To: Yufen Yu, song; +Cc: linux-raid, neilb, guoqing.jiang, colyli, xni, houtao1

Dear Yufen,


Thank you for your patch set.

Am 21.04.20 um 14:39 schrieb Yufen Yu:

>   For now, STRIPE_SIZE is equal to the value of PAGE_SIZE. That means, RAID5 will
>   issus echo bio to disk at least 64KB when PAGE_SIZE is 64KB in arm64. However,

issue

>   filesystem usually issue bio in the unit of 4KB. Then, RAID5 will waste resource
>   of disk bandwidth.
> 
>   To solve the problem, this patchset provide a new config CONFIG_MD_RAID456_STRIPE_SIZE
>   to let user config STRIPE_SIZE. The default value is 4096.
> 
>   Normally, using default STRIPE_SIZE can get better performance. And NeilBrown have
>   suggested just to fix the STRIPE_SIZE as 4096. But, out test result show that
>   big value of STRIPE_SIZE may have better performance when size of issued IOs are
>   mostly bigger than 4096. Thus, in this patchset, we still want to set STRIPE_SIZE
>   as a configureable value.

configurable

>   In current implementation, grow_buffers() uses alloc_page() to allocate the buffers
>   for each stripe_head. With the change, it means we allocate 64K buffers but just
>   use 4K of them. To save memory, we try to 'compress' multiple buffers of stripe_head
>   to only one real page. Detail shows in patch #2.
> 
>   To evaluate the new feature, we create raid5 device '/dev/md5' with 4 SSD disk
>   and test it on arm64 machine with 64KB PAGE_SIZE.

[…]

So, what is affecting the performance? The size of the bio in the used 
file system? Shouldn’t it then be a run-time option (Linux CLI parameter 
and /proc) so the Linux kernel doesn’t need to be recompiled for 
different servers? Should the option be even per RAID, as each RAID5 
device might be using another filesystem?


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC v2 0/8] md/raid5: set STRIPE_SIZE as a configurable value
  2020-04-21 12:56 ` [PATCH RFC v2 0/8] md/raid5: set STRIPE_SIZE as a configurable value Paul Menzel
@ 2020-04-23  3:01   ` Yufen Yu
  0 siblings, 0 replies; 16+ messages in thread
From: Yufen Yu @ 2020-04-23  3:01 UTC (permalink / raw)
  To: Paul Menzel, song; +Cc: linux-raid, neilb, guoqing.jiang, colyli, xni, houtao1

Hi

On 2020/4/21 20:56, Paul Menzel wrote:
> Dear Yufen,
> 
> 
> Thank you for your patch set.
> 
> Am 21.04.20 um 14:39 schrieb Yufen Yu:
> 
>>   For now, STRIPE_SIZE is equal to the value of PAGE_SIZE. That means, RAID5 will
>>   issus echo bio to disk at least 64KB when PAGE_SIZE is 64KB in arm64. However,
> 
> issue
> 
>>   filesystem usually issue bio in the unit of 4KB. Then, RAID5 will waste resource
>>   of disk bandwidth.
>>
>>   To solve the problem, this patchset provide a new config CONFIG_MD_RAID456_STRIPE_SIZE
>>   to let user config STRIPE_SIZE. The default value is 4096.
>>
>>   Normally, using default STRIPE_SIZE can get better performance. And NeilBrown have
>>   suggested just to fix the STRIPE_SIZE as 4096. But, out test result show that
>>   big value of STRIPE_SIZE may have better performance when size of issued IOs are
>>   mostly bigger than 4096. Thus, in this patchset, we still want to set STRIPE_SIZE
>>   as a configureable value.
> 
> configurable
> 
>>   In current implementation, grow_buffers() uses alloc_page() to allocate the buffers
>>   for each stripe_head. With the change, it means we allocate 64K buffers but just
>>   use 4K of them. To save memory, we try to 'compress' multiple buffers of stripe_head
>>   to only one real page. Detail shows in patch #2.
>>
>>   To evaluate the new feature, we create raid5 device '/dev/md5' with 4 SSD disk
>>   and test it on arm64 machine with 64KB PAGE_SIZE.
> 
> […]
> 
> So, what is affecting the performance? The size of the bio in the used file system? Shouldn’t it then be a run-time option (Linux CLI parameter and /proc) so the Linux kernel doesn’t need to be recompiled for different servers? Should the option be even per RAID, as each RAID5 device might be using another filesystem?


We have used perf and blktrace to dig factors that affect the performance. Test
'4KB randwrite' respectively on 'STRIPE_SIZE = 4K' and 'STRIPE_SIZE = 64K' on
our arm64 with 64KB PAGE_SIZE machine. It shows that (STRIPE_SIZE = 64K) needs
more time to compute xor and issue IO. Details:

1) perf and Flame Graph show that compute xor function(i.e. async_xor) percentage
    is about 1.45% vs 14.87%. For each 4KB bio, raid5 compute 64KB xor when
    'STRIPE_SIZE = 64K', while computing 4KB on 'STRIPE_SIZE = 4K'.

2) blktrace can trace each issued bio and it show that D2C latency (i.e. issue io
    to driver until it complete) about 114.7us vs 1325.6us. That means big bio will
    consume more time in driver.

We know RAID5 can be reshaped by user. If STRIPE_SIZE is not equal to that before doing
reshape, it can cause more complicated problems. But I think we can add a modules parameter
for raid5 to set STRIPE_SIZE.

Thanks,
Yufen

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC v2 1/8] md/raid5: add CONFIG_MD_RAID456_STRIPE_SIZE to set STRIPE_SIZE
  2020-04-21 12:39 ` [PATCH RFC v2 1/8] md/raid5: add CONFIG_MD_RAID456_STRIPE_SIZE to set STRIPE_SIZE Yufen Yu
@ 2020-04-23  6:23   ` kbuild test robot
  2020-05-08  0:56   ` Song Liu
  1 sibling, 0 replies; 16+ messages in thread
From: kbuild test robot @ 2020-04-23  6:23 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 4170 bytes --]

Hi Yufen,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on dm/for-next]
[also build test ERROR on cryptodev/master crypto/master linus/master v5.7-rc2 next-20200422]
[cannot apply to md/for-next]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Yufen-Yu/md-raid5-set-STRIPE_SIZE-as-a-configurable-value/20200423-063441
base:   https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git for-next
config: ia64-defconfig (attached as .config)
compiler: ia64-linux-gcc (GCC) 9.3.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day GCC_VERSION=9.3.0 make.cross ARCH=ia64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <lkp@intel.com>

All error/warnings (new ones prefixed by >>):

   In file included from drivers/md/raid0.c:20:
   drivers/md/raid5.h: In function 'r5_next_bio':
>> drivers/md/raid5.h:476:25: error: 'CONFIG_MD_RAID456_STRIPE_SIZE' undeclared (first use in this function)
     476 |  ((PAGE_SIZE > 4096 && (CONFIG_MD_RAID456_STRIPE_SIZE % 4096 == 0)) ? \
         |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> drivers/md/raid5.h:479:26: note: in expansion of macro 'STRIPE_SIZE'
     479 | #define STRIPE_SECTORS  (STRIPE_SIZE>>9)
         |                          ^~~~~~~~~~~
>> drivers/md/raid5.h:497:37: note: in expansion of macro 'STRIPE_SECTORS'
     497 |  if (bio_end_sector(bio) < sector + STRIPE_SECTORS)
         |                                     ^~~~~~~~~~~~~~
   drivers/md/raid5.h:476:25: note: each undeclared identifier is reported only once for each function it appears in
     476 |  ((PAGE_SIZE > 4096 && (CONFIG_MD_RAID456_STRIPE_SIZE % 4096 == 0)) ? \
         |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> drivers/md/raid5.h:479:26: note: in expansion of macro 'STRIPE_SIZE'
     479 | #define STRIPE_SECTORS  (STRIPE_SIZE>>9)
         |                          ^~~~~~~~~~~
>> drivers/md/raid5.h:497:37: note: in expansion of macro 'STRIPE_SECTORS'
     497 |  if (bio_end_sector(bio) < sector + STRIPE_SECTORS)
         |                                     ^~~~~~~~~~~~~~

vim +/CONFIG_MD_RAID456_STRIPE_SIZE +476 drivers/md/raid5.h

   469	
   470	/*
   471	 * Stripe cache
   472	 */
   473	
   474	#define NR_STRIPES		256
   475	#define STRIPE_SIZE	\
 > 476		((PAGE_SIZE > 4096 && (CONFIG_MD_RAID456_STRIPE_SIZE % 4096 == 0)) ? \
   477		 CONFIG_MD_RAID456_STRIPE_SIZE : PAGE_SIZE)
   478	#define STRIPE_SHIFT		(PAGE_SHIFT - 9)
 > 479	#define STRIPE_SECTORS		(STRIPE_SIZE>>9)
   480	#define	IO_THRESHOLD		1
   481	#define BYPASS_THRESHOLD	1
   482	#define NR_HASH			(PAGE_SIZE / sizeof(struct hlist_head))
   483	#define HASH_MASK		(NR_HASH - 1)
   484	#define MAX_STRIPE_BATCH	8
   485	
   486	/* bio's attached to a stripe+device for I/O are linked together in bi_sector
   487	 * order without overlap.  There may be several bio's per stripe+device, and
   488	 * a bio could span several devices.
   489	 * When walking this list for a particular stripe+device, we must never proceed
   490	 * beyond a bio that extends past this device, as the next bio might no longer
   491	 * be valid.
   492	 * This function is used to determine the 'next' bio in the list, given the
   493	 * sector of the current stripe+device
   494	 */
   495	static inline struct bio *r5_next_bio(struct bio *bio, sector_t sector)
   496	{
 > 497		if (bio_end_sector(bio) < sector + STRIPE_SECTORS)
   498			return bio->bi_next;
   499		else
   500			return NULL;
   501	}
   502	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 19938 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC v2 1/8] md/raid5: add CONFIG_MD_RAID456_STRIPE_SIZE to set STRIPE_SIZE
  2020-04-21 12:39 ` [PATCH RFC v2 1/8] md/raid5: add CONFIG_MD_RAID456_STRIPE_SIZE to set STRIPE_SIZE Yufen Yu
  2020-04-23  6:23   ` kbuild test robot
@ 2020-05-08  0:56   ` Song Liu
  1 sibling, 0 replies; 16+ messages in thread
From: Song Liu @ 2020-05-08  0:56 UTC (permalink / raw)
  To: Yufen Yu; +Cc: linux-raid, NeilBrown, Guoqing Jiang, Coly Li, Xiao Ni, Hou Tao

On Tue, Apr 21, 2020 at 5:40 AM Yufen Yu <yuyufen@huawei.com> wrote:
>
[...]
> Thus, we provide a configure to set STRIPE_SIZE when PAGE_SIZE is bigger
> than 4096. Normally, default value (4096) can get relatively good
> performance. But if each issued io is bigger than 4096, setting value more
> than 4096 may get better performance.
>
> Signed-off-by: Yufen Yu <yuyufen@huawei.com>

I am sorry for the delay response.

> ---
>  drivers/md/Kconfig | 18 ++++++++++++++++++
>  drivers/md/raid5.h |  4 +++-
>  2 files changed, 21 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
> index d6d5ab23c088..05c5a315358b 100644
> --- a/drivers/md/Kconfig
> +++ b/drivers/md/Kconfig
> @@ -157,6 +157,24 @@ config MD_RAID456
>
>           If unsure, say Y.
>
> +config MD_RAID456_STRIPE_SIZE
> +       int "RAID4/RAID5/RAID6 stripe size"
> +       default "4096"
> +       depends on MD_RAID456
> +       help
> +         The default value is 4096. Just setting a new value when PAGE_SIZE
> +         is bigger than 4096.  In that case, you can set it as more than
> +         4096, such as 8KB, 16K, 64K, which requires that be a multiple of
> +         4K.
> +
> +         When you try to set a big value, likely 64KB on arm64, that means,
> +         you know size of each io that issued to raid device is more than
> +         4096. Otherwise just use default value.
> +
> +         Normally, using default STRIPE_SIZE can get better performance.
> +         Only change this value if you know what you are doing.
> +
> +
>  config MD_MULTIPATH
>         tristate "Multipath I/O support"
>         depends on BLK_DEV_MD
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index f90e0704bed9..b0010991bdc8 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -472,7 +472,9 @@ struct disk_info {
>   */
>
>  #define NR_STRIPES             256
> -#define STRIPE_SIZE            PAGE_SIZE
> +#define STRIPE_SIZE    \
> +       ((PAGE_SIZE > 4096 && (CONFIG_MD_RAID456_STRIPE_SIZE % 4096 == 0)) ? \
> +        CONFIG_MD_RAID456_STRIPE_SIZE : PAGE_SIZE)

How about we make the config multiple of 4kB? So 4096 will be 1, 8192 will be 2.

Thanks,
Song

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC v2 2/8] md/raid5: add a member of r5pages for struct stripe_head
  2020-04-21 12:39 ` [PATCH RFC v2 2/8] md/raid5: add a member of r5pages for struct stripe_head Yufen Yu
@ 2020-05-08  1:25   ` Song Liu
  2020-05-08 23:58   ` Song Liu
  1 sibling, 0 replies; 16+ messages in thread
From: Song Liu @ 2020-05-08  1:25 UTC (permalink / raw)
  To: Yufen Yu; +Cc: linux-raid, NeilBrown, Guoqing Jiang, Coly Li, Xiao Ni, Hou Tao

On Tue, Apr 21, 2020 at 5:40 AM Yufen Yu <yuyufen@huawei.com> wrote:
>
> Since grow_buffers() uses alloc_page() allocate the buffers for each
> stripe_head(), means, it will allocate 64K buffers and just use 4K
> of them, after setting STRIPE_SIZE as 4096.
>
> To avoid waste memory, we try to contain multiple 'page' of sh->dev
> into one real page. That means, multiple sh->dev[i].page will point to
> the only page with different offset. Example of 64K PAGE_SIZE and
> 4K STRIPE_SIZE as following:
>
>                     64K PAGE_SIZE
>           +---+---+---+---+------------------------------+
>           |   |   |   |   |
>           |   |   |   |   |
>           +-+-+-+-+-+-+-+-+------------------------------+
>             ^   ^   ^   ^
>             |   |   |   +----------------------------+
>             |   |   |                                |
>             |   |   +-------------------+            |
>             |   |                       |            |
>             |   +----------+            |            |
>             |              |            |            |
>             +-+            |            |            |
>               |            |            |            |
>         +-----+-----+------+-----+------+-----+------+------+
> sh      | offset(0) | offset(4K) | offset(8K) | offset(16K) |
>  +      +-----------+------------+------------+-------------+
>  +----> dev[0].page  dev[1].page  dev[2].page  dev[3].page
>
> After compressing pages, the users of sh->dev[i].page need to be take care:
>
>   1) When issue bio of stripe_head, bi_io_vec.bv_page will point to
>      the page directly. So, we should make sure bv_offset been set with
>      correct offset.
>
>   2) When compute xor, the page will be passed to computer function.
>      So, we also need to pass offset of that page to computer. Let it
>      compute correct location of each sh->dev[i].page.
>
> This patch will add a new member of r5pages in stripe_head to manage
> all pages needed by each sh->dev[i]. We also add 'offset' for each r5dev
> so that users can get related page offset easily. And add helper function
> to get page and it's index in r5pages array by disk index. This is patch
> in preparation for fallowing changes.
>
> Signed-off-by: Yufen Yu <yuyufen@huawei.com>
> ---
>  drivers/md/raid5.h | 56 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 56 insertions(+)
>
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index b0010991bdc8..6296578e9585 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -246,6 +246,13 @@ struct stripe_head {
>                 int                  target, target2;
>                 enum sum_check_flags zero_sum_result;
>         } ops;
> +
> +       /* These pages will be used by bios in dev[i] */
> +       struct r5pages {
> +               struct page     **page;
> +               int     size; /* page array size */

Do we really need variable size for each stripe?

Thanks,
Song

PS: I like the overall idea of 2/8 and 3/8. Will continue reviewing later.


> +       } pages;
> +
>         struct r5dev {
>                 /* rreq and rvec are used for the replacement device when
>                  * writing data to both devices.
> @@ -253,6 +260,7 @@ struct stripe_head {
>                 struct bio      req, rreq;
>                 struct bio_vec  vec, rvec;
>                 struct page     *page, *orig_page;
> +               unsigned int    offset;         /* offset of this page */
>                 struct bio      *toread, *read, *towrite, *written;
>                 sector_t        sector;                 /* sector of this page */
>                 unsigned long   flags;
> @@ -754,6 +762,54 @@ static inline int algorithm_is_DDF(int layout)
>         return layout >= 8 && layout <= 10;
>  }
>
> +/*
> + * Return corresponding page index of r5pages array.
> + */
> +static inline int raid5_get_page_index(struct stripe_head *sh, int disk_idx)
> +{
> +       WARN_ON(!sh->pages.page);
> +       if (disk_idx >= sh->raid_conf->pool_size)
> +               return -ENOENT;
> +
> +       return (disk_idx * STRIPE_SIZE) / PAGE_SIZE;
> +}
> +
> +/*
> + * Return offset of the corresponding page for r5dev.
> + */
> +static inline int raid5_get_page_offset(struct stripe_head *sh, int disk_idx)
> +{
> +       WARN_ON(!sh->pages.page);
> +       if (disk_idx >= sh->raid_conf->pool_size)
> +               return -ENOENT;
> +
> +       return (disk_idx * STRIPE_SIZE) % PAGE_SIZE;
> +}
> +
> +/*
> + * Return corresponding page address for r5dev.
> + */
> +static inline struct page *
> +raid5_get_dev_page(struct stripe_head *sh, int disk_idx)
> +{
> +       int idx;
> +
> +       WARN_ON(!sh->pages.page);
> +       idx = raid5_get_page_index(sh, disk_idx);
> +       return sh->pages.page[idx];
> +}
> +
> +/*
> + * We want to 'compress' multiple buffers to one real page for
> + * stripe_head when PAGE_SIZE is biggger than STRIPE_SIZE. If their
> + * values are equal, no need to use this strategy. For now, it just
> + * support raid level less than 5.
> + */
> +static inline int raid5_compress_stripe_pages(struct r5conf *conf)
> +{
> +       return (PAGE_SIZE > STRIPE_SIZE) && (conf->level < 6);
> +}
> +
>  extern void md_raid5_kick_device(struct r5conf *conf);
>  extern int raid5_set_cache_size(struct mddev *mddev, int size);
>  extern sector_t raid5_compute_blocknr(struct stripe_head *sh, int i, int previous);
> --
> 2.21.1
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC v2 2/8] md/raid5: add a member of r5pages for struct stripe_head
  2020-04-21 12:39 ` [PATCH RFC v2 2/8] md/raid5: add a member of r5pages for struct stripe_head Yufen Yu
  2020-05-08  1:25   ` Song Liu
@ 2020-05-08 23:58   ` Song Liu
  2020-05-09  6:40     ` Yufen Yu
  1 sibling, 1 reply; 16+ messages in thread
From: Song Liu @ 2020-05-08 23:58 UTC (permalink / raw)
  To: Yufen Yu; +Cc: linux-raid, NeilBrown, Guoqing Jiang, Coly Li, Xiao Ni, Hou Tao

On Tue, Apr 21, 2020 at 5:40 AM Yufen Yu <yuyufen@huawei.com> wrote:
>
[...]
> Signed-off-by: Yufen Yu <yuyufen@huawei.com>
> ---
> +
> +/*
> + * We want to 'compress' multiple buffers to one real page for
> + * stripe_head when PAGE_SIZE is biggger than STRIPE_SIZE. If their
> + * values are equal, no need to use this strategy. For now, it just
> + * support raid level less than 5.
> + */

I don't think "compress" is the right terminology here. It is more
like "share" a page.
Why not support raid6 here?

Overall, the set looks reasonable to me. Please revise and send another version.

Thanks,
Song



> +static inline int raid5_compress_stripe_pages(struct r5conf *conf)
> +{
> +       return (PAGE_SIZE > STRIPE_SIZE) && (conf->level < 6);
> +}
> +
>  extern void md_raid5_kick_device(struct r5conf *conf);
>  extern int raid5_set_cache_size(struct mddev *mddev, int size);
>  extern sector_t raid5_compute_blocknr(struct stripe_head *sh, int i, int previous);
> --
> 2.21.1
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC v2 2/8] md/raid5: add a member of r5pages for struct stripe_head
  2020-05-08 23:58   ` Song Liu
@ 2020-05-09  6:40     ` Yufen Yu
  0 siblings, 0 replies; 16+ messages in thread
From: Yufen Yu @ 2020-05-09  6:40 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid, NeilBrown, Guoqing Jiang, Coly Li, Xiao Ni, Hou Tao



On 2020/5/9 7:58, Song Liu wrote:
> On Tue, Apr 21, 2020 at 5:40 AM Yufen Yu <yuyufen@huawei.com> wrote:
>>
> [...]
>> Signed-off-by: Yufen Yu <yuyufen@huawei.com>
>> ---
>> +
>> +/*
>> + * We want to 'compress' multiple buffers to one real page for
>> + * stripe_head when PAGE_SIZE is biggger than STRIPE_SIZE. If their
>> + * values are equal, no need to use this strategy. For now, it just
>> + * support raid level less than 5.
>> + */
> 
> I don't think "compress" is the right terminology here. It is more
> like "share" a page.
> Why not support raid6 here?
> 

Thanks a lot for your reviewing and suggestion.
Since it need to modify more existing code to support raid6, this series
have not do it. I will try to support raid6 in next version.

Thanks,
Yufen

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-05-09  6:40 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-21 12:39 [PATCH RFC v2 0/8] md/raid5: set STRIPE_SIZE as a configurable value Yufen Yu
2020-04-21 12:39 ` [PATCH RFC v2 1/8] md/raid5: add CONFIG_MD_RAID456_STRIPE_SIZE to set STRIPE_SIZE Yufen Yu
2020-04-23  6:23   ` kbuild test robot
2020-05-08  0:56   ` Song Liu
2020-04-21 12:39 ` [PATCH RFC v2 2/8] md/raid5: add a member of r5pages for struct stripe_head Yufen Yu
2020-05-08  1:25   ` Song Liu
2020-05-08 23:58   ` Song Liu
2020-05-09  6:40     ` Yufen Yu
2020-04-21 12:39 ` [PATCH RFC v2 3/8] md/raid5: allocate and free pages of r5pages Yufen Yu
2020-04-21 12:39 ` [PATCH RFC v2 4/8] md/raid5: set correct page offset for bi_io_vec in ops_run_io() Yufen Yu
2020-04-21 12:39 ` [PATCH RFC v2 5/8] md/raid5: set correct page offset for async_copy_data() Yufen Yu
2020-04-21 12:39 ` [PATCH RFC v2 6/8] md/raid5: add new xor function to support different page offset Yufen Yu
2020-04-21 12:39 ` [PATCH RFC v2 7/8] md/raid5: add offset array in scribble buffer Yufen Yu
2020-04-21 12:39 ` [PATCH RFC v2 8/8] md/raid5: compute xor with correct page offset Yufen Yu
2020-04-21 12:56 ` [PATCH RFC v2 0/8] md/raid5: set STRIPE_SIZE as a configurable value Paul Menzel
2020-04-23  3:01   ` Yufen Yu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.