* [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value
@ 2020-05-27 13:19 Yufen Yu
2020-05-27 13:19 ` [PATCH v3 01/11] md/raid5: add CONFIG_MD_RAID456_STRIPE_SHIFT to set STRIPE_SIZE Yufen Yu
` (12 more replies)
0 siblings, 13 replies; 31+ messages in thread
From: Yufen Yu @ 2020-05-27 13:19 UTC (permalink / raw)
To: song; +Cc: linux-raid, neilb, guoqing.jiang, colyli, xni, houtao1, yuyufen
Hi, all
For now, STRIPE_SIZE is equal to the value of PAGE_SIZE. That means, RAID5 will
issus echo bio to disk at least 64KB when PAGE_SIZE is 64KB in arm64. However,
filesystem usually issue bio in the unit of 4KB. Then, RAID5 will waste resource
of disk bandwidth.
To solve the problem, this patchset provide a new config CONFIG_MD_RAID456_STRIPE_SHIFT
to let user config STRIPE_SIZE. The default value is 1, means 4096(1<<9).
Normally, using default STRIPE_SIZE can get better performance. And NeilBrown have
suggested just to fix the STRIPE_SIZE as 4096. But, out test result show that
big value of STRIPE_SIZE may have better performance when size of issued IOs are
mostly bigger than 4096. Thus, in this patchset, we still want to set STRIPE_SIZE
as a configureable value.
In current implementation, grow_buffers() uses alloc_page() to allocate the buffers
for each stripe_head. With the change, it means we allocate 64K buffers but just
use 4K of them. To save memory, we try to 'compress' multiple buffers of stripe_head
to only one real page. Detail shows in patch #2.
To evaluate the new feature, we create raid5 device '/dev/md5' with 4 SSD disk
and test it on arm64 machine with 64KB PAGE_SIZE.
1) We format /dev/md5 with mkfs.ext4 and mount ext4 with default configure on
/mnt directory. Then, trying to test it by dbench with command:
dbench -D /mnt -t 1000 10. Result show as:
'STRIPE_SHIFT = 64KB'
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 9805011 0.021 64.728
Close 7202525 0.001 0.120
Rename 415213 0.051 44.681
Unlink 1980066 0.079 93.147
Deltree 240 1.793 6.516
Mkdir 120 0.004 0.007
Qpathinfo 8887512 0.007 37.114
Qfileinfo 1557262 0.001 0.030
Qfsinfo 1629582 0.012 0.152
Sfileinfo 798756 0.040 57.641
Find 3436004 0.019 57.782
WriteX 4887239 0.021 57.638
ReadX 15370483 0.005 37.818
LockX 31934 0.003 0.022
UnlockX 31933 0.001 0.021
Flush 687205 13.302 530.088
Throughput 307.799 MB/sec 10 clients 10 procs max_latency=530.091 ms
-------------------------------------------------------
'STRIPE_SIZE = 4KB'
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 11999166 0.021 36.380
Close 8814128 0.001 0.122
Rename 508113 0.051 29.169
Unlink 2423242 0.070 38.141
Deltree 300 1.885 7.155
Mkdir 150 0.004 0.006
Qpathinfo 10875921 0.007 35.485
Qfileinfo 1905837 0.001 0.032
Qfsinfo 1994304 0.012 0.125
Sfileinfo 977450 0.029 26.489
Find 4204952 0.019 9.361
WriteX 5981890 0.019 27.804
ReadX 18809742 0.004 33.491
LockX 39074 0.003 0.025
UnlockX 39074 0.001 0.014
Flush 841022 10.712 458.848
Throughput 376.777 MB/sec 10 clients 10 procs max_latency=458.852 ms
-------------------------------------------------------
It shows that setting STREIP_SIZE as 4KB has higher thoughput, i.e.
(376.777 vs 307.799) and has smaller latency (530.091 vs 458.852)
than that setting as 64KB.
2) We try to evaluate IO throughput for /dev/md5 by fio with config:
[4KB randwrite]
direct=1
numjob=2
iodepth=64
ioengine=libaio
filename=/dev/md5
bs=4KB
rw=randwrite
[64KB write]
direct=1
numjob=2
iodepth=64
ioengine=libaio
filename=/dev/md5
bs=1MB
rw=write
The fio test result as follow:
+ +
| STRIPE_SIZE(64KB) | STRIPE_SIZE(4KB)
+----------------------------------------------------+
4KB randwrite | 15MB/s | 100MB/s
+----------------------------------------------------+
1MB write | 1000MB/s | 700MB/s
The result show that when size of io is bigger than 4KB (64KB),
64KB STRIPE_SIZE has much higher IOPS. But for 4KB randwrite, that
means, size of io issued to device are smaller, 4KB STRIPE_SIZE
have better performance.
V3:
* RAID6 can support shared pages.
* Rename function raid5_compress_stripe_pages() as
raid5_stripe_pages_shared() and update commit message.
* Rename CONFIG_MD_RAID456_STRIPE_SIZE as CONFIG_MD_RAID456_STRIPE_SHIFT,
and make the STRIPE_SIZE as multiple of 4KB.
V2:
https://www.spinics.net/lists/raid/msg64254.html
Introduce share pages strategy to save memory, just support RAID4 and RAID5.
V1:
https://www.spinics.net/lists/raid/msg63111.html
Just add CONFIG_MD_RAID456_STRIPE_SIZE to set STRIPE_SIZE
Yufen Yu (11):
md/raid5: add CONFIG_MD_RAID456_STRIPE_SHIFT to set STRIPE_SIZE
md/raid5: add a member of r5pages for struct stripe_head
md/raid5: allocate and free pages of r5pages
md/raid5: set correct page offset for bi_io_vec in ops_run_io()
md/raid5: set correct page offset for async_copy_data()
md/raid5: add new xor function to support different page offset
md/raid5: add offset array in scribble buffer
md/raid5: compute xor with correct page offset
md/raid6: let syndrome computor support different page offset
md/raid6: compute syndrome with correct page offset
raid6test: adaptation with syndrome function
crypto/async_tx/async_pq.c | 71 ++++---
crypto/async_tx/async_raid6_recov.c | 161 +++++++++++----
crypto/async_tx/async_xor.c | 120 +++++++++--
crypto/async_tx/raid6test.c | 24 ++-
drivers/md/Kconfig | 21 ++
drivers/md/raid5.c | 302 +++++++++++++++++++++++-----
drivers/md/raid5.h | 59 +++++-
include/linux/async_tx.h | 23 ++-
8 files changed, 628 insertions(+), 153 deletions(-)
--
2.21.3
^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH v3 01/11] md/raid5: add CONFIG_MD_RAID456_STRIPE_SHIFT to set STRIPE_SIZE
2020-05-27 13:19 [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value Yufen Yu
@ 2020-05-27 13:19 ` Yufen Yu
2020-05-27 13:54 ` Guoqing Jiang
` (3 more replies)
2020-05-27 13:19 ` [PATCH v3 02/11] md/raid5: add a member of r5pages for struct stripe_head Yufen Yu
` (11 subsequent siblings)
12 siblings, 4 replies; 31+ messages in thread
From: Yufen Yu @ 2020-05-27 13:19 UTC (permalink / raw)
To: song; +Cc: linux-raid, neilb, guoqing.jiang, colyli, xni, houtao1, yuyufen
In RAID5, if issued bio size is bigger than STRIPE_SIZE, it will be split
in the unit of STRIPE_SIZE and process them one by one. Even for size
less then STRIPE_SIZE, RAID5 also request data from disk at least of
STRIPE_SIZE.
Nowdays, STRIPE_SIZE is equal to the value of PAGE_SIZE. Since filesystem
usually issue bio in the unit of 4KB, there is no problem for PAGE_SIZE as
4KB. But, for 64KB PAGE_SIZE, bio from filesystem requests 4KB data while
RAID5 issue IO at least STRIPE_SIZE (64KB) each time. That will waste
resource of disk bandwidth and compute xor.
To avoding the waste, we want to add a new CONFIG option to adjust
STREIPE_SIZE. Default value is 4096. User can also set the value bigger
than 4KB for some special requirements, such as we know the issued io
size is more than 4KB.
To evaluate the new feature, we create raid5 device '/dev/md5' with
4 SSD disk and test it on arm64 machine with 64KB PAGE_SIZE.
1) We format /dev/md5 with mkfs.ext4 and mount ext4 with default
configure on /mnt directory. Then, trying to test it by dbench with
command: dbench -D /mnt -t 1000 10. Result show as:
'STRIPE_SIZE = 64KB'
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 9805011 0.021 64.728
Close 7202525 0.001 0.120
Rename 415213 0.051 44.681
Unlink 1980066 0.079 93.147
Deltree 240 1.793 6.516
Mkdir 120 0.004 0.007
Qpathinfo 8887512 0.007 37.114
Qfileinfo 1557262 0.001 0.030
Qfsinfo 1629582 0.012 0.152
Sfileinfo 798756 0.040 57.641
Find 3436004 0.019 57.782
WriteX 4887239 0.021 57.638
ReadX 15370483 0.005 37.818
LockX 31934 0.003 0.022
UnlockX 31933 0.001 0.021
Flush 687205 13.302 530.088
Throughput 307.799 MB/sec 10 clients 10 procs max_latency=530.091 ms
-------------------------------------------------------
'STRIPE_SIZE = 4KB'
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 11999166 0.021 36.380
Close 8814128 0.001 0.122
Rename 508113 0.051 29.169
Unlink 2423242 0.070 38.141
Deltree 300 1.885 7.155
Mkdir 150 0.004 0.006
Qpathinfo 10875921 0.007 35.485
Qfileinfo 1905837 0.001 0.032
Qfsinfo 1994304 0.012 0.125
Sfileinfo 977450 0.029 26.489
Find 4204952 0.019 9.361
WriteX 5981890 0.019 27.804
ReadX 18809742 0.004 33.491
LockX 39074 0.003 0.025
UnlockX 39074 0.001 0.014
Flush 841022 10.712 458.848
Throughput 376.777 MB/sec 10 clients 10 procs max_latency=458.852 ms
-------------------------------------------------------
It show that setting STREIP_SIZE as 4KB has higher thoughput, i.e.
(376.777 vs 307.799) and has smaller latency (530.091 vs 458.852)
than that setting as 64KB.
2) We try to evaluate IO throughput for /dev/md5 by fio with config:
[4KB randwrite]
direct=1
numjob=2
iodepth=64
ioengine=libaio
filename=/dev/md5
bs=4KB
rw=randwrite
[64KB write]
direct=1
numjob=2
iodepth=64
ioengine=libaio
filename=/dev/md5
bs=1MB
rw=write
The result as follow:
+ +
| STRIPE_SIZE(64KB) | STRIPE_SIZE(4KB)
+----------------------------------------------------+
4KB randwrite | 15MB/s | 100MB/s
+----------------------------------------------------+
1MB write | 1000MB/s | 700MB/s
The result show that when size of io is bigger than 4KB (64KB),
64KB STRIPE_SIZE has much higher IOPS. But for 4KB randwrite, that
means, size of io issued to device are smaller, 4KB STRIPE_SIZE
have better performance.
Thus, we provide a configure to set STRIPE_SIZE when PAGE_SIZE is bigger
than 4096. Normally, default value (4096) can get relatively good
performance. But if each issued io is bigger than 4096, setting value more
than 4096 may get better performance.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
---
drivers/md/Kconfig | 21 +++++++++++++++++++++
drivers/md/raid5.h | 4 +++-
2 files changed, 24 insertions(+), 1 deletion(-)
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index d6d5ab23c088..629324f92c42 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -157,6 +157,27 @@ config MD_RAID456
If unsure, say Y.
+config MD_RAID456_STRIPE_SHIFT
+ int "RAID4/RAID5/RAID6 stripe size shift"
+ default "1"
+ depends on MD_RAID456
+ help
+ When set the value as 'N', stripe size will be set as 'N << 9',
+ which is a multiple of 4KB.
+
+ The default value is 1, that means the default stripe size is
+ 4096(1 << 9). Just setting as a bigger value when PAGE_SIZE is
+ bigger than 4096. In that case, you can set it as 2(8KB),
+ 4(16K), 16(64K).
+
+ When you try to set a big value, likely 16 on arm64 with 64KB
+ PAGE_SIZE, that means, you know size of each io that issued to
+ raid device is more than 4096. Otherwise just use default value.
+
+ Normally, using default value can get better performance.
+ Only change this value if you know what you are doing.
+
+
config MD_MULTIPATH
tristate "Multipath I/O support"
depends on BLK_DEV_MD
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index f90e0704bed9..b25f107dafc7 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -472,7 +472,9 @@ struct disk_info {
*/
#define NR_STRIPES 256
-#define STRIPE_SIZE PAGE_SIZE
+#define CONFIG_STRIPE_SIZE (CONFIG_MD_RAID456_STRIPE_SHIFT << 9)
+#define STRIPE_SIZE \
+ (CONFIG_STRIPE_SIZE > PAGE_SIZE ? PAGE_SIZE : CONFIG_STRIPE_SIZE)
#define STRIPE_SHIFT (PAGE_SHIFT - 9)
#define STRIPE_SECTORS (STRIPE_SIZE>>9)
#define IO_THRESHOLD 1
--
2.21.3
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v3 02/11] md/raid5: add a member of r5pages for struct stripe_head
2020-05-27 13:19 [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value Yufen Yu
2020-05-27 13:19 ` [PATCH v3 01/11] md/raid5: add CONFIG_MD_RAID456_STRIPE_SHIFT to set STRIPE_SIZE Yufen Yu
@ 2020-05-27 13:19 ` Yufen Yu
2020-05-27 13:19 ` [PATCH v3 03/11] md/raid5: allocate and free pages of r5pages Yufen Yu
` (10 subsequent siblings)
12 siblings, 0 replies; 31+ messages in thread
From: Yufen Yu @ 2020-05-27 13:19 UTC (permalink / raw)
To: song; +Cc: linux-raid, neilb, guoqing.jiang, colyli, xni, houtao1, yuyufen
Since grow_buffers() uses alloc_page() allocate the buffers for each
stripe_head(), means, it will allocate 64K buffers and just use 4K
of them, after setting STRIPE_SIZE as 4096.
To avoid wasting memory, we try to contain multiple 'page' of sh->dev
into one real page. That means, multiple sh->dev[i].page will point to
the only page with different offset. Example of 64K PAGE_SIZE and
4K STRIPE_SIZE as following:
64K PAGE_SIZE
+---+---+---+---+------------------------------+
| | | | |
| | | | |
+-+-+-+-+-+-+-+-+------------------------------+
^ ^ ^ ^
| | | +----------------------------+
| | | |
| | +-------------------+ |
| | | |
| +----------+ | |
| | | |
+-+ | | |
| | | |
+-----+-----+------+-----+------+-----+------+------+
sh | offset(0) | offset(4K) | offset(8K) | offset(16K) |
+ +-----------+------------+------------+-------------+
+----> dev[0].page dev[1].page dev[2].page dev[3].page
After trying to share page, the users of sh->dev[i].page need to be
take care:
1) When issue bio of stripe_head, bi_io_vec.bv_page will point to
the page directly. So, we should make sure bv_offset been set with
correct offset.
2) When compute xor, the page will be passed to computer function.
So, we also need to pass offset of that page to computer. Let it
compute correct location of each sh->dev[i].page.
This patch will add a new member of r5pages in stripe_head to manage
all pages needed by each sh->dev[i]. We also add 'offset' for each r5dev
so that users can get related page offset easily. And add helper function
to get page and it's index in r5pages array by disk index. This is patch
in preparation for fallowing changes.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
---
drivers/md/raid5.h | 55 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 55 insertions(+)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index b25f107dafc7..edc9bf519d05 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -246,6 +246,13 @@ struct stripe_head {
int target, target2;
enum sum_check_flags zero_sum_result;
} ops;
+
+ /* These pages will be used by bios in dev[i] */
+ struct r5pages {
+ struct page **page;
+ int size; /* page array size */
+ } pages;
+
struct r5dev {
/* rreq and rvec are used for the replacement device when
* writing data to both devices.
@@ -253,6 +260,7 @@ struct stripe_head {
struct bio req, rreq;
struct bio_vec vec, rvec;
struct page *page, *orig_page;
+ unsigned int offset; /* offset of this page */
struct bio *toread, *read, *towrite, *written;
sector_t sector; /* sector of this page */
unsigned long flags;
@@ -754,6 +762,53 @@ static inline int algorithm_is_DDF(int layout)
return layout >= 8 && layout <= 10;
}
+/*
+ * Return corresponding page index of r5pages array.
+ */
+static inline int raid5_get_page_index(struct stripe_head *sh, int disk_idx)
+{
+ WARN_ON(!sh->pages.page);
+ if (disk_idx >= sh->raid_conf->pool_size)
+ return -ENOENT;
+
+ return (disk_idx * STRIPE_SIZE) / PAGE_SIZE;
+}
+
+/*
+ * Return offset of the corresponding page for r5dev.
+ */
+static inline int raid5_get_page_offset(struct stripe_head *sh, int disk_idx)
+{
+ WARN_ON(!sh->pages.page);
+ if (disk_idx >= sh->raid_conf->pool_size)
+ return -ENOENT;
+
+ return (disk_idx * STRIPE_SIZE) % PAGE_SIZE;
+}
+
+/*
+ * Return corresponding page address for r5dev.
+ */
+static inline struct page *
+raid5_get_dev_page(struct stripe_head *sh, int disk_idx)
+{
+ int idx;
+
+ WARN_ON(!sh->pages.page);
+ idx = raid5_get_page_index(sh, disk_idx);
+ return sh->pages.page[idx];
+}
+
+/*
+ * We want to let multiple buffers to share one real page for
+ * stripe_head when PAGE_SIZE is biggger than STRIPE_SIZE. If
+ * they are equal, no need to use this strategy.
+ */
+static inline int raid5_stripe_pages_shared(struct r5conf *conf)
+{
+ return PAGE_SIZE > STRIPE_SIZE;
+}
+
extern void md_raid5_kick_device(struct r5conf *conf);
extern int raid5_set_cache_size(struct mddev *mddev, int size);
extern sector_t raid5_compute_blocknr(struct stripe_head *sh, int i, int previous);
--
2.21.3
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v3 03/11] md/raid5: allocate and free pages of r5pages
2020-05-27 13:19 [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value Yufen Yu
2020-05-27 13:19 ` [PATCH v3 01/11] md/raid5: add CONFIG_MD_RAID456_STRIPE_SHIFT to set STRIPE_SIZE Yufen Yu
2020-05-27 13:19 ` [PATCH v3 02/11] md/raid5: add a member of r5pages for struct stripe_head Yufen Yu
@ 2020-05-27 13:19 ` Yufen Yu
2020-05-27 13:19 ` [PATCH v3 04/11] md/raid5: set correct page offset for bi_io_vec in ops_run_io() Yufen Yu
` (9 subsequent siblings)
12 siblings, 0 replies; 31+ messages in thread
From: Yufen Yu @ 2020-05-27 13:19 UTC (permalink / raw)
To: song; +Cc: linux-raid, neilb, guoqing.jiang, colyli, xni, houtao1, yuyufen
When PAGE_SIZE is bigger than STRIPE_SIZE, try to allocate pages of
r5pages in grow_buffres() and free these pages in shrink_buffers().
Then, setting sh->dev[i].page and sh->dev[i].offset as the page in
array. Without enable shared page, we just set offset as value of '0'.
When reshape raid array, the new allocated stripes can reuse old stripe's
pages. By the way, we called resize_stripes() only when grow raid array
disks, so that don't worry about memleak for old r5pages.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
---
drivers/md/raid5.c | 142 ++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 128 insertions(+), 14 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3f96b4406902..57d140c930bd 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -448,20 +448,72 @@ static struct stripe_head *get_free_stripe(struct r5conf *conf, int hash)
return sh;
}
+/*
+ * Try to free all pages in r5pages array.
+ */
+static void free_stripe_pages(struct stripe_head *sh)
+{
+ int i;
+ struct page *p;
+
+ /* Have not allocate page pool */
+ if (!sh->pages.page)
+ return;
+
+ for (i = 0; i < sh->pages.size; i++) {
+ p = sh->pages.page[i];
+ if (p)
+ put_page(p);
+ sh->pages.page[i] = NULL;
+ }
+}
+
+/*
+ * Allocate pages for r5pages.
+ */
+static int alloc_stripe_pages(struct stripe_head *sh, gfp_t gfp)
+{
+ int i;
+ struct page *p;
+
+ for (i = 0; i < sh->pages.size; i++) {
+ /* The page have allocated. */
+ if (sh->pages.page[i])
+ continue;
+
+ p = alloc_page(gfp);
+ if (!p) {
+ free_stripe_pages(sh);
+ return -ENOMEM;
+ }
+ sh->pages.page[i] = p;
+ }
+ return 0;
+}
+
static void shrink_buffers(struct stripe_head *sh)
{
struct page *p;
int i;
int num = sh->raid_conf->pool_size;
- for (i = 0; i < num ; i++) {
+ if (raid5_stripe_pages_shared(sh->raid_conf))
+ free_stripe_pages(sh); /* Free pages in r5pages */
+
+ for (i = 0; i < num; i++) {
WARN_ON(sh->dev[i].page != sh->dev[i].orig_page);
p = sh->dev[i].page;
- if (!p)
+
+ /*
+ * If we use pages in r5pages, these pages have been
+ * freed in free_stripe_pages().
+ */
+ if (raid5_stripe_pages_shared(sh->raid_conf) || !p)
continue;
sh->dev[i].page = NULL;
put_page(p);
}
+
}
static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
@@ -469,14 +521,26 @@ static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
int i;
int num = sh->raid_conf->pool_size;
+ if (raid5_stripe_pages_shared(sh->raid_conf) &&
+ alloc_stripe_pages(sh, gfp))
+ return -ENOMEM;
+
for (i = 0; i < num; i++) {
struct page *page;
+ unsigned int offset;
- if (!(page = alloc_page(gfp))) {
- return 1;
+ if (raid5_stripe_pages_shared(sh->raid_conf)) {
+ page = raid5_get_dev_page(sh, i);
+ offset = raid5_get_page_offset(sh, i);
+ } else {
+ page = alloc_page(gfp);
+ if (!page)
+ return -ENOMEM;
+ offset = 0;
}
sh->dev[i].page = page;
sh->dev[i].orig_page = page;
+ sh->dev[i].offset = offset;
}
return 0;
@@ -2123,6 +2187,9 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
static void free_stripe(struct kmem_cache *sc, struct stripe_head *sh)
{
+ if (sh->pages.page)
+ kfree(sh->pages.page);
+
if (sh->ppl_page)
__free_page(sh->ppl_page);
kmem_cache_free(sc, sh);
@@ -2154,14 +2221,28 @@ static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp,
if (raid5_has_ppl(conf)) {
sh->ppl_page = alloc_page(gfp);
- if (!sh->ppl_page) {
- free_stripe(sc, sh);
- sh = NULL;
- }
+ if (!sh->ppl_page)
+ goto fail;
+ }
+
+ if (raid5_stripe_pages_shared(conf)) {
+ int nr_page;
+
+ /* Each of the sh->dev[i] need one STRIPE_SIZE */
+ nr_page = (disks * STRIPE_SIZE + PAGE_SIZE - 1) / PAGE_SIZE;
+ sh->pages.page = kzalloc(sizeof(struct page *) * nr_page, gfp);
+ if (!sh->pages.page)
+ goto fail;
+ sh->pages.size = nr_page;
}
}
return sh;
+
+fail:
+ free_stripe(sc, sh);
+ return NULL;
}
+
static int grow_one_stripe(struct r5conf *conf, gfp_t gfp)
{
struct stripe_head *sh;
@@ -2360,10 +2441,18 @@ static int resize_stripes(struct r5conf *conf, int newsize)
osh = get_free_stripe(conf, hash);
unlock_device_hash_lock(conf, hash);
- for(i=0; i<conf->pool_size; i++) {
+ if (raid5_stripe_pages_shared(conf)) {
+ /* We reuse pages in r5pages of old stripe head */
+ for (i = 0; i < osh->pages.size; i++)
+ nsh->pages.page[i] = osh->pages.page[i];
+ }
+
+ for (i = 0; i < conf->pool_size; i++) {
nsh->dev[i].page = osh->dev[i].page;
nsh->dev[i].orig_page = osh->dev[i].page;
+ nsh->dev[i].offset = osh->dev[i].offset;
}
+
nsh->hash_lock_index = hash;
free_stripe(conf->slab_cache, osh);
cnt++;
@@ -2410,17 +2499,42 @@ static int resize_stripes(struct r5conf *conf, int newsize)
/* Step 4, return new stripes to service */
while(!list_empty(&newstripes)) {
+ struct page *p;
+ unsigned int offset;
nsh = list_entry(newstripes.next, struct stripe_head, lru);
list_del_init(&nsh->lru);
- for (i=conf->raid_disks; i < newsize; i++)
- if (nsh->dev[i].page == NULL) {
- struct page *p = alloc_page(GFP_NOIO);
- nsh->dev[i].page = p;
- nsh->dev[i].orig_page = p;
+ /*
+ * If we use r5pages, means, pages.size is not zero,
+ * allocate pages it needed for new stripe_head.
+ */
+ for (i = 0; i < nsh->pages.size; i++) {
+ if (nsh->pages.page[i] == NULL) {
+ p = alloc_page(GFP_NOIO);
if (!p)
err = -ENOMEM;
+ nsh->pages.page[i] = p;
}
+ }
+
+ for (i = conf->raid_disks; i < newsize; i++) {
+ if (nsh->dev[i].page)
+ continue;
+
+ if (raid5_stripe_pages_shared(conf)) {
+ p = raid5_get_dev_page(nsh, i);
+ offset = raid5_get_page_offset(nsh, i);
+ } else {
+ p = alloc_page(GFP_NOIO);
+ if (!p)
+ err = -ENOMEM;
+ offset = 0;
+ }
+
+ nsh->dev[i].page = p;
+ nsh->dev[i].orig_page = p;
+ nsh->dev[i].offset = offset;
+ }
raid5_release_stripe(nsh);
}
/* critical section pass, GFP_NOIO no longer needed */
--
2.21.3
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v3 04/11] md/raid5: set correct page offset for bi_io_vec in ops_run_io()
2020-05-27 13:19 [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value Yufen Yu
` (2 preceding siblings ...)
2020-05-27 13:19 ` [PATCH v3 03/11] md/raid5: allocate and free pages of r5pages Yufen Yu
@ 2020-05-27 13:19 ` Yufen Yu
2020-05-27 13:19 ` [PATCH v3 05/11] md/raid5: set correct page offset for async_copy_data() Yufen Yu
` (8 subsequent siblings)
12 siblings, 0 replies; 31+ messages in thread
From: Yufen Yu @ 2020-05-27 13:19 UTC (permalink / raw)
To: song; +Cc: linux-raid, neilb, guoqing.jiang, colyli, xni, houtao1, yuyufen
After using r5pages for each sh->dev[i], we need to set correct offset
of that page for bi_io_vec when issue bio. The value of offset is zero
without using r5pages.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
---
drivers/md/raid5.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 57d140c930bd..9890f21e4f47 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1194,7 +1194,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
sh->dev[i].vec.bv_page = sh->dev[i].page;
bi->bi_vcnt = 1;
bi->bi_io_vec[0].bv_len = STRIPE_SIZE;
- bi->bi_io_vec[0].bv_offset = 0;
+ bi->bi_io_vec[0].bv_offset = sh->dev[i].offset;
bi->bi_iter.bi_size = STRIPE_SIZE;
bi->bi_write_hint = sh->dev[i].write_hint;
if (!rrdev)
@@ -1248,7 +1248,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
sh->dev[i].rvec.bv_page = sh->dev[i].page;
rbi->bi_vcnt = 1;
rbi->bi_io_vec[0].bv_len = STRIPE_SIZE;
- rbi->bi_io_vec[0].bv_offset = 0;
+ rbi->bi_io_vec[0].bv_offset = sh->dev[i].offset;
rbi->bi_iter.bi_size = STRIPE_SIZE;
rbi->bi_write_hint = sh->dev[i].write_hint;
sh->dev[i].write_hint = RWH_WRITE_LIFE_NOT_SET;
--
2.21.3
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v3 05/11] md/raid5: set correct page offset for async_copy_data()
2020-05-27 13:19 [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value Yufen Yu
` (3 preceding siblings ...)
2020-05-27 13:19 ` [PATCH v3 04/11] md/raid5: set correct page offset for bi_io_vec in ops_run_io() Yufen Yu
@ 2020-05-27 13:19 ` Yufen Yu
2020-05-27 13:19 ` [PATCH v3 06/11] md/raid5: add new xor function to support different page offset Yufen Yu
` (7 subsequent siblings)
12 siblings, 0 replies; 31+ messages in thread
From: Yufen Yu @ 2020-05-27 13:19 UTC (permalink / raw)
To: song; +Cc: linux-raid, neilb, guoqing.jiang, colyli, xni, houtao1, yuyufen
ops_run_biofill() and ops_run_biodrain() will call async_copy_data()
to copy sh->dev[i].page from or to bio. It also need to set correct
page offset for dev->page when use r5pages.
Without modifying original code logic, we replace 'page_offset' with
'page_offset + poff' simplify. In case of that wihtout using r5pages,
poff is zero.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
---
drivers/md/raid5.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 9890f21e4f47..4b7b5cc1ba1f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1290,7 +1290,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
static struct dma_async_tx_descriptor *
async_copy_data(int frombio, struct bio *bio, struct page **page,
- sector_t sector, struct dma_async_tx_descriptor *tx,
+ unsigned int poff, sector_t sector, struct dma_async_tx_descriptor *tx,
struct stripe_head *sh, int no_skipcopy)
{
struct bio_vec bvl;
@@ -1325,6 +1325,7 @@ async_copy_data(int frombio, struct bio *bio, struct page **page,
else
clen = len;
+
if (clen > 0) {
b_offset += bvl.bv_offset;
bio_page = bvl.bv_page;
@@ -1335,11 +1336,12 @@ async_copy_data(int frombio, struct bio *bio, struct page **page,
!no_skipcopy)
*page = bio_page;
else
- tx = async_memcpy(*page, bio_page, page_offset,
- b_offset, clen, &submit);
+ tx = async_memcpy(*page, bio_page,
+ page_offset + poff, b_offset,
+ clen, &submit);
} else
tx = async_memcpy(bio_page, *page, b_offset,
- page_offset, clen, &submit);
+ page_offset + poff, clen, &submit);
}
/* chain the operations */
submit.depend_tx = tx;
@@ -1410,7 +1412,7 @@ static void ops_run_biofill(struct stripe_head *sh)
while (rbi && rbi->bi_iter.bi_sector <
dev->sector + STRIPE_SECTORS) {
tx = async_copy_data(0, rbi, &dev->page,
- dev->sector, tx, sh, 0);
+ dev->offset, dev->sector, tx, sh, 0);
rbi = r5_next_bio(rbi, dev->sector);
}
}
@@ -1825,7 +1827,8 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
set_bit(R5_Discard, &dev->flags);
else {
tx = async_copy_data(1, wbi, &dev->page,
- dev->sector, tx, sh,
+ dev->offset, dev->sector,
+ tx, sh,
r5c_is_writeback(conf->log));
if (dev->page != dev->orig_page &&
!r5c_is_writeback(conf->log)) {
--
2.21.3
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v3 06/11] md/raid5: add new xor function to support different page offset
2020-05-27 13:19 [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value Yufen Yu
` (4 preceding siblings ...)
2020-05-27 13:19 ` [PATCH v3 05/11] md/raid5: set correct page offset for async_copy_data() Yufen Yu
@ 2020-05-27 13:19 ` Yufen Yu
2020-05-27 13:19 ` [PATCH v3 07/11] md/raid5: add offset array in scribble buffer Yufen Yu
` (6 subsequent siblings)
12 siblings, 0 replies; 31+ messages in thread
From: Yufen Yu @ 2020-05-27 13:19 UTC (permalink / raw)
To: song; +Cc: linux-raid, neilb, guoqing.jiang, colyli, xni, houtao1, yuyufen
RAID5 will call async_xor() and async_xor_val() to compute xor.
However, both of them require common src/dst page offset. After
introducing shared pages of r5pages, we want these xor computer
function to support different src/dst page offset.
Here, we add two new functions async_xor_offsets() and
async_xor_val_offsets() respectively for async_xor() and async_xor_val().
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
---
crypto/async_tx/async_xor.c | 120 +++++++++++++++++++++++++++++++-----
include/linux/async_tx.h | 11 ++++
2 files changed, 114 insertions(+), 17 deletions(-)
diff --git a/crypto/async_tx/async_xor.c b/crypto/async_tx/async_xor.c
index 4e5eebe52e6a..29a979358332 100644
--- a/crypto/async_tx/async_xor.c
+++ b/crypto/async_tx/async_xor.c
@@ -97,7 +97,8 @@ do_async_xor(struct dma_chan *chan, struct dmaengine_unmap_data *unmap,
}
static void
-do_sync_xor(struct page *dest, struct page **src_list, unsigned int offset,
+do_sync_xor_offsets(struct page *dest, unsigned int offset,
+ struct page **src_list, unsigned int *src_offset,
int src_cnt, size_t len, struct async_submit_ctl *submit)
{
int i;
@@ -114,7 +115,8 @@ do_sync_xor(struct page *dest, struct page **src_list, unsigned int offset,
/* convert to buffer pointers */
for (i = 0; i < src_cnt; i++)
if (src_list[i])
- srcs[xor_src_cnt++] = page_address(src_list[i]) + offset;
+ srcs[xor_src_cnt++] = page_address(src_list[i]) +
+ (src_offset ? src_offset[i] : offset);
src_cnt = xor_src_cnt;
/* set destination address */
dest_buf = page_address(dest) + offset;
@@ -135,11 +137,31 @@ do_sync_xor(struct page *dest, struct page **src_list, unsigned int offset,
async_tx_sync_epilog(submit);
}
+static inline bool
+dma_xor_aligned_offsets(struct dma_device *device, unsigned int offset,
+ unsigned int *src_offset, int src_cnt, int len)
+{
+ int i;
+
+ if (!is_dma_xor_aligned(device, offset, 0, len))
+ return false;
+
+ if (!src_offset)
+ return true;
+
+ for (i = 0; i < src_cnt; i++) {
+ if (!is_dma_xor_aligned(device, src_offset[i], 0, len))
+ return false;
+ }
+ return true;
+}
+
/**
- * async_xor - attempt to xor a set of blocks with a dma engine.
+ * async_xor_offsets - attempt to xor a set of blocks with a dma engine.
* @dest: destination page
+ * @offset: dst offset to start transaction
* @src_list: array of source pages
- * @offset: common src/dst offset to start transaction
+ * @src_offset: array of source pages offset, NULL means common src/dst offset
* @src_cnt: number of source pages
* @len: length in bytes
* @submit: submission / completion modifiers
@@ -157,8 +179,9 @@ do_sync_xor(struct page *dest, struct page **src_list, unsigned int offset,
* is not specified.
*/
struct dma_async_tx_descriptor *
-async_xor(struct page *dest, struct page **src_list, unsigned int offset,
- int src_cnt, size_t len, struct async_submit_ctl *submit)
+async_xor_offsets(struct page *dest, unsigned int offset,
+ struct page **src_list, unsigned int *src_offset,
+ int src_cnt, size_t len, struct async_submit_ctl *submit)
{
struct dma_chan *chan = async_tx_find_channel(submit, DMA_XOR,
&dest, 1, src_list,
@@ -171,7 +194,8 @@ async_xor(struct page *dest, struct page **src_list, unsigned int offset,
if (device)
unmap = dmaengine_get_unmap_data(device->dev, src_cnt+1, GFP_NOWAIT);
- if (unmap && is_dma_xor_aligned(device, offset, 0, len)) {
+ if (unmap && dma_xor_aligned_offsets(device, offset,
+ src_offset, src_cnt, len)) {
struct dma_async_tx_descriptor *tx;
int i, j;
@@ -184,7 +208,8 @@ async_xor(struct page *dest, struct page **src_list, unsigned int offset,
continue;
unmap->to_cnt++;
unmap->addr[j++] = dma_map_page(device->dev, src_list[i],
- offset, len, DMA_TO_DEVICE);
+ (src_offset ? src_offset[i] : offset),
+ len, DMA_TO_DEVICE);
}
/* map it bidirectional as it may be re-used as a source */
@@ -213,11 +238,42 @@ async_xor(struct page *dest, struct page **src_list, unsigned int offset,
/* wait for any prerequisite operations */
async_tx_quiesce(&submit->depend_tx);
- do_sync_xor(dest, src_list, offset, src_cnt, len, submit);
+ do_sync_xor_offsets(dest, offset, src_list, src_offset,
+ src_cnt, len, submit);
return NULL;
}
}
+EXPORT_SYMBOL_GPL(async_xor_offsets);
+
+/**
+ * async_xor - attempt to xor a set of blocks with a dma engine.
+ * @dest: destination page
+ * @src_list: array of source pages
+ * @offset: common src/dst offset to start transaction
+ * @src_cnt: number of source pages
+ * @len: length in bytes
+ * @submit: submission / completion modifiers
+ *
+ * honored flags: ASYNC_TX_ACK, ASYNC_TX_XOR_ZERO_DST, ASYNC_TX_XOR_DROP_DST
+ *
+ * xor_blocks always uses the dest as a source so the
+ * ASYNC_TX_XOR_ZERO_DST flag must be set to not include dest data in
+ * the calculation. The assumption with dma eninges is that they only
+ * use the destination buffer as a source when it is explicity specified
+ * in the source list.
+ *
+ * src_list note: if the dest is also a source it must be at index zero.
+ * The contents of this array will be overwritten if a scribble region
+ * is not specified.
+ */
+struct dma_async_tx_descriptor *
+async_xor(struct page *dest, struct page **src_list, unsigned int offset,
+ int src_cnt, size_t len, struct async_submit_ctl *submit)
+{
+ return async_xor_offsets(dest, offset, src_list, NULL,
+ src_cnt, len, submit);
+}
EXPORT_SYMBOL_GPL(async_xor);
static int page_is_zero(struct page *p, unsigned int offset, size_t len)
@@ -237,10 +293,11 @@ xor_val_chan(struct async_submit_ctl *submit, struct page *dest,
}
/**
- * async_xor_val - attempt a xor parity check with a dma engine.
+ * async_xor_val_offsets - attempt a xor parity check with a dma engine.
* @dest: destination page used if the xor is performed synchronously
+ * @offset: des offset in pages to start transaction
* @src_list: array of source pages
- * @offset: offset in pages to start transaction
+ * @src_offset: array of source pages offset, NULL means common src/det offset
* @src_cnt: number of source pages
* @len: length in bytes
* @result: 0 if sum == 0 else non-zero
@@ -253,9 +310,10 @@ xor_val_chan(struct async_submit_ctl *submit, struct page *dest,
* is not specified.
*/
struct dma_async_tx_descriptor *
-async_xor_val(struct page *dest, struct page **src_list, unsigned int offset,
- int src_cnt, size_t len, enum sum_check_flags *result,
- struct async_submit_ctl *submit)
+async_xor_val_offsets(struct page *dest, unsigned int offset,
+ struct page **src_list, unsigned int *src_offset,
+ int src_cnt, size_t len, enum sum_check_flags *result,
+ struct async_submit_ctl *submit)
{
struct dma_chan *chan = xor_val_chan(submit, dest, src_list, src_cnt, len);
struct dma_device *device = chan ? chan->device : NULL;
@@ -268,7 +326,7 @@ async_xor_val(struct page *dest, struct page **src_list, unsigned int offset,
unmap = dmaengine_get_unmap_data(device->dev, src_cnt, GFP_NOWAIT);
if (unmap && src_cnt <= device->max_xor &&
- is_dma_xor_aligned(device, offset, 0, len)) {
+ dma_xor_aligned_offsets(device, offset, src_offset, src_cnt, len)) {
unsigned long dma_prep_flags = 0;
int i;
@@ -281,7 +339,8 @@ async_xor_val(struct page *dest, struct page **src_list, unsigned int offset,
for (i = 0; i < src_cnt; i++) {
unmap->addr[i] = dma_map_page(device->dev, src_list[i],
- offset, len, DMA_TO_DEVICE);
+ (src_offset ? src_offset[i] : offset),
+ len, DMA_TO_DEVICE);
unmap->to_cnt++;
}
unmap->len = len;
@@ -312,7 +371,8 @@ async_xor_val(struct page *dest, struct page **src_list, unsigned int offset,
submit->flags |= ASYNC_TX_XOR_DROP_DST;
submit->flags &= ~ASYNC_TX_ACK;
- tx = async_xor(dest, src_list, offset, src_cnt, len, submit);
+ tx = async_xor_offsets(dest, offset, src_list, src_offset,
+ src_cnt, len, submit);
async_tx_quiesce(&tx);
@@ -325,6 +385,32 @@ async_xor_val(struct page *dest, struct page **src_list, unsigned int offset,
return tx;
}
+EXPORT_SYMBOL_GPL(async_xor_val_offsets);
+
+/**
+ * async_xor_val - attempt a xor parity check with a dma engine.
+ * @dest: destination page used if the xor is performed synchronously
+ * @src_list: array of source pages
+ * @offset: offset in pages to start transaction
+ * @src_cnt: number of source pages
+ * @len: length in bytes
+ * @result: 0 if sum == 0 else non-zero
+ * @submit: submission / completion modifiers
+ *
+ * honored flags: ASYNC_TX_ACK
+ *
+ * src_list note: if the dest is also a source it must be at index zero.
+ * The contents of this array will be overwritten if a scribble region
+ * is not specified.
+ */
+struct dma_async_tx_descriptor *
+async_xor_val(struct page *dest, struct page **src_list, unsigned int offset,
+ int src_cnt, size_t len, enum sum_check_flags *result,
+ struct async_submit_ctl *submit)
+{
+ return async_xor_val_offsets(dest, offset, src_list, NULL, src_cnt,
+ len, result, submit);
+}
EXPORT_SYMBOL_GPL(async_xor_val);
MODULE_AUTHOR("Intel Corporation");
diff --git a/include/linux/async_tx.h b/include/linux/async_tx.h
index 75e582b8d2d9..8d79e2de06bd 100644
--- a/include/linux/async_tx.h
+++ b/include/linux/async_tx.h
@@ -162,11 +162,22 @@ struct dma_async_tx_descriptor *
async_xor(struct page *dest, struct page **src_list, unsigned int offset,
int src_cnt, size_t len, struct async_submit_ctl *submit);
+struct dma_async_tx_descriptor *
+async_xor_offsets(struct page *dest, unsigned int offset,
+ struct page **src_list, unsigned int *src_offset,
+ int src_cnt, size_t len, struct async_submit_ctl *submit);
+
struct dma_async_tx_descriptor *
async_xor_val(struct page *dest, struct page **src_list, unsigned int offset,
int src_cnt, size_t len, enum sum_check_flags *result,
struct async_submit_ctl *submit);
+struct dma_async_tx_descriptor *
+async_xor_val_offsets(struct page *dest, unsigned int offset,
+ struct page **src_list, unsigned int *src_offset,
+ int src_cnt, size_t len, enum sum_check_flags *result,
+ struct async_submit_ctl *submit);
+
struct dma_async_tx_descriptor *
async_memcpy(struct page *dest, struct page *src, unsigned int dest_offset,
unsigned int src_offset, size_t len,
--
2.21.3
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v3 07/11] md/raid5: add offset array in scribble buffer
2020-05-27 13:19 [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value Yufen Yu
` (5 preceding siblings ...)
2020-05-27 13:19 ` [PATCH v3 06/11] md/raid5: add new xor function to support different page offset Yufen Yu
@ 2020-05-27 13:19 ` Yufen Yu
2020-05-27 13:19 ` [PATCH v3 08/11] md/raid5: compute xor with correct page offset Yufen Yu
` (5 subsequent siblings)
12 siblings, 0 replies; 31+ messages in thread
From: Yufen Yu @ 2020-05-27 13:19 UTC (permalink / raw)
To: song; +Cc: linux-raid, neilb, guoqing.jiang, colyli, xni, houtao1, yuyufen
When enable shared buffers for stripe_head, it need an offset
array to record page offset to compute xor. To avoid repeatly allocate
an new array each time, we add a memory region into scribble buffer
to record offset.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
---
drivers/md/raid5.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 4b7b5cc1ba1f..b97ebc7b5747 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1467,6 +1467,15 @@ static addr_conv_t *to_addr_conv(struct stripe_head *sh,
return (void *) (to_addr_page(percpu, i) + sh->disks + 2);
}
+/*
+ * Return a pointer to record offset address.
+ */
+static unsigned int *
+to_addr_offs(struct stripe_head *sh, struct raid5_percpu *percpu)
+{
+ return (unsigned int *) (to_addr_conv(sh, percpu, 0) + sh->disks + 2);
+}
+
static struct dma_async_tx_descriptor *
ops_run_compute5(struct stripe_head *sh, struct raid5_percpu *percpu)
{
@@ -2315,8 +2324,9 @@ static int scribble_alloc(struct raid5_percpu *percpu,
int num, int cnt, gfp_t flags)
{
size_t obj_size =
- sizeof(struct page *) * (num+2) +
- sizeof(addr_conv_t) * (num+2);
+ sizeof(struct page *) * (num + 2) +
+ sizeof(addr_conv_t) * (num + 2) +
+ sizeof(unsigned int) * (num + 2);
void *scribble;
scribble = kvmalloc_array(cnt, obj_size, flags);
--
2.21.3
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v3 08/11] md/raid5: compute xor with correct page offset
2020-05-27 13:19 [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value Yufen Yu
` (6 preceding siblings ...)
2020-05-27 13:19 ` [PATCH v3 07/11] md/raid5: add offset array in scribble buffer Yufen Yu
@ 2020-05-27 13:19 ` Yufen Yu
2020-05-27 13:19 ` [PATCH v3 09/11] md/raid6: let syndrome computor support different " Yufen Yu
` (4 subsequent siblings)
12 siblings, 0 replies; 31+ messages in thread
From: Yufen Yu @ 2020-05-27 13:19 UTC (permalink / raw)
To: song; +Cc: linux-raid, neilb, guoqing.jiang, colyli, xni, houtao1, yuyufen
When compute xor, the pages address will be passed to computer function.
After trying to use pages in r5pages, we also need to pass page offset
to let it know correct location of each page.
For now raid5-cache and raid5-ppl are supported only when PAGE_SIZE is
equal to 4096. In that case, shared pages will not be supported and
dev->offset is '0'. So, we can use that value directly.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
---
drivers/md/raid5.c | 64 ++++++++++++++++++++++++++++++++++++----------
1 file changed, 51 insertions(+), 13 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b97ebc7b5747..1a59c1db96ff 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1481,6 +1481,7 @@ ops_run_compute5(struct stripe_head *sh, struct raid5_percpu *percpu)
{
int disks = sh->disks;
struct page **xor_srcs = to_addr_page(percpu, 0);
+ unsigned int *offs = to_addr_offs(sh, percpu);
int target = sh->ops.target;
struct r5dev *tgt = &sh->dev[target];
struct page *xor_dest = tgt->page;
@@ -1488,6 +1489,7 @@ ops_run_compute5(struct stripe_head *sh, struct raid5_percpu *percpu)
struct dma_async_tx_descriptor *tx;
struct async_submit_ctl submit;
int i;
+ unsigned int des_offset = tgt->offset;
BUG_ON(sh->batch_head);
@@ -1495,18 +1497,23 @@ ops_run_compute5(struct stripe_head *sh, struct raid5_percpu *percpu)
__func__, (unsigned long long)sh->sector, target);
BUG_ON(!test_bit(R5_Wantcompute, &tgt->flags));
- for (i = disks; i--; )
- if (i != target)
+ for (i = disks; i--; ) {
+ if (i != target) {
+ offs[count] = sh->dev[i].offset;
xor_srcs[count++] = sh->dev[i].page;
+ }
+ }
atomic_inc(&sh->count);
init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_ZERO_DST, NULL,
ops_complete_compute, sh, to_addr_conv(sh, percpu, 0));
if (unlikely(count == 1))
- tx = async_memcpy(xor_dest, xor_srcs[0], 0, 0, STRIPE_SIZE, &submit);
+ tx = async_memcpy(xor_dest, xor_srcs[0], des_offset, offs[0],
+ STRIPE_SIZE, &submit);
else
- tx = async_xor(xor_dest, xor_srcs, 0, count, STRIPE_SIZE, &submit);
+ tx = async_xor_offsets(xor_dest, des_offset, xor_srcs, offs,
+ count, STRIPE_SIZE, &submit);
return tx;
}
@@ -1745,10 +1752,12 @@ ops_run_prexor5(struct stripe_head *sh, struct raid5_percpu *percpu,
{
int disks = sh->disks;
struct page **xor_srcs = to_addr_page(percpu, 0);
+ unsigned int *offs = to_addr_offs(sh, percpu);
int count = 0, pd_idx = sh->pd_idx, i;
struct async_submit_ctl submit;
/* existing parity data subtracted */
+ unsigned int des_offset = offs[count] = sh->dev[pd_idx].offset;
struct page *xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page;
BUG_ON(sh->batch_head);
@@ -1758,15 +1767,23 @@ ops_run_prexor5(struct stripe_head *sh, struct raid5_percpu *percpu,
for (i = disks; i--; ) {
struct r5dev *dev = &sh->dev[i];
/* Only process blocks that are known to be uptodate */
- if (test_bit(R5_InJournal, &dev->flags))
+ if (test_bit(R5_InJournal, &dev->flags)) {
+ /*
+ * For this case, PAGE_SIZE must be 4KB and will not
+ * use r5pages. So dev->offset is zero.
+ */
+ offs[count] = dev->offset;
xor_srcs[count++] = dev->orig_page;
- else if (test_bit(R5_Wantdrain, &dev->flags))
+ } else if (test_bit(R5_Wantdrain, &dev->flags)) {
+ offs[count] = dev->offset;
xor_srcs[count++] = dev->page;
+ }
}
init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_DROP_DST, tx,
ops_complete_prexor, sh, to_addr_conv(sh, percpu, 0));
- tx = async_xor(xor_dest, xor_srcs, 0, count, STRIPE_SIZE, &submit);
+ tx = async_xor_offsets(xor_dest, des_offset, xor_srcs, offs,
+ count, STRIPE_SIZE, &submit);
return tx;
}
@@ -1916,6 +1933,7 @@ ops_run_reconstruct5(struct stripe_head *sh, struct raid5_percpu *percpu,
{
int disks = sh->disks;
struct page **xor_srcs;
+ unsigned int *offs;
struct async_submit_ctl submit;
int count, pd_idx = sh->pd_idx, i;
struct page *xor_dest;
@@ -1924,6 +1942,7 @@ ops_run_reconstruct5(struct stripe_head *sh, struct raid5_percpu *percpu,
int j = 0;
struct stripe_head *head_sh = sh;
int last_stripe;
+ unsigned int des_offset;
pr_debug("%s: stripe %llu\n", __func__,
(unsigned long long)sh->sector);
@@ -1940,27 +1959,37 @@ ops_run_reconstruct5(struct stripe_head *sh, struct raid5_percpu *percpu,
ops_complete_reconstruct(sh);
return;
}
+
again:
count = 0;
xor_srcs = to_addr_page(percpu, j);
+ offs = to_addr_offs(sh, percpu);
/* check if prexor is active which means only process blocks
* that are part of a read-modify-write (written)
*/
if (head_sh->reconstruct_state == reconstruct_state_prexor_drain_run) {
prexor = 1;
+ des_offset = offs[count] = sh->dev[pd_idx].offset;
xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page;
+
for (i = disks; i--; ) {
struct r5dev *dev = &sh->dev[i];
if (head_sh->dev[i].written ||
- test_bit(R5_InJournal, &head_sh->dev[i].flags))
+ test_bit(R5_InJournal, &head_sh->dev[i].flags)) {
+ offs[count] = dev->offset;
xor_srcs[count++] = dev->page;
+ }
}
} else {
xor_dest = sh->dev[pd_idx].page;
+ des_offset = sh->dev[pd_idx].offset;
+
for (i = disks; i--; ) {
struct r5dev *dev = &sh->dev[i];
- if (i != pd_idx)
+ if (i != pd_idx) {
+ offs[count] = dev->offset;
xor_srcs[count++] = dev->page;
+ }
}
}
@@ -1986,9 +2015,12 @@ ops_run_reconstruct5(struct stripe_head *sh, struct raid5_percpu *percpu,
}
if (unlikely(count == 1))
- tx = async_memcpy(xor_dest, xor_srcs[0], 0, 0, STRIPE_SIZE, &submit);
+ tx = async_memcpy(xor_dest, xor_srcs[0], des_offset,
+ offs[0], STRIPE_SIZE, &submit);
else
- tx = async_xor(xor_dest, xor_srcs, 0, count, STRIPE_SIZE, &submit);
+ tx = async_xor_offsets(xor_dest, des_offset, xor_srcs,
+ offs, count, STRIPE_SIZE, &submit);
+
if (!last_stripe) {
j++;
sh = list_first_entry(&sh->batch_list, struct stripe_head,
@@ -2076,10 +2108,12 @@ static void ops_run_check_p(struct stripe_head *sh, struct raid5_percpu *percpu)
int qd_idx = sh->qd_idx;
struct page *xor_dest;
struct page **xor_srcs = to_addr_page(percpu, 0);
+ unsigned int *offs = to_addr_offs(sh, percpu);
struct dma_async_tx_descriptor *tx;
struct async_submit_ctl submit;
int count;
int i;
+ unsigned int dest_offset;
pr_debug("%s: stripe %llu\n", __func__,
(unsigned long long)sh->sector);
@@ -2087,17 +2121,21 @@ static void ops_run_check_p(struct stripe_head *sh, struct raid5_percpu *percpu)
BUG_ON(sh->batch_head);
count = 0;
xor_dest = sh->dev[pd_idx].page;
+ dest_offset = sh->dev[pd_idx].offset;
+ offs[count] = dest_offset;
xor_srcs[count++] = xor_dest;
+
for (i = disks; i--; ) {
if (i == pd_idx || i == qd_idx)
continue;
+ offs[count] = sh->dev[i].offset;
xor_srcs[count++] = sh->dev[i].page;
}
init_async_submit(&submit, 0, NULL, NULL, NULL,
to_addr_conv(sh, percpu, 0));
- tx = async_xor_val(xor_dest, xor_srcs, 0, count, STRIPE_SIZE,
- &sh->ops.zero_sum_result, &submit);
+ tx = async_xor_val_offsets(xor_dest, dest_offset, xor_srcs, offs,
+ count, STRIPE_SIZE, &sh->ops.zero_sum_result, &submit);
atomic_inc(&sh->count);
init_async_submit(&submit, ASYNC_TX_ACK, tx, ops_complete_check, sh, NULL);
--
2.21.3
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v3 09/11] md/raid6: let syndrome computor support different page offset
2020-05-27 13:19 [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value Yufen Yu
` (7 preceding siblings ...)
2020-05-27 13:19 ` [PATCH v3 08/11] md/raid5: compute xor with correct page offset Yufen Yu
@ 2020-05-27 13:19 ` Yufen Yu
2020-05-27 13:19 ` [PATCH v3 10/11] md/raid6: compute syndrome with correct " Yufen Yu
` (3 subsequent siblings)
12 siblings, 0 replies; 31+ messages in thread
From: Yufen Yu @ 2020-05-27 13:19 UTC (permalink / raw)
To: song; +Cc: linux-raid, neilb, guoqing.jiang, colyli, xni, houtao1, yuyufen
For now, syndrome compute functions require common offset in the pages
array. However, we expect these function can support different page
offset when try to use share page.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
---
crypto/async_tx/async_pq.c | 71 +++++++-----
crypto/async_tx/async_raid6_recov.c | 161 ++++++++++++++++++++--------
include/linux/async_tx.h | 12 ++-
3 files changed, 172 insertions(+), 72 deletions(-)
diff --git a/crypto/async_tx/async_pq.c b/crypto/async_tx/async_pq.c
index 341ece61cf9b..1a4084e0984c 100644
--- a/crypto/async_tx/async_pq.c
+++ b/crypto/async_tx/async_pq.c
@@ -104,7 +104,7 @@ do_async_gen_syndrome(struct dma_chan *chan,
* do_sync_gen_syndrome - synchronously calculate a raid6 syndrome
*/
static void
-do_sync_gen_syndrome(struct page **blocks, unsigned int offset, int disks,
+do_sync_gen_syndrome(struct page **blocks, unsigned int *offsets, int disks,
size_t len, struct async_submit_ctl *submit)
{
void **srcs;
@@ -121,7 +121,8 @@ do_sync_gen_syndrome(struct page **blocks, unsigned int offset, int disks,
BUG_ON(i > disks - 3); /* P or Q can't be zero */
srcs[i] = (void*)raid6_empty_zero_page;
} else {
- srcs[i] = page_address(blocks[i]) + offset;
+ srcs[i] = page_address(blocks[i]) + offsets[i];
+
if (i < disks - 2) {
stop = i;
if (start == -1)
@@ -138,10 +139,23 @@ do_sync_gen_syndrome(struct page **blocks, unsigned int offset, int disks,
async_tx_sync_epilog(submit);
}
+static inline bool
+is_dma_pq_aligned_offs(struct dma_device *dev, unsigned int *offs,
+ int src_cnt, size_t len)
+{
+ int i;
+
+ for (i = 0; i < src_cnt; i++) {
+ if (!is_dma_pq_aligned(dev, offs[i], 0, len))
+ return false;
+ }
+ return true;
+}
+
/**
* async_gen_syndrome - asynchronously calculate a raid6 syndrome
* @blocks: source blocks from idx 0..disks-3, P @ disks-2 and Q @ disks-1
- * @offset: common offset into each block (src and dest) to start transaction
+ * @offsets: offset array into each block (src and dest) to start transaction
* @disks: number of blocks (including missing P or Q, see below)
* @len: length of operation in bytes
* @submit: submission/completion modifiers
@@ -160,7 +174,7 @@ do_sync_gen_syndrome(struct page **blocks, unsigned int offset, int disks,
* path.
*/
struct dma_async_tx_descriptor *
-async_gen_syndrome(struct page **blocks, unsigned int offset, int disks,
+async_gen_syndrome(struct page **blocks, unsigned int *offsets, int disks,
size_t len, struct async_submit_ctl *submit)
{
int src_cnt = disks - 2;
@@ -179,7 +193,7 @@ async_gen_syndrome(struct page **blocks, unsigned int offset, int disks,
if (unmap && !(submit->flags & ASYNC_TX_PQ_XOR_DST) &&
(src_cnt <= dma_maxpq(device, 0) ||
dma_maxpq(device, DMA_PREP_CONTINUE) > 0) &&
- is_dma_pq_aligned(device, offset, 0, len)) {
+ is_dma_pq_aligned_offs(device, offsets, disks, len)) {
struct dma_async_tx_descriptor *tx;
enum dma_ctrl_flags dma_flags = 0;
unsigned char coefs[MAX_DISKS];
@@ -196,8 +210,8 @@ async_gen_syndrome(struct page **blocks, unsigned int offset, int disks,
for (i = 0, j = 0; i < src_cnt; i++) {
if (blocks[i] == NULL)
continue;
- unmap->addr[j] = dma_map_page(device->dev, blocks[i], offset,
- len, DMA_TO_DEVICE);
+ unmap->addr[j] = dma_map_page(device->dev, blocks[i],
+ offsets[i], len, DMA_TO_DEVICE);
coefs[j] = raid6_gfexp[i];
unmap->to_cnt++;
j++;
@@ -210,7 +224,8 @@ async_gen_syndrome(struct page **blocks, unsigned int offset, int disks,
unmap->bidi_cnt++;
if (P(blocks, disks))
unmap->addr[j++] = dma_map_page(device->dev, P(blocks, disks),
- offset, len, DMA_BIDIRECTIONAL);
+ P(offsets, disks),
+ len, DMA_BIDIRECTIONAL);
else {
unmap->addr[j++] = 0;
dma_flags |= DMA_PREP_PQ_DISABLE_P;
@@ -219,7 +234,8 @@ async_gen_syndrome(struct page **blocks, unsigned int offset, int disks,
unmap->bidi_cnt++;
if (Q(blocks, disks))
unmap->addr[j++] = dma_map_page(device->dev, Q(blocks, disks),
- offset, len, DMA_BIDIRECTIONAL);
+ Q(offsets, disks),
+ len, DMA_BIDIRECTIONAL);
else {
unmap->addr[j++] = 0;
dma_flags |= DMA_PREP_PQ_DISABLE_Q;
@@ -240,13 +256,13 @@ async_gen_syndrome(struct page **blocks, unsigned int offset, int disks,
if (!P(blocks, disks)) {
P(blocks, disks) = pq_scribble_page;
- BUG_ON(len + offset > PAGE_SIZE);
+ P(offsets, disks) = 0;
}
if (!Q(blocks, disks)) {
Q(blocks, disks) = pq_scribble_page;
- BUG_ON(len + offset > PAGE_SIZE);
+ Q(offsets, disks) = 0;
}
- do_sync_gen_syndrome(blocks, offset, disks, len, submit);
+ do_sync_gen_syndrome(blocks, offsets, disks, len, submit);
return NULL;
}
@@ -278,9 +294,9 @@ pq_val_chan(struct async_submit_ctl *submit, struct page **blocks, int disks, si
* specified.
*/
struct dma_async_tx_descriptor *
-async_syndrome_val(struct page **blocks, unsigned int offset, int disks,
+async_syndrome_val(struct page **blocks, unsigned int *offsets, int disks,
size_t len, enum sum_check_flags *pqres, struct page *spare,
- struct async_submit_ctl *submit)
+ unsigned int s_off, struct async_submit_ctl *submit)
{
struct dma_chan *chan = pq_val_chan(submit, blocks, disks, len);
struct dma_device *device = chan ? chan->device : NULL;
@@ -295,7 +311,7 @@ async_syndrome_val(struct page **blocks, unsigned int offset, int disks,
unmap = dmaengine_get_unmap_data(device->dev, disks, GFP_NOWAIT);
if (unmap && disks <= dma_maxpq(device, 0) &&
- is_dma_pq_aligned(device, offset, 0, len)) {
+ is_dma_pq_aligned_offs(device, offsets, disks, len)) {
struct device *dev = device->dev;
dma_addr_t pq[2];
int i, j = 0, src_cnt = 0;
@@ -307,7 +323,7 @@ async_syndrome_val(struct page **blocks, unsigned int offset, int disks,
for (i = 0; i < disks-2; i++)
if (likely(blocks[i])) {
unmap->addr[j] = dma_map_page(dev, blocks[i],
- offset, len,
+ offsets[i], len,
DMA_TO_DEVICE);
coefs[j] = raid6_gfexp[i];
unmap->to_cnt++;
@@ -320,7 +336,7 @@ async_syndrome_val(struct page **blocks, unsigned int offset, int disks,
dma_flags |= DMA_PREP_PQ_DISABLE_P;
} else {
pq[0] = dma_map_page(dev, P(blocks, disks),
- offset, len,
+ P(offsets, disks), len,
DMA_TO_DEVICE);
unmap->addr[j++] = pq[0];
unmap->to_cnt++;
@@ -330,7 +346,7 @@ async_syndrome_val(struct page **blocks, unsigned int offset, int disks,
dma_flags |= DMA_PREP_PQ_DISABLE_Q;
} else {
pq[1] = dma_map_page(dev, Q(blocks, disks),
- offset, len,
+ Q(offsets, disks), len,
DMA_TO_DEVICE);
unmap->addr[j++] = pq[1];
unmap->to_cnt++;
@@ -355,7 +371,9 @@ async_syndrome_val(struct page **blocks, unsigned int offset, int disks,
async_tx_submit(chan, tx, submit);
} else {
struct page *p_src = P(blocks, disks);
+ unsigned int p_off = P(offsets, disks);
struct page *q_src = Q(blocks, disks);
+ unsigned int q_off = Q(offsets, disks);
enum async_tx_flags flags_orig = submit->flags;
dma_async_tx_callback cb_fn_orig = submit->cb_fn;
void *scribble = submit->scribble;
@@ -381,27 +399,32 @@ async_syndrome_val(struct page **blocks, unsigned int offset, int disks,
if (p_src) {
init_async_submit(submit, ASYNC_TX_XOR_ZERO_DST, NULL,
NULL, NULL, scribble);
- tx = async_xor(spare, blocks, offset, disks-2, len, submit);
+ tx = async_xor_offsets(spare, s_off,
+ blocks, offsets, disks-2, len, submit);
async_tx_quiesce(&tx);
- p = page_address(p_src) + offset;
- s = page_address(spare) + offset;
+ p = page_address(p_src) + p_off;
+ s = page_address(spare) + s_off;
*pqres |= !!memcmp(p, s, len) << SUM_CHECK_P;
}
if (q_src) {
P(blocks, disks) = NULL;
Q(blocks, disks) = spare;
+ Q(offsets, disks) = s_off;
init_async_submit(submit, 0, NULL, NULL, NULL, scribble);
- tx = async_gen_syndrome(blocks, offset, disks, len, submit);
+ tx = async_gen_syndrome(blocks, offsets, disks,
+ len, submit);
async_tx_quiesce(&tx);
- q = page_address(q_src) + offset;
- s = page_address(spare) + offset;
+ q = page_address(q_src) + q_off;
+ s = page_address(spare) + s_off;
*pqres |= !!memcmp(q, s, len) << SUM_CHECK_Q;
}
/* restore P, Q and submit */
P(blocks, disks) = p_src;
+ P(offsets, disks) = p_off;
Q(blocks, disks) = q_src;
+ Q(offsets, disks) = q_off;
submit->cb_fn = cb_fn_orig;
submit->cb_param = cb_param_orig;
diff --git a/crypto/async_tx/async_raid6_recov.c b/crypto/async_tx/async_raid6_recov.c
index f249142ceac4..219f7bf1c488 100644
--- a/crypto/async_tx/async_raid6_recov.c
+++ b/crypto/async_tx/async_raid6_recov.c
@@ -15,8 +15,9 @@
#include <linux/dmaengine.h>
static struct dma_async_tx_descriptor *
-async_sum_product(struct page *dest, struct page **srcs, unsigned char *coef,
- size_t len, struct async_submit_ctl *submit)
+async_sum_product(struct page *dest, unsigned int d_off,
+ struct page **srcs, unsigned int *src_offs, unsigned char *coef,
+ size_t len, struct async_submit_ctl *submit)
{
struct dma_chan *chan = async_tx_find_channel(submit, DMA_PQ,
&dest, 1, srcs, 2, len);
@@ -37,11 +38,14 @@ async_sum_product(struct page *dest, struct page **srcs, unsigned char *coef,
if (submit->flags & ASYNC_TX_FENCE)
dma_flags |= DMA_PREP_FENCE;
- unmap->addr[0] = dma_map_page(dev, srcs[0], 0, len, DMA_TO_DEVICE);
- unmap->addr[1] = dma_map_page(dev, srcs[1], 0, len, DMA_TO_DEVICE);
+ unmap->addr[0] = dma_map_page(dev, srcs[0], src_offs[0],
+ len, DMA_TO_DEVICE);
+ unmap->addr[1] = dma_map_page(dev, srcs[1], src_offs[1],
+ len, DMA_TO_DEVICE);
unmap->to_cnt = 2;
- unmap->addr[2] = dma_map_page(dev, dest, 0, len, DMA_BIDIRECTIONAL);
+ unmap->addr[2] = dma_map_page(dev, dest, d_off,
+ len, DMA_BIDIRECTIONAL);
unmap->bidi_cnt = 1;
/* engine only looks at Q, but expects it to follow P */
pq[1] = unmap->addr[2];
@@ -66,9 +70,9 @@ async_sum_product(struct page *dest, struct page **srcs, unsigned char *coef,
async_tx_quiesce(&submit->depend_tx);
amul = raid6_gfmul[coef[0]];
bmul = raid6_gfmul[coef[1]];
- a = page_address(srcs[0]);
- b = page_address(srcs[1]);
- c = page_address(dest);
+ a = page_address(srcs[0]) + src_offs[0];
+ b = page_address(srcs[1]) + src_offs[1];
+ c = page_address(dest) + d_off;
while (len--) {
ax = amul[*a++];
@@ -80,8 +84,9 @@ async_sum_product(struct page *dest, struct page **srcs, unsigned char *coef,
}
static struct dma_async_tx_descriptor *
-async_mult(struct page *dest, struct page *src, u8 coef, size_t len,
- struct async_submit_ctl *submit)
+async_mult(struct page *dest, unsigned int d_off, struct page *src,
+ unsigned int s_off, u8 coef, size_t len,
+ struct async_submit_ctl *submit)
{
struct dma_chan *chan = async_tx_find_channel(submit, DMA_PQ,
&dest, 1, &src, 1, len);
@@ -101,9 +106,11 @@ async_mult(struct page *dest, struct page *src, u8 coef, size_t len,
if (submit->flags & ASYNC_TX_FENCE)
dma_flags |= DMA_PREP_FENCE;
- unmap->addr[0] = dma_map_page(dev, src, 0, len, DMA_TO_DEVICE);
+ unmap->addr[0] = dma_map_page(dev, src, s_off,
+ len, DMA_TO_DEVICE);
unmap->to_cnt++;
- unmap->addr[1] = dma_map_page(dev, dest, 0, len, DMA_BIDIRECTIONAL);
+ unmap->addr[1] = dma_map_page(dev, dest, d_off,
+ len, DMA_BIDIRECTIONAL);
dma_dest[1] = unmap->addr[1];
unmap->bidi_cnt++;
unmap->len = len;
@@ -133,8 +140,8 @@ async_mult(struct page *dest, struct page *src, u8 coef, size_t len,
*/
async_tx_quiesce(&submit->depend_tx);
qmul = raid6_gfmul[coef];
- d = page_address(dest);
- s = page_address(src);
+ d = page_address(dest) + d_off;
+ s = page_address(src) + s_off;
while (len--)
*d++ = qmul[*s++];
@@ -144,11 +151,14 @@ async_mult(struct page *dest, struct page *src, u8 coef, size_t len,
static struct dma_async_tx_descriptor *
__2data_recov_4(int disks, size_t bytes, int faila, int failb,
- struct page **blocks, struct async_submit_ctl *submit)
+ struct page **blocks, unsigned int *offs,
+ struct async_submit_ctl *submit)
{
struct dma_async_tx_descriptor *tx = NULL;
struct page *p, *q, *a, *b;
+ unsigned int p_off, q_off, a_off, b_off;
struct page *srcs[2];
+ unsigned int src_offs[2];
unsigned char coef[2];
enum async_tx_flags flags = submit->flags;
dma_async_tx_callback cb_fn = submit->cb_fn;
@@ -156,26 +166,34 @@ __2data_recov_4(int disks, size_t bytes, int faila, int failb,
void *scribble = submit->scribble;
p = blocks[disks-2];
+ p_off = offs[disks-2];
q = blocks[disks-1];
+ q_off = offs[disks-1];
a = blocks[faila];
+ a_off = offs[faila];
b = blocks[failb];
+ b_off = offs[failb];
/* in the 4 disk case P + Pxy == P and Q + Qxy == Q */
/* Dx = A*(P+Pxy) + B*(Q+Qxy) */
srcs[0] = p;
+ src_offs[0] = p_off;
srcs[1] = q;
+ src_offs[1] = q_off;
coef[0] = raid6_gfexi[failb-faila];
coef[1] = raid6_gfinv[raid6_gfexp[faila]^raid6_gfexp[failb]];
init_async_submit(submit, ASYNC_TX_FENCE, tx, NULL, NULL, scribble);
- tx = async_sum_product(b, srcs, coef, bytes, submit);
+ tx = async_sum_product(b, b_off, srcs, src_offs, coef, bytes, submit);
/* Dy = P+Pxy+Dx */
srcs[0] = p;
+ src_offs[0] = p_off;
srcs[1] = b;
+ src_offs[1] = b_off;
init_async_submit(submit, flags | ASYNC_TX_XOR_ZERO_DST, tx, cb_fn,
cb_param, scribble);
- tx = async_xor(a, srcs, 0, 2, bytes, submit);
+ tx = async_xor_offsets(a, a_off, srcs, src_offs, 2, bytes, submit);
return tx;
@@ -183,11 +201,14 @@ __2data_recov_4(int disks, size_t bytes, int faila, int failb,
static struct dma_async_tx_descriptor *
__2data_recov_5(int disks, size_t bytes, int faila, int failb,
- struct page **blocks, struct async_submit_ctl *submit)
+ struct page **blocks, unsigned int *offs,
+ struct async_submit_ctl *submit)
{
struct dma_async_tx_descriptor *tx = NULL;
struct page *p, *q, *g, *dp, *dq;
+ unsigned int p_off, q_off, g_off, dp_off, dq_off;
struct page *srcs[2];
+ unsigned int src_offs[2];
unsigned char coef[2];
enum async_tx_flags flags = submit->flags;
dma_async_tx_callback cb_fn = submit->cb_fn;
@@ -208,60 +229,77 @@ __2data_recov_5(int disks, size_t bytes, int faila, int failb,
BUG_ON(good_srcs > 1);
p = blocks[disks-2];
+ p_off = offs[disks-2];
q = blocks[disks-1];
+ q_off = offs[disks-1];
g = blocks[good];
+ g_off = offs[good];
/* Compute syndrome with zero for the missing data pages
* Use the dead data pages as temporary storage for delta p and
* delta q
*/
dp = blocks[faila];
+ dp_off = offs[faila];
dq = blocks[failb];
+ dq_off = offs[failb];
init_async_submit(submit, ASYNC_TX_FENCE, tx, NULL, NULL, scribble);
- tx = async_memcpy(dp, g, 0, 0, bytes, submit);
+ tx = async_memcpy(dp, g, dp_off, g_off, bytes, submit);
init_async_submit(submit, ASYNC_TX_FENCE, tx, NULL, NULL, scribble);
- tx = async_mult(dq, g, raid6_gfexp[good], bytes, submit);
+ tx = async_mult(dq, dq_off, g, g_off,
+ raid6_gfexp[good], bytes, submit);
/* compute P + Pxy */
srcs[0] = dp;
+ src_offs[0] = dp_off;
srcs[1] = p;
+ src_offs[1] = p_off;
init_async_submit(submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_DROP_DST, tx,
NULL, NULL, scribble);
- tx = async_xor(dp, srcs, 0, 2, bytes, submit);
+ tx = async_xor_offsets(dp, dp_off, srcs, src_offs, 2, bytes, submit);
/* compute Q + Qxy */
srcs[0] = dq;
+ src_offs[0] = dq_off;
srcs[1] = q;
+ src_offs[1] = q_off;
init_async_submit(submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_DROP_DST, tx,
NULL, NULL, scribble);
- tx = async_xor(dq, srcs, 0, 2, bytes, submit);
+ tx = async_xor_offsets(dq, dq_off, srcs, src_offs, 2, bytes, submit);
/* Dx = A*(P+Pxy) + B*(Q+Qxy) */
srcs[0] = dp;
+ src_offs[0] = dp_off;
srcs[1] = dq;
+ src_offs[1] = dq_off;
coef[0] = raid6_gfexi[failb-faila];
coef[1] = raid6_gfinv[raid6_gfexp[faila]^raid6_gfexp[failb]];
init_async_submit(submit, ASYNC_TX_FENCE, tx, NULL, NULL, scribble);
- tx = async_sum_product(dq, srcs, coef, bytes, submit);
+ tx = async_sum_product(dq, dq_off, srcs, src_offs, coef, bytes, submit);
/* Dy = P+Pxy+Dx */
srcs[0] = dp;
+ src_offs[0] = dp_off;
srcs[1] = dq;
+ src_offs[1] = dq_off;
init_async_submit(submit, flags | ASYNC_TX_XOR_DROP_DST, tx, cb_fn,
cb_param, scribble);
- tx = async_xor(dp, srcs, 0, 2, bytes, submit);
+ tx = async_xor_offsets(dp, dp_off, srcs, src_offs, 2, bytes, submit);
return tx;
}
static struct dma_async_tx_descriptor *
__2data_recov_n(int disks, size_t bytes, int faila, int failb,
- struct page **blocks, struct async_submit_ctl *submit)
+ struct page **blocks, unsigned int *offs,
+ struct async_submit_ctl *submit)
{
struct dma_async_tx_descriptor *tx = NULL;
struct page *p, *q, *dp, *dq;
+ unsigned int p_off, q_off, dp_off, dq_off;
struct page *srcs[2];
+ unsigned int src_offs[2];
unsigned char coef[2];
enum async_tx_flags flags = submit->flags;
dma_async_tx_callback cb_fn = submit->cb_fn;
@@ -269,56 +307,74 @@ __2data_recov_n(int disks, size_t bytes, int faila, int failb,
void *scribble = submit->scribble;
p = blocks[disks-2];
+ p_off = offs[disks-2];
q = blocks[disks-1];
+ q_off = offs[disks-1];
/* Compute syndrome with zero for the missing data pages
* Use the dead data pages as temporary storage for
* delta p and delta q
*/
dp = blocks[faila];
+ dp_off = offs[faila];
blocks[faila] = NULL;
blocks[disks-2] = dp;
+ offs[disks-2] = dp_off;
dq = blocks[failb];
+ dq_off = offs[failb];
blocks[failb] = NULL;
blocks[disks-1] = dq;
+ offs[disks-1] = dq_off;
init_async_submit(submit, ASYNC_TX_FENCE, tx, NULL, NULL, scribble);
- tx = async_gen_syndrome(blocks, 0, disks, bytes, submit);
+ tx = async_gen_syndrome(blocks, offs, disks, bytes, submit);
/* Restore pointer table */
blocks[faila] = dp;
+ offs[faila] = dp_off;
blocks[failb] = dq;
+ offs[failb] = dq_off;
blocks[disks-2] = p;
+ offs[disks-2] = p_off;
blocks[disks-1] = q;
+ offs[disks-1] = q_off;
/* compute P + Pxy */
srcs[0] = dp;
+ src_offs[0] = dp_off;
srcs[1] = p;
+ src_offs[1] = p_off;
init_async_submit(submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_DROP_DST, tx,
NULL, NULL, scribble);
- tx = async_xor(dp, srcs, 0, 2, bytes, submit);
+ tx = async_xor_offsets(dp, dp_off, srcs, src_offs, 2, bytes, submit);
/* compute Q + Qxy */
srcs[0] = dq;
+ src_offs[0] = dq_off;
srcs[1] = q;
+ src_offs[1] = q_off;
init_async_submit(submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_DROP_DST, tx,
NULL, NULL, scribble);
- tx = async_xor(dq, srcs, 0, 2, bytes, submit);
+ tx = async_xor_offsets(dq, dq_off, srcs, src_offs, 2, bytes, submit);
/* Dx = A*(P+Pxy) + B*(Q+Qxy) */
srcs[0] = dp;
+ src_offs[0] = dp_off;
srcs[1] = dq;
+ src_offs[1] = dq_off;
coef[0] = raid6_gfexi[failb-faila];
coef[1] = raid6_gfinv[raid6_gfexp[faila]^raid6_gfexp[failb]];
init_async_submit(submit, ASYNC_TX_FENCE, tx, NULL, NULL, scribble);
- tx = async_sum_product(dq, srcs, coef, bytes, submit);
+ tx = async_sum_product(dq, dq_off, srcs, src_offs, coef, bytes, submit);
/* Dy = P+Pxy+Dx */
srcs[0] = dp;
+ src_offs[0] = dp_off;
srcs[1] = dq;
+ src_offs[1] = dq_off;
init_async_submit(submit, flags | ASYNC_TX_XOR_DROP_DST, tx, cb_fn,
cb_param, scribble);
- tx = async_xor(dp, srcs, 0, 2, bytes, submit);
+ tx = async_xor_offsets(dp, dp_off, srcs, src_offs, 2, bytes, submit);
return tx;
}
@@ -334,7 +390,8 @@ __2data_recov_n(int disks, size_t bytes, int faila, int failb,
*/
struct dma_async_tx_descriptor *
async_raid6_2data_recov(int disks, size_t bytes, int faila, int failb,
- struct page **blocks, struct async_submit_ctl *submit)
+ struct page **blocks, unsigned int *offs,
+ struct async_submit_ctl *submit)
{
void *scribble = submit->scribble;
int non_zero_srcs, i;
@@ -358,7 +415,7 @@ async_raid6_2data_recov(int disks, size_t bytes, int faila, int failb,
if (blocks[i] == NULL)
ptrs[i] = (void *) raid6_empty_zero_page;
else
- ptrs[i] = page_address(blocks[i]);
+ ptrs[i] = page_address(blocks[i]) + offs[i];
raid6_2data_recov(disks, bytes, faila, failb, ptrs);
@@ -383,16 +440,19 @@ async_raid6_2data_recov(int disks, size_t bytes, int faila, int failb,
* explicitly handle the special case of a 4 disk array with
* both data disks missing.
*/
- return __2data_recov_4(disks, bytes, faila, failb, blocks, submit);
+ return __2data_recov_4(disks, bytes, faila, failb,
+ blocks, offs, submit);
case 3:
/* dma devices do not uniformly understand a single
* source pq operation (in contrast to the synchronous
* case), so explicitly handle the special case of a 5 disk
* array with 2 of 3 data disks missing.
*/
- return __2data_recov_5(disks, bytes, faila, failb, blocks, submit);
+ return __2data_recov_5(disks, bytes, faila, failb,
+ blocks, offs, submit);
default:
- return __2data_recov_n(disks, bytes, faila, failb, blocks, submit);
+ return __2data_recov_n(disks, bytes, faila, failb,
+ blocks, offs, submit);
}
}
EXPORT_SYMBOL_GPL(async_raid6_2data_recov);
@@ -407,10 +467,12 @@ EXPORT_SYMBOL_GPL(async_raid6_2data_recov);
*/
struct dma_async_tx_descriptor *
async_raid6_datap_recov(int disks, size_t bytes, int faila,
- struct page **blocks, struct async_submit_ctl *submit)
+ struct page **blocks, unsigned int *offs,
+ struct async_submit_ctl *submit)
{
struct dma_async_tx_descriptor *tx = NULL;
struct page *p, *q, *dq;
+ unsigned int p_off, q_off, dq_off;
u8 coef;
enum async_tx_flags flags = submit->flags;
dma_async_tx_callback cb_fn = submit->cb_fn;
@@ -418,6 +480,7 @@ async_raid6_datap_recov(int disks, size_t bytes, int faila,
void *scribble = submit->scribble;
int good_srcs, good, i;
struct page *srcs[2];
+ unsigned int src_offs[2];
pr_debug("%s: disks: %d len: %zu\n", __func__, disks, bytes);
@@ -434,7 +497,7 @@ async_raid6_datap_recov(int disks, size_t bytes, int faila,
if (blocks[i] == NULL)
ptrs[i] = (void*)raid6_empty_zero_page;
else
- ptrs[i] = page_address(blocks[i]);
+ ptrs[i] = page_address(blocks[i]) + offs[i];
raid6_datap_recov(disks, bytes, faila, ptrs);
@@ -458,55 +521,67 @@ async_raid6_datap_recov(int disks, size_t bytes, int faila,
BUG_ON(good_srcs == 0);
p = blocks[disks-2];
+ p_off = offs[disks-2];
q = blocks[disks-1];
+ q_off = offs[disks-1];
/* Compute syndrome with zero for the missing data page
* Use the dead data page as temporary storage for delta q
*/
dq = blocks[faila];
+ dq_off = offs[faila];
blocks[faila] = NULL;
blocks[disks-1] = dq;
+ offs[disks-1] = dq_off;
/* in the 4-disk case we only need to perform a single source
* multiplication with the one good data block.
*/
if (good_srcs == 1) {
struct page *g = blocks[good];
+ unsigned int g_off = offs[good];
init_async_submit(submit, ASYNC_TX_FENCE, tx, NULL, NULL,
scribble);
- tx = async_memcpy(p, g, 0, 0, bytes, submit);
+ tx = async_memcpy(p, g, p_off, q_off, bytes, submit);
init_async_submit(submit, ASYNC_TX_FENCE, tx, NULL, NULL,
scribble);
- tx = async_mult(dq, g, raid6_gfexp[good], bytes, submit);
+ tx = async_mult(dq, dq_off, g, g_off,
+ raid6_gfexp[good], bytes, submit);
} else {
init_async_submit(submit, ASYNC_TX_FENCE, tx, NULL, NULL,
scribble);
- tx = async_gen_syndrome(blocks, 0, disks, bytes, submit);
+ tx = async_gen_syndrome(blocks, offs, disks, bytes, submit);
}
/* Restore pointer table */
blocks[faila] = dq;
+ offs[faila] = dq_off;
blocks[disks-1] = q;
+ offs[disks-1] = q_off;
/* calculate g^{-faila} */
coef = raid6_gfinv[raid6_gfexp[faila]];
srcs[0] = dq;
+ src_offs[0] = dq_off;
srcs[1] = q;
+ src_offs[1] = q_off;
init_async_submit(submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_DROP_DST, tx,
NULL, NULL, scribble);
- tx = async_xor(dq, srcs, 0, 2, bytes, submit);
+ tx = async_xor_offsets(dq, dq_off, srcs, src_offs, 2, bytes, submit);
init_async_submit(submit, ASYNC_TX_FENCE, tx, NULL, NULL, scribble);
- tx = async_mult(dq, dq, coef, bytes, submit);
+ tx = async_mult(dq, dq_off, dq, dq_off, coef, bytes, submit);
srcs[0] = p;
+ src_offs[0] = p_off;
srcs[1] = dq;
+ src_offs[1] = dq_off;
init_async_submit(submit, flags | ASYNC_TX_XOR_DROP_DST, tx, cb_fn,
cb_param, scribble);
- tx = async_xor(p, srcs, 0, 2, bytes, submit);
+ tx = async_xor_offsets(p, p_off, srcs, src_offs, 2, bytes, submit);
return tx;
}
diff --git a/include/linux/async_tx.h b/include/linux/async_tx.h
index 8d79e2de06bd..84d5cc5ff060 100644
--- a/include/linux/async_tx.h
+++ b/include/linux/async_tx.h
@@ -186,21 +186,23 @@ async_memcpy(struct page *dest, struct page *src, unsigned int dest_offset,
struct dma_async_tx_descriptor *async_trigger_callback(struct async_submit_ctl *submit);
struct dma_async_tx_descriptor *
-async_gen_syndrome(struct page **blocks, unsigned int offset, int src_cnt,
+async_gen_syndrome(struct page **blocks, unsigned int *offsets, int src_cnt,
size_t len, struct async_submit_ctl *submit);
struct dma_async_tx_descriptor *
-async_syndrome_val(struct page **blocks, unsigned int offset, int src_cnt,
+async_syndrome_val(struct page **blocks, unsigned int *offsets, int src_cnt,
size_t len, enum sum_check_flags *pqres, struct page *spare,
- struct async_submit_ctl *submit);
+ unsigned int s_off, struct async_submit_ctl *submit);
struct dma_async_tx_descriptor *
async_raid6_2data_recov(int src_num, size_t bytes, int faila, int failb,
- struct page **ptrs, struct async_submit_ctl *submit);
+ struct page **ptrs, unsigned int *offs,
+ struct async_submit_ctl *submit);
struct dma_async_tx_descriptor *
async_raid6_datap_recov(int src_num, size_t bytes, int faila,
- struct page **ptrs, struct async_submit_ctl *submit);
+ struct page **ptrs, unsigned int *offs,
+ struct async_submit_ctl *submit);
void async_tx_quiesce(struct dma_async_tx_descriptor **tx);
#endif /* _ASYNC_TX_H_ */
--
2.21.3
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v3 10/11] md/raid6: compute syndrome with correct page offset
2020-05-27 13:19 [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value Yufen Yu
` (8 preceding siblings ...)
2020-05-27 13:19 ` [PATCH v3 09/11] md/raid6: let syndrome computor support different " Yufen Yu
@ 2020-05-27 13:19 ` Yufen Yu
2020-05-27 13:19 ` [PATCH v3 11/11] raid6test: adaptation with syndrome function Yufen Yu
` (2 subsequent siblings)
12 siblings, 0 replies; 31+ messages in thread
From: Yufen Yu @ 2020-05-27 13:19 UTC (permalink / raw)
To: song; +Cc: linux-raid, neilb, guoqing.jiang, colyli, xni, houtao1, yuyufen
When raid6 compute syndrome, the pages address will be passed to computer
function. After trying to support shared page between multiple sh->dev,
we also need to let computor know the correct location of each page.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
---
drivers/md/raid5.c | 63 +++++++++++++++++++++++++++++++++-------------
1 file changed, 45 insertions(+), 18 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 1a59c1db96ff..5a886951b8b4 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1520,6 +1520,7 @@ ops_run_compute5(struct stripe_head *sh, struct raid5_percpu *percpu)
/* set_syndrome_sources - populate source buffers for gen_syndrome
* @srcs - (struct page *) array of size sh->disks
+ * @offs - (unsigned int) array of offset for each page
* @sh - stripe_head to parse
*
* Populates srcs in proper layout order for the stripe and returns the
@@ -1528,6 +1529,7 @@ ops_run_compute5(struct stripe_head *sh, struct raid5_percpu *percpu)
* is recorded in srcs[count+1]].
*/
static int set_syndrome_sources(struct page **srcs,
+ unsigned int *offs,
struct stripe_head *sh,
int srctype)
{
@@ -1558,6 +1560,12 @@ static int set_syndrome_sources(struct page **srcs,
srcs[slot] = sh->dev[i].orig_page;
else
srcs[slot] = sh->dev[i].page;
+ /*
+ * For orig_page, PAGE_SIZE must be 4KB and will
+ * not use r5pages. In that case, dev[i].offset
+ * is 0. So we can also use the value directly.
+ */
+ offs[slot] = sh->dev[i].offset;
}
i = raid6_next_disk(i, disks);
} while (i != d0_idx);
@@ -1570,12 +1578,14 @@ ops_run_compute6_1(struct stripe_head *sh, struct raid5_percpu *percpu)
{
int disks = sh->disks;
struct page **blocks = to_addr_page(percpu, 0);
+ unsigned int *offs = to_addr_offs(sh, percpu);
int target;
int qd_idx = sh->qd_idx;
struct dma_async_tx_descriptor *tx;
struct async_submit_ctl submit;
struct r5dev *tgt;
struct page *dest;
+ unsigned int dest_off;
int i;
int count;
@@ -1594,30 +1604,35 @@ ops_run_compute6_1(struct stripe_head *sh, struct raid5_percpu *percpu)
tgt = &sh->dev[target];
BUG_ON(!test_bit(R5_Wantcompute, &tgt->flags));
dest = tgt->page;
+ dest_off = tgt->offset;
atomic_inc(&sh->count);
if (target == qd_idx) {
- count = set_syndrome_sources(blocks, sh, SYNDROME_SRC_ALL);
+ count = set_syndrome_sources(blocks, offs,
+ sh, SYNDROME_SRC_ALL);
blocks[count] = NULL; /* regenerating p is not necessary */
BUG_ON(blocks[count+1] != dest); /* q should already be set */
init_async_submit(&submit, ASYNC_TX_FENCE, NULL,
ops_complete_compute, sh,
to_addr_conv(sh, percpu, 0));
- tx = async_gen_syndrome(blocks, 0, count+2, STRIPE_SIZE, &submit);
+ tx = async_gen_syndrome(blocks, offs,
+ count+2, STRIPE_SIZE, &submit);
} else {
/* Compute any data- or p-drive using XOR */
count = 0;
for (i = disks; i-- ; ) {
if (i == target || i == qd_idx)
continue;
+ offs[count] = sh->dev[i].offset;
blocks[count++] = sh->dev[i].page;
}
init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_ZERO_DST,
NULL, ops_complete_compute, sh,
to_addr_conv(sh, percpu, 0));
- tx = async_xor(dest, blocks, 0, count, STRIPE_SIZE, &submit);
+ tx = async_xor_offsets(dest, dest_off, blocks, offs,
+ count, STRIPE_SIZE, &submit);
}
return tx;
@@ -1636,6 +1651,8 @@ ops_run_compute6_2(struct stripe_head *sh, struct raid5_percpu *percpu)
struct r5dev *tgt2 = &sh->dev[target2];
struct dma_async_tx_descriptor *tx;
struct page **blocks = to_addr_page(percpu, 0);
+ unsigned int *offs = to_addr_offs(sh, percpu);
+
struct async_submit_ctl submit;
BUG_ON(sh->batch_head);
@@ -1655,6 +1672,7 @@ ops_run_compute6_2(struct stripe_head *sh, struct raid5_percpu *percpu)
do {
int slot = raid6_idx_to_slot(i, sh, &count, syndrome_disks);
+ offs[slot] = sh->dev[i].offset;
blocks[slot] = sh->dev[i].page;
if (i == target)
@@ -1679,10 +1697,11 @@ ops_run_compute6_2(struct stripe_head *sh, struct raid5_percpu *percpu)
init_async_submit(&submit, ASYNC_TX_FENCE, NULL,
ops_complete_compute, sh,
to_addr_conv(sh, percpu, 0));
- return async_gen_syndrome(blocks, 0, syndrome_disks+2,
- STRIPE_SIZE, &submit);
+ return async_gen_syndrome(blocks, offs,
+ syndrome_disks+2, STRIPE_SIZE, &submit);
} else {
struct page *dest;
+ unsigned int dest_off;
int data_target;
int qd_idx = sh->qd_idx;
@@ -1696,21 +1715,24 @@ ops_run_compute6_2(struct stripe_head *sh, struct raid5_percpu *percpu)
for (i = disks; i-- ; ) {
if (i == data_target || i == qd_idx)
continue;
+ offs[count] = sh->dev[i].offset;
blocks[count++] = sh->dev[i].page;
}
dest = sh->dev[data_target].page;
+ dest_off = sh->dev[data_target].offset;
init_async_submit(&submit,
ASYNC_TX_FENCE|ASYNC_TX_XOR_ZERO_DST,
NULL, NULL, NULL,
to_addr_conv(sh, percpu, 0));
- tx = async_xor(dest, blocks, 0, count, STRIPE_SIZE,
- &submit);
+ tx = async_xor_offsets(dest, dest_off, blocks, offs,
+ count, STRIPE_SIZE, &submit);
- count = set_syndrome_sources(blocks, sh, SYNDROME_SRC_ALL);
+ count = set_syndrome_sources(blocks, offs,
+ sh, SYNDROME_SRC_ALL);
init_async_submit(&submit, ASYNC_TX_FENCE, tx,
ops_complete_compute, sh,
to_addr_conv(sh, percpu, 0));
- return async_gen_syndrome(blocks, 0, count+2,
+ return async_gen_syndrome(blocks, offs, count+2,
STRIPE_SIZE, &submit);
}
} else {
@@ -1721,12 +1743,12 @@ ops_run_compute6_2(struct stripe_head *sh, struct raid5_percpu *percpu)
/* We're missing D+P. */
return async_raid6_datap_recov(syndrome_disks+2,
STRIPE_SIZE, faila,
- blocks, &submit);
+ blocks, offs, &submit);
} else {
/* We're missing D+D. */
return async_raid6_2data_recov(syndrome_disks+2,
STRIPE_SIZE, faila, failb,
- blocks, &submit);
+ blocks, offs, &submit);
}
}
}
@@ -1793,17 +1815,18 @@ ops_run_prexor6(struct stripe_head *sh, struct raid5_percpu *percpu,
struct dma_async_tx_descriptor *tx)
{
struct page **blocks = to_addr_page(percpu, 0);
+ unsigned int *offs = to_addr_offs(sh, percpu);
int count;
struct async_submit_ctl submit;
pr_debug("%s: stripe %llu\n", __func__,
(unsigned long long)sh->sector);
- count = set_syndrome_sources(blocks, sh, SYNDROME_SRC_WANT_DRAIN);
+ count = set_syndrome_sources(blocks, offs, sh, SYNDROME_SRC_WANT_DRAIN);
init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_PQ_XOR_DST, tx,
ops_complete_prexor, sh, to_addr_conv(sh, percpu, 0));
- tx = async_gen_syndrome(blocks, 0, count+2, STRIPE_SIZE, &submit);
+ tx = async_gen_syndrome(blocks, offs, count+2, STRIPE_SIZE, &submit);
return tx;
}
@@ -2035,6 +2058,7 @@ ops_run_reconstruct6(struct stripe_head *sh, struct raid5_percpu *percpu,
{
struct async_submit_ctl submit;
struct page **blocks;
+ unsigned int *offs;
int count, i, j = 0;
struct stripe_head *head_sh = sh;
int last_stripe;
@@ -2059,6 +2083,7 @@ ops_run_reconstruct6(struct stripe_head *sh, struct raid5_percpu *percpu,
again:
blocks = to_addr_page(percpu, j);
+ offs = to_addr_offs(sh, percpu);
if (sh->reconstruct_state == reconstruct_state_prexor_drain_run) {
synflags = SYNDROME_SRC_WRITTEN;
@@ -2068,7 +2093,7 @@ ops_run_reconstruct6(struct stripe_head *sh, struct raid5_percpu *percpu,
txflags = ASYNC_TX_ACK;
}
- count = set_syndrome_sources(blocks, sh, synflags);
+ count = set_syndrome_sources(blocks, offs, sh, synflags);
last_stripe = !head_sh->batch_head ||
list_first_entry(&sh->batch_list,
struct stripe_head, batch_list) == head_sh;
@@ -2080,7 +2105,7 @@ ops_run_reconstruct6(struct stripe_head *sh, struct raid5_percpu *percpu,
} else
init_async_submit(&submit, 0, tx, NULL, NULL,
to_addr_conv(sh, percpu, j));
- tx = async_gen_syndrome(blocks, 0, count+2, STRIPE_SIZE, &submit);
+ tx = async_gen_syndrome(blocks, offs, count+2, STRIPE_SIZE, &submit);
if (!last_stripe) {
j++;
sh = list_first_entry(&sh->batch_list, struct stripe_head,
@@ -2145,6 +2170,7 @@ static void ops_run_check_p(struct stripe_head *sh, struct raid5_percpu *percpu)
static void ops_run_check_pq(struct stripe_head *sh, struct raid5_percpu *percpu, int checkp)
{
struct page **srcs = to_addr_page(percpu, 0);
+ unsigned int *offs = to_addr_offs(sh, percpu);
struct async_submit_ctl submit;
int count;
@@ -2152,15 +2178,16 @@ static void ops_run_check_pq(struct stripe_head *sh, struct raid5_percpu *percpu
(unsigned long long)sh->sector, checkp);
BUG_ON(sh->batch_head);
- count = set_syndrome_sources(srcs, sh, SYNDROME_SRC_ALL);
+ count = set_syndrome_sources(srcs, offs, sh, SYNDROME_SRC_ALL);
if (!checkp)
srcs[count] = NULL;
atomic_inc(&sh->count);
init_async_submit(&submit, ASYNC_TX_ACK, NULL, ops_complete_check,
sh, to_addr_conv(sh, percpu, 0));
- async_syndrome_val(srcs, 0, count+2, STRIPE_SIZE,
- &sh->ops.zero_sum_result, percpu->spare_page, &submit);
+ async_syndrome_val(srcs, offs, count+2, STRIPE_SIZE,
+ &sh->ops.zero_sum_result,
+ percpu->spare_page, 0, &submit);
}
static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
--
2.21.3
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v3 11/11] raid6test: adaptation with syndrome function
2020-05-27 13:19 [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value Yufen Yu
` (9 preceding siblings ...)
2020-05-27 13:19 ` [PATCH v3 10/11] md/raid6: compute syndrome with correct " Yufen Yu
@ 2020-05-27 13:19 ` Yufen Yu
2020-05-28 14:10 ` [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value Song Liu
2020-05-28 22:07 ` Guoqing Jiang
12 siblings, 0 replies; 31+ messages in thread
From: Yufen Yu @ 2020-05-27 13:19 UTC (permalink / raw)
To: song; +Cc: linux-raid, neilb, guoqing.jiang, colyli, xni, houtao1, yuyufen
After changing some syndrome functions to support different page offsets,
we also need to adapt raid6test module. In this module, pages are allocated
by the itself and their offset are '0'.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
---
crypto/async_tx/raid6test.c | 24 ++++++++++++++++--------
1 file changed, 16 insertions(+), 8 deletions(-)
diff --git a/crypto/async_tx/raid6test.c b/crypto/async_tx/raid6test.c
index 14e73dcd7475..66db82e5a3b1 100644
--- a/crypto/async_tx/raid6test.c
+++ b/crypto/async_tx/raid6test.c
@@ -18,6 +18,7 @@
#define NDISKS 64 /* Including P and Q */
static struct page *dataptrs[NDISKS];
+unsigned int dataoffs[NDISKS];
static addr_conv_t addr_conv[NDISKS];
static struct page *data[NDISKS+3];
static struct page *spare;
@@ -38,6 +39,7 @@ static void makedata(int disks)
for (i = 0; i < disks; i++) {
prandom_bytes(page_address(data[i]), PAGE_SIZE);
dataptrs[i] = data[i];
+ dataoffs[i] = 0;
}
}
@@ -52,7 +54,8 @@ static char disk_type(int d, int disks)
}
/* Recover two failed blocks. */
-static void raid6_dual_recov(int disks, size_t bytes, int faila, int failb, struct page **ptrs)
+static void raid6_dual_recov(int disks, size_t bytes, int faila, int failb,
+ struct page **ptrs, unsigned int *offs)
{
struct async_submit_ctl submit;
struct completion cmp;
@@ -66,7 +69,8 @@ static void raid6_dual_recov(int disks, size_t bytes, int faila, int failb, stru
if (faila == disks-2) {
/* P+Q failure. Just rebuild the syndrome. */
init_async_submit(&submit, 0, NULL, NULL, NULL, addr_conv);
- tx = async_gen_syndrome(ptrs, 0, disks, bytes, &submit);
+ tx = async_gen_syndrome(ptrs, offs,
+ disks, bytes, &submit);
} else {
struct page *blocks[NDISKS];
struct page *dest;
@@ -89,22 +93,26 @@ static void raid6_dual_recov(int disks, size_t bytes, int faila, int failb, stru
tx = async_xor(dest, blocks, 0, count, bytes, &submit);
init_async_submit(&submit, 0, tx, NULL, NULL, addr_conv);
- tx = async_gen_syndrome(ptrs, 0, disks, bytes, &submit);
+ tx = async_gen_syndrome(ptrs, offs,
+ disks, bytes, &submit);
}
} else {
if (failb == disks-2) {
/* data+P failure. */
init_async_submit(&submit, 0, NULL, NULL, NULL, addr_conv);
- tx = async_raid6_datap_recov(disks, bytes, faila, ptrs, &submit);
+ tx = async_raid6_datap_recov(disks, bytes,
+ faila, ptrs, offs, &submit);
} else {
/* data+data failure. */
init_async_submit(&submit, 0, NULL, NULL, NULL, addr_conv);
- tx = async_raid6_2data_recov(disks, bytes, faila, failb, ptrs, &submit);
+ tx = async_raid6_2data_recov(disks, bytes,
+ faila, failb, ptrs, offs, &submit);
}
}
init_completion(&cmp);
init_async_submit(&submit, ASYNC_TX_ACK, tx, callback, &cmp, addr_conv);
- tx = async_syndrome_val(ptrs, 0, disks, bytes, &result, spare, &submit);
+ tx = async_syndrome_val(ptrs, offs,
+ disks, bytes, &result, spare, 0, &submit);
async_tx_issue_pending(tx);
if (wait_for_completion_timeout(&cmp, msecs_to_jiffies(3000)) == 0)
@@ -126,7 +134,7 @@ static int test_disks(int i, int j, int disks)
dataptrs[i] = recovi;
dataptrs[j] = recovj;
- raid6_dual_recov(disks, PAGE_SIZE, i, j, dataptrs);
+ raid6_dual_recov(disks, PAGE_SIZE, i, j, dataptrs, dataoffs);
erra = memcmp(page_address(data[i]), page_address(recovi), PAGE_SIZE);
errb = memcmp(page_address(data[j]), page_address(recovj), PAGE_SIZE);
@@ -162,7 +170,7 @@ static int test(int disks, int *tests)
/* Generate assumed good syndrome */
init_completion(&cmp);
init_async_submit(&submit, ASYNC_TX_ACK, NULL, callback, &cmp, addr_conv);
- tx = async_gen_syndrome(dataptrs, 0, disks, PAGE_SIZE, &submit);
+ tx = async_gen_syndrome(dataptrs, dataoffs, disks, PAGE_SIZE, &submit);
async_tx_issue_pending(tx);
if (wait_for_completion_timeout(&cmp, msecs_to_jiffies(3000)) == 0) {
--
2.21.3
^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH v3 01/11] md/raid5: add CONFIG_MD_RAID456_STRIPE_SHIFT to set STRIPE_SIZE
2020-05-27 13:19 ` [PATCH v3 01/11] md/raid5: add CONFIG_MD_RAID456_STRIPE_SHIFT to set STRIPE_SIZE Yufen Yu
@ 2020-05-27 13:54 ` Guoqing Jiang
2020-05-27 23:30 ` John Stoffel
2020-05-28 6:17 ` Yufen Yu
2020-05-27 15:16 ` Xiao Ni
` (2 subsequent siblings)
3 siblings, 2 replies; 31+ messages in thread
From: Guoqing Jiang @ 2020-05-27 13:54 UTC (permalink / raw)
To: Yufen Yu, song; +Cc: linux-raid, neilb, colyli, xni, houtao1
Hi,
On 5/27/20 3:19 PM, Yufen Yu wrote:
> +config MD_RAID456_STRIPE_SHIFT
> + int "RAID4/RAID5/RAID6 stripe size shift"
> + default "1"
> + depends on MD_RAID456
> + help
> + When set the value as 'N', stripe size will be set as 'N << 9',
> + which is a multiple of 4KB.
If 'N << 9', then seems you are convert it to sector, do you actually
mean 'N << 12'?
> +
> + The default value is 1, that means the default stripe size is
> + 4096(1 << 9). Just setting as a bigger value when PAGE_SIZE is
> + bigger than 4096. In that case, you can set it as 2(8KB),
> + 4(16K), 16(64K).
So with the above description, the algorithm should be 2 << 12 = 8KB and
so on.
> +
> + When you try to set a big value, likely 16 on arm64 with 64KB
> + PAGE_SIZE, that means, you know size of each io that issued to
> + raid device is more than 4096. Otherwise just use default value.
> +
> + Normally, using default value can get better performance.
> + Only change this value if you know what you are doing.
> +
> +
> config MD_MULTIPATH
> tristate "Multipath I/O support"
> depends on BLK_DEV_MD
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index f90e0704bed9..b25f107dafc7 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -472,7 +472,9 @@ struct disk_info {
> */
>
> #define NR_STRIPES 256
> -#define STRIPE_SIZE PAGE_SIZE
> +#define CONFIG_STRIPE_SIZE (CONFIG_MD_RAID456_STRIPE_SHIFT << 9)
> +#define STRIPE_SIZE \
> + (CONFIG_STRIPE_SIZE > PAGE_SIZE ? PAGE_SIZE : CONFIG_STRIPE_SIZE)
If I am not misunderstand, you need to s/9/12/ above.
Thanks,
Guoqing
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v3 01/11] md/raid5: add CONFIG_MD_RAID456_STRIPE_SHIFT to set STRIPE_SIZE
2020-05-27 13:19 ` [PATCH v3 01/11] md/raid5: add CONFIG_MD_RAID456_STRIPE_SHIFT to set STRIPE_SIZE Yufen Yu
2020-05-27 13:54 ` Guoqing Jiang
@ 2020-05-27 15:16 ` Xiao Ni
2020-05-28 6:29 ` Yufen Yu
2020-05-27 20:21 ` kbuild test robot
2020-05-28 14:23 ` Song Liu
3 siblings, 1 reply; 31+ messages in thread
From: Xiao Ni @ 2020-05-27 15:16 UTC (permalink / raw)
To: Yufen Yu, song; +Cc: linux-raid, neilb, guoqing.jiang, colyli, houtao1
On 05/27/2020 09:19 PM, Yufen Yu wrote:
> In RAID5, if issued bio size is bigger than STRIPE_SIZE, it will be split
> in the unit of STRIPE_SIZE and process them one by one. Even for size
> less then STRIPE_SIZE, RAID5 also request data from disk at least of
> STRIPE_SIZE.
>
> Nowdays, STRIPE_SIZE is equal to the value of PAGE_SIZE. Since filesystem
> usually issue bio in the unit of 4KB, there is no problem for PAGE_SIZE as
> 4KB. But, for 64KB PAGE_SIZE, bio from filesystem requests 4KB data while
> RAID5 issue IO at least STRIPE_SIZE (64KB) each time. That will waste
> resource of disk bandwidth and compute xor.
>
> To avoding the waste, we want to add a new CONFIG option to adjust
> STREIPE_SIZE. Default value is 4096. User can also set the value bigger
> than 4KB for some special requirements, such as we know the issued io
> size is more than 4KB.
>
> To evaluate the new feature, we create raid5 device '/dev/md5' with
> 4 SSD disk and test it on arm64 machine with 64KB PAGE_SIZE.
>
> 1) We format /dev/md5 with mkfs.ext4 and mount ext4 with default
> configure on /mnt directory. Then, trying to test it by dbench with
> command: dbench -D /mnt -t 1000 10. Result show as:
>
> 'STRIPE_SIZE = 64KB'
>
> Operation Count AvgLat MaxLat
> ----------------------------------------
> NTCreateX 9805011 0.021 64.728
> Close 7202525 0.001 0.120
> Rename 415213 0.051 44.681
> Unlink 1980066 0.079 93.147
> Deltree 240 1.793 6.516
> Mkdir 120 0.004 0.007
> Qpathinfo 8887512 0.007 37.114
> Qfileinfo 1557262 0.001 0.030
> Qfsinfo 1629582 0.012 0.152
> Sfileinfo 798756 0.040 57.641
> Find 3436004 0.019 57.782
> WriteX 4887239 0.021 57.638
> ReadX 15370483 0.005 37.818
> LockX 31934 0.003 0.022
> UnlockX 31933 0.001 0.021
> Flush 687205 13.302 530.088
>
> Throughput 307.799 MB/sec 10 clients 10 procs max_latency=530.091 ms
> -------------------------------------------------------
>
> 'STRIPE_SIZE = 4KB'
>
> Operation Count AvgLat MaxLat
> ----------------------------------------
> NTCreateX 11999166 0.021 36.380
> Close 8814128 0.001 0.122
> Rename 508113 0.051 29.169
> Unlink 2423242 0.070 38.141
> Deltree 300 1.885 7.155
> Mkdir 150 0.004 0.006
> Qpathinfo 10875921 0.007 35.485
> Qfileinfo 1905837 0.001 0.032
> Qfsinfo 1994304 0.012 0.125
> Sfileinfo 977450 0.029 26.489
> Find 4204952 0.019 9.361
> WriteX 5981890 0.019 27.804
> ReadX 18809742 0.004 33.491
> LockX 39074 0.003 0.025
> UnlockX 39074 0.001 0.014
> Flush 841022 10.712 458.848
>
> Throughput 376.777 MB/sec 10 clients 10 procs max_latency=458.852 ms
> -------------------------------------------------------
>
> It show that setting STREIP_SIZE as 4KB has higher thoughput, i.e.
> (376.777 vs 307.799) and has smaller latency (530.091 vs 458.852)
> than that setting as 64KB.
>
> 2) We try to evaluate IO throughput for /dev/md5 by fio with config:
>
> [4KB randwrite]
> direct=1
> numjob=2
> iodepth=64
> ioengine=libaio
> filename=/dev/md5
> bs=4KB
> rw=randwrite
>
> [64KB write]
> direct=1
> numjob=2
> iodepth=64
> ioengine=libaio
> filename=/dev/md5
> bs=1MB
> rw=write
>
> The result as follow:
>
> + +
> | STRIPE_SIZE(64KB) | STRIPE_SIZE(4KB)
> +----------------------------------------------------+
> 4KB randwrite | 15MB/s | 100MB/s
> +----------------------------------------------------+
> 1MB write | 1000MB/s | 700MB/s
>
> The result show that when size of io is bigger than 4KB (64KB),
> 64KB STRIPE_SIZE has much higher IOPS. But for 4KB randwrite, that
> means, size of io issued to device are smaller, 4KB STRIPE_SIZE
> have better performance.
>
> Thus, we provide a configure to set STRIPE_SIZE when PAGE_SIZE is bigger
> than 4096. Normally, default value (4096) can get relatively good
> performance. But if each issued io is bigger than 4096, setting value more
> than 4096 may get better performance.
>
> Signed-off-by: Yufen Yu <yuyufen@huawei.com>
> ---
> drivers/md/Kconfig | 21 +++++++++++++++++++++
> drivers/md/raid5.h | 4 +++-
> 2 files changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
> index d6d5ab23c088..629324f92c42 100644
> --- a/drivers/md/Kconfig
> +++ b/drivers/md/Kconfig
> @@ -157,6 +157,27 @@ config MD_RAID456
>
> If unsure, say Y.
>
> +config MD_RAID456_STRIPE_SHIFT
> + int "RAID4/RAID5/RAID6 stripe size shift"
> + default "1"
> + depends on MD_RAID456
> + help
> + When set the value as 'N', stripe size will be set as 'N << 9',
> + which is a multiple of 4KB.
> +
> + The default value is 1, that means the default stripe size is
> + 4096(1 << 9). Just setting as a bigger value when PAGE_SIZE is
> + bigger than 4096. In that case, you can set it as 2(8KB),
> + 4(16K), 16(64K).
> +
> + When you try to set a big value, likely 16 on arm64 with 64KB
> + PAGE_SIZE, that means, you know size of each io that issued to
> + raid device is more than 4096. Otherwise just use default value.
> +
> + Normally, using default value can get better performance.
> + Only change this value if you know what you are doing.
> +
> +
> config MD_MULTIPATH
> tristate "Multipath I/O support"
> depends on BLK_DEV_MD
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index f90e0704bed9..b25f107dafc7 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -472,7 +472,9 @@ struct disk_info {
> */
>
> #define NR_STRIPES 256
> -#define STRIPE_SIZE PAGE_SIZE
> +#define CONFIG_STRIPE_SIZE (CONFIG_MD_RAID456_STRIPE_SHIFT << 9)
> +#define STRIPE_SIZE \
> + (CONFIG_STRIPE_SIZE > PAGE_SIZE ? PAGE_SIZE : CONFIG_STRIPE_SIZE)
Hi Yufen
Is it what you want? Or it should be:
+#define STRIPE_SIZE \
+ (CONFIG_STRIPE_SIZE > PAGE_SIZE ? CONFIG_STRIPE_SIZE : PAGE_SIZE)
> #define STRIPE_SHIFT (PAGE_SHIFT - 9)
> #define STRIPE_SECTORS (STRIPE_SIZE>>9)
> #define IO_THRESHOLD 1
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v3 01/11] md/raid5: add CONFIG_MD_RAID456_STRIPE_SHIFT to set STRIPE_SIZE
2020-05-27 13:19 ` [PATCH v3 01/11] md/raid5: add CONFIG_MD_RAID456_STRIPE_SHIFT to set STRIPE_SIZE Yufen Yu
@ 2020-05-27 20:21 ` kbuild test robot
2020-05-27 15:16 ` Xiao Ni
` (2 subsequent siblings)
3 siblings, 0 replies; 31+ messages in thread
From: kbuild test robot @ 2020-05-27 20:21 UTC (permalink / raw)
To: song
Cc: kbuild-all, linux-raid, neilb, guoqing.jiang, colyli, xni,
houtao1, yuyufen
[-- Attachment #1: Type: text/plain, Size: 4352 bytes --]
Hi Yufen,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on cryptodev/master]
[also build test ERROR on crypto/master linus/master v5.7-rc7 next-20200526]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]
url: https://github.com/0day-ci/linux/commits/Yufen-Yu/md-raid5-set-STRIPE_SIZE-as-a-configurable-value/20200527-212526
base: https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git master
config: ia64-defconfig (attached as .config)
compiler: ia64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=ia64
If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <lkp@intel.com>
All error/warnings (new ones prefixed by >>, old ones prefixed by <<):
In file included from drivers/md/raid0.c:20:
drivers/md/raid5.h: In function 'r5_next_bio':
>> drivers/md/raid5.h:475:29: error: 'CONFIG_MD_RAID456_STRIPE_SHIFT' undeclared (first use in this function)
475 | #define CONFIG_STRIPE_SIZE (CONFIG_MD_RAID456_STRIPE_SHIFT << 9)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> drivers/md/raid5.h:477:3: note: in expansion of macro 'CONFIG_STRIPE_SIZE'
477 | (CONFIG_STRIPE_SIZE > PAGE_SIZE ? PAGE_SIZE : CONFIG_STRIPE_SIZE)
| ^~~~~~~~~~~~~~~~~~
>> drivers/md/raid5.h:479:26: note: in expansion of macro 'STRIPE_SIZE'
479 | #define STRIPE_SECTORS (STRIPE_SIZE>>9)
| ^~~~~~~~~~~
>> drivers/md/raid5.h:497:37: note: in expansion of macro 'STRIPE_SECTORS'
497 | if (bio_end_sector(bio) < sector + STRIPE_SECTORS)
| ^~~~~~~~~~~~~~
drivers/md/raid5.h:475:29: note: each undeclared identifier is reported only once for each function it appears in
475 | #define CONFIG_STRIPE_SIZE (CONFIG_MD_RAID456_STRIPE_SHIFT << 9)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> drivers/md/raid5.h:477:3: note: in expansion of macro 'CONFIG_STRIPE_SIZE'
477 | (CONFIG_STRIPE_SIZE > PAGE_SIZE ? PAGE_SIZE : CONFIG_STRIPE_SIZE)
| ^~~~~~~~~~~~~~~~~~
>> drivers/md/raid5.h:479:26: note: in expansion of macro 'STRIPE_SIZE'
479 | #define STRIPE_SECTORS (STRIPE_SIZE>>9)
| ^~~~~~~~~~~
>> drivers/md/raid5.h:497:37: note: in expansion of macro 'STRIPE_SECTORS'
497 | if (bio_end_sector(bio) < sector + STRIPE_SECTORS)
| ^~~~~~~~~~~~~~
vim +/CONFIG_MD_RAID456_STRIPE_SHIFT +475 drivers/md/raid5.h
469
470 /*
471 * Stripe cache
472 */
473
474 #define NR_STRIPES 256
> 475 #define CONFIG_STRIPE_SIZE (CONFIG_MD_RAID456_STRIPE_SHIFT << 9)
476 #define STRIPE_SIZE \
> 477 (CONFIG_STRIPE_SIZE > PAGE_SIZE ? PAGE_SIZE : CONFIG_STRIPE_SIZE)
478 #define STRIPE_SHIFT (PAGE_SHIFT - 9)
> 479 #define STRIPE_SECTORS (STRIPE_SIZE>>9)
480 #define IO_THRESHOLD 1
481 #define BYPASS_THRESHOLD 1
482 #define NR_HASH (PAGE_SIZE / sizeof(struct hlist_head))
483 #define HASH_MASK (NR_HASH - 1)
484 #define MAX_STRIPE_BATCH 8
485
486 /* bio's attached to a stripe+device for I/O are linked together in bi_sector
487 * order without overlap. There may be several bio's per stripe+device, and
488 * a bio could span several devices.
489 * When walking this list for a particular stripe+device, we must never proceed
490 * beyond a bio that extends past this device, as the next bio might no longer
491 * be valid.
492 * This function is used to determine the 'next' bio in the list, given the
493 * sector of the current stripe+device
494 */
495 static inline struct bio *r5_next_bio(struct bio *bio, sector_t sector)
496 {
> 497 if (bio_end_sector(bio) < sector + STRIPE_SECTORS)
498 return bio->bi_next;
499 else
500 return NULL;
501 }
502
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 20062 bytes --]
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v3 01/11] md/raid5: add CONFIG_MD_RAID456_STRIPE_SHIFT to set STRIPE_SIZE
@ 2020-05-27 20:21 ` kbuild test robot
0 siblings, 0 replies; 31+ messages in thread
From: kbuild test robot @ 2020-05-27 20:21 UTC (permalink / raw)
To: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 4446 bytes --]
Hi Yufen,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on cryptodev/master]
[also build test ERROR on crypto/master linus/master v5.7-rc7 next-20200526]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]
url: https://github.com/0day-ci/linux/commits/Yufen-Yu/md-raid5-set-STRIPE_SIZE-as-a-configurable-value/20200527-212526
base: https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git master
config: ia64-defconfig (attached as .config)
compiler: ia64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=ia64
If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <lkp@intel.com>
All error/warnings (new ones prefixed by >>, old ones prefixed by <<):
In file included from drivers/md/raid0.c:20:
drivers/md/raid5.h: In function 'r5_next_bio':
>> drivers/md/raid5.h:475:29: error: 'CONFIG_MD_RAID456_STRIPE_SHIFT' undeclared (first use in this function)
475 | #define CONFIG_STRIPE_SIZE (CONFIG_MD_RAID456_STRIPE_SHIFT << 9)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> drivers/md/raid5.h:477:3: note: in expansion of macro 'CONFIG_STRIPE_SIZE'
477 | (CONFIG_STRIPE_SIZE > PAGE_SIZE ? PAGE_SIZE : CONFIG_STRIPE_SIZE)
| ^~~~~~~~~~~~~~~~~~
>> drivers/md/raid5.h:479:26: note: in expansion of macro 'STRIPE_SIZE'
479 | #define STRIPE_SECTORS (STRIPE_SIZE>>9)
| ^~~~~~~~~~~
>> drivers/md/raid5.h:497:37: note: in expansion of macro 'STRIPE_SECTORS'
497 | if (bio_end_sector(bio) < sector + STRIPE_SECTORS)
| ^~~~~~~~~~~~~~
drivers/md/raid5.h:475:29: note: each undeclared identifier is reported only once for each function it appears in
475 | #define CONFIG_STRIPE_SIZE (CONFIG_MD_RAID456_STRIPE_SHIFT << 9)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> drivers/md/raid5.h:477:3: note: in expansion of macro 'CONFIG_STRIPE_SIZE'
477 | (CONFIG_STRIPE_SIZE > PAGE_SIZE ? PAGE_SIZE : CONFIG_STRIPE_SIZE)
| ^~~~~~~~~~~~~~~~~~
>> drivers/md/raid5.h:479:26: note: in expansion of macro 'STRIPE_SIZE'
479 | #define STRIPE_SECTORS (STRIPE_SIZE>>9)
| ^~~~~~~~~~~
>> drivers/md/raid5.h:497:37: note: in expansion of macro 'STRIPE_SECTORS'
497 | if (bio_end_sector(bio) < sector + STRIPE_SECTORS)
| ^~~~~~~~~~~~~~
vim +/CONFIG_MD_RAID456_STRIPE_SHIFT +475 drivers/md/raid5.h
469
470 /*
471 * Stripe cache
472 */
473
474 #define NR_STRIPES 256
> 475 #define CONFIG_STRIPE_SIZE (CONFIG_MD_RAID456_STRIPE_SHIFT << 9)
476 #define STRIPE_SIZE \
> 477 (CONFIG_STRIPE_SIZE > PAGE_SIZE ? PAGE_SIZE : CONFIG_STRIPE_SIZE)
478 #define STRIPE_SHIFT (PAGE_SHIFT - 9)
> 479 #define STRIPE_SECTORS (STRIPE_SIZE>>9)
480 #define IO_THRESHOLD 1
481 #define BYPASS_THRESHOLD 1
482 #define NR_HASH (PAGE_SIZE / sizeof(struct hlist_head))
483 #define HASH_MASK (NR_HASH - 1)
484 #define MAX_STRIPE_BATCH 8
485
486 /* bio's attached to a stripe+device for I/O are linked together in bi_sector
487 * order without overlap. There may be several bio's per stripe+device, and
488 * a bio could span several devices.
489 * When walking this list for a particular stripe+device, we must never proceed
490 * beyond a bio that extends past this device, as the next bio might no longer
491 * be valid.
492 * This function is used to determine the 'next' bio in the list, given the
493 * sector of the current stripe+device
494 */
495 static inline struct bio *r5_next_bio(struct bio *bio, sector_t sector)
496 {
> 497 if (bio_end_sector(bio) < sector + STRIPE_SECTORS)
498 return bio->bi_next;
499 else
500 return NULL;
501 }
502
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org
[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 20062 bytes --]
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v3 01/11] md/raid5: add CONFIG_MD_RAID456_STRIPE_SHIFT to set STRIPE_SIZE
2020-05-27 13:54 ` Guoqing Jiang
@ 2020-05-27 23:30 ` John Stoffel
2020-05-28 6:17 ` Yufen Yu
1 sibling, 0 replies; 31+ messages in thread
From: John Stoffel @ 2020-05-27 23:30 UTC (permalink / raw)
To: Guoqing Jiang; +Cc: Yufen Yu, song, linux-raid, neilb, colyli, xni, houtao1
>>>>> "Guoqing" == Guoqing Jiang <guoqing.jiang@cloud.ionos.com> writes:
Guoqing> Hi,
Guoqing> On 5/27/20 3:19 PM, Yufen Yu wrote:
>> +config MD_RAID456_STRIPE_SHIFT
>> + int "RAID4/RAID5/RAID6 stripe size shift"
>> + default "1"
>> + depends on MD_RAID456
>> + help
>> + When set the value as 'N', stripe size will be set as 'N << 9',
>> + which is a multiple of 4KB.
Guoqing> If 'N << 9', then seems you are convert it to sector, do you actually
Guoqing> mean 'N << 12'?
Aren't there helpers that can be used here instead of semi-magic
numbers? At the least, the 9 and 12 should be #defines with good
names, or using the standard PAGE_SIZE and other defines.
>> +
>> + The default value is 1, that means the default stripe size is
>> + 4096(1 << 9). Just setting as a bigger value when PAGE_SIZE is
>> + bigger than 4096. In that case, you can set it as 2(8KB),
>> + 4(16K), 16(64K).
Guoqing> So with the above description, the algorithm should be 2 << 12 = 8KB and
Guoqing> so on.
>> +
>> + When you try to set a big value, likely 16 on arm64 with 64KB
>> + PAGE_SIZE, that means, you know size of each io that issued to
>> + raid device is more than 4096. Otherwise just use default value.
>> +
Cheers,
John
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v3 01/11] md/raid5: add CONFIG_MD_RAID456_STRIPE_SHIFT to set STRIPE_SIZE
2020-05-27 13:54 ` Guoqing Jiang
2020-05-27 23:30 ` John Stoffel
@ 2020-05-28 6:17 ` Yufen Yu
1 sibling, 0 replies; 31+ messages in thread
From: Yufen Yu @ 2020-05-28 6:17 UTC (permalink / raw)
To: Guoqing Jiang, song; +Cc: linux-raid, neilb, colyli, xni, houtao1
On 2020/5/27 21:54, Guoqing Jiang wrote:
> Hi,
>
> On 5/27/20 3:19 PM, Yufen Yu wrote:
>> +config MD_RAID456_STRIPE_SHIFT
>> + int "RAID4/RAID5/RAID6 stripe size shift"
>> + default "1"
>> + depends on MD_RAID456
>> + help
>> + When set the value as 'N', stripe size will be set as 'N << 9',
>> + which is a multiple of 4KB.
>
> If 'N << 9', then seems you are convert it to sector, do you actually mean 'N << 12'?
>
>> +
>> + The default value is 1, that means the default stripe size is
>> + 4096(1 << 9). Just setting as a bigger value when PAGE_SIZE is
>> + bigger than 4096. In that case, you can set it as 2(8KB),
>> + 4(16K), 16(64K).
>
> So with the above description, the algorithm should be 2 << 12 = 8KB and so on.
>
>> +
>> + When you try to set a big value, likely 16 on arm64 with 64KB
>> + PAGE_SIZE, that means, you know size of each io that issued to
>> + raid device is more than 4096. Otherwise just use default value.
>> +
>> + Normally, using default value can get better performance.
>> + Only change this value if you know what you are doing.
>> +
>> +
>> config MD_MULTIPATH
>> tristate "Multipath I/O support"
>> depends on BLK_DEV_MD
>> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
>> index f90e0704bed9..b25f107dafc7 100644
>> --- a/drivers/md/raid5.h
>> +++ b/drivers/md/raid5.h
>> @@ -472,7 +472,9 @@ struct disk_info {
>> */
>> #define NR_STRIPES 256
>> -#define STRIPE_SIZE PAGE_SIZE
>> +#define CONFIG_STRIPE_SIZE (CONFIG_MD_RAID456_STRIPE_SHIFT << 9)
>> +#define STRIPE_SIZE \
>> + (CONFIG_STRIPE_SIZE > PAGE_SIZE ? PAGE_SIZE : CONFIG_STRIPE_SIZE)
>
> If I am not misunderstand, you need to s/9/12/ above.
Oh yeah, thanks a lot for catching this. Sorry for this obvious error.
It show be 12, then STRIPE_SIZE can be multiple of 4KB, as Liu Song suggested.
Thanks,
Yufen,
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v3 01/11] md/raid5: add CONFIG_MD_RAID456_STRIPE_SHIFT to set STRIPE_SIZE
2020-05-27 15:16 ` Xiao Ni
@ 2020-05-28 6:29 ` Yufen Yu
0 siblings, 0 replies; 31+ messages in thread
From: Yufen Yu @ 2020-05-28 6:29 UTC (permalink / raw)
To: Xiao Ni, song; +Cc: linux-raid, neilb, guoqing.jiang, colyli, houtao1
On 2020/5/27 23:16, Xiao Ni wrote:
>
>
> On 05/27/2020 09:19 PM, Yufen Yu wrote:
>> In RAID5, if issued bio size is bigger than STRIPE_SIZE, it will be split
>> in the unit of STRIPE_SIZE and process them one by one. Even for size
>> less then STRIPE_SIZE, RAID5 also request data from disk at least of
>> STRIPE_SIZE.
>>
>> Nowdays, STRIPE_SIZE is equal to the value of PAGE_SIZE. Since filesystem
>> usually issue bio in the unit of 4KB, there is no problem for PAGE_SIZE as
>> 4KB. But, for 64KB PAGE_SIZE, bio from filesystem requests 4KB data while
>> RAID5 issue IO at least STRIPE_SIZE (64KB) each time. That will waste
>> resource of disk bandwidth and compute xor.
>>
>> To avoding the waste, we want to add a new CONFIG option to adjust
>> STREIPE_SIZE. Default value is 4096. User can also set the value bigger
>> than 4KB for some special requirements, such as we know the issued io
>> size is more than 4KB.
>>
>> To evaluate the new feature, we create raid5 device '/dev/md5' with
>> 4 SSD disk and test it on arm64 machine with 64KB PAGE_SIZE.
>>
>> 1) We format /dev/md5 with mkfs.ext4 and mount ext4 with default
>> configure on /mnt directory. Then, trying to test it by dbench with
>> command: dbench -D /mnt -t 1000 10. Result show as:
>>
>> 'STRIPE_SIZE = 64KB'
>>
>> Operation Count AvgLat MaxLat
>> ----------------------------------------
>> NTCreateX 9805011 0.021 64.728
>> Close 7202525 0.001 0.120
>> Rename 415213 0.051 44.681
>> Unlink 1980066 0.079 93.147
>> Deltree 240 1.793 6.516
>> Mkdir 120 0.004 0.007
>> Qpathinfo 8887512 0.007 37.114
>> Qfileinfo 1557262 0.001 0.030
>> Qfsinfo 1629582 0.012 0.152
>> Sfileinfo 798756 0.040 57.641
>> Find 3436004 0.019 57.782
>> WriteX 4887239 0.021 57.638
>> ReadX 15370483 0.005 37.818
>> LockX 31934 0.003 0.022
>> UnlockX 31933 0.001 0.021
>> Flush 687205 13.302 530.088
>>
>> Throughput 307.799 MB/sec 10 clients 10 procs max_latency=530.091 ms
>> -------------------------------------------------------
>>
>> 'STRIPE_SIZE = 4KB'
>>
>> Operation Count AvgLat MaxLat
>> ----------------------------------------
>> NTCreateX 11999166 0.021 36.380
>> Close 8814128 0.001 0.122
>> Rename 508113 0.051 29.169
>> Unlink 2423242 0.070 38.141
>> Deltree 300 1.885 7.155
>> Mkdir 150 0.004 0.006
>> Qpathinfo 10875921 0.007 35.485
>> Qfileinfo 1905837 0.001 0.032
>> Qfsinfo 1994304 0.012 0.125
>> Sfileinfo 977450 0.029 26.489
>> Find 4204952 0.019 9.361
>> WriteX 5981890 0.019 27.804
>> ReadX 18809742 0.004 33.491
>> LockX 39074 0.003 0.025
>> UnlockX 39074 0.001 0.014
>> Flush 841022 10.712 458.848
>>
>> Throughput 376.777 MB/sec 10 clients 10 procs max_latency=458.852 ms
>> -------------------------------------------------------
>>
>> It show that setting STREIP_SIZE as 4KB has higher thoughput, i.e.
>> (376.777 vs 307.799) and has smaller latency (530.091 vs 458.852)
>> than that setting as 64KB.
>>
>> 2) We try to evaluate IO throughput for /dev/md5 by fio with config:
>>
>> [4KB randwrite]
>> direct=1
>> numjob=2
>> iodepth=64
>> ioengine=libaio
>> filename=/dev/md5
>> bs=4KB
>> rw=randwrite
>>
>> [64KB write]
>> direct=1
>> numjob=2
>> iodepth=64
>> ioengine=libaio
>> filename=/dev/md5
>> bs=1MB
>> rw=write
>>
>> The result as follow:
>>
>> + +
>> | STRIPE_SIZE(64KB) | STRIPE_SIZE(4KB)
>> +----------------------------------------------------+
>> 4KB randwrite | 15MB/s | 100MB/s
>> +----------------------------------------------------+
>> 1MB write | 1000MB/s | 700MB/s
>>
>> The result show that when size of io is bigger than 4KB (64KB),
>> 64KB STRIPE_SIZE has much higher IOPS. But for 4KB randwrite, that
>> means, size of io issued to device are smaller, 4KB STRIPE_SIZE
>> have better performance.
>>
>> Thus, we provide a configure to set STRIPE_SIZE when PAGE_SIZE is bigger
>> than 4096. Normally, default value (4096) can get relatively good
>> performance. But if each issued io is bigger than 4096, setting value more
>> than 4096 may get better performance.
>>
>> Signed-off-by: Yufen Yu <yuyufen@huawei.com>
>> ---
>> drivers/md/Kconfig | 21 +++++++++++++++++++++
>> drivers/md/raid5.h | 4 +++-
>> 2 files changed, 24 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
>> index d6d5ab23c088..629324f92c42 100644
>> --- a/drivers/md/Kconfig
>> +++ b/drivers/md/Kconfig
>> @@ -157,6 +157,27 @@ config MD_RAID456
>> If unsure, say Y.
>> +config MD_RAID456_STRIPE_SHIFT
>> + int "RAID4/RAID5/RAID6 stripe size shift"
>> + default "1"
>> + depends on MD_RAID456
>> + help
>> + When set the value as 'N', stripe size will be set as 'N << 9',
>> + which is a multiple of 4KB.
>> +
>> + The default value is 1, that means the default stripe size is
>> + 4096(1 << 9). Just setting as a bigger value when PAGE_SIZE is
>> + bigger than 4096. In that case, you can set it as 2(8KB),
>> + 4(16K), 16(64K).
>> +
>> + When you try to set a big value, likely 16 on arm64 with 64KB
>> + PAGE_SIZE, that means, you know size of each io that issued to
>> + raid device is more than 4096. Otherwise just use default value.
>> +
>> + Normally, using default value can get better performance.
>> + Only change this value if you know what you are doing.
>> +
>> +
>> config MD_MULTIPATH
>> tristate "Multipath I/O support"
>> depends on BLK_DEV_MD
>> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
>> index f90e0704bed9..b25f107dafc7 100644
>> --- a/drivers/md/raid5.h
>> +++ b/drivers/md/raid5.h
>> @@ -472,7 +472,9 @@ struct disk_info {
>> */
>> #define NR_STRIPES 256
>> -#define STRIPE_SIZE PAGE_SIZE
>> +#define CONFIG_STRIPE_SIZE (CONFIG_MD_RAID456_STRIPE_SHIFT << 9)
>> +#define STRIPE_SIZE \
>> + (CONFIG_STRIPE_SIZE > PAGE_SIZE ? PAGE_SIZE : CONFIG_STRIPE_SIZE)
> Hi Yufen
>
> Is it what you want? Or it should be:
>
> +#define STRIPE_SIZE \
> + (CONFIG_STRIPE_SIZE > PAGE_SIZE ? CONFIG_STRIPE_SIZE : PAGE_SIZE)
>> #define STRIPE_SHIFT (PAGE_SHIFT - 9)
>> #define STRIPE_SECTORS (STRIPE_SIZE>>9)
>> #define IO_THRESHOLD 1
>
Yes, I think it is what I want.
STRIPE_SIZE should not be bigger than PAGE_SIZE. So, if CONFIG_STRIPE_SIZE
is bigger, we just set it as PAGE_SIZE.
Thanks,
Yufen
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value
2020-05-27 13:19 [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value Yufen Yu
` (10 preceding siblings ...)
2020-05-27 13:19 ` [PATCH v3 11/11] raid6test: adaptation with syndrome function Yufen Yu
@ 2020-05-28 14:10 ` Song Liu
2020-05-28 14:28 ` Song Liu
2020-05-28 22:07 ` Guoqing Jiang
12 siblings, 1 reply; 31+ messages in thread
From: Song Liu @ 2020-05-28 14:10 UTC (permalink / raw)
To: Yufen Yu; +Cc: linux-raid, NeilBrown, Guoqing Jiang, Coly Li, Xiao Ni, Hou Tao
On Wed, May 27, 2020 at 6:20 AM Yufen Yu <yuyufen@huawei.com> wrote:
>
> Hi, all
>
> For now, STRIPE_SIZE is equal to the value of PAGE_SIZE. That means, RAID5 will
> issus echo bio to disk at least 64KB when PAGE_SIZE is 64KB in arm64. However,
> filesystem usually issue bio in the unit of 4KB. Then, RAID5 will waste resource
> of disk bandwidth.
>
Thanks for the patch set.
Since this is a big change, I am planning to process this set after
upcoming merge window.
Please let me know if you need it urgently.
Song
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v3 01/11] md/raid5: add CONFIG_MD_RAID456_STRIPE_SHIFT to set STRIPE_SIZE
2020-05-27 13:19 ` [PATCH v3 01/11] md/raid5: add CONFIG_MD_RAID456_STRIPE_SHIFT to set STRIPE_SIZE Yufen Yu
` (2 preceding siblings ...)
2020-05-27 20:21 ` kbuild test robot
@ 2020-05-28 14:23 ` Song Liu
2020-05-29 8:42 ` Yufen Yu
3 siblings, 1 reply; 31+ messages in thread
From: Song Liu @ 2020-05-28 14:23 UTC (permalink / raw)
To: Yufen Yu; +Cc: linux-raid, NeilBrown, Guoqing Jiang, Coly Li, Xiao Ni, Hou Tao
On Wed, May 27, 2020 at 6:20 AM Yufen Yu <yuyufen@huawei.com> wrote:
>
[...]
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index f90e0704bed9..b25f107dafc7 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -472,7 +472,9 @@ struct disk_info {
> */
>
> #define NR_STRIPES 256
> -#define STRIPE_SIZE PAGE_SIZE
> +#define CONFIG_STRIPE_SIZE (CONFIG_MD_RAID456_STRIPE_SHIFT << 9)
> +#define STRIPE_SIZE \
> + (CONFIG_STRIPE_SIZE > PAGE_SIZE ? PAGE_SIZE : CONFIG_STRIPE_SIZE)
> #define STRIPE_SHIFT (PAGE_SHIFT - 9)
I think we also need to update STRIPE_SHIFT.
Thanks,
Song
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value
2020-05-28 14:10 ` [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value Song Liu
@ 2020-05-28 14:28 ` Song Liu
2020-05-29 9:32 ` Yufen Yu
0 siblings, 1 reply; 31+ messages in thread
From: Song Liu @ 2020-05-28 14:28 UTC (permalink / raw)
To: Yufen Yu; +Cc: linux-raid, NeilBrown, Guoqing Jiang, Coly Li, Xiao Ni, Hou Tao
On Thu, May 28, 2020 at 7:10 AM Song Liu <song@kernel.org> wrote:
>
> On Wed, May 27, 2020 at 6:20 AM Yufen Yu <yuyufen@huawei.com> wrote:
> >
> > Hi, all
> >
> > For now, STRIPE_SIZE is equal to the value of PAGE_SIZE. That means, RAID5 will
> > issus echo bio to disk at least 64KB when PAGE_SIZE is 64KB in arm64. However,
> > filesystem usually issue bio in the unit of 4KB. Then, RAID5 will waste resource
> > of disk bandwidth.
> >
>
> Thanks for the patch set.
>
> Since this is a big change, I am planning to process this set after
> upcoming merge window.
> Please let me know if you need it urgently.
I haven't thought about this in detail yet: how about compatibility?
Say we create an
array with STRIPE_SIZE of 4kB, does it work well after we upgrade kernel to have
STRIPE_SIZE of 8kB?
Thanks,
Song
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value
2020-05-27 13:19 [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value Yufen Yu
` (11 preceding siblings ...)
2020-05-28 14:10 ` [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value Song Liu
@ 2020-05-28 22:07 ` Guoqing Jiang
2020-05-29 11:49 ` Yufen Yu
12 siblings, 1 reply; 31+ messages in thread
From: Guoqing Jiang @ 2020-05-28 22:07 UTC (permalink / raw)
To: Yufen Yu, song; +Cc: linux-raid, neilb, colyli, xni, houtao1
On 5/27/20 3:19 PM, Yufen Yu wrote:
> Hi, all
>
> For now, STRIPE_SIZE is equal to the value of PAGE_SIZE. That means, RAID5 will
> issus echo bio to disk at least 64KB when PAGE_SIZE is 64KB in arm64. However,
> filesystem usually issue bio in the unit of 4KB. Then, RAID5 will waste resource
> of disk bandwidth.
Could you explain a little bit about "waste resource"? Does it mean the
chance for
full stripe write is limited because of the incompatible between fs
(4KB bio) and
raid5 (64KB stripe unit)?
> To solve the problem, this patchset provide a new config CONFIG_MD_RAID456_STRIPE_SHIFT
> to let user config STRIPE_SIZE. The default value is 1, means 4096(1<<9).
>
> Normally, using default STRIPE_SIZE can get better performance. And NeilBrown have
> suggested just to fix the STRIPE_SIZE as 4096.But, out test result show that
> big value of STRIPE_SIZE may have better performance when size of issued IOs are
> mostly bigger than 4096. Thus, in this patchset, we still want to set STRIPE_SIZE
> as a configureable value.
I think it is better to define stripe size as 4K if it fits to generally
scenario, and also
aligns with fs.
> In current implementation, grow_buffers() uses alloc_page() to allocate the buffers
> for each stripe_head. With the change, it means we allocate 64K buffers but just
> use 4K of them. To save memory, we try to 'compress' multiple buffers of stripe_head
> to only one real page. Detail shows in patch #2.
>
> To evaluate the new feature, we create raid5 device '/dev/md5' with 4 SSD disk
> and test it on arm64 machine with 64KB PAGE_SIZE.
>
> 1) We format /dev/md5 with mkfs.ext4 and mount ext4 with default configure on
> /mnt directory. Then, trying to test it by dbench with command:
> dbench -D /mnt -t 1000 10. Result show as:
>
> 'STRIPE_SHIFT = 64KB'
>
> Operation Count AvgLat MaxLat
> ----------------------------------------
> NTCreateX 9805011 0.021 64.728
> Close 7202525 0.001 0.120
> Rename 415213 0.051 44.681
> Unlink 1980066 0.079 93.147
> Deltree 240 1.793 6.516
> Mkdir 120 0.004 0.007
> Qpathinfo 8887512 0.007 37.114
> Qfileinfo 1557262 0.001 0.030
> Qfsinfo 1629582 0.012 0.152
> Sfileinfo 798756 0.040 57.641
> Find 3436004 0.019 57.782
> WriteX 4887239 0.021 57.638
> ReadX 15370483 0.005 37.818
> LockX 31934 0.003 0.022
> UnlockX 31933 0.001 0.021
> Flush 687205 13.302 530.088
>
> Throughput 307.799 MB/sec 10 clients 10 procs max_latency=530.091 ms
> -------------------------------------------------------
>
> 'STRIPE_SIZE = 4KB'
>
> Operation Count AvgLat MaxLat
> ----------------------------------------
> NTCreateX 11999166 0.021 36.380
> Close 8814128 0.001 0.122
> Rename 508113 0.051 29.169
> Unlink 2423242 0.070 38.141
> Deltree 300 1.885 7.155
> Mkdir 150 0.004 0.006
> Qpathinfo 10875921 0.007 35.485
> Qfileinfo 1905837 0.001 0.032
> Qfsinfo 1994304 0.012 0.125
> Sfileinfo 977450 0.029 26.489
> Find 4204952 0.019 9.361
> WriteX 5981890 0.019 27.804
> ReadX 18809742 0.004 33.491
> LockX 39074 0.003 0.025
> UnlockX 39074 0.001 0.014
> Flush 841022 10.712 458.848
>
> Throughput 376.777 MB/sec 10 clients 10 procs max_latency=458.852 ms
> -------------------------------------------------------
What is the default io unit size of dbench?
> 2) We try to evaluate IO throughput for /dev/md5 by fio with config:
>
> [4KB randwrite]
> direct=1
> numjob=2
> iodepth=64
> ioengine=libaio
> filename=/dev/md5
> bs=4KB
> rw=randwrite
>
> [64KB write]
> direct=1
> numjob=2
> iodepth=64
> ioengine=libaio
> filename=/dev/md5
> bs=1MB
> rw=write
>
> The fio test result as follow:
>
> + +
> | STRIPE_SIZE(64KB) | STRIPE_SIZE(4KB)
> +----------------------------------------------------+
> 4KB randwrite | 15MB/s | 100MB/s
> +----------------------------------------------------+
> 1MB write | 1000MB/s | 700MB/s
>
> The result show that when size of io is bigger than 4KB (64KB),
> 64KB STRIPE_SIZE has much higher IOPS. But for 4KB randwrite, that
> means, size of io issued to device are smaller, 4KB STRIPE_SIZE
> have better performance.
The 4k rand write performance drops from 100MB/S to 15MB/S?! How about other
io sizes? Say 16k, 64K and 256K etc, it would be more convincing if 64KB
stripe
has better performance than 4KB stripe overall.
Thanks,
Guoqing
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v3 01/11] md/raid5: add CONFIG_MD_RAID456_STRIPE_SHIFT to set STRIPE_SIZE
2020-05-28 14:23 ` Song Liu
@ 2020-05-29 8:42 ` Yufen Yu
0 siblings, 0 replies; 31+ messages in thread
From: Yufen Yu @ 2020-05-29 8:42 UTC (permalink / raw)
To: Song Liu; +Cc: linux-raid, NeilBrown, Guoqing Jiang, Coly Li, Xiao Ni, Hou Tao
On 2020/5/28 22:23, Song Liu wrote:
> On Wed, May 27, 2020 at 6:20 AM Yufen Yu <yuyufen@huawei.com> wrote:
>>
> [...]
>> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
>> index f90e0704bed9..b25f107dafc7 100644
>> --- a/drivers/md/raid5.h
>> +++ b/drivers/md/raid5.h
>> @@ -472,7 +472,9 @@ struct disk_info {
>> */
>>
>> #define NR_STRIPES 256
>> -#define STRIPE_SIZE PAGE_SIZE
>> +#define CONFIG_STRIPE_SIZE (CONFIG_MD_RAID456_STRIPE_SHIFT << 9)
>> +#define STRIPE_SIZE \
>> + (CONFIG_STRIPE_SIZE > PAGE_SIZE ? PAGE_SIZE : CONFIG_STRIPE_SIZE)
>> #define STRIPE_SHIFT (PAGE_SHIFT - 9)
>
> I think we also need to update STRIPE_SHIFT.
>
Yeah, I forgot to update it. Thanks for catching this.
Thanks,
Yufen
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value
2020-05-28 14:28 ` Song Liu
@ 2020-05-29 9:32 ` Yufen Yu
0 siblings, 0 replies; 31+ messages in thread
From: Yufen Yu @ 2020-05-29 9:32 UTC (permalink / raw)
To: Song Liu; +Cc: linux-raid, NeilBrown, Guoqing Jiang, Coly Li, Xiao Ni, Hou Tao
On 2020/5/28 22:28, Song Liu wrote:
> On Thu, May 28, 2020 at 7:10 AM Song Liu <song@kernel.org> wrote:
>>
>> On Wed, May 27, 2020 at 6:20 AM Yufen Yu <yuyufen@huawei.com> wrote:
>>>
>>> Hi, all
>>>
>>> For now, STRIPE_SIZE is equal to the value of PAGE_SIZE. That means, RAID5 will
>>> issus echo bio to disk at least 64KB when PAGE_SIZE is 64KB in arm64. However,
>>> filesystem usually issue bio in the unit of 4KB. Then, RAID5 will waste resource
>>> of disk bandwidth.
>>>
>>
>> Thanks for the patch set.
>>
>> Since this is a big change, I am planning to process this set after
>> upcoming merge window.
>> Please let me know if you need it urgently.
I agree with your plan.
>
> I haven't thought about this in detail yet: how about compatibility?
> Say we create an
> array with STRIPE_SIZE of 4kB, does it work well after we upgrade kernel to have
> STRIPE_SIZE of 8kB?
Each time upgrade kernel, we need to reboot system to make it effective.
And, system will allocate new stripe_head and buffers base on the new STRIPE_SIZE.
Then everything will be ok and go on running.
STRIPE_SIZE decides each io size issued to array disk and its buffers size in
stripe_head. But each time restart system, we will allocate new buffer and
stripe_head based on the new STRIPE_SIZE. It is no matter what the value before.
So, I think changing STRIPE_SIZE is not problem for upgrade kernel.
Thanks,
Yufen
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value
2020-05-28 22:07 ` Guoqing Jiang
@ 2020-05-29 11:49 ` Yufen Yu
2020-05-29 12:22 ` Guoqing Jiang
0 siblings, 1 reply; 31+ messages in thread
From: Yufen Yu @ 2020-05-29 11:49 UTC (permalink / raw)
To: Guoqing Jiang, song; +Cc: linux-raid, neilb, colyli, xni, houtao1
On 2020/5/29 6:07, Guoqing Jiang wrote:
> On 5/27/20 3:19 PM, Yufen Yu wrote:
>> Hi, all
>>
>> For now, STRIPE_SIZE is equal to the value of PAGE_SIZE. That means, RAID5 will
>> issus echo bio to disk at least 64KB when PAGE_SIZE is 64KB in arm64. However,
>> filesystem usually issue bio in the unit of 4KB. Then, RAID5 will waste resource
>> of disk bandwidth.
>
> Could you explain a little bit about "waste resource"? Does it mean the chance for
> full stripe write is limited because of the incompatible between fs (4KB bio) and
> raid5 (64KB stripe unit)?
Applications may request 4KB data, but RAID5 will issue 64KB size bio to disk, which
will waste disk bandwidth and more cpu time to compute xor. Detail performance data
can see in previous email:
https://www.spinics.net/lists/raid/msg64261.html
>
>> To solve the problem, this patchset provide a new config CONFIG_MD_RAID456_STRIPE_SHIFT
>> to let user config STRIPE_SIZE. The default value is 1, means 4096(1<<9).
>>
>> Normally, using default STRIPE_SIZE can get better performance. And NeilBrown have
>> suggested just to fix the STRIPE_SIZE as 4096.But, out test result show that
>> big value of STRIPE_SIZE may have better performance when size of issued IOs are
>> mostly bigger than 4096. Thus, in this patchset, we still want to set STRIPE_SIZE
>> as a configureable value.
>
> I think it is better to define stripe size as 4K if it fits to generally scenario, and also
> aligns with fs.
>
>> In current implementation, grow_buffers() uses alloc_page() to allocate the buffers
>> for each stripe_head. With the change, it means we allocate 64K buffers but just
>> use 4K of them. To save memory, we try to 'compress' multiple buffers of stripe_head
>> to only one real page. Detail shows in patch #2.
>>
>> To evaluate the new feature, we create raid5 device '/dev/md5' with 4 SSD disk
>> and test it on arm64 machine with 64KB PAGE_SIZE.
>> 1) We format /dev/md5 with mkfs.ext4 and mount ext4 with default configure on
>> /mnt directory. Then, trying to test it by dbench with command:
>> dbench -D /mnt -t 1000 10. Result show as:
>> 'STRIPE_SHIFT = 64KB'
>> Operation Count AvgLat MaxLat
>> ----------------------------------------
>> NTCreateX 9805011 0.021 64.728
>> Close 7202525 0.001 0.120
>> Rename 415213 0.051 44.681
>> Unlink 1980066 0.079 93.147
>> Deltree 240 1.793 6.516
>> Mkdir 120 0.004 0.007
>> Qpathinfo 8887512 0.007 37.114
>> Qfileinfo 1557262 0.001 0.030
>> Qfsinfo 1629582 0.012 0.152
>> Sfileinfo 798756 0.040 57.641
>> Find 3436004 0.019 57.782
>> WriteX 4887239 0.021 57.638
>> ReadX 15370483 0.005 37.818
>> LockX 31934 0.003 0.022
>> UnlockX 31933 0.001 0.021
>> Flush 687205 13.302 530.088
>> Throughput 307.799 MB/sec 10 clients 10 procs max_latency=530.091 ms
>> -------------------------------------------------------
>> 'STRIPE_SIZE = 4KB'
>> Operation Count AvgLat MaxLat
>> ----------------------------------------
>> NTCreateX 11999166 0.021 36.380
>> Close 8814128 0.001 0.122
>> Rename 508113 0.051 29.169
>> Unlink 2423242 0.070 38.141
>> Deltree 300 1.885 7.155
>> Mkdir 150 0.004 0.006
>> Qpathinfo 10875921 0.007 35.485
>> Qfileinfo 1905837 0.001 0.032
>> Qfsinfo 1994304 0.012 0.125
>> Sfileinfo 977450 0.029 26.489
>> Find 4204952 0.019 9.361
>> WriteX 5981890 0.019 27.804
>> ReadX 18809742 0.004 33.491
>> LockX 39074 0.003 0.025
>> UnlockX 39074 0.001 0.014
>> Flush 841022 10.712 458.848
>> Throughput 376.777 MB/sec 10 clients 10 procs max_latency=458.852 ms
>> -------------------------------------------------------
>
> What is the default io unit size of dbench?
Since dbench runs on ext4 filesystem, so I think most io size is about 4KB.
>
>> 2) We try to evaluate IO throughput for /dev/md5 by fio with config:
>> [4KB randwrite]
>> direct=1
>> numjob=2
>> iodepth=64
>> ioengine=libaio
>> filename=/dev/md5
>> bs=4KB
>> rw=randwrite
>> [64KB write]
>> direct=1
>> numjob=2
>> iodepth=64
>> ioengine=libaio
>> filename=/dev/md5
>> bs=1MB
>> rw=write
>> The fio test result as follow:
>> + +
>> | STRIPE_SIZE(64KB) | STRIPE_SIZE(4KB)
>> +----------------------------------------------------+
>> 4KB randwrite | 15MB/s | 100MB/s
>> +----------------------------------------------------+
>> 1MB write | 1000MB/s | 700MB/s
>> The result show that when size of io is bigger than 4KB (64KB),
>> 64KB STRIPE_SIZE has much higher IOPS. But for 4KB randwrite, that
>> means, size of io issued to device are smaller, 4KB STRIPE_SIZE
>> have better performance.
>
> The 4k rand write performance drops from 100MB/S to 15MB/S?! How about other
> io sizes? Say 16k, 64K and 256K etc, it would be more convincing if 64KB stripe
> has better performance than 4KB stripe overall.
>
Maybe I have not explain clearly. Here, the fio test result shows that 4KB
STRIPE_SIZE is not always have better performance. If applications request
IO size mostly are bigger than 4KB, likely 1MB in test, set STRIPE_SIZE with
a bigger value can get better performance.
So, we try to provide a configurable STRIPE_SIZE, rather than fix STRIPE_SIZE as 4096.
Thanks,
Yufen
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value
2020-05-29 11:49 ` Yufen Yu
@ 2020-05-29 12:22 ` Guoqing Jiang
2020-05-30 2:15 ` Yufen Yu
0 siblings, 1 reply; 31+ messages in thread
From: Guoqing Jiang @ 2020-05-29 12:22 UTC (permalink / raw)
To: Yufen Yu, song; +Cc: linux-raid, neilb, colyli, xni, houtao1
On 5/29/20 1:49 PM, Yufen Yu wrote:
>> The 4k rand write performance drops from 100MB/S to 15MB/S?! How
>> about other
>> io sizes? Say 16k, 64K and 256K etc, it would be more convincing if
>> 64KB stripe
>> has better performance than 4KB stripe overall.
>>
>
> Maybe I have not explain clearly. Here, the fio test result shows that
> 4KB
> STRIPE_SIZE is not always have better performance. If applications
> request
> IO size mostly are bigger than 4KB, likely 1MB in test, set
> STRIPE_SIZE with
> a bigger value can get better performance.
>
> So, we try to provide a configurable STRIPE_SIZE, rather than fix
> STRIPE_SIZE as 4096.
Which means if you set stripe size to 64KB then you should guarantee the
io size should
always bigger then 1MB, right? Given that, I don't think it makes lots
of sense.
Anyway, just my $0.02.
Thanks,
Guoqing
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value
2020-05-29 12:22 ` Guoqing Jiang
@ 2020-05-30 2:15 ` Yufen Yu
2020-06-01 14:02 ` Guoqing Jiang
0 siblings, 1 reply; 31+ messages in thread
From: Yufen Yu @ 2020-05-30 2:15 UTC (permalink / raw)
To: Guoqing Jiang, song; +Cc: linux-raid, neilb, colyli, xni, houtao1
On 2020/5/29 20:22, Guoqing Jiang wrote:
> On 5/29/20 1:49 PM, Yufen Yu wrote:
>>> The 4k rand write performance drops from 100MB/S to 15MB/S?! How about other
>>> io sizes? Say 16k, 64K and 256K etc, it would be more convincing if 64KB stripe
>>> has better performance than 4KB stripe overall.
>>>
>>
>> Maybe I have not explain clearly. Here, the fio test result shows that 4KB
>> STRIPE_SIZE is not always have better performance. If applications request
>> IO size mostly are bigger than 4KB, likely 1MB in test, set STRIPE_SIZE with
>> a bigger value can get better performance.
>>
>> So, we try to provide a configurable STRIPE_SIZE, rather than fix STRIPE_SIZE as 4096.
>
> Which means if you set stripe size to 64KB then you should guarantee the io size should
> always bigger then 1MB, right? Given that, I don't think it makes lots of sense.
>
No, I think you misunderstood. This patchset just want to optimize RAID5 performance
for systems whose PAGE_SIZE is bigger than 4KB, likely 64KB on ARM64. Without this
patchset, STRIPE_SIZE is equal to 64KB, means each IO size issued to array disk at
least 64KB each time, Right? But filesystems usually issue bio in the unit of 4KB,
means sometimes required 4KB but read or write 64KB on disk actually. That would
waste resources.
After this patchset, we set STRIPE_SIZE as default 4KB. For systems like X86, which
just support 4KB PAGE_SIZE, it will not have any effect. But for 64KB arm64 system,
it **normally** can get better performance on filesystems base on raid5, like dbench test.
fio test just want to say that, we can also configure STRIPE_SIZE with a bigger value
than default 4KB on 64KB ARM64 system when applications mostly issue big IO. It can
get better performance for reducing IO split in RAID5.
Thanks,
Yufen
> Anyway, just my $0.02.
>
> Thanks,
> Guoqing
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value
2020-05-30 2:15 ` Yufen Yu
@ 2020-06-01 14:02 ` Guoqing Jiang
2020-06-02 6:59 ` Song Liu
0 siblings, 1 reply; 31+ messages in thread
From: Guoqing Jiang @ 2020-06-01 14:02 UTC (permalink / raw)
To: Yufen Yu, song; +Cc: linux-raid, neilb, colyli, xni, houtao1
On 5/30/20 4:15 AM, Yufen Yu wrote:
>
>
> On 2020/5/29 20:22, Guoqing Jiang wrote:
>> On 5/29/20 1:49 PM, Yufen Yu wrote:
>>>> The 4k rand write performance drops from 100MB/S to 15MB/S?! How
>>>> about other
>>>> io sizes? Say 16k, 64K and 256K etc, it would be more convincing if
>>>> 64KB stripe
>>>> has better performance than 4KB stripe overall.
>>>>
>>>
>>> Maybe I have not explain clearly. Here, the fio test result shows
>>> that 4KB
>>> STRIPE_SIZE is not always have better performance. If applications
>>> request
>>> IO size mostly are bigger than 4KB, likely 1MB in test, set
>>> STRIPE_SIZE with
>>> a bigger value can get better performance.
>>>
>>> So, we try to provide a configurable STRIPE_SIZE, rather than fix
>>> STRIPE_SIZE as 4096.
>>
>> Which means if you set stripe size to 64KB then you should guarantee
>> the io size should
>> always bigger then 1MB, right? Given that, I don't think it makes
>> lots of sense.
>>
>
> No, I think you misunderstood. This patchset just want to optimize
> RAID5 performance
> for systems whose PAGE_SIZE is bigger than 4KB, likely 64KB on ARM64.
> Without this
> patchset, STRIPE_SIZE is equal to 64KB, means each IO size issued to
> array disk at
> least 64KB each time, Right? But filesystems usually issue bio in the
> unit of 4KB,
> means sometimes required 4KB but read or write 64KB on disk actually.
> That would
> waste resources.
Yes,, it is hard for me to understand your way is better than just make
stripe size equals to
4KB.
>
> After this patchset, we set STRIPE_SIZE as default 4KB. For systems
> like X86, which
> just support 4KB PAGE_SIZE, it will not have any effect. But for 64KB
> arm64 system,
> it **normally** can get better performance on filesystems base on
> raid5, like dbench test.
>
> fio test just want to say that, we can also configure STRIPE_SIZE with
> a bigger value
> than default 4KB on 64KB ARM64 system when applications mostly issue
> big IO. It can
> get better performance for reducing IO split in RAID5.
I do think the flexibility is not enough, if someone set stripe size to
64KB by any chance,
people could complain the performance of raid5 really sucks if the io is
not big. And it is
not realistic to let people rebuild the module in case the io size is
changed, so it would be
more helpful if the stripe size can be changed dynamically without
recompile code.
Thanks,
Guoqing
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value
2020-06-01 14:02 ` Guoqing Jiang
@ 2020-06-02 6:59 ` Song Liu
2020-06-04 13:17 ` Yufen Yu
0 siblings, 1 reply; 31+ messages in thread
From: Song Liu @ 2020-06-02 6:59 UTC (permalink / raw)
To: Guoqing Jiang; +Cc: Yufen Yu, linux-raid, NeilBrown, Coly Li, Xiao Ni, Hou Tao
On Mon, Jun 1, 2020 at 7:02 AM Guoqing Jiang
<guoqing.jiang@cloud.ionos.com> wrote:
>
> On 5/30/20 4:15 AM, Yufen Yu wrote:
> >
> >
> > On 2020/5/29 20:22, Guoqing Jiang wrote:
> >> On 5/29/20 1:49 PM, Yufen Yu wrote:
> >>>> The 4k rand write performance drops from 100MB/S to 15MB/S?! How
> >>>> about other
> >>>> io sizes? Say 16k, 64K and 256K etc, it would be more convincing if
> >>>> 64KB stripe
> >>>> has better performance than 4KB stripe overall.
> >>>>
> >>>
> >>> Maybe I have not explain clearly. Here, the fio test result shows
> >>> that 4KB
> >>> STRIPE_SIZE is not always have better performance. If applications
> >>> request
> >>> IO size mostly are bigger than 4KB, likely 1MB in test, set
> >>> STRIPE_SIZE with
> >>> a bigger value can get better performance.
> >>>
> >>> So, we try to provide a configurable STRIPE_SIZE, rather than fix
> >>> STRIPE_SIZE as 4096.
> >>
> >> Which means if you set stripe size to 64KB then you should guarantee
> >> the io size should
> >> always bigger then 1MB, right? Given that, I don't think it makes
> >> lots of sense.
> >>
> >
> > No, I think you misunderstood. This patchset just want to optimize
> > RAID5 performance
> > for systems whose PAGE_SIZE is bigger than 4KB, likely 64KB on ARM64.
> > Without this
> > patchset, STRIPE_SIZE is equal to 64KB, means each IO size issued to
> > array disk at
> > least 64KB each time, Right? But filesystems usually issue bio in the
> > unit of 4KB,
> > means sometimes required 4KB but read or write 64KB on disk actually.
> > That would
> > waste resources.
>
> Yes,, it is hard for me to understand your way is better than just make
> stripe size equals to
> 4KB.
>
> >
> > After this patchset, we set STRIPE_SIZE as default 4KB. For systems
> > like X86, which
> > just support 4KB PAGE_SIZE, it will not have any effect. But for 64KB
> > arm64 system,
> > it **normally** can get better performance on filesystems base on
> > raid5, like dbench test.
> >
> > fio test just want to say that, we can also configure STRIPE_SIZE with
> > a bigger value
> > than default 4KB on 64KB ARM64 system when applications mostly issue
> > big IO. It can
> > get better performance for reducing IO split in RAID5.
>
> I do think the flexibility is not enough, if someone set stripe size to
> 64KB by any chance,
> people could complain the performance of raid5 really sucks if the io is
> not big. And it is
> not realistic to let people rebuild the module in case the io size is
> changed, so it would be
> more helpful if the stripe size can be changed dynamically without
> recompile code.
Agreed that it is not ideal to recompile the kernel/module to change stripe
size. It is also possible that multiple arrays in one system have different
optimal stripe sizes. I guess it shouldn't be too complicated to make this
configurable per array?
Thanks,
Song
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value
2020-06-02 6:59 ` Song Liu
@ 2020-06-04 13:17 ` Yufen Yu
0 siblings, 0 replies; 31+ messages in thread
From: Yufen Yu @ 2020-06-04 13:17 UTC (permalink / raw)
To: Song Liu, Guoqing Jiang; +Cc: linux-raid, NeilBrown, Coly Li, Xiao Ni, Hou Tao
On 2020/6/2 14:59, Song Liu wrote:
>> I do think the flexibility is not enough, if someone set stripe size to
>> 64KB by any chance,
>> people could complain the performance of raid5 really sucks if the io is
>> not big. And it is
>> not realistic to let people rebuild the module in case the io size is
>> changed, so it would be
>> more helpful if the stripe size can be changed dynamically without
>> recompile code.
>
> Agreed that it is not ideal to recompile the kernel/module to change stripe
> size. It is also possible that multiple arrays in one system have different
> optimal stripe sizes. I guess it shouldn't be too complicated to make this
> configurable per array?
>
Yeah. Thanks a lot for suggestion. I admit that the current implementation is really
not good. To make it more flexible and provide each stripe_size for raid arrays, I plan
to move stripe_size into r5conf and make it dynamically changeable by sysfs interface.
Thanks,
Yufen
^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2020-06-04 13:17 UTC | newest]
Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-27 13:19 [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value Yufen Yu
2020-05-27 13:19 ` [PATCH v3 01/11] md/raid5: add CONFIG_MD_RAID456_STRIPE_SHIFT to set STRIPE_SIZE Yufen Yu
2020-05-27 13:54 ` Guoqing Jiang
2020-05-27 23:30 ` John Stoffel
2020-05-28 6:17 ` Yufen Yu
2020-05-27 15:16 ` Xiao Ni
2020-05-28 6:29 ` Yufen Yu
2020-05-27 20:21 ` kbuild test robot
2020-05-27 20:21 ` kbuild test robot
2020-05-28 14:23 ` Song Liu
2020-05-29 8:42 ` Yufen Yu
2020-05-27 13:19 ` [PATCH v3 02/11] md/raid5: add a member of r5pages for struct stripe_head Yufen Yu
2020-05-27 13:19 ` [PATCH v3 03/11] md/raid5: allocate and free pages of r5pages Yufen Yu
2020-05-27 13:19 ` [PATCH v3 04/11] md/raid5: set correct page offset for bi_io_vec in ops_run_io() Yufen Yu
2020-05-27 13:19 ` [PATCH v3 05/11] md/raid5: set correct page offset for async_copy_data() Yufen Yu
2020-05-27 13:19 ` [PATCH v3 06/11] md/raid5: add new xor function to support different page offset Yufen Yu
2020-05-27 13:19 ` [PATCH v3 07/11] md/raid5: add offset array in scribble buffer Yufen Yu
2020-05-27 13:19 ` [PATCH v3 08/11] md/raid5: compute xor with correct page offset Yufen Yu
2020-05-27 13:19 ` [PATCH v3 09/11] md/raid6: let syndrome computor support different " Yufen Yu
2020-05-27 13:19 ` [PATCH v3 10/11] md/raid6: compute syndrome with correct " Yufen Yu
2020-05-27 13:19 ` [PATCH v3 11/11] raid6test: adaptation with syndrome function Yufen Yu
2020-05-28 14:10 ` [PATCH v3 00/11] md/raid5: set STRIPE_SIZE as a configurable value Song Liu
2020-05-28 14:28 ` Song Liu
2020-05-29 9:32 ` Yufen Yu
2020-05-28 22:07 ` Guoqing Jiang
2020-05-29 11:49 ` Yufen Yu
2020-05-29 12:22 ` Guoqing Jiang
2020-05-30 2:15 ` Yufen Yu
2020-06-01 14:02 ` Guoqing Jiang
2020-06-02 6:59 ` Song Liu
2020-06-04 13:17 ` Yufen Yu
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.