* dm-thinp feature request: skip allocation on writes of all zeroes
@ 2014-09-30 18:38 Eric Wheeler
2014-09-30 19:30 ` Zdenek Kabelac
` (2 more replies)
0 siblings, 3 replies; 46+ messages in thread
From: Eric Wheeler @ 2014-09-30 18:38 UTC (permalink / raw)
To: lvm-devel
Hello all,
We have been testing dm-thinp for the last couple years and it has been
stable for the most part.
I have always wondered, does LVM ignore writes with blocks of all zeroes
when the destination address is unallocated? From experience, it does not
appear that this is so (or I cannot find the config option). This would
reduce IOs, particularily when importing VM disk images from other
sources. Currently, we must import, allocate lots of zeroes, then fstrim
after the fact.
I have paged through the dm-thinp code, but I am not quite certain what
the best place to add this logic would be such that the IO is completed
and the allocation is not performed.
I would be happy to work on a patch, but I need a little bit of direction:
What would it take to add this as a feature option to the dmsetup table
string?
Where in the dm-thinp code would this best be addressed?
--
Eric Wheeler, President eWheeler, Inc. dba Global Linux Security
888-LINUX26 (888-546-8926) Fax: 503-716-3878 PO Box 25107
www.GlobalLinuxSecurity.pro Linux since 1996! Portland, OR 97298
^ permalink raw reply [flat|nested] 46+ messages in thread
* dm-thinp feature request: skip allocation on writes of all zeroes
2014-09-30 18:38 dm-thinp feature request: skip allocation on writes of all zeroes Eric Wheeler
@ 2014-09-30 19:30 ` Zdenek Kabelac
2014-09-30 20:13 ` Mike Snitzer
2014-12-04 7:05 ` [PATCH] dm-thinp: skip allocation on writes of all zeros to unallocated blocks Eric Wheeler
2 siblings, 0 replies; 46+ messages in thread
From: Zdenek Kabelac @ 2014-09-30 19:30 UTC (permalink / raw)
To: lvm-devel
Dne 30.9.2014 v 20:38 Eric Wheeler napsal(a):
> Hello all,
>
> We have been testing dm-thinp for the last couple years and it has been stable
> for the most part.
>
> I have always wondered, does LVM ignore writes with blocks of all zeroes when
> the destination address is unallocated? From experience, it does not appear
> that this is so (or I cannot find the config option). This would reduce IOs,
> particularily when importing VM disk images from other sources. Currently, we
> must import, allocate lots of zeroes, then fstrim after the fact.
If you use thinpool with enabled zeroing of provisioned chunks -
then you always get 'zeroed' thin-volume device in all parts -
so you don't need to zero anything in such block device.
Now if you 'skip' writing zeroes (like i.e. cp -a does when copying files),
you get your 'skip' zero writes.
Note - if your original source device would be already an LV - it might be
then probably more efficient to either use such LV as an external origin, or
implement support for merging LV into thin-pool.
Regards
Zdenek
^ permalink raw reply [flat|nested] 46+ messages in thread
* dm-thinp feature request: skip allocation on writes of all zeroes
2014-09-30 18:38 dm-thinp feature request: skip allocation on writes of all zeroes Eric Wheeler
2014-09-30 19:30 ` Zdenek Kabelac
@ 2014-09-30 20:13 ` Mike Snitzer
2014-09-30 22:38 ` Eric Wheeler
2014-12-04 7:05 ` [PATCH] dm-thinp: skip allocation on writes of all zeros to unallocated blocks Eric Wheeler
2 siblings, 1 reply; 46+ messages in thread
From: Mike Snitzer @ 2014-09-30 20:13 UTC (permalink / raw)
To: lvm-devel
On Tue, Sep 30 2014 at 2:38pm -0400,
Eric Wheeler <lvm-dev@lists.ewheeler.net> wrote:
> Hello all,
>
> We have been testing dm-thinp for the last couple years and it has
> been stable for the most part.
>
> I have always wondered, does LVM ignore writes with blocks of all
> zeroes when the destination address is unallocated? From experience,
> it does not appear that this is so (or I cannot find the config
> option). This would reduce IOs, particularily when importing VM disk
> images from other sources. Currently, we must import, allocate lots
> of zeroes, then fstrim after the fact.
We don't inspect the data being written to infer WRITE SAME-like action.
And in your example fstrim wouldn't do anything for you. The zeroed
data in the image is still valid isn't it? fstrim is only useful to
reclaim blocks that are no longer used.
> I have paged through the dm-thinp code, but I am not quite certain
> what the best place to add this logic would be such that the IO is
> completed and the allocation is not performed.
>
> I would be happy to work on a patch, but I need a little bit of direction:
>
> What would it take to add this as a feature option to the dmsetup
> table string?
>
> Where in the dm-thinp code would this best be addressed?
Honestly, you'd be better served to use deduplication. A dm-dedup
target has been in development (and I'm reviewing it), see:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=dm-dedup
It is very much in-development, but it gives you an idea. You could
pair a dedup target with dm-thinp in a stacked configuration to achieve
your goals.
^ permalink raw reply [flat|nested] 46+ messages in thread
* dm-thinp feature request: skip allocation on writes of all zeroes
2014-09-30 20:13 ` Mike Snitzer
@ 2014-09-30 22:38 ` Eric Wheeler
0 siblings, 0 replies; 46+ messages in thread
From: Eric Wheeler @ 2014-09-30 22:38 UTC (permalink / raw)
To: lvm-devel
>> I have always wondered, does LVM ignore writes with blocks of all
>> zeroes when the destination address is unallocated? From experience,
>> it does not appear that this is so (or I cannot find the config
>> option). This would reduce IOs, particularily when importing VM disk
>> images from other sources. Currently, we must import, allocate lots
>> of zeroes, then fstrim after the fact.
>
> And in your example fstrim wouldn't do anything for you. [...]
> fstrim is only useful to reclaim blocks that are no longer used.
For this particular example, unused data is mostly zeroes and thus fstrim
frees those unused chunks when dm-thinp receives the discard. I realize
that fstrim, strictly speaking, is not the solution---it was only meant to
highlight my example.
> The zeroed data in the image is still valid isn't it?
Yes, and it would continue to be valid because dm-thinp returns zeroes in
read requests to unallocated chunks. A quick check for all zeroes in the
bio before writing to an unallocated chunk would save a good amount of
overhead in our application, and probably others' as well. (Of course if
the chunk is allocated, the write must proceed like normal.)
> We don't inspect the data being written to infer WRITE SAME-like action.
I would like to implement data inspection in this way and make it
available as an option to the dmsetup table string.
Can you recommend where I should look in the dm-thinp code to check the
bio for all zeroes and branch accordingly?
(You are correct, deduplication would be an even better solution, but for
the moment I believe that WRITE SAME-like action could be implemented with
a relatively small patch if I can learn where to insert the code.)
-Eric
^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH] dm-thinp: skip allocation on writes of all zeros to unallocated blocks
2014-09-30 18:38 dm-thinp feature request: skip allocation on writes of all zeroes Eric Wheeler
2014-09-30 19:30 ` Zdenek Kabelac
2014-09-30 20:13 ` Mike Snitzer
@ 2014-12-04 7:05 ` Eric Wheeler
2014-12-04 7:25 ` Eric Wheeler
2 siblings, 1 reply; 46+ messages in thread
From: Eric Wheeler @ 2014-12-04 7:05 UTC (permalink / raw)
To: lvm-devel
This patch skips all-zero writes to unallocated blocks of dm-thinp
volumes.
Unallocated zero-writes are 70x faster and never allocate space in this test:
# dd if=/dev/zero of=/dev/test/test1 bs=1M count=1024
1073741824 bytes (1.1 GB) copied, 0.794343 s, 1.4 GB/s
Without the patch, zero-writes allocate space and hit the disk:
# dd if=/dev/zero of=/dev/test/test1 bs=1M count=1024
1073741824 bytes (1.1 GB) copied, 53.8064 s, 20.0 MB/s
For the test below, notice the allocation difference for thin volumes
test1 and test2 (after dd if=test1 of=test2), even though they have the
same md5sum:
LV VG Attr LSize Pool Origin Data%
test1 test Vwi-a-tz-- 4.00g thinp 22.04
test2 test Vwi-a-tz-- 4.00g thinp 18.33
An additional 3.71% of space was saved by the patch, and so were
the ~150MB of (possibly random) IOs that would have hit disk, not to
mention reads that now bypass the disk since they are unallocated.
We also save the metadata overhead of ~2400 allocations when calling
provision_block().
# lvcreate -T test/thinp -L 5G
# lvcreate -T test/thinp -V 4G -n test1
# lvcreate -T test/thinp -V 4G -n test2
Simple ext4+kernel tree extract test:
First prepare two dm-thinp volumes test1 and test2 of equal size. First
mkfs.ext4 /dev/test/test1 without the patch and then mount and extract
3.17.4's source tree onto the test1 filesystem, and unmount
Next, install patched dm_thin_pool.ko, then dd test1 over test2 and
verify checksums:
# dd if=/dev/test/test1 of=/dev/test/test2 bs=1M
# md5sum /dev/test/test?
b210f032a6465178103317f3c40ab59f /dev/test/test1
b210f032a6465178103317f3c40ab59f /dev/test/test2
Yes, they match!
Signed-off-by: Eric Wheeler <lvm-dev@lists.ewheeler.net>
---
drivers/md/dm-thin.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 45 insertions(+), 0 deletions(-)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index fc9c848..71dd545 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1230,6 +1230,42 @@ static void process_shared_bio(struct thin_c *tc, struct bio *bio,
}
}
+/* return true if bio data contains all 0x00's */
+bool bio_all_zeros(struct bio *bio)
+{
+ unsigned long flags;
+ struct bio_vec bv;
+ struct bvec_iter iter;
+
+ char *data;
+ uint64_t *p;
+ int i, count;
+
+ bool allzeros = true;
+
+ bio_for_each_segment(bv, bio, iter) {
+ data = bvec_kmap_irq(&bv, &flags);
+
+ p = (uint64_t*)data;
+ count = bv.bv_len / sizeof(uint64_t);
+
+ for (i = 0; i < count; i++) {
+ if (*p) {
+ allzeros = false;
+ break;
+ }
+ p++;
+ }
+
+ bvec_kunmap_irq(data, &flags);
+
+ if (likely(!allzeros))
+ break;
+ }
+
+ return allzeros;
+}
+
static void provision_block(struct thin_c *tc, struct bio *bio, dm_block_t block,
struct dm_bio_prison_cell *cell)
{
@@ -1258,6 +1294,15 @@ static void provision_block(struct thin_c *tc, struct bio *bio, dm_block_t block
return;
}
+ /*
+ * Skip writes of all zeroes
+ */
+ if (bio_data_dir(bio) == WRITE && unlikely( bio_all_zeros(bio) )) {
+ cell_defer_no_holder(tc, cell);
+ bio_endio(bio, 0);
+ return;
+ }
+
r = alloc_data_block(tc, &data_block);
switch (r) {
case 0:
--
1.7.1
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH] dm-thinp: skip allocation on writes of all zeros to unallocated blocks
2014-12-04 7:05 ` [PATCH] dm-thinp: skip allocation on writes of all zeros to unallocated blocks Eric Wheeler
@ 2014-12-04 7:25 ` Eric Wheeler
2014-12-04 15:33 ` Mike Snitzer
0 siblings, 1 reply; 46+ messages in thread
From: Eric Wheeler @ 2014-12-04 7:25 UTC (permalink / raw)
To: lvm-devel
This patch skips all-zero writes to unallocated blocks of dm-thinp volumes.
Unallocated zero-writes are 70x faster and never allocate space in this test:
# dd if=/dev/zero of=/dev/test/test1 bs=1M count=1024
1073741824 bytes (1.1 GB) copied, 0.794343 s, 1.4 GB/s
Without the patch, zero-writes allocate space and hit the disk:
# dd if=/dev/zero of=/dev/test/test1 bs=1M count=1024
1073741824 bytes (1.1 GB) copied, 53.8064 s, 20.0 MB/s
For the test below, notice the allocation difference for thin volumes
test1 and test2 (after dd if=test1 of=test2), even though they have the
same md5sum:
LV VG Attr LSize Pool Origin Data%
test1 test Vwi-a-tz-- 4.00g thinp 22.04
test2 test Vwi-a-tz-- 4.00g thinp 18.33
An additional 3.71% of space was saved by the patch, and so were
the ~150MB of (possibly random) IOs that would have hit disk, not to
mention reads that now bypass the disk since they are unallocated.
We also save the metadata overhead of ~2400 allocations when calling
provision_block().
# lvcreate -T test/thinp -L 5G
# lvcreate -T test/thinp -V 4G -n test1
# lvcreate -T test/thinp -V 4G -n test2
Simple ext4+kernel tree extract test:
First prepare two dm-thinp volumes test1 and test2 of equal size. First
mkfs.ext4 /dev/test/test1 without the patch and then mount and extract
3.17.4's source tree onto the test1 filesystem, and unmount
Next, install patched dm_thin_pool.ko, then dd test1 over test2 and
verify checksums:
# dd if=/dev/test/test1 of=/dev/test/test2 bs=1M
# md5sum /dev/test/test?
b210f032a6465178103317f3c40ab59f /dev/test/test1
b210f032a6465178103317f3c40ab59f /dev/test/test2
Yes, they match!
Signed-off-by: Eric Wheeler <lvm-dev@lists.ewheeler.net>
---
Resending the patch as it was malformed on the first try.
Eric Wheeler, President eWheeler, Inc. dba Global Linux Security
888-LINUX26 (888-546-8926) Fax: 503-716-3878 PO Box 25107
www.GlobalLinuxSecurity.pro Linux since 1996! Portland, OR 97298
drivers/md/dm-thin.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 45 insertions(+), 0 deletions(-)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index fc9c848..71dd545 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1230,6 +1230,42 @@ static void process_shared_bio(struct thin_c *tc, struct bio *bio,
}
}
+/* return true if bio data contains all 0x00's */
+bool bio_all_zeros(struct bio *bio)
+{
+ unsigned long flags;
+ struct bio_vec bv;
+ struct bvec_iter iter;
+
+ char *data;
+ uint64_t *p;
+ int i, count;
+
+ bool allzeros = true;
+
+ bio_for_each_segment(bv, bio, iter) {
+ data = bvec_kmap_irq(&bv, &flags);
+
+ p = (uint64_t*)data;
+ count = bv.bv_len / sizeof(uint64_t);
+
+ for (i = 0; i < count; i++) {
+ if (*p) {
+ allzeros = false;
+ break;
+ }
+ p++;
+ }
+
+ bvec_kunmap_irq(data, &flags);
+
+ if (likely(!allzeros))
+ break;
+ }
+
+ return allzeros;
+}
+
static void provision_block(struct thin_c *tc, struct bio *bio, dm_block_t block,
struct dm_bio_prison_cell *cell)
{
@@ -1258,6 +1294,15 @@ static void provision_block(struct thin_c *tc, struct bio *bio, dm_block_t block
return;
}
+ /*
+ * Skip writes of all zeroes
+ */
+ if (bio_data_dir(bio) == WRITE && unlikely( bio_all_zeros(bio) )) {
+ cell_defer_no_holder(tc, cell);
+ bio_endio(bio, 0);
+ return;
+ }
+
r = alloc_data_block(tc, &data_block);
switch (r) {
case 0:
--
1.7.1
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH] dm thin: optimize away writing all zeroes to unprovisioned blocks
2014-12-04 7:25 ` Eric Wheeler
@ 2014-12-04 15:33 ` Mike Snitzer
0 siblings, 0 replies; 46+ messages in thread
From: Mike Snitzer @ 2014-12-04 15:33 UTC (permalink / raw)
To: Eric Wheeler; +Cc: axboe, dm-devel, ejt, LVM2 development
Hi,
In the future please send DM changes to dm-devel@redhat.com
Comments inlined below, and I've provided a revised patch at the end.
On Thu, Dec 04 2014 at 2:25am -0500,
Eric Wheeler <ewheeler@ewheeler.net> wrote:
> This patch skips all-zero writes to unallocated blocks of dm-thinp volumes.
>
> Unallocated zero-writes are 70x faster and never allocate space in this test:
> # dd if=/dev/zero of=/dev/test/test1 bs=1M count=1024
> 1073741824 bytes (1.1 GB) copied, 0.794343 s, 1.4 GB/s
>
> Without the patch, zero-writes allocate space and hit the disk:
> # dd if=/dev/zero of=/dev/test/test1 bs=1M count=1024
> 1073741824 bytes (1.1 GB) copied, 53.8064 s, 20.0 MB/s
>
> For the test below, notice the allocation difference for thin volumes
> test1 and test2 (after dd if=test1 of=test2), even though they have the
> same md5sum:
> LV VG Attr LSize Pool Origin Data%
> test1 test Vwi-a-tz-- 4.00g thinp 22.04
> test2 test Vwi-a-tz-- 4.00g thinp 18.33
>
> An additional 3.71% of space was saved by the patch, and so were
> the ~150MB of (possibly random) IOs that would have hit disk, not to
> mention reads that now bypass the disk since they are unallocated.
> We also save the metadata overhead of ~2400 allocations when calling
> provision_block().
>
> # lvcreate -T test/thinp -L 5G
> # lvcreate -T test/thinp -V 4G -n test1
> # lvcreate -T test/thinp -V 4G -n test2
>
> Simple ext4+kernel tree extract test:
>
> First prepare two dm-thinp volumes test1 and test2 of equal size. First
> mkfs.ext4 /dev/test/test1 without the patch and then mount and extract
> 3.17.4's source tree onto the test1 filesystem, and unmount
>
> Next, install patched dm_thin_pool.ko, then dd test1 over test2 and
> verify checksums:
> # dd if=/dev/test/test1 of=/dev/test/test2 bs=1M
> # md5sum /dev/test/test?
> b210f032a6465178103317f3c40ab59f /dev/test/test1
> b210f032a6465178103317f3c40ab59f /dev/test/test2
> Yes, they match!
>
>
> Signed-off-by: Eric Wheeler <lvm-dev@lists.ewheeler.net>
> ---
> Resending the patch as it was malformed on the first try.
The resend was also malformed.. but don't worry about resending for this
patch.
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index fc9c848..71dd545 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -1230,6 +1230,42 @@ static void process_shared_bio(struct thin_c *tc, struct bio *bio,
> }
> }
A helper like this really belongs in block/bio.c:
> +/* return true if bio data contains all 0x00's */
> +bool bio_all_zeros(struct bio *bio) +{
> + unsigned long flags;
> + struct bio_vec bv;
> + struct bvec_iter iter;
> +
> + char *data;
> + uint64_t *p;
> + int i, count;
> + + bool allzeros = true;
> +
> + bio_for_each_segment(bv, bio, iter) {
> + data = bvec_kmap_irq(&bv, &flags);
> +
> + p = (uint64_t*)data;
> + count = bv.bv_len / sizeof(uint64_t);
Addressing a bio's contents in terms of uint64_t has the potential to
access beyond bv.bv_len (byte addressing vs 64bit addressing). I can
see you were just looking to be more efficient about checking the bios'
contents but I'm not convinced it would always be safe.
I'm open to something more efficient than what I implemented below, but
it is the most straight-forward code I thought of.
> +
> + for (i = 0; i < count; i++) {
> + if (*p) {
> + allzeros = false;
> + break;
> + }
> + p++;
> + }
> +
> + bvec_kunmap_irq(data, &flags);
> +
> + if (likely(!allzeros))
> + break;
> + }
> +
> + return allzeros;
> +}
> +
> static void provision_block(struct thin_c *tc, struct bio *bio, dm_block_t block,
> struct dm_bio_prison_cell *cell)
> {
> @@ -1258,6 +1294,15 @@ static void provision_block(struct thin_c *tc, struct bio *bio, dm_block_t block
> return;
> }
>
> + /*
> + * Skip writes of all zeroes
> + */
> + if (bio_data_dir(bio) == WRITE && unlikely( bio_all_zeros(bio) )) {
> + cell_defer_no_holder(tc, cell);
> + bio_endio(bio, 0);
> + return;
> + }
> +
No need to check for bio_data_dir(bio) == WRITE (at this point in
provision_block() we already know it is a WRITE).
Here is a revised patch that is more like I'd expect to land upstream.
Jens are you OK with us adding bio_is_zero_filled to block/bio.c? If so
should I split it out as a separate patch for you to pick up or just
carry it as part of the patch that lands in linux-dm.git?
From: Mike Snitzer <snitzer@redhat.com>
Date: Thu, 4 Dec 2014 10:18:32 -0500
Subject: [PATCH] dm thin: optimize away writing all zeroes to unprovisioned blocks
Introduce bio_is_zero_filled() and use it to optimize away writing all
zeroes to unprovisioned blocks. Subsequent reads to the associated
unprovisioned blocks will be zero filled.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Eric Wheeler <ewheeler@ewheeler.net>
Cc: Jens Axboe <axboe@kernel.dk>
---
block/bio.c | 25 +++++++++++++++++++++++++
drivers/md/dm-thin.c | 10 ++++++++++
include/linux/bio.h | 1 +
3 files changed, 36 insertions(+), 0 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 3e6e198..7d07593 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -515,6 +515,31 @@ void zero_fill_bio(struct bio *bio)
}
EXPORT_SYMBOL(zero_fill_bio);
+bool bio_is_zero_filled(struct bio *bio)
+{
+ unsigned i;
+ unsigned long flags;
+ struct bio_vec bv;
+ struct bvec_iter iter;
+
+ bio_for_each_segment(bv, bio, iter) {
+ char *data = bvec_kmap_irq(&bv, &flags);
+ char *p = data;
+
+ for (i = 0; i < bv.bv_len; i++) {
+ if (*p) {
+ bvec_kunmap_irq(data, &flags);
+ return false;
+ }
+ p++;
+ }
+ bvec_kunmap_irq(data, &flags);
+ }
+
+ return true;
+}
+EXPORT_SYMBOL(bio_is_zero_filled);
+
/**
* bio_put - release a reference to a bio
* @bio: bio to release reference to
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 8735543..13aff8c 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1501,6 +1501,16 @@ static void provision_block(struct thin_c *tc, struct bio *bio, dm_block_t block
return;
}
+ /*
+ * Optimize away writes of all zeroes, subsequent reads to
+ * associated unprovisioned blocks will be zero filled.
+ */
+ if (unlikely(bio_is_zero_filled(bio))) {
+ cell_defer_no_holder(tc, cell);
+ bio_endio(bio, 0);
+ return;
+ }
+
r = alloc_data_block(tc, &data_block);
switch (r) {
case 0:
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7347f48..602094b 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -465,6 +465,7 @@ extern struct bio *bio_copy_user_iov(struct request_queue *,
int, int, gfp_t);
extern int bio_uncopy_user(struct bio *);
void zero_fill_bio(struct bio *bio);
+bool bio_is_zero_filled(struct bio *bio);
extern struct bio_vec *bvec_alloc(gfp_t, int, unsigned long *, mempool_t *);
extern void bvec_free(mempool_t *, struct bio_vec *, unsigned int);
extern unsigned int bvec_nr_vecs(unsigned short idx);
--
1.7.4.4
--
lvm-devel mailing list
lvm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/lvm-devel
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH] dm thin: optimize away writing all zeroes to unprovisioned blocks
@ 2014-12-04 15:33 ` Mike Snitzer
0 siblings, 0 replies; 46+ messages in thread
From: Mike Snitzer @ 2014-12-04 15:33 UTC (permalink / raw)
To: lvm-devel
Hi,
In the future please send DM changes to dm-devel at redhat.com
Comments inlined below, and I've provided a revised patch at the end.
On Thu, Dec 04 2014 at 2:25am -0500,
Eric Wheeler <ewheeler@ewheeler.net> wrote:
> This patch skips all-zero writes to unallocated blocks of dm-thinp volumes.
>
> Unallocated zero-writes are 70x faster and never allocate space in this test:
> # dd if=/dev/zero of=/dev/test/test1 bs=1M count=1024
> 1073741824 bytes (1.1 GB) copied, 0.794343 s, 1.4 GB/s
>
> Without the patch, zero-writes allocate space and hit the disk:
> # dd if=/dev/zero of=/dev/test/test1 bs=1M count=1024
> 1073741824 bytes (1.1 GB) copied, 53.8064 s, 20.0 MB/s
>
> For the test below, notice the allocation difference for thin volumes
> test1 and test2 (after dd if=test1 of=test2), even though they have the
> same md5sum:
> LV VG Attr LSize Pool Origin Data%
> test1 test Vwi-a-tz-- 4.00g thinp 22.04
> test2 test Vwi-a-tz-- 4.00g thinp 18.33
>
> An additional 3.71% of space was saved by the patch, and so were
> the ~150MB of (possibly random) IOs that would have hit disk, not to
> mention reads that now bypass the disk since they are unallocated.
> We also save the metadata overhead of ~2400 allocations when calling
> provision_block().
>
> # lvcreate -T test/thinp -L 5G
> # lvcreate -T test/thinp -V 4G -n test1
> # lvcreate -T test/thinp -V 4G -n test2
>
> Simple ext4+kernel tree extract test:
>
> First prepare two dm-thinp volumes test1 and test2 of equal size. First
> mkfs.ext4 /dev/test/test1 without the patch and then mount and extract
> 3.17.4's source tree onto the test1 filesystem, and unmount
>
> Next, install patched dm_thin_pool.ko, then dd test1 over test2 and
> verify checksums:
> # dd if=/dev/test/test1 of=/dev/test/test2 bs=1M
> # md5sum /dev/test/test?
> b210f032a6465178103317f3c40ab59f /dev/test/test1
> b210f032a6465178103317f3c40ab59f /dev/test/test2
> Yes, they match!
>
>
> Signed-off-by: Eric Wheeler <lvm-dev@lists.ewheeler.net>
> ---
> Resending the patch as it was malformed on the first try.
The resend was also malformed.. but don't worry about resending for this
patch.
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index fc9c848..71dd545 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -1230,6 +1230,42 @@ static void process_shared_bio(struct thin_c *tc, struct bio *bio,
> }
> }
A helper like this really belongs in block/bio.c:
> +/* return true if bio data contains all 0x00's */
> +bool bio_all_zeros(struct bio *bio) +{
> + unsigned long flags;
> + struct bio_vec bv;
> + struct bvec_iter iter;
> +
> + char *data;
> + uint64_t *p;
> + int i, count;
> + + bool allzeros = true;
> +
> + bio_for_each_segment(bv, bio, iter) {
> + data = bvec_kmap_irq(&bv, &flags);
> +
> + p = (uint64_t*)data;
> + count = bv.bv_len / sizeof(uint64_t);
Addressing a bio's contents in terms of uint64_t has the potential to
access beyond bv.bv_len (byte addressing vs 64bit addressing). I can
see you were just looking to be more efficient about checking the bios'
contents but I'm not convinced it would always be safe.
I'm open to something more efficient than what I implemented below, but
it is the most straight-forward code I thought of.
> +
> + for (i = 0; i < count; i++) {
> + if (*p) {
> + allzeros = false;
> + break;
> + }
> + p++;
> + }
> +
> + bvec_kunmap_irq(data, &flags);
> +
> + if (likely(!allzeros))
> + break;
> + }
> +
> + return allzeros;
> +}
> +
> static void provision_block(struct thin_c *tc, struct bio *bio, dm_block_t block,
> struct dm_bio_prison_cell *cell)
> {
> @@ -1258,6 +1294,15 @@ static void provision_block(struct thin_c *tc, struct bio *bio, dm_block_t block
> return;
> }
>
> + /*
> + * Skip writes of all zeroes
> + */
> + if (bio_data_dir(bio) == WRITE && unlikely( bio_all_zeros(bio) )) {
> + cell_defer_no_holder(tc, cell);
> + bio_endio(bio, 0);
> + return;
> + }
> +
No need to check for bio_data_dir(bio) == WRITE (at this point in
provision_block() we already know it is a WRITE).
Here is a revised patch that is more like I'd expect to land upstream.
Jens are you OK with us adding bio_is_zero_filled to block/bio.c? If so
should I split it out as a separate patch for you to pick up or just
carry it as part of the patch that lands in linux-dm.git?
From: Mike Snitzer <snitzer@redhat.com>
Date: Thu, 4 Dec 2014 10:18:32 -0500
Subject: [PATCH] dm thin: optimize away writing all zeroes to unprovisioned blocks
Introduce bio_is_zero_filled() and use it to optimize away writing all
zeroes to unprovisioned blocks. Subsequent reads to the associated
unprovisioned blocks will be zero filled.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Eric Wheeler <ewheeler@ewheeler.net>
Cc: Jens Axboe <axboe@kernel.dk>
---
block/bio.c | 25 +++++++++++++++++++++++++
drivers/md/dm-thin.c | 10 ++++++++++
include/linux/bio.h | 1 +
3 files changed, 36 insertions(+), 0 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 3e6e198..7d07593 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -515,6 +515,31 @@ void zero_fill_bio(struct bio *bio)
}
EXPORT_SYMBOL(zero_fill_bio);
+bool bio_is_zero_filled(struct bio *bio)
+{
+ unsigned i;
+ unsigned long flags;
+ struct bio_vec bv;
+ struct bvec_iter iter;
+
+ bio_for_each_segment(bv, bio, iter) {
+ char *data = bvec_kmap_irq(&bv, &flags);
+ char *p = data;
+
+ for (i = 0; i < bv.bv_len; i++) {
+ if (*p) {
+ bvec_kunmap_irq(data, &flags);
+ return false;
+ }
+ p++;
+ }
+ bvec_kunmap_irq(data, &flags);
+ }
+
+ return true;
+}
+EXPORT_SYMBOL(bio_is_zero_filled);
+
/**
* bio_put - release a reference to a bio
* @bio: bio to release reference to
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 8735543..13aff8c 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1501,6 +1501,16 @@ static void provision_block(struct thin_c *tc, struct bio *bio, dm_block_t block
return;
}
+ /*
+ * Optimize away writes of all zeroes, subsequent reads to
+ * associated unprovisioned blocks will be zero filled.
+ */
+ if (unlikely(bio_is_zero_filled(bio))) {
+ cell_defer_no_holder(tc, cell);
+ bio_endio(bio, 0);
+ return;
+ }
+
r = alloc_data_block(tc, &data_block);
switch (r) {
case 0:
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7347f48..602094b 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -465,6 +465,7 @@ extern struct bio *bio_copy_user_iov(struct request_queue *,
int, int, gfp_t);
extern int bio_uncopy_user(struct bio *);
void zero_fill_bio(struct bio *bio);
+bool bio_is_zero_filled(struct bio *bio);
extern struct bio_vec *bvec_alloc(gfp_t, int, unsigned long *, mempool_t *);
extern void bvec_free(mempool_t *, struct bio_vec *, unsigned int);
extern unsigned int bvec_nr_vecs(unsigned short idx);
--
1.7.4.4
^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: dm thin: optimize away writing all zeroes to unprovisioned blocks
2014-12-04 15:33 ` Mike Snitzer
@ 2014-12-04 15:43 ` Mike Snitzer
-1 siblings, 0 replies; 46+ messages in thread
From: Mike Snitzer @ 2014-12-04 15:43 UTC (permalink / raw)
To: Eric Wheeler; +Cc: axboe, dm-devel, ejt, LVM2 development
On Thu, Dec 04 2014 at 10:33am -0500,
Mike Snitzer <snitzer@redhat.com> wrote:
> > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> > index fc9c848..71dd545 100644
> > --- a/drivers/md/dm-thin.c
> > +++ b/drivers/md/dm-thin.c
> > @@ -1230,6 +1230,42 @@ static void process_shared_bio(struct thin_c *tc, struct bio *bio,
> > }
> > }
>
> A helper like this really belongs in block/bio.c:
>
> > +/* return true if bio data contains all 0x00's */
> > +bool bio_all_zeros(struct bio *bio) +{
> > + unsigned long flags;
> > + struct bio_vec bv;
> > + struct bvec_iter iter;
> > +
> > + char *data;
> > + uint64_t *p;
> > + int i, count;
> > + + bool allzeros = true;
> > +
> > + bio_for_each_segment(bv, bio, iter) {
> > + data = bvec_kmap_irq(&bv, &flags);
> > +
> > + p = (uint64_t*)data;
> > + count = bv.bv_len / sizeof(uint64_t);
>
> Addressing a bio's contents in terms of uint64_t has the potential to
> access beyond bv.bv_len (byte addressing vs 64bit addressing). I can
> see you were just looking to be more efficient about checking the bios'
> contents but I'm not convinced it would always be safe.
Actually, given your: count = bv.bv_len / sizeof(uint64_t);
My immediate concern was it would've truncated any partial contents of
the bio_vec beyond the last uint64_t boundary. Leading to possible
false positive return from the function (because some contents weren't
checked).
--
lvm-devel mailing list
lvm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/lvm-devel
^ permalink raw reply [flat|nested] 46+ messages in thread
* dm thin: optimize away writing all zeroes to unprovisioned blocks
@ 2014-12-04 15:43 ` Mike Snitzer
0 siblings, 0 replies; 46+ messages in thread
From: Mike Snitzer @ 2014-12-04 15:43 UTC (permalink / raw)
To: lvm-devel
On Thu, Dec 04 2014 at 10:33am -0500,
Mike Snitzer <snitzer@redhat.com> wrote:
> > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> > index fc9c848..71dd545 100644
> > --- a/drivers/md/dm-thin.c
> > +++ b/drivers/md/dm-thin.c
> > @@ -1230,6 +1230,42 @@ static void process_shared_bio(struct thin_c *tc, struct bio *bio,
> > }
> > }
>
> A helper like this really belongs in block/bio.c:
>
> > +/* return true if bio data contains all 0x00's */
> > +bool bio_all_zeros(struct bio *bio) +{
> > + unsigned long flags;
> > + struct bio_vec bv;
> > + struct bvec_iter iter;
> > +
> > + char *data;
> > + uint64_t *p;
> > + int i, count;
> > + + bool allzeros = true;
> > +
> > + bio_for_each_segment(bv, bio, iter) {
> > + data = bvec_kmap_irq(&bv, &flags);
> > +
> > + p = (uint64_t*)data;
> > + count = bv.bv_len / sizeof(uint64_t);
>
> Addressing a bio's contents in terms of uint64_t has the potential to
> access beyond bv.bv_len (byte addressing vs 64bit addressing). I can
> see you were just looking to be more efficient about checking the bios'
> contents but I'm not convinced it would always be safe.
Actually, given your: count = bv.bv_len / sizeof(uint64_t);
My immediate concern was it would've truncated any partial contents of
the bio_vec beyond the last uint64_t boundary. Leading to possible
false positive return from the function (because some contents weren't
checked).
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: dm thin: optimize away writing all zeroes to unprovisioned blocks
2014-12-04 15:33 ` Mike Snitzer
@ 2014-12-05 14:47 ` Mike Snitzer
-1 siblings, 0 replies; 46+ messages in thread
From: Mike Snitzer @ 2014-12-05 14:47 UTC (permalink / raw)
To: Eric Wheeler; +Cc: axboe, dm-devel, ejt, LVM2 development
On Thu, Dec 04 2014 at 10:33am -0500,
Mike Snitzer <snitzer@redhat.com> wrote:
> From: Mike Snitzer <snitzer@redhat.com>
> Date: Thu, 4 Dec 2014 10:18:32 -0500
> Subject: [PATCH] dm thin: optimize away writing all zeroes to unprovisioned blocks
>
> Introduce bio_is_zero_filled() and use it to optimize away writing all
> zeroes to unprovisioned blocks. Subsequent reads to the associated
> unprovisioned blocks will be zero filled.
>
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> Cc: Eric Wheeler <ewheeler@ewheeler.net>
> Cc: Jens Axboe <axboe@kernel.dk>
In testing this patch it is now quite clear that this change seriously
impacts test coverage in the device-mapper-test-suite because dmts'
wipe_device() uses /dev/zero as the ifile for dd to write to the disk.
As such, with this patch all tests expecting to see provisioned blocks
as a side-effect of wipe_device now fail (e.g. DiscardQuickTests).
So this change won't go upstream until full test coverage can be
restored in dmts. I'll see what I can come up with but it is low
priority.
BTW, this also makes me wonder if this change will be extremely
unintuitive to existing or future user of DM thinp. So much so that it
might be best to require a new thin-pool feature flag to enable this
optimization.
Mike
--
lvm-devel mailing list
lvm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/lvm-devel
^ permalink raw reply [flat|nested] 46+ messages in thread
* dm thin: optimize away writing all zeroes to unprovisioned blocks
@ 2014-12-05 14:47 ` Mike Snitzer
0 siblings, 0 replies; 46+ messages in thread
From: Mike Snitzer @ 2014-12-05 14:47 UTC (permalink / raw)
To: lvm-devel
On Thu, Dec 04 2014 at 10:33am -0500,
Mike Snitzer <snitzer@redhat.com> wrote:
> From: Mike Snitzer <snitzer@redhat.com>
> Date: Thu, 4 Dec 2014 10:18:32 -0500
> Subject: [PATCH] dm thin: optimize away writing all zeroes to unprovisioned blocks
>
> Introduce bio_is_zero_filled() and use it to optimize away writing all
> zeroes to unprovisioned blocks. Subsequent reads to the associated
> unprovisioned blocks will be zero filled.
>
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> Cc: Eric Wheeler <ewheeler@ewheeler.net>
> Cc: Jens Axboe <axboe@kernel.dk>
In testing this patch it is now quite clear that this change seriously
impacts test coverage in the device-mapper-test-suite because dmts'
wipe_device() uses /dev/zero as the ifile for dd to write to the disk.
As such, with this patch all tests expecting to see provisioned blocks
as a side-effect of wipe_device now fail (e.g. DiscardQuickTests).
So this change won't go upstream until full test coverage can be
restored in dmts. I'll see what I can come up with but it is low
priority.
BTW, this also makes me wonder if this change will be extremely
unintuitive to existing or future user of DM thinp. So much so that it
might be best to require a new thin-pool feature flag to enable this
optimization.
Mike
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH] dm thin: optimize away writing all zeroes to unprovisioned blocks
2014-12-04 15:33 ` Mike Snitzer
@ 2014-12-05 17:27 ` Jens Axboe
-1 siblings, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2014-12-05 17:27 UTC (permalink / raw)
To: Mike Snitzer, Eric Wheeler; +Cc: dm-devel, ejt, LVM2 development
On 12/04/2014 08:33 AM, Mike Snitzer wrote:
>
> Here is a revised patch that is more like I'd expect to land upstream.
> Jens are you OK with us adding bio_is_zero_filled to block/bio.c? If so
> should I split it out as a separate patch for you to pick up or just
> carry it as part of the patch that lands in linux-dm.git?
Yeah I'm fine with adding that helper. I do wonder what the performance
impact is on this for dm. Have you tried a (worst case) test of writing
blocks that are zero filled, but with the last byte not being a zero?
--
Jens Axboe
^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH] dm thin: optimize away writing all zeroes to unprovisioned blocks
@ 2014-12-05 17:27 ` Jens Axboe
0 siblings, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2014-12-05 17:27 UTC (permalink / raw)
To: lvm-devel
On 12/04/2014 08:33 AM, Mike Snitzer wrote:
>
> Here is a revised patch that is more like I'd expect to land upstream.
> Jens are you OK with us adding bio_is_zero_filled to block/bio.c? If so
> should I split it out as a separate patch for you to pick up or just
> carry it as part of the patch that lands in linux-dm.git?
Yeah I'm fine with adding that helper. I do wonder what the performance
impact is on this for dm. Have you tried a (worst case) test of writing
blocks that are zero filled, but with the last byte not being a zero?
--
Jens Axboe
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: dm thin: optimize away writing all zeroes to unprovisioned blocks
2014-12-05 17:27 ` Jens Axboe
@ 2014-12-05 18:33 ` Mike Snitzer
-1 siblings, 0 replies; 46+ messages in thread
From: Mike Snitzer @ 2014-12-05 18:33 UTC (permalink / raw)
To: Jens Axboe; +Cc: Eric Wheeler, dm-devel, ejt, LVM2 development
On Fri, Dec 05 2014 at 12:27pm -0500,
Jens Axboe <axboe@kernel.dk> wrote:
> On 12/04/2014 08:33 AM, Mike Snitzer wrote:
> >
> >Here is a revised patch that is more like I'd expect to land upstream.
> >Jens are you OK with us adding bio_is_zero_filled to block/bio.c? If so
> >should I split it out as a separate patch for you to pick up or just
> >carry it as part of the patch that lands in linux-dm.git?
>
> Yeah I'm fine with adding that helper.
Good to know, thanks.
> I do wonder what the performance impact is on this for dm. Have you
> tried a (worst case) test of writing blocks that are zero filled, but
> with the last byte not being a zero?
I haven't tested that yet, but I share your concern. Does fio offer the
ability to slap down a specific pattern like this? If not is there a
logical place to extend fio to enable this?
--
lvm-devel mailing list
lvm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/lvm-devel
^ permalink raw reply [flat|nested] 46+ messages in thread
* dm thin: optimize away writing all zeroes to unprovisioned blocks
@ 2014-12-05 18:33 ` Mike Snitzer
0 siblings, 0 replies; 46+ messages in thread
From: Mike Snitzer @ 2014-12-05 18:33 UTC (permalink / raw)
To: lvm-devel
On Fri, Dec 05 2014 at 12:27pm -0500,
Jens Axboe <axboe@kernel.dk> wrote:
> On 12/04/2014 08:33 AM, Mike Snitzer wrote:
> >
> >Here is a revised patch that is more like I'd expect to land upstream.
> >Jens are you OK with us adding bio_is_zero_filled to block/bio.c? If so
> >should I split it out as a separate patch for you to pick up or just
> >carry it as part of the patch that lands in linux-dm.git?
>
> Yeah I'm fine with adding that helper.
Good to know, thanks.
> I do wonder what the performance impact is on this for dm. Have you
> tried a (worst case) test of writing blocks that are zero filled, but
> with the last byte not being a zero?
I haven't tested that yet, but I share your concern. Does fio offer the
ability to slap down a specific pattern like this? If not is there a
logical place to extend fio to enable this?
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: dm thin: optimize away writing all zeroes to unprovisioned blocks
2014-12-04 15:43 ` Mike Snitzer
@ 2014-12-06 22:33 ` Eric Wheeler
-1 siblings, 0 replies; 46+ messages in thread
From: Eric Wheeler @ 2014-12-06 22:33 UTC (permalink / raw)
To: LVM2 development; +Cc: axboe, dm-devel, ejt
> Actually, given your: count = bv.bv_len / sizeof(uint64_t);
>
> My immediate concern was it would've truncated any partial contents of
> the bio_vec beyond the last uint64_t boundary. Leading to possible
> false positive return from the function (because some contents weren't
> checked).
Understood. I had always thought bio data was block-size aligned.
Out of curiosity, do you have examples of when this is not so?
>
> --
> lvm-devel mailing list
> lvm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/lvm-devel
>
--
lvm-devel mailing list
lvm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/lvm-devel
^ permalink raw reply [flat|nested] 46+ messages in thread
* dm thin: optimize away writing all zeroes to unprovisioned blocks
@ 2014-12-06 22:33 ` Eric Wheeler
0 siblings, 0 replies; 46+ messages in thread
From: Eric Wheeler @ 2014-12-06 22:33 UTC (permalink / raw)
To: lvm-devel
> Actually, given your: count = bv.bv_len / sizeof(uint64_t);
>
> My immediate concern was it would've truncated any partial contents of
> the bio_vec beyond the last uint64_t boundary. Leading to possible
> false positive return from the function (because some contents weren't
> checked).
Understood. I had always thought bio data was block-size aligned.
Out of curiosity, do you have examples of when this is not so?
>
> --
> lvm-devel mailing list
> lvm-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/lvm-devel
>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: dm thin: optimize away writing all zeroes to unprovisioned blocks
2014-12-05 14:47 ` Mike Snitzer
@ 2014-12-06 22:36 ` Eric Wheeler
-1 siblings, 0 replies; 46+ messages in thread
From: Eric Wheeler @ 2014-12-06 22:36 UTC (permalink / raw)
To: LVM2 development; +Cc: axboe, dm-devel, ejt
On Fri, 5 Dec 2014, Mike Snitzer wrote:
> On Thu, Dec 04 2014 at 10:33am -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
> In testing this patch it is now quite clear that this change seriously
> impacts test coverage in the device-mapper-test-suite because dmts'
> wipe_device() uses /dev/zero as the ifile for dd to write to the disk.
> As such, with this patch all tests expecting to see provisioned blocks
> as a side-effect of wipe_device now fail (e.g. DiscardQuickTests).
>
> So this change won't go upstream until full test coverage can be
> restored in dmts. I'll see what I can come up with but it is low
> priority.
Can dtms use /dev/urandom?
> BTW, this also makes me wonder if this change will be extremely
> unintuitive to existing or future user of DM thinp. So much so that it
> might be best to require a new thin-pool feature flag to enable this
> optimization.
Good idea, some may wish to turn it off. IMO, leaving it on by default
would likely work for most use cases.
-Eric
> Mike
>
> --
> lvm-devel mailing list
> lvm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/lvm-devel
>
--
lvm-devel mailing list
lvm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/lvm-devel
^ permalink raw reply [flat|nested] 46+ messages in thread
* dm thin: optimize away writing all zeroes to unprovisioned blocks
@ 2014-12-06 22:36 ` Eric Wheeler
0 siblings, 0 replies; 46+ messages in thread
From: Eric Wheeler @ 2014-12-06 22:36 UTC (permalink / raw)
To: lvm-devel
On Fri, 5 Dec 2014, Mike Snitzer wrote:
> On Thu, Dec 04 2014 at 10:33am -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
> In testing this patch it is now quite clear that this change seriously
> impacts test coverage in the device-mapper-test-suite because dmts'
> wipe_device() uses /dev/zero as the ifile for dd to write to the disk.
> As such, with this patch all tests expecting to see provisioned blocks
> as a side-effect of wipe_device now fail (e.g. DiscardQuickTests).
>
> So this change won't go upstream until full test coverage can be
> restored in dmts. I'll see what I can come up with but it is low
> priority.
Can dtms use /dev/urandom?
> BTW, this also makes me wonder if this change will be extremely
> unintuitive to existing or future user of DM thinp. So much so that it
> might be best to require a new thin-pool feature flag to enable this
> optimization.
Good idea, some may wish to turn it off. IMO, leaving it on by default
would likely work for most use cases.
-Eric
> Mike
>
> --
> lvm-devel mailing list
> lvm-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/lvm-devel
>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: dm thin: optimize away writing all zeroes to unprovisioned blocks
2014-12-05 18:33 ` Mike Snitzer
@ 2014-12-06 22:40 ` Eric Wheeler
-1 siblings, 0 replies; 46+ messages in thread
From: Eric Wheeler @ 2014-12-06 22:40 UTC (permalink / raw)
To: LVM2 development; +Cc: Jens Axboe, dm-devel, ejt
On Fri, 5 Dec 2014, Mike Snitzer wrote:
>> I do wonder what the performance impact is on this for dm. Have you
>> tried a (worst case) test of writing blocks that are zero filled, but
>> with the last byte not being a zero?
The additional overhead of worst-case should be (nearly) equal to the
simplest test case of dd if=/dev/zero of=/dev/thinp/vol. In my testing
that was 1.4GB/s within KVM on an Intel Xeon(R) CPU E3-1230 V2 @ 3.30GHz.
I could see where really fast storage that can obtain 1.4gb/s might notice
a performance regression---but for most use cases that is quite fast.
-Eric
> I haven't tested that yet, but I share your concern. Does fio offer the
> ability to slap down a specific pattern like this? If not is there a
> logical place to extend fio to enable this?
> --
> lvm-devel mailing list
> lvm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/lvm-devel
>
--
lvm-devel mailing list
lvm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/lvm-devel
^ permalink raw reply [flat|nested] 46+ messages in thread
* dm thin: optimize away writing all zeroes to unprovisioned blocks
@ 2014-12-06 22:40 ` Eric Wheeler
0 siblings, 0 replies; 46+ messages in thread
From: Eric Wheeler @ 2014-12-06 22:40 UTC (permalink / raw)
To: lvm-devel
On Fri, 5 Dec 2014, Mike Snitzer wrote:
>> I do wonder what the performance impact is on this for dm. Have you
>> tried a (worst case) test of writing blocks that are zero filled, but
>> with the last byte not being a zero?
The additional overhead of worst-case should be (nearly) equal to the
simplest test case of dd if=/dev/zero of=/dev/thinp/vol. In my testing
that was 1.4GB/s within KVM on an Intel Xeon(R) CPU E3-1230 V2 @ 3.30GHz.
I could see where really fast storage that can obtain 1.4gb/s might notice
a performance regression---but for most use cases that is quite fast.
-Eric
> I haven't tested that yet, but I share your concern. Does fio offer the
> ability to slap down a specific pattern like this? If not is there a
> logical place to extend fio to enable this?
> --
> lvm-devel mailing list
> lvm-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/lvm-devel
>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: dm thin: optimize away writing all zeroes to unprovisioned blocks
2014-12-05 18:33 ` Mike Snitzer
@ 2014-12-07 1:36 ` Jens Axboe
-1 siblings, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2014-12-07 1:36 UTC (permalink / raw)
To: Mike Snitzer; +Cc: Eric Wheeler, dm-devel, ejt, LVM2 development
On 12/05/2014 11:33 AM, Mike Snitzer wrote:
> On Fri, Dec 05 2014 at 12:27pm -0500,
> Jens Axboe <axboe@kernel.dk> wrote:
>
>> On 12/04/2014 08:33 AM, Mike Snitzer wrote:
>>>
>>> Here is a revised patch that is more like I'd expect to land upstream.
>>> Jens are you OK with us adding bio_is_zero_filled to block/bio.c? If so
>>> should I split it out as a separate patch for you to pick up or just
>>> carry it as part of the patch that lands in linux-dm.git?
>>
>> Yeah I'm fine with adding that helper.
>
> Good to know, thanks.
>
>> I do wonder what the performance impact is on this for dm. Have you
>> tried a (worst case) test of writing blocks that are zero filled, but
>> with the last byte not being a zero?
>
> I haven't tested that yet, but I share your concern. Does fio offer the
> ability to slap down a specific pattern like this? If not is there a
> logical place to extend fio to enable this?
You can do an explicit pattern for IO buffers, but it's not really
geared towards this, it's more for repeat patterns. But maybe something
could be added, let me think about the best way to do that...
--
Jens Axboe
^ permalink raw reply [flat|nested] 46+ messages in thread
* dm thin: optimize away writing all zeroes to unprovisioned blocks
@ 2014-12-07 1:36 ` Jens Axboe
0 siblings, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2014-12-07 1:36 UTC (permalink / raw)
To: lvm-devel
On 12/05/2014 11:33 AM, Mike Snitzer wrote:
> On Fri, Dec 05 2014 at 12:27pm -0500,
> Jens Axboe <axboe@kernel.dk> wrote:
>
>> On 12/04/2014 08:33 AM, Mike Snitzer wrote:
>>>
>>> Here is a revised patch that is more like I'd expect to land upstream.
>>> Jens are you OK with us adding bio_is_zero_filled to block/bio.c? If so
>>> should I split it out as a separate patch for you to pick up or just
>>> carry it as part of the patch that lands in linux-dm.git?
>>
>> Yeah I'm fine with adding that helper.
>
> Good to know, thanks.
>
>> I do wonder what the performance impact is on this for dm. Have you
>> tried a (worst case) test of writing blocks that are zero filled, but
>> with the last byte not being a zero?
>
> I haven't tested that yet, but I share your concern. Does fio offer the
> ability to slap down a specific pattern like this? If not is there a
> logical place to extend fio to enable this?
You can do an explicit pattern for IO buffers, but it's not really
geared towards this, it's more for repeat patterns. But maybe something
could be added, let me think about the best way to do that...
--
Jens Axboe
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [lvm-devel] dm thin: optimize away writing all zeroes to unprovisioned blocks
2014-12-06 22:40 ` Eric Wheeler
@ 2014-12-07 1:41 ` Jens Axboe
-1 siblings, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2014-12-07 1:41 UTC (permalink / raw)
To: Eric Wheeler, LVM2 development; +Cc: dm-devel, ejt
On 12/06/2014 03:40 PM, Eric Wheeler wrote:
> On Fri, 5 Dec 2014, Mike Snitzer wrote:
>>> I do wonder what the performance impact is on this for dm. Have you
>>> tried a (worst case) test of writing blocks that are zero filled, but
>>> with the last byte not being a zero?
>
> The additional overhead of worst-case should be (nearly) equal to the
> simplest test case of dd if=/dev/zero of=/dev/thinp/vol. In my testing
> that was 1.4GB/s within KVM on an Intel Xeon(R) CPU E3-1230 V2 @ 3.30GHz.
That seems way too slow for checking if it's zero or not... Memory
bandwidth should be way higher than that. The line above, was that what
you ran? How does it look with bs=4k or higher?
> I could see where really fast storage that can obtain 1.4gb/s might
> notice a performance regression---but for most use cases that is quite
> fast.
Depends on your view point, I think it's pretty slow. Plenty of devices
out there that are faster than that. Quick example, the laptop I am
typing this email from:
read : io=12233MB, bw=1432.7MB/s, iops=22922, runt= 8539msec
Writes are around 1GB/sec, enough that I bet the slowdown in checking
for zeroes will be noticeable. In the pci-e flash space, more than
double your current /dev/zero benchmark is common for current devices.
--
Jens Axboe
^ permalink raw reply [flat|nested] 46+ messages in thread
* dm thin: optimize away writing all zeroes to unprovisioned blocks
@ 2014-12-07 1:41 ` Jens Axboe
0 siblings, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2014-12-07 1:41 UTC (permalink / raw)
To: lvm-devel
On 12/06/2014 03:40 PM, Eric Wheeler wrote:
> On Fri, 5 Dec 2014, Mike Snitzer wrote:
>>> I do wonder what the performance impact is on this for dm. Have you
>>> tried a (worst case) test of writing blocks that are zero filled, but
>>> with the last byte not being a zero?
>
> The additional overhead of worst-case should be (nearly) equal to the
> simplest test case of dd if=/dev/zero of=/dev/thinp/vol. In my testing
> that was 1.4GB/s within KVM on an Intel Xeon(R) CPU E3-1230 V2 @ 3.30GHz.
That seems way too slow for checking if it's zero or not... Memory
bandwidth should be way higher than that. The line above, was that what
you ran? How does it look with bs=4k or higher?
> I could see where really fast storage that can obtain 1.4gb/s might
> notice a performance regression---but for most use cases that is quite
> fast.
Depends on your view point, I think it's pretty slow. Plenty of devices
out there that are faster than that. Quick example, the laptop I am
typing this email from:
read : io=12233MB, bw=1432.7MB/s, iops=22922, runt= 8539msec
Writes are around 1GB/sec, enough that I bet the slowdown in checking
for zeroes will be noticeable. In the pci-e flash space, more than
double your current /dev/zero benchmark is common for current devices.
--
Jens Axboe
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: dm thin: optimize away writing all zeroes to unprovisioned blocks
2014-12-07 1:41 ` Jens Axboe
@ 2014-12-07 6:30 ` Eric Wheeler
-1 siblings, 0 replies; 46+ messages in thread
From: Eric Wheeler @ 2014-12-07 6:30 UTC (permalink / raw)
To: Jens Axboe; +Cc: dm-devel, ejt, LVM2 development
On Sat, 6 Dec 2014, Jens Axboe wrote:
> On 12/06/2014 03:40 PM, Eric Wheeler wrote:
>> On Fri, 5 Dec 2014, Mike Snitzer wrote:
>>>> I do wonder what the performance impact is on this for dm. Have you
>>>> tried a (worst case) test of writing blocks that are zero filled, but
>>>> with the last byte not being a zero?
>>
>> The additional overhead of worst-case should be (nearly) equal to the
>> simplest test case of dd if=/dev/zero of=/dev/thinp/vol. In my testing
>> that was 1.4GB/s within KVM on an Intel Xeon(R) CPU E3-1230 V2 @ 3.30GHz.
>
> That seems way too slow for checking if it's zero or not... Memory bandwidth
> should be way higher than that. The line above, was that what you ran? How
> does it look with bs=4k or higher?
In userspace I can get ~12GB/s, so I think the algorithm is sound.
dd might not be the right tool for this.
> read : io=12233MB, bw=1432.7MB/s, iops=22922, runt= 8539msec
Can you suggest the right fio commandline to test sequential writes if all
zeros? I tried --zero_buffers but couldn't get it to write zeros, writes
kept going to disk.
Also, attached is the patch that supports uintptr_t word sized 0-checks.
It steps byte-aligned at the beginning and end in case either end is not
word aligned.
I tried a few different algorithms:
Mike's trivial byte-by-byte zero check
using memcmp(ZERO_PAGE, data, bv.bv_len)==0
and the fastest one below:
-Eric
---
block/bio.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++
drivers/md/dm-thin.c | 10 +++++++
include/linux/bio.h | 1 +
3 files changed, 78 insertions(+), 0 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 8c2e55e..9100d35 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -511,6 +511,73 @@ void zero_fill_bio(struct bio *bio)
}
EXPORT_SYMBOL(zero_fill_bio);
+bool bio_is_zero_filled(struct bio *bio)
+{
+ unsigned i, count;
+ unsigned long flags;
+ struct bio_vec bv;
+ struct bvec_iter iter;
+ bio_for_each_segment(bv, bio, iter) {
+ char *data = bvec_kmap_irq(&bv, &flags);
+ char *p = data;
+ uintptr_t *parch;
+ int left = bv.bv_len;
+
+ if (unlikely( data == NULL ))
+ continue;
+
+
+ /* check unaligned bytes at the beginning of p */
+ if (unlikely( ( (uintptr_t)p & (sizeof(uintptr_t)-1) ) != 0 )) {
+ count = sizeof(uintptr_t) - ( (uintptr_t)p & (sizeof(uintptr_t)-1) );
+ for (i = 0; i < count; i++) {
+ if (*p) {
+ bvec_kunmap_irq(data, &flags);
+ return false;
+ }
+ p++;
+ }
+ left -= count;
+ }
+
+ /* we should be word aligned now */
+ BUG_ON(unlikely( ((uintptr_t)p & (sizeof(uintptr_t)-1) ) != 0 ));
+
+ /* now check in word-sized chunks */
+ parch = (uintptr_t*)p;
+ count = left >> ilog2(sizeof(uintptr_t)); /* count = left / sizeof(uintptr_t) */;
+ for (i = 0; i < count; i++) {
+ if (*parch) {
+ bvec_kunmap_irq(data, &flags);
+ return false;
+ }
+ parch++;
+ }
+ left -= count << ilog2(sizeof(uintptr_t)); /* left -= count*sizeof(uintptr_t) */
+
+ /* check remaining unaligned values at the end */
+ p = (char*)parch;
+ if (unlikely(left > 0))
+ {
+ for (i = 0; i < left; i++) {
+ if (*p) {
+ bvec_kunmap_irq(data, &flags);
+ return false;
+ }
+ p++;
+ }
+ left = 0;
+ }
+
+ bvec_kunmap_irq(data, &flags);
+ BUG_ON(unlikely( left > 0 ));
+ BUG_ON(unlikely( data+bv.bv_len != p ));
+ }
+
+ return true;
+}
+EXPORT_SYMBOL(bio_is_zero_filled);
+
/**
* bio_put - release a reference to a bio
* @bio: bio to release reference to
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index fc9c848..6a0c2c0 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1258,6 +1258,16 @@ static void provision_block(struct thin_c *tc, struct bio *bio, dm_block_t block
return;
}
+ /*
+ * Optimize away writes of all zeroes, subsequent reads to
+ * associated unprovisioned blocks will be zero filled.
+ */
+ if (unlikely(bio_is_zero_filled(bio))) {
+ cell_defer_no_holder(tc, cell);
+ bio_endio(bio, 0);
+ return;
+ }
+
r = alloc_data_block(tc, &data_block);
switch (r) {
case 0:
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 5a64576..abb46f7 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -419,6 +419,7 @@ extern struct bio *bio_copy_user_iov(struct request_queue *,
int, int, gfp_t);
extern int bio_uncopy_user(struct bio *);
void zero_fill_bio(struct bio *bio);
+bool bio_is_zero_filled(struct bio *bio);
extern struct bio_vec *bvec_alloc(gfp_t, int, unsigned long *, mempool_t *);
extern void bvec_free(mempool_t *, struct bio_vec *, unsigned int);
extern unsigned int bvec_nr_vecs(unsigned short idx);
--
1.7.1
--
lvm-devel mailing list
lvm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/lvm-devel
^ permalink raw reply related [flat|nested] 46+ messages in thread
* dm thin: optimize away writing all zeroes to unprovisioned blocks
@ 2014-12-07 6:30 ` Eric Wheeler
0 siblings, 0 replies; 46+ messages in thread
From: Eric Wheeler @ 2014-12-07 6:30 UTC (permalink / raw)
To: lvm-devel
On Sat, 6 Dec 2014, Jens Axboe wrote:
> On 12/06/2014 03:40 PM, Eric Wheeler wrote:
>> On Fri, 5 Dec 2014, Mike Snitzer wrote:
>>>> I do wonder what the performance impact is on this for dm. Have you
>>>> tried a (worst case) test of writing blocks that are zero filled, but
>>>> with the last byte not being a zero?
>>
>> The additional overhead of worst-case should be (nearly) equal to the
>> simplest test case of dd if=/dev/zero of=/dev/thinp/vol. In my testing
>> that was 1.4GB/s within KVM on an Intel Xeon(R) CPU E3-1230 V2 @ 3.30GHz.
>
> That seems way too slow for checking if it's zero or not... Memory bandwidth
> should be way higher than that. The line above, was that what you ran? How
> does it look with bs=4k or higher?
In userspace I can get ~12GB/s, so I think the algorithm is sound.
dd might not be the right tool for this.
> read : io=12233MB, bw=1432.7MB/s, iops=22922, runt= 8539msec
Can you suggest the right fio commandline to test sequential writes if all
zeros? I tried --zero_buffers but couldn't get it to write zeros, writes
kept going to disk.
Also, attached is the patch that supports uintptr_t word sized 0-checks.
It steps byte-aligned at the beginning and end in case either end is not
word aligned.
I tried a few different algorithms:
Mike's trivial byte-by-byte zero check
using memcmp(ZERO_PAGE, data, bv.bv_len)==0
and the fastest one below:
-Eric
---
block/bio.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++
drivers/md/dm-thin.c | 10 +++++++
include/linux/bio.h | 1 +
3 files changed, 78 insertions(+), 0 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 8c2e55e..9100d35 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -511,6 +511,73 @@ void zero_fill_bio(struct bio *bio)
}
EXPORT_SYMBOL(zero_fill_bio);
+bool bio_is_zero_filled(struct bio *bio)
+{
+ unsigned i, count;
+ unsigned long flags;
+ struct bio_vec bv;
+ struct bvec_iter iter;
+ bio_for_each_segment(bv, bio, iter) {
+ char *data = bvec_kmap_irq(&bv, &flags);
+ char *p = data;
+ uintptr_t *parch;
+ int left = bv.bv_len;
+
+ if (unlikely( data == NULL ))
+ continue;
+
+
+ /* check unaligned bytes at the beginning of p */
+ if (unlikely( ( (uintptr_t)p & (sizeof(uintptr_t)-1) ) != 0 )) {
+ count = sizeof(uintptr_t) - ( (uintptr_t)p & (sizeof(uintptr_t)-1) );
+ for (i = 0; i < count; i++) {
+ if (*p) {
+ bvec_kunmap_irq(data, &flags);
+ return false;
+ }
+ p++;
+ }
+ left -= count;
+ }
+
+ /* we should be word aligned now */
+ BUG_ON(unlikely( ((uintptr_t)p & (sizeof(uintptr_t)-1) ) != 0 ));
+
+ /* now check in word-sized chunks */
+ parch = (uintptr_t*)p;
+ count = left >> ilog2(sizeof(uintptr_t)); /* count = left / sizeof(uintptr_t) */;
+ for (i = 0; i < count; i++) {
+ if (*parch) {
+ bvec_kunmap_irq(data, &flags);
+ return false;
+ }
+ parch++;
+ }
+ left -= count << ilog2(sizeof(uintptr_t)); /* left -= count*sizeof(uintptr_t) */
+
+ /* check remaining unaligned values@the end */
+ p = (char*)parch;
+ if (unlikely(left > 0))
+ {
+ for (i = 0; i < left; i++) {
+ if (*p) {
+ bvec_kunmap_irq(data, &flags);
+ return false;
+ }
+ p++;
+ }
+ left = 0;
+ }
+
+ bvec_kunmap_irq(data, &flags);
+ BUG_ON(unlikely( left > 0 ));
+ BUG_ON(unlikely( data+bv.bv_len != p ));
+ }
+
+ return true;
+}
+EXPORT_SYMBOL(bio_is_zero_filled);
+
/**
* bio_put - release a reference to a bio
* @bio: bio to release reference to
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index fc9c848..6a0c2c0 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1258,6 +1258,16 @@ static void provision_block(struct thin_c *tc, struct bio *bio, dm_block_t block
return;
}
+ /*
+ * Optimize away writes of all zeroes, subsequent reads to
+ * associated unprovisioned blocks will be zero filled.
+ */
+ if (unlikely(bio_is_zero_filled(bio))) {
+ cell_defer_no_holder(tc, cell);
+ bio_endio(bio, 0);
+ return;
+ }
+
r = alloc_data_block(tc, &data_block);
switch (r) {
case 0:
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 5a64576..abb46f7 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -419,6 +419,7 @@ extern struct bio *bio_copy_user_iov(struct request_queue *,
int, int, gfp_t);
extern int bio_uncopy_user(struct bio *);
void zero_fill_bio(struct bio *bio);
+bool bio_is_zero_filled(struct bio *bio);
extern struct bio_vec *bvec_alloc(gfp_t, int, unsigned long *, mempool_t *);
extern void bvec_free(mempool_t *, struct bio_vec *, unsigned int);
extern unsigned int bvec_nr_vecs(unsigned short idx);
--
1.7.1
^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: dm thin: optimize away writing all zeroes to unprovisioned blocks
2014-12-07 6:30 ` Eric Wheeler
@ 2014-12-07 6:45 ` Eric Wheeler
-1 siblings, 0 replies; 46+ messages in thread
From: Eric Wheeler @ 2014-12-07 6:45 UTC (permalink / raw)
To: LVM2 development; +Cc: Jens Axboe, dm-devel, ejt
Introduce bio_is_zero_filled() and use it to optimize away writing all
zeroes to unprovisioned blocks. Subsequent reads to the associated
unprovisioned blocks will be zero filled. bio_is_zero_filled now
works with unaligned bvec data.
Signed-off-by: Eric Wheeler <dm-devel@lists.ewheeler.net>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
---
> Also, attached is the patch that supports uintptr_t word sized 0-checks.
Re-sending. I think I've fixed my MUA's tendency to break patches.
block/bio.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++
drivers/md/dm-thin.c | 10 +++++++
include/linux/bio.h | 1 +
3 files changed, 78 insertions(+), 0 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 8c2e55e..9100d35 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -511,6 +511,73 @@ void zero_fill_bio(struct bio *bio)
}
EXPORT_SYMBOL(zero_fill_bio);
+bool bio_is_zero_filled(struct bio *bio)
+{
+ unsigned i, count;
+ unsigned long flags;
+ struct bio_vec bv;
+ struct bvec_iter iter;
+ bio_for_each_segment(bv, bio, iter) {
+ char *data = bvec_kmap_irq(&bv, &flags);
+ char *p = data;
+ uintptr_t *parch;
+ int left = bv.bv_len;
+
+ if (unlikely( data == NULL ))
+ continue;
+
+
+ /* check unaligned bytes at the beginning of p */
+ if (unlikely( ( (uintptr_t)p & (sizeof(uintptr_t)-1) ) != 0 )) {
+ count = sizeof(uintptr_t) - ( (uintptr_t)p & (sizeof(uintptr_t)-1) );
+ for (i = 0; i < count; i++) {
+ if (*p) {
+ bvec_kunmap_irq(data, &flags);
+ return false;
+ }
+ p++;
+ }
+ left -= count;
+ }
+
+ /* we should be word aligned now */
+ BUG_ON(unlikely( ((uintptr_t)p & (sizeof(uintptr_t)-1) ) != 0 ));
+
+ /* now check in word-sized chunks */
+ parch = (uintptr_t*)p;
+ count = left >> ilog2(sizeof(uintptr_t)); /* count = left / sizeof(uintptr_t) */;
+ for (i = 0; i < count; i++) {
+ if (*parch) {
+ bvec_kunmap_irq(data, &flags);
+ return false;
+ }
+ parch++;
+ }
+ left -= count << ilog2(sizeof(uintptr_t)); /* left -= count*sizeof(uintptr_t) */
+
+ /* check remaining unaligned values at the end */
+ p = (char*)parch;
+ if (unlikely(left > 0))
+ {
+ for (i = 0; i < left; i++) {
+ if (*p) {
+ bvec_kunmap_irq(data, &flags);
+ return false;
+ }
+ p++;
+ }
+ left = 0;
+ }
+
+ bvec_kunmap_irq(data, &flags);
+ BUG_ON(unlikely( left > 0 ));
+ BUG_ON(unlikely( data+bv.bv_len != p ));
+ }
+
+ return true;
+}
+EXPORT_SYMBOL(bio_is_zero_filled);
+
/**
* bio_put - release a reference to a bio
* @bio: bio to release reference to
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index fc9c848..6a0c2c0 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1258,6 +1258,16 @@ static void provision_block(struct thin_c *tc, struct bio *bio, dm_block_t block
return;
}
+ /*
+ * Optimize away writes of all zeroes, subsequent reads to
+ * associated unprovisioned blocks will be zero filled.
+ */
+ if (unlikely(bio_is_zero_filled(bio))) {
+ cell_defer_no_holder(tc, cell);
+ bio_endio(bio, 0);
+ return;
+ }
+
r = alloc_data_block(tc, &data_block);
switch (r) {
case 0:
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 5a64576..abb46f7 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -419,6 +419,7 @@ extern struct bio *bio_copy_user_iov(struct request_queue *,
int, int, gfp_t);
extern int bio_uncopy_user(struct bio *);
void zero_fill_bio(struct bio *bio);
+bool bio_is_zero_filled(struct bio *bio);
extern struct bio_vec *bvec_alloc(gfp_t, int, unsigned long *, mempool_t *);
extern void bvec_free(mempool_t *, struct bio_vec *, unsigned int);
extern unsigned int bvec_nr_vecs(unsigned short idx);
--
1.7.1
--
lvm-devel mailing list
lvm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/lvm-devel
^ permalink raw reply related [flat|nested] 46+ messages in thread
* dm thin: optimize away writing all zeroes to unprovisioned blocks
@ 2014-12-07 6:45 ` Eric Wheeler
0 siblings, 0 replies; 46+ messages in thread
From: Eric Wheeler @ 2014-12-07 6:45 UTC (permalink / raw)
To: lvm-devel
Introduce bio_is_zero_filled() and use it to optimize away writing all
zeroes to unprovisioned blocks. Subsequent reads to the associated
unprovisioned blocks will be zero filled. bio_is_zero_filled now
works with unaligned bvec data.
Signed-off-by: Eric Wheeler <dm-devel@lists.ewheeler.net>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
---
> Also, attached is the patch that supports uintptr_t word sized 0-checks.
Re-sending. I think I've fixed my MUA's tendency to break patches.
block/bio.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++
drivers/md/dm-thin.c | 10 +++++++
include/linux/bio.h | 1 +
3 files changed, 78 insertions(+), 0 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 8c2e55e..9100d35 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -511,6 +511,73 @@ void zero_fill_bio(struct bio *bio)
}
EXPORT_SYMBOL(zero_fill_bio);
+bool bio_is_zero_filled(struct bio *bio)
+{
+ unsigned i, count;
+ unsigned long flags;
+ struct bio_vec bv;
+ struct bvec_iter iter;
+ bio_for_each_segment(bv, bio, iter) {
+ char *data = bvec_kmap_irq(&bv, &flags);
+ char *p = data;
+ uintptr_t *parch;
+ int left = bv.bv_len;
+
+ if (unlikely( data == NULL ))
+ continue;
+
+
+ /* check unaligned bytes at the beginning of p */
+ if (unlikely( ( (uintptr_t)p & (sizeof(uintptr_t)-1) ) != 0 )) {
+ count = sizeof(uintptr_t) - ( (uintptr_t)p & (sizeof(uintptr_t)-1) );
+ for (i = 0; i < count; i++) {
+ if (*p) {
+ bvec_kunmap_irq(data, &flags);
+ return false;
+ }
+ p++;
+ }
+ left -= count;
+ }
+
+ /* we should be word aligned now */
+ BUG_ON(unlikely( ((uintptr_t)p & (sizeof(uintptr_t)-1) ) != 0 ));
+
+ /* now check in word-sized chunks */
+ parch = (uintptr_t*)p;
+ count = left >> ilog2(sizeof(uintptr_t)); /* count = left / sizeof(uintptr_t) */;
+ for (i = 0; i < count; i++) {
+ if (*parch) {
+ bvec_kunmap_irq(data, &flags);
+ return false;
+ }
+ parch++;
+ }
+ left -= count << ilog2(sizeof(uintptr_t)); /* left -= count*sizeof(uintptr_t) */
+
+ /* check remaining unaligned values@the end */
+ p = (char*)parch;
+ if (unlikely(left > 0))
+ {
+ for (i = 0; i < left; i++) {
+ if (*p) {
+ bvec_kunmap_irq(data, &flags);
+ return false;
+ }
+ p++;
+ }
+ left = 0;
+ }
+
+ bvec_kunmap_irq(data, &flags);
+ BUG_ON(unlikely( left > 0 ));
+ BUG_ON(unlikely( data+bv.bv_len != p ));
+ }
+
+ return true;
+}
+EXPORT_SYMBOL(bio_is_zero_filled);
+
/**
* bio_put - release a reference to a bio
* @bio: bio to release reference to
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index fc9c848..6a0c2c0 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1258,6 +1258,16 @@ static void provision_block(struct thin_c *tc, struct bio *bio, dm_block_t block
return;
}
+ /*
+ * Optimize away writes of all zeroes, subsequent reads to
+ * associated unprovisioned blocks will be zero filled.
+ */
+ if (unlikely(bio_is_zero_filled(bio))) {
+ cell_defer_no_holder(tc, cell);
+ bio_endio(bio, 0);
+ return;
+ }
+
r = alloc_data_block(tc, &data_block);
switch (r) {
case 0:
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 5a64576..abb46f7 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -419,6 +419,7 @@ extern struct bio *bio_copy_user_iov(struct request_queue *,
int, int, gfp_t);
extern int bio_uncopy_user(struct bio *);
void zero_fill_bio(struct bio *bio);
+bool bio_is_zero_filled(struct bio *bio);
extern struct bio_vec *bvec_alloc(gfp_t, int, unsigned long *, mempool_t *);
extern void bvec_free(mempool_t *, struct bio_vec *, unsigned int);
extern unsigned int bvec_nr_vecs(unsigned short idx);
--
1.7.1
^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [lvm-devel] dm thin: optimize away writing all zeroes to unprovisioned blocks
2014-12-07 6:30 ` Eric Wheeler
@ 2014-12-08 16:57 ` Jens Axboe
-1 siblings, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2014-12-08 16:57 UTC (permalink / raw)
To: Eric Wheeler; +Cc: dm-devel, ejt, LVM2 development
On 12/06/2014 11:30 PM, Eric Wheeler wrote:
> On Sat, 6 Dec 2014, Jens Axboe wrote:
>> On 12/06/2014 03:40 PM, Eric Wheeler wrote:
>>> On Fri, 5 Dec 2014, Mike Snitzer wrote:
>>>>> I do wonder what the performance impact is on this for dm. Have you
>>>>> tried a (worst case) test of writing blocks that are zero filled, but
>>>>> with the last byte not being a zero?
>>>
>>> The additional overhead of worst-case should be (nearly) equal to the
>>> simplest test case of dd if=/dev/zero of=/dev/thinp/vol. In my testing
>>> that was 1.4GB/s within KVM on an Intel Xeon(R) CPU E3-1230 V2 @
>>> 3.30GHz.
>>
>> That seems way too slow for checking if it's zero or not... Memory
>> bandwidth should be way higher than that. The line above, was that
>> what you ran? How does it look with bs=4k or higher?
>
> In userspace I can get ~12GB/s, so I think the algorithm is sound.
> dd might not be the right tool for this.
It's straight forward looping through the vectors, it can't really work
any other way. But we need to figure out why it's so slow...
>> read : io=12233MB, bw=1432.7MB/s, iops=22922, runt= 8539msec
>
> Can you suggest the right fio commandline to test sequential writes if
> all zeros? I tried --zero_buffers but couldn't get it to write zeros,
> writes kept going to disk.
zero_buffers=1
scramble_buffers=0
should get you all zeroes.
--
Jens Axboe
^ permalink raw reply [flat|nested] 46+ messages in thread
* dm thin: optimize away writing all zeroes to unprovisioned blocks
@ 2014-12-08 16:57 ` Jens Axboe
0 siblings, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2014-12-08 16:57 UTC (permalink / raw)
To: lvm-devel
On 12/06/2014 11:30 PM, Eric Wheeler wrote:
> On Sat, 6 Dec 2014, Jens Axboe wrote:
>> On 12/06/2014 03:40 PM, Eric Wheeler wrote:
>>> On Fri, 5 Dec 2014, Mike Snitzer wrote:
>>>>> I do wonder what the performance impact is on this for dm. Have you
>>>>> tried a (worst case) test of writing blocks that are zero filled, but
>>>>> with the last byte not being a zero?
>>>
>>> The additional overhead of worst-case should be (nearly) equal to the
>>> simplest test case of dd if=/dev/zero of=/dev/thinp/vol. In my testing
>>> that was 1.4GB/s within KVM on an Intel Xeon(R) CPU E3-1230 V2 @
>>> 3.30GHz.
>>
>> That seems way too slow for checking if it's zero or not... Memory
>> bandwidth should be way higher than that. The line above, was that
>> what you ran? How does it look with bs=4k or higher?
>
> In userspace I can get ~12GB/s, so I think the algorithm is sound.
> dd might not be the right tool for this.
It's straight forward looping through the vectors, it can't really work
any other way. But we need to figure out why it's so slow...
>> read : io=12233MB, bw=1432.7MB/s, iops=22922, runt= 8539msec
>
> Can you suggest the right fio commandline to test sequential writes if
> all zeros? I tried --zero_buffers but couldn't get it to write zeros,
> writes kept going to disk.
zero_buffers=1
scramble_buffers=0
should get you all zeroes.
--
Jens Axboe
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: dm thin: optimize away writing all zeroes to unprovisioned blocks
2014-12-08 16:57 ` Jens Axboe
@ 2014-12-09 8:02 ` Eric Wheeler
-1 siblings, 0 replies; 46+ messages in thread
From: Eric Wheeler @ 2014-12-09 8:02 UTC (permalink / raw)
To: Jens Axboe; +Cc: dm-devel, ejt, LVM2 development
On Fri, 5 Dec 2014, Mike Snitzer wrote:
> I do wonder what the performance impact is on this for dm. Have you
> tried a (worst case) test of writing blocks that are zero filled,
Jens, thank you for your help w/ fio for generating zeroed writes!
Clearly fio is superior to dd as a sequential benchmarking tool; I was
actually able to push on the system's memory bandwidth.
Results:
I hacked block/loop.c and md/dm-thin.c to always call bio_is_zero_filled()
and then complete without writing to disk, regardless of the return value
from bio_is_zero_filled(). In loop.c this was done in
do_bio_filebacked(), and for dm-thin.c this was done within
provision_block().
This allows us to compare the performance difference between the simple
loopback block device driver vs the more complex dm-thinp implementation
just prior to block allocation. These benchmarks give us a sense of how
performance differences relate between bio_is_zero_filled() and block
device implementation complexity, in addition to the raw performance of
bio_is_zero_filled in best- and worst-case scenarios.
Since we always complete without writing after the call to
bio_is_zero_filled, regardless of the bio's content (all zeros or not), we
can benchmark the difference in the common use case of random data, as
well as the edge case of skipping writes for bio's that contain all zeros
when writing to unallocated space of thin-provisioned volumes.
These benchmarks were performed under KVM, so expect them to be lower
bounds due to overhead. The hardware is a Intel(R) Xeon(R) CPU E3-1230 V2
@ 3.30GHz. The VM was allocated 4GB of memory with 4 cpu cores.
Benchmarks were performed using fio-2.1.14-33-gf8b8f
--name=writebw
--rw=write
--time_based
--runtime=7 --ramp_time=3
--norandommap
--ioengine=libaio
--group_reporting
--direct=1
--bs=1m
--filename=/dev/X
--numjobs=Y
Random data was tested using:
--zero_buffers=0 --scramble_buffers=1
Zeroed data was tested using:
--zero_buffers=1 --scramble_buffers=0
Values below are from aggrb.
dm-thinp (MB/s) loopback (MB/s) loop faster by factor of
==============+======================================================
random jobs=4 | 18496.0 33522.0 1.68x
zeros jobs=4 | 8119.2 9767.2 1.20x
==============+======================================================
random jobs=1 | 7330.5 12330.0 1.81x
zeros jobs=1 | 4965.2 6799.9 1.11x
We can see that fio reports a best-case performance of 33.5GB/s with
random data using 4 jobs in this test environment within loop.c.
For the real-world best-case within dm-thinp, fio reports 18.4GB/s, which
is is relevant for use cases where bio vectors tend to contain non-zero
data, particularly toward the beginning of the vector set.
I expect that the performance difference between loop.c and dm-thinp is
due to implementation complexity of the block device driver, such as
checking the metadata to see if a block must be allocated before calling
provision_block().
(Note that it may be possible for these test values to exceed the memory
bandwidth of the system since we exit early if finding non-zero data in a
biovec, thus the remaining data is not actually inspected but is counted
by fio. Worst-case values should all be below the memory bandwidth
maximum since all data is inspected. I believe memtest86+ says my memory
bandwidth is ~17GB/s.)
--
Eric Wheeler, President eWheeler, Inc. dba Global Linux Security
888-LINUX26 (888-546-8926) Fax: 503-716-3878 PO Box 25107
www.GlobalLinuxSecurity.pro Linux since 1996! Portland, OR 97298
--
lvm-devel mailing list
lvm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/lvm-devel
^ permalink raw reply [flat|nested] 46+ messages in thread
* dm thin: optimize away writing all zeroes to unprovisioned blocks
@ 2014-12-09 8:02 ` Eric Wheeler
0 siblings, 0 replies; 46+ messages in thread
From: Eric Wheeler @ 2014-12-09 8:02 UTC (permalink / raw)
To: lvm-devel
On Fri, 5 Dec 2014, Mike Snitzer wrote:
> I do wonder what the performance impact is on this for dm. Have you
> tried a (worst case) test of writing blocks that are zero filled,
Jens, thank you for your help w/ fio for generating zeroed writes!
Clearly fio is superior to dd as a sequential benchmarking tool; I was
actually able to push on the system's memory bandwidth.
Results:
I hacked block/loop.c and md/dm-thin.c to always call bio_is_zero_filled()
and then complete without writing to disk, regardless of the return value
from bio_is_zero_filled(). In loop.c this was done in
do_bio_filebacked(), and for dm-thin.c this was done within
provision_block().
This allows us to compare the performance difference between the simple
loopback block device driver vs the more complex dm-thinp implementation
just prior to block allocation. These benchmarks give us a sense of how
performance differences relate between bio_is_zero_filled() and block
device implementation complexity, in addition to the raw performance of
bio_is_zero_filled in best- and worst-case scenarios.
Since we always complete without writing after the call to
bio_is_zero_filled, regardless of the bio's content (all zeros or not), we
can benchmark the difference in the common use case of random data, as
well as the edge case of skipping writes for bio's that contain all zeros
when writing to unallocated space of thin-provisioned volumes.
These benchmarks were performed under KVM, so expect them to be lower
bounds due to overhead. The hardware is a Intel(R) Xeon(R) CPU E3-1230 V2
@ 3.30GHz. The VM was allocated 4GB of memory with 4 cpu cores.
Benchmarks were performed using fio-2.1.14-33-gf8b8f
--name=writebw
--rw=write
--time_based
--runtime=7 --ramp_time=3
--norandommap
--ioengine=libaio
--group_reporting
--direct=1
--bs=1m
--filename=/dev/X
--numjobs=Y
Random data was tested using:
--zero_buffers=0 --scramble_buffers=1
Zeroed data was tested using:
--zero_buffers=1 --scramble_buffers=0
Values below are from aggrb.
dm-thinp (MB/s) loopback (MB/s) loop faster by factor of
==============+======================================================
random jobs=4 | 18496.0 33522.0 1.68x
zeros jobs=4 | 8119.2 9767.2 1.20x
==============+======================================================
random jobs=1 | 7330.5 12330.0 1.81x
zeros jobs=1 | 4965.2 6799.9 1.11x
We can see that fio reports a best-case performance of 33.5GB/s with
random data using 4 jobs in this test environment within loop.c.
For the real-world best-case within dm-thinp, fio reports 18.4GB/s, which
is is relevant for use cases where bio vectors tend to contain non-zero
data, particularly toward the beginning of the vector set.
I expect that the performance difference between loop.c and dm-thinp is
due to implementation complexity of the block device driver, such as
checking the metadata to see if a block must be allocated before calling
provision_block().
(Note that it may be possible for these test values to exceed the memory
bandwidth of the system since we exit early if finding non-zero data in a
biovec, thus the remaining data is not actually inspected but is counted
by fio. Worst-case values should all be below the memory bandwidth
maximum since all data is inspected. I believe memtest86+ says my memory
bandwidth is ~17GB/s.)
--
Eric Wheeler, President eWheeler, Inc. dba Global Linux Security
888-LINUX26 (888-546-8926) Fax: 503-716-3878 PO Box 25107
www.GlobalLinuxSecurity.pro Linux since 1996! Portland, OR 97298
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: dm thin: optimize away writing all zeroes to unprovisioned blocks
2014-12-08 16:57 ` Jens Axboe
@ 2014-12-09 14:38 ` Marian Csontos
-1 siblings, 0 replies; 46+ messages in thread
From: Marian Csontos @ 2014-12-09 14:38 UTC (permalink / raw)
To: LVM2 development, Eric Wheeler; +Cc: dm-devel, ejt
On 12/08/2014 05:57 PM, Jens Axboe wrote:
> On 12/06/2014 11:30 PM, Eric Wheeler wrote:
>> On Sat, 6 Dec 2014, Jens Axboe wrote:
>>> On 12/06/2014 03:40 PM, Eric Wheeler wrote:
>>>> On Fri, 5 Dec 2014, Mike Snitzer wrote:
>>>>>> I do wonder what the performance impact is on this for dm. Have you
>>>>>> tried a (worst case) test of writing blocks that are zero filled, but
>>>>>> with the last byte not being a zero?
>>>>
>>>> The additional overhead of worst-case should be (nearly) equal to the
>>>> simplest test case of dd if=/dev/zero of=/dev/thinp/vol. In my testing
>>>> that was 1.4GB/s within KVM on an Intel Xeon(R) CPU E3-1230 V2 @
>>>> 3.30GHz.
>>>
>>> That seems way too slow for checking if it's zero or not... Memory
>>> bandwidth should be way higher than that. The line above, was that
>>> what you ran? How does it look with bs=4k or higher?
>>
>> In userspace I can get ~12GB/s, so I think the algorithm is sound.
>> dd might not be the right tool for this.
>
> It's straight forward looping through the vectors, it can't really work
> any other way. But we need to figure out why it's so slow...
May be a premature optimization without any supporting data, but as this
will be a frequently running loop, it is worth optimizing.
Two tips:
1. Is the compiler unrolling loops?
Using inline bvec_kunmap_irq in the loop may prevent the compiler from
doing so.
Try breaking the loop when *parch is non zero and calling it outside of
loop only when i >= count.
2. The function is doing quite a lot of jumping around making CPU
pipeline mostly useless.
Try using kernel's built-in memcmp, which I expect to be optimized, and
compare with a zero-page.
Perhaps doing few useless bit-or ops for every write would be more
effective than this constant jumping.
-- Martian
>
>>> read : io=12233MB, bw=1432.7MB/s, iops=22922, runt= 8539msec
>>
>> Can you suggest the right fio commandline to test sequential writes if
>> all zeros? I tried --zero_buffers but couldn't get it to write zeros,
>> writes kept going to disk.
>
> zero_buffers=1
> scramble_buffers=0
>
> should get you all zeroes.
>
--
lvm-devel mailing list
lvm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/lvm-devel
^ permalink raw reply [flat|nested] 46+ messages in thread
* dm thin: optimize away writing all zeroes to unprovisioned blocks
@ 2014-12-09 14:38 ` Marian Csontos
0 siblings, 0 replies; 46+ messages in thread
From: Marian Csontos @ 2014-12-09 14:38 UTC (permalink / raw)
To: lvm-devel
On 12/08/2014 05:57 PM, Jens Axboe wrote:
> On 12/06/2014 11:30 PM, Eric Wheeler wrote:
>> On Sat, 6 Dec 2014, Jens Axboe wrote:
>>> On 12/06/2014 03:40 PM, Eric Wheeler wrote:
>>>> On Fri, 5 Dec 2014, Mike Snitzer wrote:
>>>>>> I do wonder what the performance impact is on this for dm. Have you
>>>>>> tried a (worst case) test of writing blocks that are zero filled, but
>>>>>> with the last byte not being a zero?
>>>>
>>>> The additional overhead of worst-case should be (nearly) equal to the
>>>> simplest test case of dd if=/dev/zero of=/dev/thinp/vol. In my testing
>>>> that was 1.4GB/s within KVM on an Intel Xeon(R) CPU E3-1230 V2 @
>>>> 3.30GHz.
>>>
>>> That seems way too slow for checking if it's zero or not... Memory
>>> bandwidth should be way higher than that. The line above, was that
>>> what you ran? How does it look with bs=4k or higher?
>>
>> In userspace I can get ~12GB/s, so I think the algorithm is sound.
>> dd might not be the right tool for this.
>
> It's straight forward looping through the vectors, it can't really work
> any other way. But we need to figure out why it's so slow...
May be a premature optimization without any supporting data, but as this
will be a frequently running loop, it is worth optimizing.
Two tips:
1. Is the compiler unrolling loops?
Using inline bvec_kunmap_irq in the loop may prevent the compiler from
doing so.
Try breaking the loop when *parch is non zero and calling it outside of
loop only when i >= count.
2. The function is doing quite a lot of jumping around making CPU
pipeline mostly useless.
Try using kernel's built-in memcmp, which I expect to be optimized, and
compare with a zero-page.
Perhaps doing few useless bit-or ops for every write would be more
effective than this constant jumping.
-- Martian
>
>>> read : io=12233MB, bw=1432.7MB/s, iops=22922, runt= 8539msec
>>
>> Can you suggest the right fio commandline to test sequential writes if
>> all zeros? I tried --zero_buffers but couldn't get it to write zeros,
>> writes kept going to disk.
>
> zero_buffers=1
> scramble_buffers=0
>
> should get you all zeroes.
>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [lvm-devel] dm thin: optimize away writing all zeroes to unprovisioned blocks
2014-12-09 8:02 ` Eric Wheeler
@ 2014-12-09 15:31 ` Jens Axboe
-1 siblings, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2014-12-09 15:31 UTC (permalink / raw)
To: Eric Wheeler; +Cc: dm-devel, ejt, LVM2 development
On 12/09/2014 01:02 AM, Eric Wheeler wrote:
> On Fri, 5 Dec 2014, Mike Snitzer wrote:
>> I do wonder what the performance impact is on this for dm. Have you
>> tried a (worst case) test of writing blocks that are zero filled,
>
> Jens, thank you for your help w/ fio for generating zeroed writes!
> Clearly fio is superior to dd as a sequential benchmarking tool; I was
> actually able to push on the system's memory bandwidth.
>
> Results:
>
> I hacked block/loop.c and md/dm-thin.c to always call bio_is_zero_filled()
> and then complete without writing to disk, regardless of the return value
> from bio_is_zero_filled(). In loop.c this was done in
> do_bio_filebacked(), and for dm-thin.c this was done within
> provision_block().
>
> This allows us to compare the performance difference between the simple
> loopback block device driver vs the more complex dm-thinp implementation
> just prior to block allocation. These benchmarks give us a sense of how
> performance differences relate between bio_is_zero_filled() and block
> device implementation complexity, in addition to the raw performance of
> bio_is_zero_filled in best- and worst-case scenarios.
>
> Since we always complete without writing after the call to
> bio_is_zero_filled, regardless of the bio's content (all zeros or not), we
> can benchmark the difference in the common use case of random data, as
> well as the edge case of skipping writes for bio's that contain all zeros
> when writing to unallocated space of thin-provisioned volumes.
>
> These benchmarks were performed under KVM, so expect them to be lower
> bounds due to overhead. The hardware is a Intel(R) Xeon(R) CPU E3-1230 V2
> @ 3.30GHz. The VM was allocated 4GB of memory with 4 cpu cores.
>
> Benchmarks were performed using fio-2.1.14-33-gf8b8f
> --name=writebw
> --rw=write
> --time_based
> --runtime=7 --ramp_time=3
> --norandommap
> --ioengine=libaio
> --group_reporting
> --direct=1
> --bs=1m
> --filename=/dev/X
> --numjobs=Y
>
> Random data was tested using:
> --zero_buffers=0 --scramble_buffers=1
>
> Zeroed data was tested using:
> --zero_buffers=1 --scramble_buffers=0
>
> Values below are from aggrb.
>
> dm-thinp (MB/s) loopback (MB/s) loop faster by factor of
> ==============+======================================================
> random jobs=4 | 18496.0 33522.0 1.68x
> zeros jobs=4 | 8119.2 9767.2 1.20x
> ==============+======================================================
> random jobs=1 | 7330.5 12330.0 1.81x
> zeros jobs=1 | 4965.2 6799.9 1.11x
This looks more reasonable in terms of throughput.
One major worry here is that checking every write is blowing your cache,
so you could have a major impact on performance in general. Even for
O_DIRECT writes, you are now accessing the memory. Have you looked into
doing non-temporal memory compares instead? I think that would be the
way to go.
--
Jens Axboe
^ permalink raw reply [flat|nested] 46+ messages in thread
* dm thin: optimize away writing all zeroes to unprovisioned blocks
@ 2014-12-09 15:31 ` Jens Axboe
0 siblings, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2014-12-09 15:31 UTC (permalink / raw)
To: lvm-devel
On 12/09/2014 01:02 AM, Eric Wheeler wrote:
> On Fri, 5 Dec 2014, Mike Snitzer wrote:
>> I do wonder what the performance impact is on this for dm. Have you
>> tried a (worst case) test of writing blocks that are zero filled,
>
> Jens, thank you for your help w/ fio for generating zeroed writes!
> Clearly fio is superior to dd as a sequential benchmarking tool; I was
> actually able to push on the system's memory bandwidth.
>
> Results:
>
> I hacked block/loop.c and md/dm-thin.c to always call bio_is_zero_filled()
> and then complete without writing to disk, regardless of the return value
> from bio_is_zero_filled(). In loop.c this was done in
> do_bio_filebacked(), and for dm-thin.c this was done within
> provision_block().
>
> This allows us to compare the performance difference between the simple
> loopback block device driver vs the more complex dm-thinp implementation
> just prior to block allocation. These benchmarks give us a sense of how
> performance differences relate between bio_is_zero_filled() and block
> device implementation complexity, in addition to the raw performance of
> bio_is_zero_filled in best- and worst-case scenarios.
>
> Since we always complete without writing after the call to
> bio_is_zero_filled, regardless of the bio's content (all zeros or not), we
> can benchmark the difference in the common use case of random data, as
> well as the edge case of skipping writes for bio's that contain all zeros
> when writing to unallocated space of thin-provisioned volumes.
>
> These benchmarks were performed under KVM, so expect them to be lower
> bounds due to overhead. The hardware is a Intel(R) Xeon(R) CPU E3-1230 V2
> @ 3.30GHz. The VM was allocated 4GB of memory with 4 cpu cores.
>
> Benchmarks were performed using fio-2.1.14-33-gf8b8f
> --name=writebw
> --rw=write
> --time_based
> --runtime=7 --ramp_time=3
> --norandommap
> --ioengine=libaio
> --group_reporting
> --direct=1
> --bs=1m
> --filename=/dev/X
> --numjobs=Y
>
> Random data was tested using:
> --zero_buffers=0 --scramble_buffers=1
>
> Zeroed data was tested using:
> --zero_buffers=1 --scramble_buffers=0
>
> Values below are from aggrb.
>
> dm-thinp (MB/s) loopback (MB/s) loop faster by factor of
> ==============+======================================================
> random jobs=4 | 18496.0 33522.0 1.68x
> zeros jobs=4 | 8119.2 9767.2 1.20x
> ==============+======================================================
> random jobs=1 | 7330.5 12330.0 1.81x
> zeros jobs=1 | 4965.2 6799.9 1.11x
This looks more reasonable in terms of throughput.
One major worry here is that checking every write is blowing your cache,
so you could have a major impact on performance in general. Even for
O_DIRECT writes, you are now accessing the memory. Have you looked into
doing non-temporal memory compares instead? I think that would be the
way to go.
--
Jens Axboe
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [lvm-devel] dm thin: optimize away writing all zeroes to unprovisioned blocks
2014-12-09 15:31 ` Jens Axboe
@ 2014-12-09 15:41 ` Jens Axboe
-1 siblings, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2014-12-09 15:41 UTC (permalink / raw)
To: Eric Wheeler; +Cc: dm-devel, ejt, LVM2 development
On 12/09/2014 08:31 AM, Jens Axboe wrote:
> On 12/09/2014 01:02 AM, Eric Wheeler wrote:
>> On Fri, 5 Dec 2014, Mike Snitzer wrote:
>>> I do wonder what the performance impact is on this for dm. Have you
>>> tried a (worst case) test of writing blocks that are zero filled,
>>
>> Jens, thank you for your help w/ fio for generating zeroed writes!
>> Clearly fio is superior to dd as a sequential benchmarking tool; I was
>> actually able to push on the system's memory bandwidth.
>>
>> Results:
>>
>> I hacked block/loop.c and md/dm-thin.c to always call
>> bio_is_zero_filled()
>> and then complete without writing to disk, regardless of the return value
>> from bio_is_zero_filled(). In loop.c this was done in
>> do_bio_filebacked(), and for dm-thin.c this was done within
>> provision_block().
>>
>> This allows us to compare the performance difference between the simple
>> loopback block device driver vs the more complex dm-thinp implementation
>> just prior to block allocation. These benchmarks give us a sense of how
>> performance differences relate between bio_is_zero_filled() and block
>> device implementation complexity, in addition to the raw performance of
>> bio_is_zero_filled in best- and worst-case scenarios.
>>
>> Since we always complete without writing after the call to
>> bio_is_zero_filled, regardless of the bio's content (all zeros or
>> not), we
>> can benchmark the difference in the common use case of random data, as
>> well as the edge case of skipping writes for bio's that contain all zeros
>> when writing to unallocated space of thin-provisioned volumes.
>>
>> These benchmarks were performed under KVM, so expect them to be lower
>> bounds due to overhead. The hardware is a Intel(R) Xeon(R) CPU
>> E3-1230 V2
>> @ 3.30GHz. The VM was allocated 4GB of memory with 4 cpu cores.
>>
>> Benchmarks were performed using fio-2.1.14-33-gf8b8f
>> --name=writebw
>> --rw=write
>> --time_based
>> --runtime=7 --ramp_time=3
>> --norandommap
>> --ioengine=libaio
>> --group_reporting
>> --direct=1
>> --bs=1m
>> --filename=/dev/X
>> --numjobs=Y
>>
>> Random data was tested using:
>> --zero_buffers=0 --scramble_buffers=1
>>
>> Zeroed data was tested using:
>> --zero_buffers=1 --scramble_buffers=0
>>
>> Values below are from aggrb.
>>
>> dm-thinp (MB/s) loopback (MB/s) loop faster by
>> factor of
>> ==============+======================================================
>> random jobs=4 | 18496.0 33522.0 1.68x
>> zeros jobs=4 | 8119.2 9767.2 1.20x
>> ==============+======================================================
>> random jobs=1 | 7330.5 12330.0 1.81x
>> zeros jobs=1 | 4965.2 6799.9 1.11x
>
> This looks more reasonable in terms of throughput.
>
> One major worry here is that checking every write is blowing your cache,
> so you could have a major impact on performance in general. Even for
> O_DIRECT writes, you are now accessing the memory. Have you looked into
> doing non-temporal memory compares instead? I think that would be the
> way to go.
So I found your patch in the thread. For each vector, use memcmp()
instead and hope it does the right thing. You can compare with
empty_zero_page. That should drastically cut down on the amount of hand
rolled code you have in bio_is_zero_filled() at the moment.
--
Jens Axboe
^ permalink raw reply [flat|nested] 46+ messages in thread
* dm thin: optimize away writing all zeroes to unprovisioned blocks
@ 2014-12-09 15:41 ` Jens Axboe
0 siblings, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2014-12-09 15:41 UTC (permalink / raw)
To: lvm-devel
On 12/09/2014 08:31 AM, Jens Axboe wrote:
> On 12/09/2014 01:02 AM, Eric Wheeler wrote:
>> On Fri, 5 Dec 2014, Mike Snitzer wrote:
>>> I do wonder what the performance impact is on this for dm. Have you
>>> tried a (worst case) test of writing blocks that are zero filled,
>>
>> Jens, thank you for your help w/ fio for generating zeroed writes!
>> Clearly fio is superior to dd as a sequential benchmarking tool; I was
>> actually able to push on the system's memory bandwidth.
>>
>> Results:
>>
>> I hacked block/loop.c and md/dm-thin.c to always call
>> bio_is_zero_filled()
>> and then complete without writing to disk, regardless of the return value
>> from bio_is_zero_filled(). In loop.c this was done in
>> do_bio_filebacked(), and for dm-thin.c this was done within
>> provision_block().
>>
>> This allows us to compare the performance difference between the simple
>> loopback block device driver vs the more complex dm-thinp implementation
>> just prior to block allocation. These benchmarks give us a sense of how
>> performance differences relate between bio_is_zero_filled() and block
>> device implementation complexity, in addition to the raw performance of
>> bio_is_zero_filled in best- and worst-case scenarios.
>>
>> Since we always complete without writing after the call to
>> bio_is_zero_filled, regardless of the bio's content (all zeros or
>> not), we
>> can benchmark the difference in the common use case of random data, as
>> well as the edge case of skipping writes for bio's that contain all zeros
>> when writing to unallocated space of thin-provisioned volumes.
>>
>> These benchmarks were performed under KVM, so expect them to be lower
>> bounds due to overhead. The hardware is a Intel(R) Xeon(R) CPU
>> E3-1230 V2
>> @ 3.30GHz. The VM was allocated 4GB of memory with 4 cpu cores.
>>
>> Benchmarks were performed using fio-2.1.14-33-gf8b8f
>> --name=writebw
>> --rw=write
>> --time_based
>> --runtime=7 --ramp_time=3
>> --norandommap
>> --ioengine=libaio
>> --group_reporting
>> --direct=1
>> --bs=1m
>> --filename=/dev/X
>> --numjobs=Y
>>
>> Random data was tested using:
>> --zero_buffers=0 --scramble_buffers=1
>>
>> Zeroed data was tested using:
>> --zero_buffers=1 --scramble_buffers=0
>>
>> Values below are from aggrb.
>>
>> dm-thinp (MB/s) loopback (MB/s) loop faster by
>> factor of
>> ==============+======================================================
>> random jobs=4 | 18496.0 33522.0 1.68x
>> zeros jobs=4 | 8119.2 9767.2 1.20x
>> ==============+======================================================
>> random jobs=1 | 7330.5 12330.0 1.81x
>> zeros jobs=1 | 4965.2 6799.9 1.11x
>
> This looks more reasonable in terms of throughput.
>
> One major worry here is that checking every write is blowing your cache,
> so you could have a major impact on performance in general. Even for
> O_DIRECT writes, you are now accessing the memory. Have you looked into
> doing non-temporal memory compares instead? I think that would be the
> way to go.
So I found your patch in the thread. For each vector, use memcmp()
instead and hope it does the right thing. You can compare with
empty_zero_page. That should drastically cut down on the amount of hand
rolled code you have in bio_is_zero_filled() at the moment.
--
Jens Axboe
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH] dm thin: optimize away writing all zeroes to unprovisioned blocks
2014-12-09 15:41 ` Jens Axboe
@ 2014-12-10 2:52 ` Eric Wheeler
-1 siblings, 0 replies; 46+ messages in thread
From: Eric Wheeler @ 2014-12-10 2:52 UTC (permalink / raw)
To: Jens Axboe; +Cc: dm-devel, Marian Csontos, ejt, Eric Wheeler, LVM2 development
> > > dm-thinp (MB/s) loopback (MB/s) loop faster by
> > >
> > > ==============+======================================================
> > > random jobs=4 | 18496.0 33522.0 1.68x
> > > zeros jobs=4 | 8119.2 9767.2 1.20x
> > > ==============+======================================================
> > > random jobs=1 | 7330.5 12330.0 1.81x
> > > zeros jobs=1 | 4965.2 6799.9 1.11x
> >
> > This looks more reasonable in terms of throughput.
> >
> > One major worry here is that checking every write is blowing your cache,
> > so you could have a major impact on performance in general. Even for
> > O_DIRECT writes, you are now accessing the memory. Have you looked into
> > doing non-temporal memory compares instead? I think that would be the
> > way to go.
>
> So I found your patch in the thread. For each vector, use memcmp() instead and
> hope it does the right thing. You can compare with empty_zero_page. That
> should drastically cut down on the amount of hand rolled code you have in
> bio_is_zero_filled() at the moment.
I ran these against dm-thinp within provision_block() when checking
memcmp(page_address(ZERO_PAGE()), data, bv.bv_len) != 0
and got these numbers that show memcmp() based testing is rather slow:
zeroed jobs=1 1268.1 MB/s
zeroed jobs=4 1445.6 MB/s
random jobs=1 7554.1 MB/s
random jobs=4 17677.0 MB/s
I updated the checking code to use __uint128_t, and performance got even
better than with 64-bit integers (note that gcc is emulating 128bit).
These are the numbers w/ __uint128_t:
zeroed jobs=1 6424 MB/s
zeroed jobs=4 10738 MB/s MB/s
random jobs=1 7928 MB/s
random jobs=4 19435 MB/s MB/s
Jens says to try non-temporal memory compares to deal with cache
invalidation, but that sounds like arch-specific assembler unless I am
mistaken. Does GCC have something I can use to flag the buffer as not to
be cached? If not, I'm happy to poke some asm code into the check
routine, but I'm a bit rusty there.
Can someone suggest the inline asm that I would use to rep until a
non-zero value appears? Assume that we are 8 or 16-byte aligned, probably
the bigger the word sized checks are the better, performance wise.
My latest version is attached, and includes moving bvec_kunmap_irq out of
the for loops in hopes of getting gcc to unroll loops.
-Eric
---
block/bio.c | 74 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 74 insertions(+), 0 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 8c2e55e..262f190 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -511,6 +511,80 @@ void zero_fill_bio(struct bio *bio)
}
EXPORT_SYMBOL(zero_fill_bio);
+bool bio_is_zero_filled(struct bio *bio)
+{
+ unsigned i, count, left;
+ unsigned long flags;
+ struct bio_vec bv;
+ struct bvec_iter iter;
+ char *data = NULL, *p;
+ __uint128_t *parch;
+
+ bio_for_each_segment(bv, bio, iter) {
+ data = bvec_kmap_irq(&bv, &flags);
+ p = data;
+ left = bv.bv_len;
+
+ if (unlikely( data == NULL ))
+ continue;
+
+
+ /* check unaligned bytes at the beginning of p */
+ if (unlikely( ( (uintptr_t)p & (sizeof(__uint128_t)-1) ) != 0 )) {
+ count = sizeof(__uint128_t) - ( (uintptr_t)p & (sizeof(__uint128_t)-1) );
+ for (i = 0; i < count; i++) {
+ if (*p)
+ break;
+ p++;
+ }
+ if (i < count)
+ goto out_false;
+ left -= count;
+ }
+
+ /* we should be word aligned now */
+ BUG_ON(unlikely( ((uintptr_t)p & (sizeof(__uint128_t)-1) ) != 0 ));
+
+ /* now check in word-sized chunks */
+ parch = (__uint128_t*)p;
+ count = left >> ilog2(sizeof(__uint128_t)); /* count = left / sizeof(__uint128_t) */;
+ for (i = 0; i < count; i++) {
+ if (*parch)
+ break;
+ parch++;
+ }
+ if (i < count)
+ goto out_false;
+ left -= count << ilog2(sizeof(__uint128_t)); /* left -= count*sizeof(__uint128_t) */
+
+ /* check remaining unaligned values at the end */
+ p = (char*)parch;
+ if (unlikely(left > 0))
+ {
+ for (i = 0; i < left; i++) {
+ if (*p)
+ break;
+ p++;
+ }
+ if (i < count)
+ goto out_false;
+ left = 0;
+ }
+
+ bvec_kunmap_irq(data, &flags);
+ BUG_ON(unlikely( left > 0 ));
+ BUG_ON(unlikely( data+bv.bv_len != p ));
+ }
+
+ return true;
+
+out_false:
+ if (data != NULL)
+ bvec_kunmap_irq(data, &flags);
+ return false;
+}
+EXPORT_SYMBOL(bio_is_zero_filled);
+
/**
* bio_put - release a reference to a bio
* @bio: bio to release reference to
--
1.7.1
--
lvm-devel mailing list
lvm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/lvm-devel
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH] dm thin: optimize away writing all zeroes to unprovisioned blocks
@ 2014-12-10 2:52 ` Eric Wheeler
0 siblings, 0 replies; 46+ messages in thread
From: Eric Wheeler @ 2014-12-10 2:52 UTC (permalink / raw)
To: lvm-devel
> > > dm-thinp (MB/s) loopback (MB/s) loop faster by
> > >
> > > ==============+======================================================
> > > random jobs=4 | 18496.0 33522.0 1.68x
> > > zeros jobs=4 | 8119.2 9767.2 1.20x
> > > ==============+======================================================
> > > random jobs=1 | 7330.5 12330.0 1.81x
> > > zeros jobs=1 | 4965.2 6799.9 1.11x
> >
> > This looks more reasonable in terms of throughput.
> >
> > One major worry here is that checking every write is blowing your cache,
> > so you could have a major impact on performance in general. Even for
> > O_DIRECT writes, you are now accessing the memory. Have you looked into
> > doing non-temporal memory compares instead? I think that would be the
> > way to go.
>
> So I found your patch in the thread. For each vector, use memcmp() instead and
> hope it does the right thing. You can compare with empty_zero_page. That
> should drastically cut down on the amount of hand rolled code you have in
> bio_is_zero_filled() at the moment.
I ran these against dm-thinp within provision_block() when checking
memcmp(page_address(ZERO_PAGE()), data, bv.bv_len) != 0
and got these numbers that show memcmp() based testing is rather slow:
zeroed jobs=1 1268.1 MB/s
zeroed jobs=4 1445.6 MB/s
random jobs=1 7554.1 MB/s
random jobs=4 17677.0 MB/s
I updated the checking code to use __uint128_t, and performance got even
better than with 64-bit integers (note that gcc is emulating 128bit).
These are the numbers w/ __uint128_t:
zeroed jobs=1 6424 MB/s
zeroed jobs=4 10738 MB/s MB/s
random jobs=1 7928 MB/s
random jobs=4 19435 MB/s MB/s
Jens says to try non-temporal memory compares to deal with cache
invalidation, but that sounds like arch-specific assembler unless I am
mistaken. Does GCC have something I can use to flag the buffer as not to
be cached? If not, I'm happy to poke some asm code into the check
routine, but I'm a bit rusty there.
Can someone suggest the inline asm that I would use to rep until a
non-zero value appears? Assume that we are 8 or 16-byte aligned, probably
the bigger the word sized checks are the better, performance wise.
My latest version is attached, and includes moving bvec_kunmap_irq out of
the for loops in hopes of getting gcc to unroll loops.
-Eric
---
block/bio.c | 74 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 74 insertions(+), 0 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 8c2e55e..262f190 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -511,6 +511,80 @@ void zero_fill_bio(struct bio *bio)
}
EXPORT_SYMBOL(zero_fill_bio);
+bool bio_is_zero_filled(struct bio *bio)
+{
+ unsigned i, count, left;
+ unsigned long flags;
+ struct bio_vec bv;
+ struct bvec_iter iter;
+ char *data = NULL, *p;
+ __uint128_t *parch;
+
+ bio_for_each_segment(bv, bio, iter) {
+ data = bvec_kmap_irq(&bv, &flags);
+ p = data;
+ left = bv.bv_len;
+
+ if (unlikely( data == NULL ))
+ continue;
+
+
+ /* check unaligned bytes at the beginning of p */
+ if (unlikely( ( (uintptr_t)p & (sizeof(__uint128_t)-1) ) != 0 )) {
+ count = sizeof(__uint128_t) - ( (uintptr_t)p & (sizeof(__uint128_t)-1) );
+ for (i = 0; i < count; i++) {
+ if (*p)
+ break;
+ p++;
+ }
+ if (i < count)
+ goto out_false;
+ left -= count;
+ }
+
+ /* we should be word aligned now */
+ BUG_ON(unlikely( ((uintptr_t)p & (sizeof(__uint128_t)-1) ) != 0 ));
+
+ /* now check in word-sized chunks */
+ parch = (__uint128_t*)p;
+ count = left >> ilog2(sizeof(__uint128_t)); /* count = left / sizeof(__uint128_t) */;
+ for (i = 0; i < count; i++) {
+ if (*parch)
+ break;
+ parch++;
+ }
+ if (i < count)
+ goto out_false;
+ left -= count << ilog2(sizeof(__uint128_t)); /* left -= count*sizeof(__uint128_t) */
+
+ /* check remaining unaligned values@the end */
+ p = (char*)parch;
+ if (unlikely(left > 0))
+ {
+ for (i = 0; i < left; i++) {
+ if (*p)
+ break;
+ p++;
+ }
+ if (i < count)
+ goto out_false;
+ left = 0;
+ }
+
+ bvec_kunmap_irq(data, &flags);
+ BUG_ON(unlikely( left > 0 ));
+ BUG_ON(unlikely( data+bv.bv_len != p ));
+ }
+
+ return true;
+
+out_false:
+ if (data != NULL)
+ bvec_kunmap_irq(data, &flags);
+ return false;
+}
+EXPORT_SYMBOL(bio_is_zero_filled);
+
/**
* bio_put - release a reference to a bio
* @bio: bio to release reference to
--
1.7.1
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH] dm thin: optimize away writing all zeroes to unprovisioned blocks
2014-12-09 15:31 ` Jens Axboe
@ 2015-01-26 2:53 ` Eric Wheeler
-1 siblings, 0 replies; 46+ messages in thread
From: Eric Wheeler @ 2015-01-26 2:53 UTC (permalink / raw)
To: Jens Axboe; +Cc: dm-devel, ejt, LVM2 development
Introduce bio_is_zero_filled() and use it to optimize away writing all
zeroes to unprovisioned blocks. Subsequent reads to the associated
unprovisioned blocks will be zero filled.
Single-thread performance evaluates zero bio's at ~8137MB/s under KVM on a
Xeon E3-1230. Zero-checks attempt to minimize cache effects on non-zero
data. bio_is_zero_filled works with unaligned bvec data. (Using memcmp
and comparing against the zero page is ~5x slower, so this implementation
was optimized to increase pipelined exectution.)
The pool_feature pf.skip_zero_writes is implemented and configurable at
creation time or via table reload.
To test we prepare two dm-thinp volumes test1 and test2 of equal size.
We format test1 without the patch, mount, and extract the Linux source
tree onto the test1 filesystem, and unmount. Finally, we activate
skip_zero_writes, dd test1 over test2, and verify checksums:
b210f032a6465178103317f3c40ab59f /dev/test/test1
b210f032a6465178103317f3c40ab59f /dev/test/test2
Notice the allocation difference for thin volumes test1 and test2 (after
dd if=test1 of=test2), even though they have the same md5sum:
LV VG Attr LSize Pool Origin Data%
test1 test Vwi-a-tz-- 4.00g thinp 22.04
test2 test Vwi-a-tz-- 4.00g thinp 18.33
An additional 3.71% of space was saved by the patch, and so were the
~150MB of (possibly random) IOs that would have hit disk, not to mention
reads that now bypass the disk since they are unallocated. We also save
the metadata overhead of ~2400 provision_block() allocations.
Signed-off-by: Eric Wheeler <dm-devel@lists.ewheeler.net>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: ejt@redhat.com
---
Documentation/device-mapper/thin-provisioning.txt | 2 +
block/bio.c | 103 +++++++++++++++++++++
drivers/md/dm-thin.c | 27 ++++++
include/linux/bio.h | 1 +
4 files changed, 133 insertions(+), 0 deletions(-)
http://www.globallinuxsecurity.pro/
diff --git a/Documentation/device-mapper/thin-provisioning.txt b/Documentation/device-mapper/thin-provisioning.txt
index 2f51735..7304ccf 100644
--- a/Documentation/device-mapper/thin-provisioning.txt
+++ b/Documentation/device-mapper/thin-provisioning.txt
@@ -266,6 +266,8 @@ i) Constructor
error_if_no_space: Error IOs, instead of queueing, if no space.
+ enable_zero_writes: Enable all-zero writes to unallocated blocks.
+
Data block size must be between 64KB (128 sectors) and 1GB
(2097152 sectors) inclusive.
diff --git a/block/bio.c b/block/bio.c
index 8c2e55e..7e372b7 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -511,6 +511,109 @@ void zero_fill_bio(struct bio *bio)
}
EXPORT_SYMBOL(zero_fill_bio);
+__attribute__((optimize("unroll-loops")))
+bool bio_is_zero_filled(struct bio *bio)
+{
+ struct bio_vec bv;
+ struct bvec_iter iter;
+ char *data, *p;
+ __uint128_t *p128;
+
+ unsigned i, count, left;
+ unsigned long flags;
+
+ /* Align to the data chunk size */
+ const int align_to = sizeof(__uint128_t);
+
+ p128 = NULL;
+ bio_for_each_segment(bv, bio, iter) {
+ data = bvec_kmap_irq(&bv, &flags);
+ p = data;
+ left = bv.bv_len;
+
+ if (unlikely( data == NULL ))
+ continue;
+
+ /* check unaligned bytes at the beginning of p, if any */
+ if (unlikely( ( (uintptr_t)p & (align_to-1) ) != 0 )) {
+ count = align_to - ( (uintptr_t)p & (align_to-1) );
+ for (i = 0; i < count; i++) {
+ if (*p)
+ break;
+ p++;
+ }
+ if (i < count)
+ goto out_false;
+ left -= count;
+ }
+
+ /* we should be align_to-byte aligned now */
+ BUG_ON(unlikely( ((uintptr_t)p & (align_to-1) ) != 0 ));
+
+ p128 = (__uint128_t*)p;
+ count = left >> 9; /* count = left / (32*16) */
+
+ /* Quickly sample first, middle, and last values
+ * and exit early without hitting the loop if nonzero.
+ * This reduces cache effects for non-zero data and has almost
+ * no effective penalty since these values will probably be in
+ * cache when we test them below.
+ */
+ if (likely(count) &&
+ (p128[0] || p128[left>>5] || p128[(left>>4)-1]))
+ goto out_false;
+
+ /* We are using __uint128_t here, so 32*16=512 bytes per loop
+ * Writing out the 32 operations helps with pipelining. */
+ for (i = 0; i < count; i++) {
+ if (p128[ 0] | p128[ 1] | p128[ 2] | p128[ 3] |
+ p128[ 4] | p128[ 5] | p128[ 6] | p128[ 7] |
+ p128[ 8] | p128[ 9] | p128[10] | p128[11] |
+ p128[12] | p128[13] | p128[14] | p128[15] |
+ p128[16] | p128[17] | p128[18] | p128[19] |
+ p128[20] | p128[21] | p128[22] | p128[23] |
+ p128[24] | p128[25] | p128[26] | p128[27] |
+ p128[28] | p128[29] | p128[30] | p128[31])
+ break;
+
+ p128 += 32;
+ }
+ if (i < count)
+ goto out_false;
+
+ left -= count << 9; /* left -= count * 32*16 */
+
+ /* check remaining unaligned values at the end, if any */
+ if (unlikely(left > 0)) {
+ p = (char*)p128;
+ for (i = 0; i < left; i++) {
+ if (*p)
+ break;
+ p++;
+ }
+ if (i < left)
+ goto out_false;
+
+ /* Fixup values for the BUG_ON checks below */
+ p128 = (__uint128_t*)p;
+ left = 0;
+ }
+
+ bvec_kunmap_irq(data, &flags);
+ BUG_ON(unlikely( left > 0 ));
+ BUG_ON(unlikely( data+bv.bv_len != (char*)p128 ));
+ }
+
+ return true;
+
+out_false:
+ if (likely(data != NULL))
+ bvec_kunmap_irq(data, &flags);
+
+ return false;
+}
+EXPORT_SYMBOL(bio_is_zero_filled);
+
/**
* bio_put - release a reference to a bio
* @bio: bio to release reference to
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index fc9c848..d0cebd2 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -151,6 +151,7 @@ struct pool_features {
bool discard_enabled:1;
bool discard_passdown:1;
bool error_if_no_space:1;
+ bool skip_zero_writes:1;
};
struct thin_c;
@@ -1258,6 +1259,16 @@ static void provision_block(struct thin_c *tc, struct bio *bio, dm_block_t block
return;
}
+ /*
+ * Skip writes of all zeroes to unprovisioned blocks.
+ */
+
+ if (tc->pool->pf.skip_zero_writes && bio_is_zero_filled(bio) ) {
+ cell_defer_no_holder(tc, cell);
+ bio_endio(bio, 0);
+ return;
+ }
+
r = alloc_data_block(tc, &data_block);
switch (r) {
case 0:
@@ -2063,6 +2074,7 @@ static void pool_features_init(struct pool_features *pf)
pf->discard_enabled = true;
pf->discard_passdown = true;
pf->error_if_no_space = false;
+ pf->skip_zero_writes = true;
}
static void __pool_destroy(struct pool *pool)
@@ -2310,6 +2322,9 @@ static int parse_pool_features(struct dm_arg_set *as, struct pool_features *pf,
else if (!strcasecmp(arg_name, "error_if_no_space"))
pf->error_if_no_space = true;
+ else if (!strcasecmp(arg_name, "enable_zero_writes"))
+ pf->skip_zero_writes = false;
+
else {
ti->error = "Unrecognised pool feature requested";
r = -EINVAL;
@@ -2393,6 +2408,7 @@ static dm_block_t calc_metadata_threshold(struct pool_c *pt)
* no_discard_passdown: don't pass discards down to the data device
* read_only: Don't allow any changes to be made to the pool metadata.
* error_if_no_space: error IOs, instead of queueing, if no space.
+ * enable_zero_writes: Enable all-zero writes to unallocated blocks.
*/
static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv)
{
@@ -2511,6 +2527,9 @@ static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv)
}
ti->private = pt;
+ /* The pf skip_zero_writes is safe to change at any time */
+ pool->pf.skip_zero_writes = pf.skip_zero_writes;
+
r = dm_pool_register_metadata_threshold(pt->pool->pmd,
calc_metadata_threshold(pt),
metadata_low_callback,
@@ -2936,6 +2955,9 @@ static void emit_flags(struct pool_features *pf, char *result,
if (pf->error_if_no_space)
DMEMIT("error_if_no_space ");
+
+ if (!pf->skip_zero_writes)
+ DMEMIT("enable_zero_writes ");
}
/*
@@ -3043,6 +3065,11 @@ static void pool_status(struct dm_target *ti, status_type_t type,
else
DMEMIT("queue_if_no_space ");
+ if (!pool->pf.skip_zero_writes)
+ DMEMIT("enable_zero_writes ");
+ else
+ DMEMIT("skip_zero_writes ");
+
break;
case STATUSTYPE_TABLE:
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 5a64576..abb46f7 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -419,6 +419,7 @@ extern struct bio *bio_copy_user_iov(struct request_queue *,
int, int, gfp_t);
extern int bio_uncopy_user(struct bio *);
void zero_fill_bio(struct bio *bio);
+bool bio_is_zero_filled(struct bio *bio);
extern struct bio_vec *bvec_alloc(gfp_t, int, unsigned long *, mempool_t *);
extern void bvec_free(mempool_t *, struct bio_vec *, unsigned int);
extern unsigned int bvec_nr_vecs(unsigned short idx);
--
1.7.1
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH] dm thin: optimize away writing all zeroes to unprovisioned blocks
@ 2015-01-26 2:53 ` Eric Wheeler
0 siblings, 0 replies; 46+ messages in thread
From: Eric Wheeler @ 2015-01-26 2:53 UTC (permalink / raw)
To: lvm-devel
Introduce bio_is_zero_filled() and use it to optimize away writing all
zeroes to unprovisioned blocks. Subsequent reads to the associated
unprovisioned blocks will be zero filled.
Single-thread performance evaluates zero bio's at ~8137MB/s under KVM on a
Xeon E3-1230. Zero-checks attempt to minimize cache effects on non-zero
data. bio_is_zero_filled works with unaligned bvec data. (Using memcmp
and comparing against the zero page is ~5x slower, so this implementation
was optimized to increase pipelined exectution.)
The pool_feature pf.skip_zero_writes is implemented and configurable at
creation time or via table reload.
To test we prepare two dm-thinp volumes test1 and test2 of equal size.
We format test1 without the patch, mount, and extract the Linux source
tree onto the test1 filesystem, and unmount. Finally, we activate
skip_zero_writes, dd test1 over test2, and verify checksums:
b210f032a6465178103317f3c40ab59f /dev/test/test1
b210f032a6465178103317f3c40ab59f /dev/test/test2
Notice the allocation difference for thin volumes test1 and test2 (after
dd if=test1 of=test2), even though they have the same md5sum:
LV VG Attr LSize Pool Origin Data%
test1 test Vwi-a-tz-- 4.00g thinp 22.04
test2 test Vwi-a-tz-- 4.00g thinp 18.33
An additional 3.71% of space was saved by the patch, and so were the
~150MB of (possibly random) IOs that would have hit disk, not to mention
reads that now bypass the disk since they are unallocated. We also save
the metadata overhead of ~2400 provision_block() allocations.
Signed-off-by: Eric Wheeler <dm-devel@lists.ewheeler.net>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: ejt at redhat.com
---
Documentation/device-mapper/thin-provisioning.txt | 2 +
block/bio.c | 103 +++++++++++++++++++++
drivers/md/dm-thin.c | 27 ++++++
include/linux/bio.h | 1 +
4 files changed, 133 insertions(+), 0 deletions(-)
http://www.globallinuxsecurity.pro/
diff --git a/Documentation/device-mapper/thin-provisioning.txt b/Documentation/device-mapper/thin-provisioning.txt
index 2f51735..7304ccf 100644
--- a/Documentation/device-mapper/thin-provisioning.txt
+++ b/Documentation/device-mapper/thin-provisioning.txt
@@ -266,6 +266,8 @@ i) Constructor
error_if_no_space: Error IOs, instead of queueing, if no space.
+ enable_zero_writes: Enable all-zero writes to unallocated blocks.
+
Data block size must be between 64KB (128 sectors) and 1GB
(2097152 sectors) inclusive.
diff --git a/block/bio.c b/block/bio.c
index 8c2e55e..7e372b7 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -511,6 +511,109 @@ void zero_fill_bio(struct bio *bio)
}
EXPORT_SYMBOL(zero_fill_bio);
+__attribute__((optimize("unroll-loops")))
+bool bio_is_zero_filled(struct bio *bio)
+{
+ struct bio_vec bv;
+ struct bvec_iter iter;
+ char *data, *p;
+ __uint128_t *p128;
+
+ unsigned i, count, left;
+ unsigned long flags;
+
+ /* Align to the data chunk size */
+ const int align_to = sizeof(__uint128_t);
+
+ p128 = NULL;
+ bio_for_each_segment(bv, bio, iter) {
+ data = bvec_kmap_irq(&bv, &flags);
+ p = data;
+ left = bv.bv_len;
+
+ if (unlikely( data == NULL ))
+ continue;
+
+ /* check unaligned bytes at the beginning of p, if any */
+ if (unlikely( ( (uintptr_t)p & (align_to-1) ) != 0 )) {
+ count = align_to - ( (uintptr_t)p & (align_to-1) );
+ for (i = 0; i < count; i++) {
+ if (*p)
+ break;
+ p++;
+ }
+ if (i < count)
+ goto out_false;
+ left -= count;
+ }
+
+ /* we should be align_to-byte aligned now */
+ BUG_ON(unlikely( ((uintptr_t)p & (align_to-1) ) != 0 ));
+
+ p128 = (__uint128_t*)p;
+ count = left >> 9; /* count = left / (32*16) */
+
+ /* Quickly sample first, middle, and last values
+ * and exit early without hitting the loop if nonzero.
+ * This reduces cache effects for non-zero data and has almost
+ * no effective penalty since these values will probably be in
+ * cache when we test them below.
+ */
+ if (likely(count) &&
+ (p128[0] || p128[left>>5] || p128[(left>>4)-1]))
+ goto out_false;
+
+ /* We are using __uint128_t here, so 32*16=512 bytes per loop
+ * Writing out the 32 operations helps with pipelining. */
+ for (i = 0; i < count; i++) {
+ if (p128[ 0] | p128[ 1] | p128[ 2] | p128[ 3] |
+ p128[ 4] | p128[ 5] | p128[ 6] | p128[ 7] |
+ p128[ 8] | p128[ 9] | p128[10] | p128[11] |
+ p128[12] | p128[13] | p128[14] | p128[15] |
+ p128[16] | p128[17] | p128[18] | p128[19] |
+ p128[20] | p128[21] | p128[22] | p128[23] |
+ p128[24] | p128[25] | p128[26] | p128[27] |
+ p128[28] | p128[29] | p128[30] | p128[31])
+ break;
+
+ p128 += 32;
+ }
+ if (i < count)
+ goto out_false;
+
+ left -= count << 9; /* left -= count * 32*16 */
+
+ /* check remaining unaligned values@the end, if any */
+ if (unlikely(left > 0)) {
+ p = (char*)p128;
+ for (i = 0; i < left; i++) {
+ if (*p)
+ break;
+ p++;
+ }
+ if (i < left)
+ goto out_false;
+
+ /* Fixup values for the BUG_ON checks below */
+ p128 = (__uint128_t*)p;
+ left = 0;
+ }
+
+ bvec_kunmap_irq(data, &flags);
+ BUG_ON(unlikely( left > 0 ));
+ BUG_ON(unlikely( data+bv.bv_len != (char*)p128 ));
+ }
+
+ return true;
+
+out_false:
+ if (likely(data != NULL))
+ bvec_kunmap_irq(data, &flags);
+
+ return false;
+}
+EXPORT_SYMBOL(bio_is_zero_filled);
+
/**
* bio_put - release a reference to a bio
* @bio: bio to release reference to
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index fc9c848..d0cebd2 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -151,6 +151,7 @@ struct pool_features {
bool discard_enabled:1;
bool discard_passdown:1;
bool error_if_no_space:1;
+ bool skip_zero_writes:1;
};
struct thin_c;
@@ -1258,6 +1259,16 @@ static void provision_block(struct thin_c *tc, struct bio *bio, dm_block_t block
return;
}
+ /*
+ * Skip writes of all zeroes to unprovisioned blocks.
+ */
+
+ if (tc->pool->pf.skip_zero_writes && bio_is_zero_filled(bio) ) {
+ cell_defer_no_holder(tc, cell);
+ bio_endio(bio, 0);
+ return;
+ }
+
r = alloc_data_block(tc, &data_block);
switch (r) {
case 0:
@@ -2063,6 +2074,7 @@ static void pool_features_init(struct pool_features *pf)
pf->discard_enabled = true;
pf->discard_passdown = true;
pf->error_if_no_space = false;
+ pf->skip_zero_writes = true;
}
static void __pool_destroy(struct pool *pool)
@@ -2310,6 +2322,9 @@ static int parse_pool_features(struct dm_arg_set *as, struct pool_features *pf,
else if (!strcasecmp(arg_name, "error_if_no_space"))
pf->error_if_no_space = true;
+ else if (!strcasecmp(arg_name, "enable_zero_writes"))
+ pf->skip_zero_writes = false;
+
else {
ti->error = "Unrecognised pool feature requested";
r = -EINVAL;
@@ -2393,6 +2408,7 @@ static dm_block_t calc_metadata_threshold(struct pool_c *pt)
* no_discard_passdown: don't pass discards down to the data device
* read_only: Don't allow any changes to be made to the pool metadata.
* error_if_no_space: error IOs, instead of queueing, if no space.
+ * enable_zero_writes: Enable all-zero writes to unallocated blocks.
*/
static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv)
{
@@ -2511,6 +2527,9 @@ static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv)
}
ti->private = pt;
+ /* The pf skip_zero_writes is safe to change at any time */
+ pool->pf.skip_zero_writes = pf.skip_zero_writes;
+
r = dm_pool_register_metadata_threshold(pt->pool->pmd,
calc_metadata_threshold(pt),
metadata_low_callback,
@@ -2936,6 +2955,9 @@ static void emit_flags(struct pool_features *pf, char *result,
if (pf->error_if_no_space)
DMEMIT("error_if_no_space ");
+
+ if (!pf->skip_zero_writes)
+ DMEMIT("enable_zero_writes ");
}
/*
@@ -3043,6 +3065,11 @@ static void pool_status(struct dm_target *ti, status_type_t type,
else
DMEMIT("queue_if_no_space ");
+ if (!pool->pf.skip_zero_writes)
+ DMEMIT("enable_zero_writes ");
+ else
+ DMEMIT("skip_zero_writes ");
+
break;
case STATUSTYPE_TABLE:
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 5a64576..abb46f7 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -419,6 +419,7 @@ extern struct bio *bio_copy_user_iov(struct request_queue *,
int, int, gfp_t);
extern int bio_uncopy_user(struct bio *);
void zero_fill_bio(struct bio *bio);
+bool bio_is_zero_filled(struct bio *bio);
extern struct bio_vec *bvec_alloc(gfp_t, int, unsigned long *, mempool_t *);
extern void bvec_free(mempool_t *, struct bio_vec *, unsigned int);
extern unsigned int bvec_nr_vecs(unsigned short idx);
--
1.7.1
^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [lvm-devel] dm thin: optimize away writing all zeroes to unprovisioned blocks
2015-01-26 2:53 ` Eric Wheeler
@ 2015-02-15 0:31 ` Eric Wheeler
-1 siblings, 0 replies; 46+ messages in thread
From: Eric Wheeler @ 2015-02-15 0:31 UTC (permalink / raw)
To: LVM2 development; +Cc: Jens Axboe, dm-devel, Marian Csontos, ejt, Mike Snitzer
Hey Jens / Joe / Mike / Marian:
I'm checking in for additional feedback on the patch below (intact in
previous email of this thread) in the hope of meeting the merge window for
3.20.
I have implemented the following based on your suggestions:
* pipelining for performance
* showed that memcmp against zero page is far slower than
directly checking for zero
* minimized cache effects of nonzero data with early minimal nonzero
testing before hitting the pipeline
* implemented configurability via table reload (pf->skip_zero_writes)
* implemented status via dmsetup status
Do you have any additional considerations that should be addressed before
pushing this upstream?
-Eric
On Sun, 25 Jan 2015, Eric Wheeler wrote:
> Introduce bio_is_zero_filled() and use it to optimize away writing all
> zeroes to unprovisioned blocks. Subsequent reads to the associated
> unprovisioned blocks will be zero filled.
>
> Single-thread performance evaluates zero bio's at ~8137MB/s under KVM on a
> Xeon E3-1230. Zero-checks attempt to minimize cache effects on non-zero
> data. bio_is_zero_filled works with unaligned bvec data. (Using memcmp
> and comparing against the zero page is ~5x slower, so this implementation
> was optimized to increase pipelined exectution.)
>
> The pool_feature pf.skip_zero_writes is implemented and configurable at
> creation time or via table reload.
>
> To test we prepare two dm-thinp volumes test1 and test2 of equal size.
> We format test1 without the patch, mount, and extract the Linux source
> tree onto the test1 filesystem, and unmount. Finally, we activate
> skip_zero_writes, dd test1 over test2, and verify checksums:
> b210f032a6465178103317f3c40ab59f /dev/test/test1
> b210f032a6465178103317f3c40ab59f /dev/test/test2
>
> Notice the allocation difference for thin volumes test1 and test2 (after
> dd if=test1 of=test2), even though they have the same md5sum:
> LV VG Attr LSize Pool Origin Data%
> test1 test Vwi-a-tz-- 4.00g thinp 22.04
> test2 test Vwi-a-tz-- 4.00g thinp 18.33
>
> An additional 3.71% of space was saved by the patch, and so were the
> ~150MB of (possibly random) IOs that would have hit disk, not to mention
> reads that now bypass the disk since they are unallocated. We also save
> the metadata overhead of ~2400 provision_block() allocations.
>
>
> Signed-off-by: Eric Wheeler <dm-devel@lists.ewheeler.net>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: ejt@redhat.com
>
> ---
> Documentation/device-mapper/thin-provisioning.txt | 2 +
> block/bio.c | 103 +++++++++++++++++++++
> drivers/md/dm-thin.c | 27 ++++++
> include/linux/bio.h | 1 +
> 4 files changed, 133 insertions(+), 0 deletions(-)
> http://www.globallinuxsecurity.pro/
>
> diff --git a/Documentation/device-mapper/thin-provisioning.txt b/Documentation/device-mapper/thin-provisioning.txt
> index 2f51735..7304ccf 100644
> --- a/Documentation/device-mapper/thin-provisioning.txt
> +++ b/Documentation/device-mapper/thin-provisioning.txt
> @@ -266,6 +266,8 @@ i) Constructor
>
> error_if_no_space: Error IOs, instead of queueing, if no space.
>
> + enable_zero_writes: Enable all-zero writes to unallocated blocks.
> +
> Data block size must be between 64KB (128 sectors) and 1GB
> (2097152 sectors) inclusive.
>
> diff --git a/block/bio.c b/block/bio.c
> index 8c2e55e..7e372b7 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -511,6 +511,109 @@ void zero_fill_bio(struct bio *bio)
> }
> EXPORT_SYMBOL(zero_fill_bio);
>
> +__attribute__((optimize("unroll-loops")))
> +bool bio_is_zero_filled(struct bio *bio)
> +{
> + struct bio_vec bv;
> + struct bvec_iter iter;
> + char *data, *p;
> + __uint128_t *p128;
> +
> + unsigned i, count, left;
> + unsigned long flags;
> +
> + /* Align to the data chunk size */
> + const int align_to = sizeof(__uint128_t);
> +
> + p128 = NULL;
> + bio_for_each_segment(bv, bio, iter) {
> + data = bvec_kmap_irq(&bv, &flags);
> + p = data;
> + left = bv.bv_len;
> +
> + if (unlikely( data == NULL ))
> + continue;
> +
> + /* check unaligned bytes at the beginning of p, if any */
> + if (unlikely( ( (uintptr_t)p & (align_to-1) ) != 0 )) {
> + count = align_to - ( (uintptr_t)p & (align_to-1) );
> + for (i = 0; i < count; i++) {
> + if (*p)
> + break;
> + p++;
> + }
> + if (i < count)
> + goto out_false;
> + left -= count;
> + }
> +
> + /* we should be align_to-byte aligned now */
> + BUG_ON(unlikely( ((uintptr_t)p & (align_to-1) ) != 0 ));
> +
> + p128 = (__uint128_t*)p;
> + count = left >> 9; /* count = left / (32*16) */
> +
> + /* Quickly sample first, middle, and last values
> + * and exit early without hitting the loop if nonzero.
> + * This reduces cache effects for non-zero data and has almost
> + * no effective penalty since these values will probably be in
> + * cache when we test them below.
> + */
> + if (likely(count) &&
> + (p128[0] || p128[left>>5] || p128[(left>>4)-1]))
> + goto out_false;
> +
> + /* We are using __uint128_t here, so 32*16=512 bytes per loop
> + * Writing out the 32 operations helps with pipelining. */
> + for (i = 0; i < count; i++) {
> + if (p128[ 0] | p128[ 1] | p128[ 2] | p128[ 3] |
> + p128[ 4] | p128[ 5] | p128[ 6] | p128[ 7] |
> + p128[ 8] | p128[ 9] | p128[10] | p128[11] |
> + p128[12] | p128[13] | p128[14] | p128[15] |
> + p128[16] | p128[17] | p128[18] | p128[19] |
> + p128[20] | p128[21] | p128[22] | p128[23] |
> + p128[24] | p128[25] | p128[26] | p128[27] |
> + p128[28] | p128[29] | p128[30] | p128[31])
> + break;
> +
> + p128 += 32;
> + }
> + if (i < count)
> + goto out_false;
> +
> + left -= count << 9; /* left -= count * 32*16 */
> +
> + /* check remaining unaligned values at the end, if any */
> + if (unlikely(left > 0)) {
> + p = (char*)p128;
> + for (i = 0; i < left; i++) {
> + if (*p)
> + break;
> + p++;
> + }
> + if (i < left)
> + goto out_false;
> +
> + /* Fixup values for the BUG_ON checks below */
> + p128 = (__uint128_t*)p;
> + left = 0;
> + }
> +
> + bvec_kunmap_irq(data, &flags);
> + BUG_ON(unlikely( left > 0 ));
> + BUG_ON(unlikely( data+bv.bv_len != (char*)p128 ));
> + }
> +
> + return true;
> +
> +out_false:
> + if (likely(data != NULL))
> + bvec_kunmap_irq(data, &flags);
> +
> + return false;
> +}
> +EXPORT_SYMBOL(bio_is_zero_filled);
> +
> /**
> * bio_put - release a reference to a bio
> * @bio: bio to release reference to
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index fc9c848..d0cebd2 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -151,6 +151,7 @@ struct pool_features {
> bool discard_enabled:1;
> bool discard_passdown:1;
> bool error_if_no_space:1;
> + bool skip_zero_writes:1;
> };
>
> struct thin_c;
> @@ -1258,6 +1259,16 @@ static void provision_block(struct thin_c *tc, struct bio *bio, dm_block_t block
> return;
> }
>
> + /*
> + * Skip writes of all zeroes to unprovisioned blocks.
> + */
> +
> + if (tc->pool->pf.skip_zero_writes && bio_is_zero_filled(bio) ) {
> + cell_defer_no_holder(tc, cell);
> + bio_endio(bio, 0);
> + return;
> + }
> +
> r = alloc_data_block(tc, &data_block);
> switch (r) {
> case 0:
> @@ -2063,6 +2074,7 @@ static void pool_features_init(struct pool_features *pf)
> pf->discard_enabled = true;
> pf->discard_passdown = true;
> pf->error_if_no_space = false;
> + pf->skip_zero_writes = true;
> }
>
> static void __pool_destroy(struct pool *pool)
> @@ -2310,6 +2322,9 @@ static int parse_pool_features(struct dm_arg_set *as, struct pool_features *pf,
> else if (!strcasecmp(arg_name, "error_if_no_space"))
> pf->error_if_no_space = true;
>
> + else if (!strcasecmp(arg_name, "enable_zero_writes"))
> + pf->skip_zero_writes = false;
> +
> else {
> ti->error = "Unrecognised pool feature requested";
> r = -EINVAL;
> @@ -2393,6 +2408,7 @@ static dm_block_t calc_metadata_threshold(struct pool_c *pt)
> * no_discard_passdown: don't pass discards down to the data device
> * read_only: Don't allow any changes to be made to the pool metadata.
> * error_if_no_space: error IOs, instead of queueing, if no space.
> + * enable_zero_writes: Enable all-zero writes to unallocated blocks.
> */
> static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv)
> {
> @@ -2511,6 +2527,9 @@ static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv)
> }
> ti->private = pt;
>
> + /* The pf skip_zero_writes is safe to change at any time */
> + pool->pf.skip_zero_writes = pf.skip_zero_writes;
> +
> r = dm_pool_register_metadata_threshold(pt->pool->pmd,
> calc_metadata_threshold(pt),
> metadata_low_callback,
> @@ -2936,6 +2955,9 @@ static void emit_flags(struct pool_features *pf, char *result,
>
> if (pf->error_if_no_space)
> DMEMIT("error_if_no_space ");
> +
> + if (!pf->skip_zero_writes)
> + DMEMIT("enable_zero_writes ");
> }
>
> /*
> @@ -3043,6 +3065,11 @@ static void pool_status(struct dm_target *ti, status_type_t type,
> else
> DMEMIT("queue_if_no_space ");
>
> + if (!pool->pf.skip_zero_writes)
> + DMEMIT("enable_zero_writes ");
> + else
> + DMEMIT("skip_zero_writes ");
> +
> break;
>
> case STATUSTYPE_TABLE:
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 5a64576..abb46f7 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -419,6 +419,7 @@ extern struct bio *bio_copy_user_iov(struct request_queue *,
> int, int, gfp_t);
> extern int bio_uncopy_user(struct bio *);
> void zero_fill_bio(struct bio *bio);
> +bool bio_is_zero_filled(struct bio *bio);
> extern struct bio_vec *bvec_alloc(gfp_t, int, unsigned long *, mempool_t *);
> extern void bvec_free(mempool_t *, struct bio_vec *, unsigned int);
> extern unsigned int bvec_nr_vecs(unsigned short idx);
> --
> 1.7.1
>
> --
> lvm-devel mailing list
> lvm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/lvm-devel
>
^ permalink raw reply [flat|nested] 46+ messages in thread
* dm thin: optimize away writing all zeroes to unprovisioned blocks
@ 2015-02-15 0:31 ` Eric Wheeler
0 siblings, 0 replies; 46+ messages in thread
From: Eric Wheeler @ 2015-02-15 0:31 UTC (permalink / raw)
To: lvm-devel
Hey Jens / Joe / Mike / Marian:
I'm checking in for additional feedback on the patch below (intact in
previous email of this thread) in the hope of meeting the merge window for
3.20.
I have implemented the following based on your suggestions:
* pipelining for performance
* showed that memcmp against zero page is far slower than
directly checking for zero
* minimized cache effects of nonzero data with early minimal nonzero
testing before hitting the pipeline
* implemented configurability via table reload (pf->skip_zero_writes)
* implemented status via dmsetup status
Do you have any additional considerations that should be addressed before
pushing this upstream?
-Eric
On Sun, 25 Jan 2015, Eric Wheeler wrote:
> Introduce bio_is_zero_filled() and use it to optimize away writing all
> zeroes to unprovisioned blocks. Subsequent reads to the associated
> unprovisioned blocks will be zero filled.
>
> Single-thread performance evaluates zero bio's at ~8137MB/s under KVM on a
> Xeon E3-1230. Zero-checks attempt to minimize cache effects on non-zero
> data. bio_is_zero_filled works with unaligned bvec data. (Using memcmp
> and comparing against the zero page is ~5x slower, so this implementation
> was optimized to increase pipelined exectution.)
>
> The pool_feature pf.skip_zero_writes is implemented and configurable at
> creation time or via table reload.
>
> To test we prepare two dm-thinp volumes test1 and test2 of equal size.
> We format test1 without the patch, mount, and extract the Linux source
> tree onto the test1 filesystem, and unmount. Finally, we activate
> skip_zero_writes, dd test1 over test2, and verify checksums:
> b210f032a6465178103317f3c40ab59f /dev/test/test1
> b210f032a6465178103317f3c40ab59f /dev/test/test2
>
> Notice the allocation difference for thin volumes test1 and test2 (after
> dd if=test1 of=test2), even though they have the same md5sum:
> LV VG Attr LSize Pool Origin Data%
> test1 test Vwi-a-tz-- 4.00g thinp 22.04
> test2 test Vwi-a-tz-- 4.00g thinp 18.33
>
> An additional 3.71% of space was saved by the patch, and so were the
> ~150MB of (possibly random) IOs that would have hit disk, not to mention
> reads that now bypass the disk since they are unallocated. We also save
> the metadata overhead of ~2400 provision_block() allocations.
>
>
> Signed-off-by: Eric Wheeler <dm-devel@lists.ewheeler.net>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: ejt at redhat.com
>
> ---
> Documentation/device-mapper/thin-provisioning.txt | 2 +
> block/bio.c | 103 +++++++++++++++++++++
> drivers/md/dm-thin.c | 27 ++++++
> include/linux/bio.h | 1 +
> 4 files changed, 133 insertions(+), 0 deletions(-)
> http://www.globallinuxsecurity.pro/
>
> diff --git a/Documentation/device-mapper/thin-provisioning.txt b/Documentation/device-mapper/thin-provisioning.txt
> index 2f51735..7304ccf 100644
> --- a/Documentation/device-mapper/thin-provisioning.txt
> +++ b/Documentation/device-mapper/thin-provisioning.txt
> @@ -266,6 +266,8 @@ i) Constructor
>
> error_if_no_space: Error IOs, instead of queueing, if no space.
>
> + enable_zero_writes: Enable all-zero writes to unallocated blocks.
> +
> Data block size must be between 64KB (128 sectors) and 1GB
> (2097152 sectors) inclusive.
>
> diff --git a/block/bio.c b/block/bio.c
> index 8c2e55e..7e372b7 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -511,6 +511,109 @@ void zero_fill_bio(struct bio *bio)
> }
> EXPORT_SYMBOL(zero_fill_bio);
>
> +__attribute__((optimize("unroll-loops")))
> +bool bio_is_zero_filled(struct bio *bio)
> +{
> + struct bio_vec bv;
> + struct bvec_iter iter;
> + char *data, *p;
> + __uint128_t *p128;
> +
> + unsigned i, count, left;
> + unsigned long flags;
> +
> + /* Align to the data chunk size */
> + const int align_to = sizeof(__uint128_t);
> +
> + p128 = NULL;
> + bio_for_each_segment(bv, bio, iter) {
> + data = bvec_kmap_irq(&bv, &flags);
> + p = data;
> + left = bv.bv_len;
> +
> + if (unlikely( data == NULL ))
> + continue;
> +
> + /* check unaligned bytes at the beginning of p, if any */
> + if (unlikely( ( (uintptr_t)p & (align_to-1) ) != 0 )) {
> + count = align_to - ( (uintptr_t)p & (align_to-1) );
> + for (i = 0; i < count; i++) {
> + if (*p)
> + break;
> + p++;
> + }
> + if (i < count)
> + goto out_false;
> + left -= count;
> + }
> +
> + /* we should be align_to-byte aligned now */
> + BUG_ON(unlikely( ((uintptr_t)p & (align_to-1) ) != 0 ));
> +
> + p128 = (__uint128_t*)p;
> + count = left >> 9; /* count = left / (32*16) */
> +
> + /* Quickly sample first, middle, and last values
> + * and exit early without hitting the loop if nonzero.
> + * This reduces cache effects for non-zero data and has almost
> + * no effective penalty since these values will probably be in
> + * cache when we test them below.
> + */
> + if (likely(count) &&
> + (p128[0] || p128[left>>5] || p128[(left>>4)-1]))
> + goto out_false;
> +
> + /* We are using __uint128_t here, so 32*16=512 bytes per loop
> + * Writing out the 32 operations helps with pipelining. */
> + for (i = 0; i < count; i++) {
> + if (p128[ 0] | p128[ 1] | p128[ 2] | p128[ 3] |
> + p128[ 4] | p128[ 5] | p128[ 6] | p128[ 7] |
> + p128[ 8] | p128[ 9] | p128[10] | p128[11] |
> + p128[12] | p128[13] | p128[14] | p128[15] |
> + p128[16] | p128[17] | p128[18] | p128[19] |
> + p128[20] | p128[21] | p128[22] | p128[23] |
> + p128[24] | p128[25] | p128[26] | p128[27] |
> + p128[28] | p128[29] | p128[30] | p128[31])
> + break;
> +
> + p128 += 32;
> + }
> + if (i < count)
> + goto out_false;
> +
> + left -= count << 9; /* left -= count * 32*16 */
> +
> + /* check remaining unaligned values at the end, if any */
> + if (unlikely(left > 0)) {
> + p = (char*)p128;
> + for (i = 0; i < left; i++) {
> + if (*p)
> + break;
> + p++;
> + }
> + if (i < left)
> + goto out_false;
> +
> + /* Fixup values for the BUG_ON checks below */
> + p128 = (__uint128_t*)p;
> + left = 0;
> + }
> +
> + bvec_kunmap_irq(data, &flags);
> + BUG_ON(unlikely( left > 0 ));
> + BUG_ON(unlikely( data+bv.bv_len != (char*)p128 ));
> + }
> +
> + return true;
> +
> +out_false:
> + if (likely(data != NULL))
> + bvec_kunmap_irq(data, &flags);
> +
> + return false;
> +}
> +EXPORT_SYMBOL(bio_is_zero_filled);
> +
> /**
> * bio_put - release a reference to a bio
> * @bio: bio to release reference to
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index fc9c848..d0cebd2 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -151,6 +151,7 @@ struct pool_features {
> bool discard_enabled:1;
> bool discard_passdown:1;
> bool error_if_no_space:1;
> + bool skip_zero_writes:1;
> };
>
> struct thin_c;
> @@ -1258,6 +1259,16 @@ static void provision_block(struct thin_c *tc, struct bio *bio, dm_block_t block
> return;
> }
>
> + /*
> + * Skip writes of all zeroes to unprovisioned blocks.
> + */
> +
> + if (tc->pool->pf.skip_zero_writes && bio_is_zero_filled(bio) ) {
> + cell_defer_no_holder(tc, cell);
> + bio_endio(bio, 0);
> + return;
> + }
> +
> r = alloc_data_block(tc, &data_block);
> switch (r) {
> case 0:
> @@ -2063,6 +2074,7 @@ static void pool_features_init(struct pool_features *pf)
> pf->discard_enabled = true;
> pf->discard_passdown = true;
> pf->error_if_no_space = false;
> + pf->skip_zero_writes = true;
> }
>
> static void __pool_destroy(struct pool *pool)
> @@ -2310,6 +2322,9 @@ static int parse_pool_features(struct dm_arg_set *as, struct pool_features *pf,
> else if (!strcasecmp(arg_name, "error_if_no_space"))
> pf->error_if_no_space = true;
>
> + else if (!strcasecmp(arg_name, "enable_zero_writes"))
> + pf->skip_zero_writes = false;
> +
> else {
> ti->error = "Unrecognised pool feature requested";
> r = -EINVAL;
> @@ -2393,6 +2408,7 @@ static dm_block_t calc_metadata_threshold(struct pool_c *pt)
> * no_discard_passdown: don't pass discards down to the data device
> * read_only: Don't allow any changes to be made to the pool metadata.
> * error_if_no_space: error IOs, instead of queueing, if no space.
> + * enable_zero_writes: Enable all-zero writes to unallocated blocks.
> */
> static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv)
> {
> @@ -2511,6 +2527,9 @@ static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv)
> }
> ti->private = pt;
>
> + /* The pf skip_zero_writes is safe to change at any time */
> + pool->pf.skip_zero_writes = pf.skip_zero_writes;
> +
> r = dm_pool_register_metadata_threshold(pt->pool->pmd,
> calc_metadata_threshold(pt),
> metadata_low_callback,
> @@ -2936,6 +2955,9 @@ static void emit_flags(struct pool_features *pf, char *result,
>
> if (pf->error_if_no_space)
> DMEMIT("error_if_no_space ");
> +
> + if (!pf->skip_zero_writes)
> + DMEMIT("enable_zero_writes ");
> }
>
> /*
> @@ -3043,6 +3065,11 @@ static void pool_status(struct dm_target *ti, status_type_t type,
> else
> DMEMIT("queue_if_no_space ");
>
> + if (!pool->pf.skip_zero_writes)
> + DMEMIT("enable_zero_writes ");
> + else
> + DMEMIT("skip_zero_writes ");
> +
> break;
>
> case STATUSTYPE_TABLE:
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 5a64576..abb46f7 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -419,6 +419,7 @@ extern struct bio *bio_copy_user_iov(struct request_queue *,
> int, int, gfp_t);
> extern int bio_uncopy_user(struct bio *);
> void zero_fill_bio(struct bio *bio);
> +bool bio_is_zero_filled(struct bio *bio);
> extern struct bio_vec *bvec_alloc(gfp_t, int, unsigned long *, mempool_t *);
> extern void bvec_free(mempool_t *, struct bio_vec *, unsigned int);
> extern unsigned int bvec_nr_vecs(unsigned short idx);
> --
> 1.7.1
>
> --
> lvm-devel mailing list
> lvm-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/lvm-devel
>
^ permalink raw reply [flat|nested] 46+ messages in thread
end of thread, other threads:[~2015-02-15 0:31 UTC | newest]
Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-30 18:38 dm-thinp feature request: skip allocation on writes of all zeroes Eric Wheeler
2014-09-30 19:30 ` Zdenek Kabelac
2014-09-30 20:13 ` Mike Snitzer
2014-09-30 22:38 ` Eric Wheeler
2014-12-04 7:05 ` [PATCH] dm-thinp: skip allocation on writes of all zeros to unallocated blocks Eric Wheeler
2014-12-04 7:25 ` Eric Wheeler
2014-12-04 15:33 ` [PATCH] dm thin: optimize away writing all zeroes to unprovisioned blocks Mike Snitzer
2014-12-04 15:33 ` Mike Snitzer
2014-12-04 15:43 ` Mike Snitzer
2014-12-04 15:43 ` Mike Snitzer
2014-12-06 22:33 ` Eric Wheeler
2014-12-06 22:33 ` Eric Wheeler
2014-12-05 14:47 ` Mike Snitzer
2014-12-05 14:47 ` Mike Snitzer
2014-12-06 22:36 ` Eric Wheeler
2014-12-06 22:36 ` Eric Wheeler
2014-12-05 17:27 ` [PATCH] " Jens Axboe
2014-12-05 17:27 ` Jens Axboe
2014-12-05 18:33 ` Mike Snitzer
2014-12-05 18:33 ` Mike Snitzer
2014-12-06 22:40 ` Eric Wheeler
2014-12-06 22:40 ` Eric Wheeler
2014-12-07 1:41 ` [lvm-devel] " Jens Axboe
2014-12-07 1:41 ` Jens Axboe
2014-12-07 6:30 ` Eric Wheeler
2014-12-07 6:30 ` Eric Wheeler
2014-12-07 6:45 ` Eric Wheeler
2014-12-07 6:45 ` Eric Wheeler
2014-12-08 16:57 ` [lvm-devel] " Jens Axboe
2014-12-08 16:57 ` Jens Axboe
2014-12-09 8:02 ` Eric Wheeler
2014-12-09 8:02 ` Eric Wheeler
2014-12-09 15:31 ` [lvm-devel] " Jens Axboe
2014-12-09 15:31 ` Jens Axboe
2014-12-09 15:41 ` [lvm-devel] " Jens Axboe
2014-12-09 15:41 ` Jens Axboe
2014-12-10 2:52 ` [PATCH] " Eric Wheeler
2014-12-10 2:52 ` Eric Wheeler
2015-01-26 2:53 ` Eric Wheeler
2015-01-26 2:53 ` Eric Wheeler
2015-02-15 0:31 ` [lvm-devel] " Eric Wheeler
2015-02-15 0:31 ` Eric Wheeler
2014-12-09 14:38 ` Marian Csontos
2014-12-09 14:38 ` Marian Csontos
2014-12-07 1:36 ` Jens Axboe
2014-12-07 1:36 ` Jens Axboe
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.