[PATCH] dm-thin: Improve performance of O_SYNC IOs to mapped data

From: Yarden Maymon <yarden.maymon@volumez.com>
To: dm-devel@redhat.com
Cc: thornber@redhat.com, Yarden Maymon <yarden.maymon@volumez.com>,
	snitzer@kernel.org, agk@redhat.com
Subject: [PATCH] dm-thin: Improve performance of O_SYNC IOs to mapped data
Date: Sun, 29 Oct 2023 18:17:56 +0200	[thread overview]
Message-ID: <20231029161756.27025-1-yarden.maymon@volumez.com> (raw)

Running random write fio benchmarks on dm-thin with mapped data
there is 50% degradation when using O_SYNC.
* dm-thin without O_SYNC - 438k iops
* dm-thin with O_SYNC on mapped data - 204k iops
* directly on the underlying disk with O_SYNC - 451k iops, showing the
  problem is not the disk.

The data is mapped so the same results are expected with O_SYNC.

Currently, all O_SYNC IOs are routed to a slower path (deferred).
This action is taken early in the procedure, prior to assessing other
ongoing IOs or verifying if the IO is already mapped.

Remove the early test, and move O_SYNC to the regular data path.
O_SYNC io to a mapped space, that does not conflict with other inflight
will be remapped and routed to the faster path.
All the other O_SYNC io's behavior is maintained (deferred).

The O_SYNC IO will be deferred if :

* It is not mapped - dm_thin_find_block will return -ENODATA, the cell
  is deferred.

* There is an inflight to the same virtual key - bio_detain will
  add the io to a cell and defer it.

    build_virtual_key(tc->td, block, &key);
    if (bio_detain(tc->pool, &key, bio, &virt_cell))
        return DM_MAPIO_SUBMITTED;

* There is an inflight to the same physical key - bio_detain will
  add the io to a cell and defer it.

    build_data_key(tc->td, result.block, &key);
    if (bio_detain(tc->pool, &key, bio, &data_cell)) {
        cell_defer_no_holder(tc, virt_cell);
        return DM_MAPIO_SUBMITTED;
    }

-----------------------------------------------------

Benchmark results :

The benchmark was done on top of ubuntu's 6.2.0-1008 with commit
450e8dee51aa ("dm bufio: improve concurrent IO performance") backported.

fio params: --bs=4k --direct=1 --iodepth=32 --numjobs=8m --time_based
--runtime=5m.
dm-thin chunksize is 128k and allocation/thin_pool_zero=0 is set.
The results are in IOPs and represented as: avg_iops (max_iops).

Performance test on the underlying nvme device for baseline:
+-------------------+-----------------------+
| randwrite         | 446k (455k)           |
| randwrite sync    | 451k (455k)           |
| randrw 50/50      | 227k/227k (300k/300k) |
| randrw sync 50/50 | 227k/227k (300k/300k) |
| randread          | 773k (866k)           |
| randread sync     | 773k (861k)           |
+-------------------+-----------------------+

dm-thin blkdev with all data allocated (16GiB):
+-------------------+-----------------------+-----------------------+
|                   | Pre Patch             | Post Patch            |
+-------------------+-----------------------+-----------------------+
| randwrite         | 438k (442k)           | 450k (453k)           |
| randwrite sync    | 204k (228k)           | 450k (454k)           |
| randrw 50/50      | 224k/224k (236k/235k) | 225k/225k (234k/234k) |
| randrw sync 50/50 | 191k/191k (199k/197k) | 225k/225k (235k/235k) |
| randread          | 650k (703k)           | 661k (705k)           |
| randread sync     | 659k (705k)           | 661k (707k)           |
+-------------------+-----------------------+-----------------------+
There's a notable enhancement in random write performance with sync
compared to previous results. In the 50/50 sync test, there's also a
boost in random read due to the availability of extra resources for
reading. Furthermore, no other aspects appear to be impacted.

dm-thin blkdev without allocated data with capacity of 1.6TB (to
increase the random chance of hitting a non allocated block):
+-------------------+-------------------------+------------------------+
|                   | Pre Patch               | Post Patch             |
+-------------------+-------------------------+------------------------+
| randwrite         | 116k (253k)             | 112k (240k)            |
| randwrite sync    | 100k (121k)             | 182k (266k)            |
| randrw 50/50      | 66.7k/66.7k (109k/109k) | 67k/67k (109k/109k)    |
| randrw sync 50/50 | 76.9k/76.8k (101k/101k) | 77.6k/77.6k (122k/122k)|
| randread          | 336k (349k)             | 335k (352k)            |
| randread sync     | 334k (351k)             | 336k (348k)            |
+-------------------+-------------------------+------------------------+
In this case, there isn't a marked difference, with the exception of
random write sync, since the unmapped data path has stayed the same. The
boost in random write sync performance can be explained from random IOs
hitting the same space twice within the test (The second time they are
already mapped).

-----------------------------------------------------

Tests:
I have ran thin tests of https://github.com/jthornber/dmtest-python.
I have ran xfstests on top of thin lvm https://github.com/kdave/xfstests

I conducted a manual data integrity test :
* Constructed a layout with nvme target -> dm-thin -> nvme device.
* Using vdbench from an initiator host writing to this remote nvme
  device, using journal to a local drive.
* Initiated a reboot on the media host.
* Verified the data using vdbench once the reboot process finished.

Signed-off-by: Yarden Maymon <yarden.maymon@volumez.com>
---
 drivers/md/dm-thin.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 07c7f9795b10..ecd429260bee 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -2743,7 +2743,7 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
 		return DM_MAPIO_SUBMITTED;
 	}
 
-	if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD) {
+	if (bio_op(bio) == REQ_OP_DISCARD) {
 		thin_defer_bio_with_throttle(tc, bio);
 		return DM_MAPIO_SUBMITTED;
 	}
-- 
2.25.1