On Tue, 8 Mar 2022, Nikos Tsironis wrote: > My work focuses mainly on improving the IOPs and latency of the > dm-snapshot target, in order to bring the performance of short-lived > snapshots as close as possible to bare-metal performance. > > My initial performance evaluation of dm-snapshot had revealed a big > performance drop, while the snapshot is active; a drop which is not > justified by COW alone. > > Using fio with blktrace I had noticed that the per-CPU I/O distribution > was uneven. Although many threads were doing I/O, only a couple of the > CPUs ended up submitting I/O requests to the underlying device. > > The same issue also affects dm-clone, when doing I/O with sizes smaller > than the target's region size, where kcopyd is used for COW. > > The bottleneck here is kcopyd serializing all I/O. Users of kcopyd, such > as dm-snapshot and dm-clone, cannot take advantage of the increased I/O > parallelism that comes with using blk-mq in modern multi-core systems, > because I/Os are issued only by a single CPU at a time, the one on which > kcopyd’s thread happens to be running. > > So, I experimented redesigning kcopyd to prevent I/O serialization by > respecting thread locality for I/Os and their completions. This made the > distribution of I/O processing uniform across CPUs. > > My measurements had shown that scaling kcopyd, in combination with > scaling dm-snapshot itself [1] [2], can lead to an eventual performance > improvement of ~300% increase in sustained throughput and ~80% decrease > in I/O latency for transient snapshots, over the null_blk device. > > The work for scaling dm-snapshot has been merged [1], but, > unfortunately, I haven't been able to send upstream my work on kcopyd > yet, because I have been really busy with other things the last couple > of years. > > I haven't looked into the details of copy offload yet, but it would be > really interesting to see how it affects the performance of random and > sequential workloads, and to check how, and if, scaling kcopyd affects > the performance, in combination with copy offload. > > Nikos Hi Note that you must submit kcopyd callbacks from a single thread, otherwise there's a race condition in snapshot. The snapshot code doesn't take locks in the copy_callback and it expects that the callbacks are serialized. Maybe, adding the locks to copy_callback would solve it. Mikulas