All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nikos Tsironis <ntsironis@arrikto.com>
To: Mike Snitzer <snitzer@redhat.com>, Mikulas Patocka <mpatocka@redhat.com>
Cc: dm-devel@redhat.com, agk@redhat.com, iliastsi@arrikto.com
Subject: Re: [PATCH 0/3] dm snapshot: Improve performance using a more fine-grained locking scheme
Date: Tue, 19 Feb 2019 13:04:51 +0200	[thread overview]
Message-ID: <425d7efe-ab3f-67be-264e-9c3b6db229bc@arrikto.com> (raw)
In-Reply-To: <20190218151530.GA7852@redhat.com>

On 2/18/19 5:15 PM, Mike Snitzer wrote:
> On Mon, Feb 18 2019 at  9:37am -0500,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
>>
>>
>> On Mon, 18 Feb 2019, Nikos Tsironis wrote:
>>
>>> Hello,
>>>
>>> This is a kind reminder for this patch set. I'm bumping this thread to
>>> solicit your feedback. Please let me know what else I may need to do for
>>> these patches to land in the upcoming merge window for 5.1.
>>>
>>> Looking forward to your feedback,
>>>
>>> Nikos
>>
>> I don't know what to do with those patches.
>>
>> The patches make it more complicated, so I can't really say if they are 
>> correct or not.
> 
> As I mentioned to you before: please do the best you can reasoning
> through (and even quantifying on your own) the performance reward these
> patches provide.
> 
> Then we'll need to make a call on risk vs reward.
> 
> Nikos mentioned in another mail on Friday that this dm-snapshot patchset
> is a prereq for more dm-snapshot changes he has.  Nikos, if you could
> forecast what those additional changes to dm-snapshot are that could
> help inform the review process for this initial patchset.

Hello Mike, hello Mikulas,

Thank you both for your feedback.

I understand your desire to ensure the patches continue to maintain the
high level of quality in the kernel and I agree completely. As part of
this effort, I have been contributing to the device mapper test suite,
extending it with a test suite for dm-snapshot. Please see the relevant
Pull Requests [1], [2]. Actually, it was this process of me trying to
understand the current functioning of the code in depth and improve its
testing, which helped me identify some minor dm-snapshot bugs previously
[3], [4].

So, I definitely agree with your effort to maintain high quality for
kernel code;  extending the device mapper test suite was my very first
step in facilitating my own coding process for these patches, and I
would be happy to implement any further suggestions for testing you
may have.

The long-term goal of my current and upcoming patches is to make
dm-snapshot usable with modern, low-latency NVMe devices, make it
scalable on multi-core machines, and reduce its overhead significantly.

I understand a lot of effort has been put to make dm-thin performant,
but we are particularly interested in improving the performance of
dm-snapshot and its usability on modern hardware, because it has distinct
advantages over using dm-thin for our specific use cases:

  * Short-lived snapshots, for consistency, where I/O performance to the
    original device degrades only while the snapshot is active, and
    returns back to its original level after deleting the snapshot

  * Taking a consistent snapshot of any volume, while still being able
    to write to the origin volume, i.e., maintaining the snapshot in an
    external device, independently of the origin, without the need to
    make the original volume a dm-thin device (versus dm-thin’s external
    origin snapshots)

Actually, we are using dm-snapshot heavily right now for maintaining
ephemeral transient snapshot for backups. This is the only way to have
close to bare metal performance, without dm-thin’s permanent degradation
(at least until the origin volume is completely overwritten).

So, to give you an overall idea of the end-to-end story and where the
current patch set fits:

My initial performance evaluation of dm-snapshot revealed a big
performance drop, while the snapshot is active; a drop which is not
justified by COW alone. Using fio with blktrace I noticed that the
per-CPU I/O distribution was uneven. Although many threads were doing
I/O, only a couple of the CPUs ended up submitting I/O requests to the
underlying device.

The bottleneck here is kcopyd serializing all I/O. The recent work with
blk-mq towards increasing parallelism of I/O processing in modern
multi-cores is essentially going to waste for any users of kcopyd,
including dm-snapshot, because I/Os are issued only by a single CPU at a
time, the one on which kcopyd’s thread happens to be running.

So, the first step I took was to redesign kcopyd to prevent I/O
serialization by respecting thread locality for I/Os and their
completion. This made distribution of I/O processing uniform across
CPUs, but then dm-snapshot’s single mutex prevented it from exploiting
kcopyd’s improved scalability.

Further investigation revealed that dm-snapshot itself has significant
CPU overhead that increases I/O latency far beyond that of an NVMe
device. As an example, working with an NVMe array capable of 700K write
IOPS with an average latency of ~350us per request, I was seeing 65K
write IOPS with an average latency of ~4ms per request with dm-snapshot.
This was due to lock contention in dm-snapshot itself, even before it
hit kcopyd.

This is where the current patch set fits. Replacing the single mutex
with a fine-grained locking scheme reduces lock contention and enables
dm-snaphsot to scale as the number of threads increases. For all the
details, please see the measurements I include as part of PATCH 3/3.
Also, replacing dm-snapshot's single mutex with fine-grained locking
makes it ready to exploit dm-kcopyd’s improved design, which I will
introduce in follow-up patches.

Finally, after these two patches, I would like to improve performance
specifically for transient snapshots. In the case of transient snapshots,
there is no need to complete exceptions in order
(commit 230c83afdd9cd - "dm snapshot: avoid snapshot space leak on crash")
or even sequentially. By taking advantage of the new kcopyd design, we
can complete them out-of-order and in parallel. This will also increase
IOPS significantly and slash I/O latency.

So, all in all, my end goal is to bring performance of short-lived
snapshots based on dm-snapshot as close as possible to bare-metal
performance.

To summarize, I see the following progression of patches:

  * Scale dm-snapshot (current patch set)
  * Scale kcopyd
  * Further improve transient snapshots by lifting the limitation of
    in-order and sequential completion

My current measurements show that these patch series lead to an eventual
performance improvement of ~300% increase in sustained throughput and
~80% decrease in I/O latency for transient snapshots, over the null_blk
device. 

Especially for this patch set, please find detailed measurements as part
of PATCH 3/3.

Mike, it would be great, as you say, if you could run the test suite
yourselves, and reproduce the results.

This was a rather long email, but I hope it makes the end-to-end purpose
of these patches clearer, and quantifies the reward in increased
throughput, decreased latency, and improved efficiency of dm-snapshot on
modern hardware.

I would be more than happy to continue the conversation and focus on any
other questions you may have, as you continue with reviewing these
patches.

Thanks,
Nikos.

[1] https://github.com/jthornber/device-mapper-test-suite/pull/51
[2] https://github.com/jthornber/device-mapper-test-suite/pull/52
[3] https://www.redhat.com/archives/dm-devel/2018-October/msg00460.html
[4] https://www.redhat.com/archives/dm-devel/2018-October/msg00461.html

      reply	other threads:[~2019-02-19 11:04 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-20 18:06 [PATCH 0/3] dm snapshot: Improve performance using a more fine-grained locking scheme Nikos Tsironis
2018-12-20 18:06 ` [PATCH 1/3] list_bl: Add hlist_bl_add_before/behind helpers Nikos Tsironis
2019-02-28 21:32   ` Mike Snitzer
2019-02-28 21:34     ` Mike Snitzer
2019-03-11 18:16     ` Christoph Hellwig
2019-03-11 22:13       ` Paul E. McKenney
2019-03-11 22:43         ` Mike Snitzer
2019-03-14  0:25           ` Paul E. McKenney
2019-03-13 23:48     ` Paul E. McKenney
2019-03-14  0:30       ` Mike Snitzer
2019-03-14 13:28         ` [dm-devel] " Nikos Tsironis
2019-03-14 13:28           ` Nikos Tsironis
2019-03-14 14:07           ` [dm-devel] " Paul E. McKenney
2019-03-14 15:03             ` Paul E. McKenney
2019-03-17 11:52               ` Nikos Tsironis
2019-03-18 17:16                 ` Paul E. McKenney
2019-03-20 20:25                   ` Nikos Tsironis
2019-03-14 17:01             ` Nikos Tsironis
2018-12-20 18:06 ` [PATCH 2/3] dm snapshot: Don't sleep holding the snapshot lock Nikos Tsironis
2018-12-20 18:06 ` [PATCH 3/3] dm snapshot: Use fine-grained locking scheme Nikos Tsironis
2019-02-28 21:37   ` Mikulas Patocka
2019-03-01  9:15     ` Nikos Tsironis
2019-02-18 14:22 ` [PATCH 0/3] dm snapshot: Improve performance using a more " Nikos Tsironis
2019-02-18 14:37   ` Mikulas Patocka
2019-02-18 15:15     ` Mike Snitzer
2019-02-19 11:04       ` Nikos Tsironis [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=425d7efe-ab3f-67be-264e-9c3b6db229bc@arrikto.com \
    --to=ntsironis@arrikto.com \
    --cc=agk@redhat.com \
    --cc=dm-devel@redhat.com \
    --cc=iliastsi@arrikto.com \
    --cc=mpatocka@redhat.com \
    --cc=snitzer@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.