All of lore.kernel.org
 help / color / mirror / Atom feed
* Reg dm-cache-policy-smq
@ 2020-06-16 10:49 Lakshmi Narasimhan Sundararajan
  2020-06-17  9:18 ` Joe Thornber
  0 siblings, 1 reply; 4+ messages in thread
From: Lakshmi Narasimhan Sundararajan @ 2020-06-16 10:49 UTC (permalink / raw)
  To: lvm-devel

Hi!
I was browsing dm-cache smq policy and have an observation, can
someone help with them?

1/

static void update_promote_levels(struct smq_policy *mq)
{
/*
* If there are unused cache entries then we want to be really
* eager to promote.
*/
unsigned threshold_level = allocator_empty(&mq->cache_alloc) ?
default_promote_level(mq) : (NR_HOTSPOT_LEVELS / 2u);

threshold_level = max(threshold_level, NR_HOTSPOT_LEVELS);

^^^ threshold_level seems to always be NR_HOTSPOT_LEVELS, was min()
intended instead of max()?

Is this a bug or am I reading it wrong?

2/ Also I see there aren't any tunables  for smq. Probably that was
the original design goal. But I have been testing with cache drives of
sizes nearing 1TB on a server class system with multi container
systems.
I am seeing largish IO latency sometimes way worse than the origin device.
Upon reading the code, I am sensing it may be because of an incoming
IO  hitting an inprogress migration block, thereby increasing io
latency.

Would that be a possible scenario?

3/
As a thumb rule, I am keeping the migration threshold at 100 times
cache block size. So apart from controlling cache block size, are
there any other way to control the IO latency on a cache miss.

Thanks



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Reg dm-cache-policy-smq
  2020-06-16 10:49 Reg dm-cache-policy-smq Lakshmi Narasimhan Sundararajan
@ 2020-06-17  9:18 ` Joe Thornber
  2020-06-19  7:50   ` Lakshmi Narasimhan Sundararajan
  0 siblings, 1 reply; 4+ messages in thread
From: Joe Thornber @ 2020-06-17  9:18 UTC (permalink / raw)
  To: lvm-devel

On Tue, Jun 16, 2020 at 04:19:29PM +0530, Lakshmi Narasimhan Sundararajan wrote:

> Is this a bug or am I reading it wrong?

I agree it looks strange; I'll do some benchmarking to see how setting to min
effects it.

> 2/ Also I see there aren't any tunables  for smq. Probably that was
> the original design goal. But I have been testing with cache drives of
> sizes nearing 1TB on a server class system with multi container
> systems.
> I am seeing largish IO latency sometimes way worse than the origin device.
> Upon reading the code, I am sensing it may be because of an incoming
> IO  hitting an inprogress migration block, thereby increasing io
> latency.
> 
> Would that be a possible scenario?

Yes, this is very likely what is happening.  It sounds like your migration_threshold may
be set very high.  dm-cache is meant to be slow moving so I typically have it as a small
multiple of the block size (eg, 8).

> 3/
> As a thumb rule, I am keeping the migration threshold at 100 times
> cache block size. So apart from controlling cache block size, are
> there any other way to control the IO latency on a cache miss.

That seems v. high.

Depending on your IO load you may find dm-writeboost gives you better latency.

- Joe



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Reg dm-cache-policy-smq
  2020-06-17  9:18 ` Joe Thornber
@ 2020-06-19  7:50   ` Lakshmi Narasimhan Sundararajan
  2020-06-19 10:06     ` Joe Thornber
  0 siblings, 1 reply; 4+ messages in thread
From: Lakshmi Narasimhan Sundararajan @ 2020-06-19  7:50 UTC (permalink / raw)
  To: lvm-devel

Hi Joe,
Thank you for your reply.

I have a few followup questions, please do help me with my understanding.
1/ Does configured migration threshold account for active IO migration
of dirty cache blocks in addition to cache block migration to/from
cache device?
My understanding is migration threshold only control promotion and
demotion IO, and does not affect dirty IO writeback.
Although all of these get queued to background worker thread, which
can only actively do 4K max requests, so there is a max limit to the
migration bandwidth at any point in time from the origin device.

2/ Reading the smq caching policy, I see that the cache policy is slow
to cache and has no sense to track sequential or random traffic.
So the initial IO may never be cached. But one does rely on cache hit
ratio to be poor, and so the threshold for promotion is likely to be
lower, thereby enabling hotspots to promote faster even on random
access? Do you have any simulation results you can share with me over
dm-cache-smq to help understand smq behavior for random/sequential
traffic patterns?

3/ How does dm-writeboost compare for stability, I do not see it yet
integrated to the mainline. How are lvm supporting it?

4/ There exists also a dm-writecache, is it stable? Is lvm ready to
use dm-writecache? Any idea which distro has it integrated and
available for use?
I see rhel still reports it as experimental. Would love to hear your opinion.

Regards
LN







On Wed, Jun 17, 2020 at 2:49 PM Joe Thornber <thornber@redhat.com> wrote:
>
> On Tue, Jun 16, 2020 at 04:19:29PM +0530, Lakshmi Narasimhan Sundararajan wrote:
>
> > Is this a bug or am I reading it wrong?
>
> I agree it looks strange; I'll do some benchmarking to see how setting to min
> effects it.
>
> > 2/ Also I see there aren't any tunables  for smq. Probably that was
> > the original design goal. But I have been testing with cache drives of
> > sizes nearing 1TB on a server class system with multi container
> > systems.
> > I am seeing largish IO latency sometimes way worse than the origin device.
> > Upon reading the code, I am sensing it may be because of an incoming
> > IO  hitting an inprogress migration block, thereby increasing io
> > latency.
> >
> > Would that be a possible scenario?
>
> Yes, this is very likely what is happening.  It sounds like your migration_threshold may
> be set very high.  dm-cache is meant to be slow moving so I typically have it as a small
> multiple of the block size (eg, 8).
>
> > 3/
> > As a thumb rule, I am keeping the migration threshold at 100 times
> > cache block size. So apart from controlling cache block size, are
> > there any other way to control the IO latency on a cache miss.
>
> That seems v. high.
>
> Depending on your IO load you may find dm-writeboost gives you better latency.
>
> - Joe
>
> --
> lvm-devel mailing list
> lvm-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/lvm-devel
>



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Reg dm-cache-policy-smq
  2020-06-19  7:50   ` Lakshmi Narasimhan Sundararajan
@ 2020-06-19 10:06     ` Joe Thornber
  0 siblings, 0 replies; 4+ messages in thread
From: Joe Thornber @ 2020-06-19 10:06 UTC (permalink / raw)
  To: lvm-devel

On Fri, Jun 19, 2020 at 01:20:42PM +0530, Lakshmi Narasimhan Sundararajan wrote:
> Hi Joe,
> Thank you for your reply.
> 
> I have a few followup questions, please do help me with my understanding.
> 1/ Does configured migration threshold account for active IO migration
> of dirty cache blocks in addition to cache block migration to/from
> cache device?
> My understanding is migration threshold only control promotion and
> demotion IO, and does not affect dirty IO writeback.

Yes, looking at the code this seems to be the case.

> Although all of these get queued to background worker thread, which
> can only actively do 4K max requests, so there is a max limit to the
> migration bandwidth at any point in time from the origin device.

One confusing aspect of the migration threshold is it's talking about
the max queued migration io at any particular time,  _not_ IO per second.
I think this makes it very unintuitive for the sys admins to set.  If
I ever do any more work on dm-cache then removing migration_threshold
would be my priority.

> 
> 2/ Reading the smq caching policy, I see that the cache policy is slow
> to cache and has no sense to track sequential or random traffic.
> So the initial IO may never be cached. But one does rely on cache hit
> ratio to be poor, and so the threshold for promotion is likely to be
> lower, thereby enabling hotspots to promote faster even on random
> access? Do you have any simulation results you can share with me over
> dm-cache-smq to help understand smq behavior for random/sequential
> traffic patterns?

See below, in particular the FIO tests are essentially random IO.
dm-cache used to have an io-tracker component that was used to assess
how sequential or random io was and weight the promotion chances based
on that (spindles being good at sequential io).  But I took it out in the
end; benchmarks didn't show particular benefit.


> 
> 3/ How does dm-writeboost compare for stability, I do not see it yet
> integrated to the mainline. How are lvm supporting it?


Sorry, I meant writecache, there have been so many similarly named targets
over the years.  See below.

> 4/ There exists also a dm-writecache, is it stable? Is lvm ready to
> use dm-writecache? Any idea which distro has it integrated and
> available for use?

I believe LVM support will be in the next release of RHEL8.  It's coming
out of experimental state.  I did some benchmarking a few months ago
comparing it with dm-cache (see below).  My impressions are that it's
a solid implementation, and a lot simpler than dm-cache (so possibly
more predictable).  It's main drawback is being focussed on writes only.
I think there are still some features lacking in the LVM support compared
to dm-cache (Dave Teigland can give more info).


- Joe


Here's an internal email discussing benchmark results from Feb 2020:




More test results for writecache and dm-cache.

I'd hoped that we'd be able to give clear advise to our customers
about how to choose which cache to use.  But the results are mixed;
more discussion at the end of the email.

Git extract test
================

A simple test that completely killed the previous third party attempts
to write a 'writecache' target.

It creates a new fs on the cached device.  No discard is used by the mkfs,
because dm-cache tracks discarded regions and can get more performance
when writing data to a discarded region, which I feel is not indicative
of general performance.

Then a v. large git repo is cloned to the cached device.  This part is
purely write based (as far as the cache is concerned).

Then 20 different tags are checked out in the git repo.  This part is mixed
read/write load.  All reads are to areas that have been written to earlier 
in the test.

I like to repeat the same test with a range of different 'fast' device
sizes given in meg.  Starting well below the working set for the task,
and ending up larger.

	writecache		dm-cache		
	clone	checkout	clone	checkout
64	31	366		37.2	359.6
256	33	353		36.2	339.8
512	34	291		35	351.1
1024	30	244		30.9	212.6
1536	28	242		26.6	147.4
2048	25	240		23.7	118.1
4096	21	110		20.8	79.6
8192	22	88				
16384	21	90			
							
		clone	checkout					
raw NVMe	23	76					


The dm-cache results are as I would expect.  If the fast device is tiny
compared the the working set then we get poor performance (which could
be tweaked by reducing the migration_threshold tunable).  But as the
available fast device goes up we see real value.

I'd expected writecache to do better here, since we only ever read what's
just been written.  But I think the volume of writes is such that the fast
device is filling up and forcing writecache to writeback before it can cache
any more writes.  It's rare (artificial) for writecache to need more space
than dm-thin. 



Git extract only
================

Like the previous test except the mkfs and git clone are performed on the
origin, and then the caches are attached.  This means the reads are generally
not to areas that have previously been written to.

I've run the checkout part twice to see how the caches adapt (dm-cache is a
slow moving cache after all).


	writecache		dm-cache	
	Pass 1	Pass 2		Pass 1	Pass 2
256	355	365		335.8	351.1
512	290	305		320.8	345.4
1024	242	254		190	170.4
1536	241	242		150.6	98.6
2048	240	238		150.1	100.1
4096	240	239		154.5	101.1

You can see dm-cache adapting nicely here.



FIO benchmarks
==============

I also have some standard FIO tests that I run.  One profile was given
to me by the perf team and is meant to simulate a database workload
(random 8k io, biased to some regions).

dm-cache uses a 32k block size, so the 8k ios will force a full copy when
a block is promoted to the fast device.

I run fio twice to see how the caches warm up.


100% read
---------

	writecache (s)		dm-cache (s)	
	Pass 1	Pass 2		Pass 1	Pass 2
128	241	230		190	162
256	239	230		169	146
512	230	230		159	111
1024	230	230		110	13.4
2048	230	230		103	4.8
4096	230	230		103	4.4
8192	230	230		104	4.7

Obviously this it totally unfair to writecache.


50% read/write
---------------

	writecache (s)		dm-cache (s)	
	Pass 1	Pass 2		Pass 1	Pass 2
128	127	131		213	181
256	101	108		211	189
512	71	71		173	108
1024	62	46		130	19
2048	62	46		111	6
4096	62	46		109	5.8
8192	62	46		110	6.1

writecache wins on the first pass while dm-cache has been frantically
promoting blocks to the fast device.  dm-cache gets it's payoff
on the second pass.


100% write
----------

	writecache (s)		dm-cache (s)	
	Pass 1	Pass 2		Pass 1	Pass 2
128	88.7	107		232	201
256	59	96		225	209
512	9.6	72		185	112
1024	2.3	2.5		127	24
2048	2.6	2.4		113	2.7
4096	2.4	2.4		113	2.6
8192	2.4	2.6		114	2.7

writecache's time to shine.


How do you decide which cache to use?
=====================================

This isn't easy to answer.  Let's play 20 questions instead (questions
should be answered in order).


1. Do you need writethrough mode?   --- Yes --->    Use dm-cache

2. Do you repeatedly do IO to the same parts of the disk?   --- Yes --->   Use dm-cache

  For instance your server may be constantly hitting the same database
  tables.

  Hot spots are really dm-cache's thing.  For instance, if I set up a
  cache with 8G NVME and a 16G origin and then repeatedly zero the first
  1G of the cache.  You'd think that this is playing to writecache's strengths,
  but the timings for JT machine are:

    writecache: 0.88, 1.37, 1.37, 1.37 ...
    dm-cache:   0.91, 0.86, 0.86, 0.87 ...

  writecache is doing great here (spindle would be ~5 seconds).  But it can't
  compete with dm-cache which has just moved the first gig to the fast dev.

3. Is the READ working set small enough to fit in the page cache?  --- Yes --->   Use writecache  

  writecache and the page cache work together.  If the page cache is supplying all your
  read caching needs then you're just left with write io.


Other things to consider:

- Do you use applications that skip the page cache?

  For instance databases often use O_DIRECT, libaio and manage their own read
  caches.





^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-06-19 10:06 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-16 10:49 Reg dm-cache-policy-smq Lakshmi Narasimhan Sundararajan
2020-06-17  9:18 ` Joe Thornber
2020-06-19  7:50   ` Lakshmi Narasimhan Sundararajan
2020-06-19 10:06     ` Joe Thornber

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.