* To add, or not to add, a bio REQ_ROTATIONAL flag
@ 2016-07-29 0:50 Eric Wheeler
2016-07-29 1:04 ` Wols Lists
2016-07-29 1:16 ` Martin K. Petersen
0 siblings, 2 replies; 4+ messages in thread
From: Eric Wheeler @ 2016-07-29 0:50 UTC (permalink / raw)
To: linux-block; +Cc: dm-devel, linux-raid, linux-kernel, linux-bcache
Hello all,
With the many SSD caching layers being developed (bcache, dm-cache,
dm-writeboost, etc), how could we flag a bio from userspace to indicate
whether the bio is preferred to hit spinning disks instead of an SSD?
Unnecessary promotions, evections, and writeback increase the write burden
on the caching layer and burns out SSDs too fast (TBW), thus requring
equipment replacement.
Is there already a mechanism for this that could be added to the various
caching mechanisms' promote/demote/bypass logic?
For example, I would like to prevent backups from influencing the cache
eviction logic. Neither do I wish to evict cache due to a bio from a
backup process, nor do I wish a bio from the backup process to be cached
on the SSD.
We would want to bypass the cache for IO that is somehow flagged to bypass
block-layer caches and use the rotational disk unless the referenced block
already exists on the SSD.
There might be two cases here that would be ideal to unify without
touching filesystem code:
1) open() of a block device
2) open() on a file such that a filesystem must flag the bio
I had considered writing something to detect FADV_SEQUENTIAL/FADV_NOREUSE
or `ionice -c3` on a process hitting bcache and modifying
check_should_bypass()/should_writeback() to behave as such.
However, just because FADV_SEQUENTIAL is flagged doesn't mean the cache
should bypass. Filesystems can fragment, and while the file being read
may be read sequentially, the blocks on which it resides may not be.
Same thing for higher-level block devices such as dm-thinp where one might
sequentially read a thin volume but its _tdata might not be in linear
order. This may imply that we need a new way to flag cache bypass from
userspace that is neither io-priority nor fadvise driven.
So what are our options? What might be the best way to do this?
If fadvise is the better option, how can a block device driver lookup the
fadvise advice from a given bio struct? Can we add an FADV_NOSSD flag
since FADV_SEQUENTIAL may be insufficent? Are FADV_NOREUSE/FADV_DONTNEED
reasonable candidates?
Perhaps ionice could be used used, but the concept of "priority"
doesn't exactly encompass the concept of cache-bypass---so is something
else needed?
Other ideas?
--
Eric Wheeler
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: To add, or not to add, a bio REQ_ROTATIONAL flag
2016-07-29 0:50 To add, or not to add, a bio REQ_ROTATIONAL flag Eric Wheeler
@ 2016-07-29 1:04 ` Wols Lists
2016-07-29 1:16 ` Martin K. Petersen
1 sibling, 0 replies; 4+ messages in thread
From: Wols Lists @ 2016-07-29 1:04 UTC (permalink / raw)
To: Eric Wheeler, linux-block
Cc: dm-devel, linux-raid, linux-kernel, linux-bcache
On 29/07/16 01:50, Eric Wheeler wrote:
> Hello all,
>
> With the many SSD caching layers being developed (bcache, dm-cache,
> dm-writeboost, etc), how could we flag a bio from userspace to indicate
> whether the bio is preferred to hit spinning disks instead of an SSD?
>
> Unnecessary promotions, evections, and writeback increase the write burden
> on the caching layer and burns out SSDs too fast (TBW), thus requring
> equipment replacement.
What's the spec of these devices? How long are they expected to last?
Other recent posts on this (linux-raid) mailing list refer to tests on
SSDs that indicates their typical life is way beyond their nominal life,
and that in normal usage they are actually likely to outlive "spinning
rust".
http://techreport.com/review/24841/introducing-the-ssd-endurance-experiment
http://techreport.com/review/27909/the-ssd-endurance-experiment-theyre-all-dead/3
Looking at the results, the FIRST drives only started failing once
they'd written some 700 Terabytes. How long is it going to take you to
write that much data over a SATA3 link?
Cheers,
Wol
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: To add, or not to add, a bio REQ_ROTATIONAL flag
2016-07-29 0:50 To add, or not to add, a bio REQ_ROTATIONAL flag Eric Wheeler
2016-07-29 1:04 ` Wols Lists
@ 2016-07-29 1:16 ` Martin K. Petersen
2016-08-01 2:58 ` Eric Wheeler
1 sibling, 1 reply; 4+ messages in thread
From: Martin K. Petersen @ 2016-07-29 1:16 UTC (permalink / raw)
To: Eric Wheeler
Cc: linux-block, dm-devel, linux-raid, linux-kernel, linux-bcache
>>>>> "Eric" == Eric Wheeler <bcache@lists.ewheeler.net> writes:
Eric,
Eric> However, just because FADV_SEQUENTIAL is flagged doesn't mean the
Eric> cache should bypass. Filesystems can fragment, and while the file
Eric> being read may be read sequentially, the blocks on which it
Eric> resides may not be. Same thing for higher-level block devices
Eric> such as dm-thinp where one might sequentially read a thin volume
Eric> but its _tdata might not be in linear order. This may imply that
Eric> we need a new way to flag cache bypass from userspace that is
Eric> neither io-priority nor fadvise driven.
Why conflate the two? Something being a background task is orthogonal to
whether it is being read sequentially or not.
Eric> So what are our options? What might be the best way to do this?
For the SCSI I/O hints I use the idle I/O priority to classify
backups. Works fine.
Eric> Are FADV_NOREUSE/FADV_DONTNEED reasonable candidates?
FADV_DONTNEED was intended for this. There have been patches posted in
the past that tied the loop between the fadvise flags and the bio. I
would like to see those revived.
Eric> Perhaps ionice could be used used, but the concept of "priority"
Eric> doesn't exactly encompass the concept of cache-bypass---so is
Eric> something else needed?
The idle class explicitly does not have a priority.
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: To add, or not to add, a bio REQ_ROTATIONAL flag
2016-07-29 1:16 ` Martin K. Petersen
@ 2016-08-01 2:58 ` Eric Wheeler
0 siblings, 0 replies; 4+ messages in thread
From: Eric Wheeler @ 2016-08-01 2:58 UTC (permalink / raw)
To: Martin K. Petersen
Cc: linux-block, dm-devel, linux-raid, linux-kernel, linux-bcache,
hurikhan77, antlists, Dan Williams, Jason B. Akers, Kapil Karkra,
Jens Axboe, Jeff Moyer, david
[+cc from "Enable use of Solid State Hybrid Drives"
https://lkml.org/lkml/2014/10/29/698 ]
On Thu, 28 Jul 2016, Martin K. Petersen wrote:
> >>>>> "Eric" == Eric Wheeler <bcache@lists.ewheeler.net> writes:
> Eric> [...] This may imply that
> Eric> we need a new way to flag cache bypass from userspace [...]
> Eric> So what are our options? What might be the best way to do this?
[...]
> Eric> Are FADV_NOREUSE/FADV_DONTNEED reasonable candidates?
>
> FADV_DONTNEED was intended for this. There have been patches posted in
> the past that tied the loop between the fadvise flags and the bio. I
> would like to see those revived.
That sounds like a good start, this looks about right from 2014:
https://lkml.org/lkml/2014/10/29/698
https://lwn.net/Articles/619058/
I read through the thread and have summarized the relevant parts here
with additional commentary below the summary:
/* Summary
They were seeking to do basically the same in 2014 thing we want with
stacked block caching drivers today: hint to the IO layer so the (ATA 3.2)
driver can decide whether a block should hit the cache or spinning disk.
This was done by adding bitflags to ioprio for IOPRIO_ADV_ advice.
There are two arguments throughout the thread: one that the cache hint
should be per-process (ionice) and the other, that hints should be per
inode via fadvise (and maybe madvise). Dan Williams noted with respect to
fadvise for their implementation that "It's straightforward to add, but I
think "80%" of the benefit can be had by just having a per-thread cache
priority."
Kapil Karkra extended the page flags so the ioprio advice bits can be
copied into bio->bi_rw, to which Jens said "is a bit...icky. I see why
it's done, though, it requires the least amount of plumbing."
Martin K. Petersen provides a matrix of desires for actual use cases here:
https://lkml.org/lkml/2014/10/29/1014
and asks "Are there actually people asking for sub-file granularity? I
didn't get any requests for that in the survey I did this summer. [...] In
any case I thought it was interesting that pretty much every use case that
people came up with could be adequately described by a handful of I/O
classes."
Further, Jens notes that "I think we've needed a proper API for passing in
appropriate hints on a per-io basis for a LONG time. [...] We've tried
(and failed) in the past to define a set of hints that make sense. It'd be
a shame to add something that's specific to a given transport/technology.
That said, this set of hints do seem pretty basic and would not
necessarily be a bad place to start. But they are still very specific to
this use case."
*/
So, perhaps it is time to plan the hint API and figure out how to plumb
it. These are some design considerations based on the thread:
a. People want per-process cache hinting (ionice, or some other tool).
b. Per inode+range hinting would be useful to some (fadvise, ioctl, etc)
c. Don't use page flags to convey cache hints---or find a clean way to do so.
d. Per IO hints would be useful to both stacking and hardware drivers.
e. Cache layers will implement their own device assignment choice based
on the caching hint; for example, an IO flagged to miss the cache might
hit if already in cache due to unrelated IO and such a determination would
be made per-cache-implementation.
I can see this go two ways:
1. A dedicated implementation for cache hinting.
2. An API for generalized hinting, upon which cache hinting may be
implemented.
To consider #2, what hinting is wanted from processes and inodes down to
bio's? Does it justify an entire API for generalized hinting, or do we
just need a cache hinting implementation? If we do want #2, then what are
all of the features wanted by the community so it can be designed as such?
If #1 is sufficient, then what is the preferred mechanism and
implementation for cache hinting?
In either direction, how can those hints pass down to bio's in an
appropriate way (ie, not page flags)?
With the interest of a cache hinting implementation independent of
transport/technology, I have been playing with an idea to use two per-IO
"TTL" counters, both of which tend toward zero; I've not yet started an
implementation:
cacheskip:
Decrement until zero to skip cache layers (slow medium)
Ignore cachedepth until cacheskip==0.
cachedepth:
Initialize to positive, negative, or zero value. Once zero, no
special treatment is given to the IO. When less than zero, prefer the
slower medium. When greater than zero, prefer the faster medium.
Inc/decrement toward zero each time the IO passes through a
caching layer.
Independent of how we might apply these counters to a pid/inode, the cache
layers might look something like this:
cachedepth description
0 direct IO
+-1 pagecache
+-2 som arbitrary
+-3 caching
+-4 driver
+-n ...
Layers beyond the pagecache are assigned arbitrarily by the driver
stacking order implemented by the end user. For example, if passing
through dm-cache, then dm-cache would use its own preference logic to
decide whether it should cache or not if cachedepth is zero. If nonzero,
it would cache/bypass appropriately and then inc/decrements cachedepth
toward zero after making its decision. Understandably, extenuating
circumstances may require a layer to ignore the hint---such as a
bypass-hinted IO that gets cached because it is already hot.
Consider the following scenarios for this contrived cache stack:
1. pagecache
2. dm-cache
3. bcache
4. HBA supporting cache hints (ATA 3.2, perhaps)
cacheskip cachedepth description
-------------------------------------------
0 0 use pagecache; lower layers do what they want
1 0 skip pagecache (direct IO); lower layers do what they want
0 -1 same as previous
2 1 skip pagecache, dmcache; prefer bcache-ssd
0 -3 skip pagecache; dmcache bypass; bcache bypass
1 2 skip pagecache; prefer dmcache-ssd, prefer bcache-ssd
3 1 hint to prefer HBA cache only
This would empower the user to decide where caching should begin, and for
how many layers caching should hint for slow(-) or fast(+) backing devices
before letting the IO stack make its own hintless choice. Hopefully this
lets each layer make their own choices that best fit their implementation.
Note that this would not support multi-device tiering as written. If some
layer supports multiple IO performance tiers (more than 2) at the same
layer, then this hinting algorithm is insufficient unless a
cache-layer-specific datastructure could be passed with the IO hinting
request. Also, an eviction hint is not supported by this model.
Please comment with your thoughts. I look forward to feedback and
implementation ideas for what would be the best way to plumb cache hinting
for whatever implementation is chosen.
--
Eric Wheeler
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2016-08-01 2:59 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-29 0:50 To add, or not to add, a bio REQ_ROTATIONAL flag Eric Wheeler
2016-07-29 1:04 ` Wols Lists
2016-07-29 1:16 ` Martin K. Petersen
2016-08-01 2:58 ` Eric Wheeler
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).