linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* To add, or not to add, a bio REQ_ROTATIONAL flag
@ 2016-07-29  0:50 Eric Wheeler
  2016-07-29  1:04 ` Wols Lists
  2016-07-29  1:16 ` Martin K. Petersen
  0 siblings, 2 replies; 4+ messages in thread
From: Eric Wheeler @ 2016-07-29  0:50 UTC (permalink / raw)
  To: linux-block; +Cc: dm-devel, linux-raid, linux-kernel, linux-bcache

Hello all,

With the many SSD caching layers being developed (bcache, dm-cache, 
dm-writeboost, etc), how could we flag a bio from userspace to indicate 
whether the bio is preferred to hit spinning disks instead of an SSD?

Unnecessary promotions, evections, and writeback increase the write burden 
on the caching layer and burns out SSDs too fast (TBW), thus requring 
equipment replacement.

Is there already a mechanism for this that could be added to the various 
caching mechanisms' promote/demote/bypass logic?

For example, I would like to prevent backups from influencing the cache 
eviction logic. Neither do I wish to evict cache due to a bio from a 
backup process, nor do I wish a bio from the backup process to be cached 
on the SSD.  


We would want to bypass the cache for IO that is somehow flagged to bypass 
block-layer caches and use the rotational disk unless the referenced block 
already exists on the SSD.

There might be two cases here that would be ideal to unify without 
touching filesystem code:

  1) open() of a block device

  2) open() on a file such that a filesystem must flag the bio

I had considered writing something to detect FADV_SEQUENTIAL/FADV_NOREUSE 
or `ionice -c3` on a process hitting bcache and modifying 
check_should_bypass()/should_writeback() to behave as such.

However, just because FADV_SEQUENTIAL is flagged doesn't mean the cache 
should bypass.  Filesystems can fragment, and while the file being read 
may be read sequentially, the blocks on which it resides may not be.  
Same thing for higher-level block devices such as dm-thinp where one might 
sequentially read a thin volume but its _tdata might not be in linear 
order.  This may imply that we need a new way to flag cache bypass from 
userspace that is neither io-priority nor fadvise driven.

So what are our options?  What might be the best way to do this?

If fadvise is the better option, how can a block device driver lookup the 
fadvise advice from a given bio struct?  Can we add an FADV_NOSSD flag 
since FADV_SEQUENTIAL may be insufficent?  Are FADV_NOREUSE/FADV_DONTNEED 
reasonable candidates?

Perhaps ionice could be used used, but the concept of "priority" 
doesn't exactly encompass the concept of cache-bypass---so is something 
else needed?

Other ideas?  


--
Eric Wheeler

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: To add, or not to add, a bio REQ_ROTATIONAL flag
  2016-07-29  0:50 To add, or not to add, a bio REQ_ROTATIONAL flag Eric Wheeler
@ 2016-07-29  1:04 ` Wols Lists
  2016-07-29  1:16 ` Martin K. Petersen
  1 sibling, 0 replies; 4+ messages in thread
From: Wols Lists @ 2016-07-29  1:04 UTC (permalink / raw)
  To: Eric Wheeler, linux-block
  Cc: dm-devel, linux-raid, linux-kernel, linux-bcache

On 29/07/16 01:50, Eric Wheeler wrote:
> Hello all,
> 
> With the many SSD caching layers being developed (bcache, dm-cache, 
> dm-writeboost, etc), how could we flag a bio from userspace to indicate 
> whether the bio is preferred to hit spinning disks instead of an SSD?
> 
> Unnecessary promotions, evections, and writeback increase the write burden 
> on the caching layer and burns out SSDs too fast (TBW), thus requring 
> equipment replacement.

What's the spec of these devices? How long are they expected to last?

Other recent posts on this (linux-raid) mailing list refer to tests on
SSDs that indicates their typical life is way beyond their nominal life,
and that in normal usage they are actually likely to outlive "spinning
rust".

http://techreport.com/review/24841/introducing-the-ssd-endurance-experiment

http://techreport.com/review/27909/the-ssd-endurance-experiment-theyre-all-dead/3

Looking at the results, the FIRST drives only started failing once
they'd written some 700 Terabytes. How long is it going to take you to
write that much data over a SATA3 link?

Cheers,
Wol

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: To add, or not to add, a bio REQ_ROTATIONAL flag
  2016-07-29  0:50 To add, or not to add, a bio REQ_ROTATIONAL flag Eric Wheeler
  2016-07-29  1:04 ` Wols Lists
@ 2016-07-29  1:16 ` Martin K. Petersen
  2016-08-01  2:58   ` Eric Wheeler
  1 sibling, 1 reply; 4+ messages in thread
From: Martin K. Petersen @ 2016-07-29  1:16 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: linux-block, dm-devel, linux-raid, linux-kernel, linux-bcache

>>>>> "Eric" == Eric Wheeler <bcache@lists.ewheeler.net> writes:

Eric,

Eric> However, just because FADV_SEQUENTIAL is flagged doesn't mean the
Eric> cache should bypass.  Filesystems can fragment, and while the file
Eric> being read may be read sequentially, the blocks on which it
Eric> resides may not be.  Same thing for higher-level block devices
Eric> such as dm-thinp where one might sequentially read a thin volume
Eric> but its _tdata might not be in linear order.  This may imply that
Eric> we need a new way to flag cache bypass from userspace that is
Eric> neither io-priority nor fadvise driven.

Why conflate the two? Something being a background task is orthogonal to
whether it is being read sequentially or not.

Eric> So what are our options?  What might be the best way to do this?

For the SCSI I/O hints I use the idle I/O priority to classify
backups. Works fine.

Eric> Are FADV_NOREUSE/FADV_DONTNEED reasonable candidates?

FADV_DONTNEED was intended for this. There have been patches posted in
the past that tied the loop between the fadvise flags and the bio. I
would like to see those revived.

Eric> Perhaps ionice could be used used, but the concept of "priority"
Eric> doesn't exactly encompass the concept of cache-bypass---so is
Eric> something else needed?

The idle class explicitly does not have a priority.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: To add, or not to add, a bio REQ_ROTATIONAL flag
  2016-07-29  1:16 ` Martin K. Petersen
@ 2016-08-01  2:58   ` Eric Wheeler
  0 siblings, 0 replies; 4+ messages in thread
From: Eric Wheeler @ 2016-08-01  2:58 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: linux-block, dm-devel, linux-raid, linux-kernel, linux-bcache,
	hurikhan77, antlists, Dan Williams, Jason B. Akers, Kapil Karkra,
	Jens Axboe, Jeff Moyer, david

[+cc from "Enable use of Solid State Hybrid Drives"
	https://lkml.org/lkml/2014/10/29/698 ]

On Thu, 28 Jul 2016, Martin K. Petersen wrote:
> >>>>> "Eric" == Eric Wheeler <bcache@lists.ewheeler.net> writes:
> Eric> [...]  This may imply that
> Eric> we need a new way to flag cache bypass from userspace [...]
> Eric> So what are our options?  What might be the best way to do this?
[...] 
> Eric> Are FADV_NOREUSE/FADV_DONTNEED reasonable candidates?
> 
> FADV_DONTNEED was intended for this. There have been patches posted in
> the past that tied the loop between the fadvise flags and the bio. I
> would like to see those revived.

That sounds like a good start, this looks about right from 2014:
	https://lkml.org/lkml/2014/10/29/698
	https://lwn.net/Articles/619058/

I read through the thread and have summarized the relevant parts here 
with additional commentary below the summary:

/* Summary 

They were seeking to do basically the same in 2014 thing we want with 
stacked block caching drivers today: hint to the IO layer so the (ATA 3.2) 
driver can decide whether a block should hit the cache or spinning disk.  
This was done by adding bitflags to ioprio for IOPRIO_ADV_ advice.

There are two arguments throughout the thread: one that the cache hint 
should be per-process (ionice) and the other, that hints should be per 
inode via fadvise (and maybe madvise).  Dan Williams noted with respect to 
fadvise for their implementation that "It's straightforward to add, but I 
think "80%" of the benefit can be had by just having a per-thread cache 
priority."

Kapil Karkra extended the page flags so the ioprio advice bits can be 
copied into bio->bi_rw, to which Jens said "is a bit...icky. I see why 
it's done, though, it requires the least amount of plumbing."

Martin K. Petersen provides a matrix of desires for actual use cases here:
	https://lkml.org/lkml/2014/10/29/1014 
and asks "Are there actually people asking for sub-file granularity? I 
didn't get any requests for that in the survey I did this summer. [...] In 
any case I thought it was interesting that pretty much every use case that 
people came up with could be adequately described by a handful of I/O 
classes."

Further, Jens notes that "I think we've needed a proper API for passing in 
appropriate hints on a per-io basis for a LONG time. [...] We've tried 
(and failed) in the past to define a set of hints that make sense. It'd be 
a shame to add something that's specific to a given transport/technology. 
That said, this set of hints do seem pretty basic and would not 
necessarily be a bad place to start. But they are still very specific to 
this use case."
*/


So, perhaps it is time to plan the hint API and figure out how to plumb 
it.  These are some design considerations based on the thread:

a. People want per-process cache hinting (ionice, or some other tool).
b. Per inode+range hinting would be useful to some (fadvise, ioctl, etc)
c. Don't use page flags to convey cache hints---or find a clean way to do so.
d. Per IO hints would be useful to both stacking and hardware drivers.
e. Cache layers will implement their own device assignment choice based 
on the caching hint; for example, an IO flagged to miss the cache might 
hit if already in cache due to unrelated IO and such a determination would 
be made per-cache-implementation.


I can see this go two ways:

1. A dedicated implementation for cache hinting.
2. An API for generalized hinting, upon which cache hinting may be 
implemented.

To consider #2, what hinting is wanted from processes and inodes down to 
bio's?  Does it justify an entire API for generalized hinting, or do we 
just need a cache hinting implementation?  If we do want #2, then what are 
all of the features wanted by the community so it can be designed as such?

If #1 is sufficient, then what is the preferred mechanism and 
implementation for cache hinting?

In either direction, how can those hints pass down to bio's in an 
appropriate way (ie, not page flags)?


With the interest of a cache hinting implementation independent of 
transport/technology, I have been playing with an idea to use two per-IO 
"TTL" counters, both of which tend toward zero; I've not yet started an 
implementation:

cacheskip: 
	Decrement until zero to skip cache layers (slow medium)
	Ignore cachedepth until cacheskip==0.
	
cachedepth:
	Initialize to positive, negative, or zero value.  Once zero, no 
	special treatment is given to the IO.  When less than zero, prefer the 
	slower medium.  When greater than zero, prefer the faster medium.  
	Inc/decrement toward zero each time the IO passes through a 
	caching layer.

Independent of how we might apply these counters to a pid/inode, the cache 
layers might look something like this:

cachedepth	description
  0		direct IO
+-1		pagecache
+-2		som arbitrary
+-3		caching
+-4		driver
+-n		...

Layers beyond the pagecache are assigned arbitrarily by the driver 
stacking order implemented by the end user. For example, if passing 
through dm-cache, then dm-cache would use its own preference logic to 
decide whether it should cache or not if cachedepth is zero.  If nonzero, 
it would cache/bypass appropriately and then inc/decrements cachedepth 
toward zero after making its decision.  Understandably, extenuating 
circumstances may require a layer to ignore the hint---such as a 
bypass-hinted IO that gets cached because it is already hot.

Consider the following scenarios for this contrived cache stack:

1. pagecache
2. dm-cache
3. bcache
4. HBA supporting cache hints (ATA 3.2, perhaps)

cacheskip	cachedepth	description
-------------------------------------------
	0		0	use pagecache; lower layers do what they want
	1		0	skip pagecache (direct IO); lower layers do what they want
	0		-1	same as previous
	2		1	skip pagecache, dmcache; prefer bcache-ssd
	0		-3	skip pagecache; dmcache bypass; bcache bypass
	1		2	skip pagecache; prefer dmcache-ssd, prefer bcache-ssd
	3		1	hint to prefer HBA cache only

This would empower the user to decide where caching should begin, and for 
how many layers caching should hint for slow(-) or fast(+) backing devices 
before letting the IO stack make its own hintless choice.  Hopefully this 
lets each layer make their own choices that best fit their implementation.

Note that this would not support multi-device tiering as written.  If some 
layer supports multiple IO performance tiers (more than 2) at the same 
layer, then this hinting algorithm is insufficient unless a 
cache-layer-specific datastructure could be passed with the IO hinting 
request.  Also, an eviction hint is not supported by this model.


Please comment with your thoughts.  I look forward to feedback and 
implementation ideas for what would be the best way to plumb cache hinting 
for whatever implementation is chosen.

--
Eric Wheeler

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-08-01  2:59 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-29  0:50 To add, or not to add, a bio REQ_ROTATIONAL flag Eric Wheeler
2016-07-29  1:04 ` Wols Lists
2016-07-29  1:16 ` Martin K. Petersen
2016-08-01  2:58   ` Eric Wheeler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).