All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 1/3] bcache: introduce bcache sysfs entries for ioprio-based bypass/writeback hints
       [not found]     ` <CAC2ZOYt+ZMep=PT5FbQKiqZ0EE1f4+JJn=oTJUtQjLwGvy=KfQ@mail.gmail.com>
@ 2020-10-05 19:41       ` Eric Wheeler
  2020-10-06 12:28         ` Nix
  2020-10-07 20:35         ` Eric Wheeler
  0 siblings, 2 replies; 8+ messages in thread
From: Eric Wheeler @ 2020-10-05 19:41 UTC (permalink / raw)
  To: Kai Krakow; +Cc: Nix, linux-bcache, linux-block

[+cc:bcache and blocklist]

On Sun, 4 Oct 2020, Kai Krakow wrote:

> Hey Nix!
> 
> Apparently, `git send-email` probably swallowed the patch 0/3 message for you.
> 
> It was about adding one additional patch which reduced boot time for
> me with idle mode active by a factor of 2.
> 
> You can look at it here:
> https://github.com/kakra/linux/pull/4
> 
> It's "bcache: Only skip data request in io_prio bypass mode" just if
> you're curious.
> 
> Regards,
> Kai
> 
> Am So., 4. Okt. 2020 um 15:19 Uhr schrieb Nix <nix@esperi.org.uk>:
> >
> > On 3 Oct 2020, Kai Krakow spake thusly:
> >
> > > Having idle IOs bypass the cache can increase performance elsewhere
> > > since you probably don't care about their performance.  In addition,
> > > this prevents idle IOs from promoting into (polluting) your cache and
> > > evicting blocks that are more important elsewhere.
> >
> > FYI, stats from 20 days of uptime with this patch live in a stack with
> > XFS above it and md/RAID-6 below (20 days being the time since the last
> > reboot: I've been running this patch for years with older kernels
> > without incident):
> >
> > stats_total/bypassed: 282.2G
> > stats_total/cache_bypass_hits: 123808
> > stats_total/cache_bypass_misses: 400813
> > stats_total/cache_hit_ratio: 53
> > stats_total/cache_hits: 9284282
> > stats_total/cache_miss_collisions: 51582
> > stats_total/cache_misses: 8183822
> > stats_total/cache_readaheads: 0
> > written: 168.6G
> >
> > ... so it's still saving a lot of seeking. This is despite having
> > backups running every three hours (in idle mode), and the usual updatedb
> > runs, etc, plus, well, actual work which sometimes involves huge greps
> > etc: I also tend to do big cp -al's of transient stuff like build dirs
> > in idle mode to suppress caching, because the build dir will be deleted
> > long before it expires from the page cache.
> >
> > The SSD, which is an Intel DC S3510 and is thus read-biased rather than
> > write-biased (not ideal for this use-case: whoops, I misread the
> > datasheet), says
> >
> > EnduranceAnalyzer : 506.90 years
> >
> > despite also housing all the XFS journals. I am... not worried about the
> > SSD wearing out. It'll outlast everything else at this rate. It'll
> > probably outlast the machine's case and the floor the machine sits on.
> > It'll certainly outlast me (or at least last long enough to be discarded
> > by reason of being totally obsolete). Given that I really really don't
> > want to ever have to replace it (and no doubt screw up replacing it and
> > wreck the machine), this is excellent.
> >
> > (When I had to run without the ioprio patch, the expected SSD lifetime
> > and cache hit rate both plunged. It was still years, but enough years
> > that it could potentially have worn out before the rest of the machine
> > did. Using ioprio for this might be a bit of an abuse of ioprio, and
> > really some other mechanism might be better, but in the absence of such
> > a mechanism, ioprio *is*, at least for me, fairly tightly correlated
> > with whether I'm going to want to wait for I/O from the same block in
> > future.)
> 
From Nix on 10/03 at 5:39 AM PST
> I suppose. I'm not sure we don't want to skip even that for truly
> idle-time I/Os, though: booting is one thing, but do you want all the
> metadata associated with random deep directory trees you access once a
> year to be stored in your SSD's limited space, pushing out data you
> might actually use, because the idle-time backup traversed those trees?
> I know I don't. The whole point of idle-time I/O is that you don't care
> how fast it returns. If backing it up is speeding things up, I'd be
> interested in knowing why... what this is really saying is that metadata
> should be considered important even if the user says it isn't!
> 
> (I guess this is helping because of metadata that is read by idle I/Os
> first, but then non-idle ones later, in which case for anyone who runs
> backups this is just priming the cache with all metadata on the disk.
> Why not just run a non-idle-time cronjob to do that in the middle of the
> night if it's beneficial?)

(It did not look like this was being CC'd to the list so I have pasted the 
relevant bits of conversation. Kai, please resend your patch set and CC 
the list linux-bcache@vger.kernel.org)

I am glad that people are still making effective use of this patch!

It works great unless you are using mq-scsi (or perhaps mq-dm). For the 
multi-queue systems out there, ioprio does not seem to pass down through 
the stack into bcache, probably because it is passed through a worker 
thread for the submission or some other detail that I have not researched. 

Long ago others had concerns using ioprio as the mechanism for cache 
hinting, so what does everyone think about implementing cgroup inside of 
bcache? From what I can tell, cgroups have a stronger binding to an IO 
than ioprio hints. 

I think there are several per-cgroup tunables that could be useful. Here 
are the ones that I can think of, please chime in if anyone can think of 
others: 
 - should_bypass_write
 - should_bypass_read
 - should_bypass_meta
 - should_bypass_read_ahead
 - should_writeback
 - should_writeback_meta
 - should_cache_read
 - sequential_cutoff

Indeed, some of these could be combined into a single multi-valued cgroup 
option such as:
 - should_bypass = read,write,meta

 
--
Eric Wheeler

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/3] bcache: introduce bcache sysfs entries for ioprio-based bypass/writeback hints
  2020-10-05 19:41       ` [PATCH 1/3] bcache: introduce bcache sysfs entries for ioprio-based bypass/writeback hints Eric Wheeler
@ 2020-10-06 12:28         ` Nix
  2020-10-06 13:10           ` Kai Krakow
  2020-10-07 20:35         ` Eric Wheeler
  1 sibling, 1 reply; 8+ messages in thread
From: Nix @ 2020-10-06 12:28 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Kai Krakow, linux-bcache, linux-block

On 5 Oct 2020, Eric Wheeler verbalised:

> [+cc:bcache and blocklist]
>
> (It did not look like this was being CC'd to the list so I have pasted the 
> relevant bits of conversation. Kai, please resend your patch set and CC 
> the list linux-bcache@vger.kernel.org)

Oh sorry. I don't know what's been going on with the Cc:s here.

> I am glad that people are still making effective use of this patch!

:)

> It works great unless you are using mq-scsi (or perhaps mq-dm). For the 
> multi-queue systems out there, ioprio does not seem to pass down through 
> the stack into bcache, probably because it is passed through a worker 
> thread for the submission or some other detail that I have not researched. 

That sounds like a bug in the mq-scsi machinery: it surely should be
passing the ioprio off to the worker thread so that the worker thread
can reliably mimic the behaviour of the thread it's acting on behalf of.

> Long ago others had concerns using ioprio as the mechanism for cache 
> hinting, so what does everyone think about implementing cgroup inside of 
> bcache? From what I can tell, cgroups have a stronger binding to an IO 
> than ioprio hints. 

Nice idea, but... using cgroups would make this essentially unusable for
me, and probably for most other people, because on a systemd system the
cgroup hierarchy is more or less owned in fee simple by systemd, and it
won't let you use cgroups for something else, or even make your own
underneath the ones it's managing: it sometimes seems to work but they
can suddenly go away without warning and all the processes in them get
transferred out by systemd or even killed off.

(And as for making systemd set up suitable cgroups, that too would make
it unusable for me: I tend to run jobs ad-hoc with ionice, use ionice in
scripts etc to reduce caching when I know it won't be needed, and that
sort of thing is just not mature enough to be reliable in systemd yet.
It's rare for a systemd --user invocation to get everything so confused
that the entire system is reundered unusable, but it has happened to me
in the past, so unlike ionice I am now damn wary of using systemd --user
invocations for anything. They're a hell of a lot clunkier for ad-hoc
use than a simple ionice, too: you can't just say "run this command in a
--user", you have to set up a .service file etc.)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/3] bcache: introduce bcache sysfs entries for ioprio-based bypass/writeback hints
  2020-10-06 12:28         ` Nix
@ 2020-10-06 13:10           ` Kai Krakow
  2020-10-06 16:34             ` Nix
  0 siblings, 1 reply; 8+ messages in thread
From: Kai Krakow @ 2020-10-06 13:10 UTC (permalink / raw)
  To: Nix; +Cc: Eric Wheeler, linux-bcache, linux-block

Am Di., 6. Okt. 2020 um 14:28 Uhr schrieb Nix <nix@esperi.org.uk>:
>
> On 5 Oct 2020, Eric Wheeler verbalised:
>
> > [+cc:bcache and blocklist]
> >
> > (It did not look like this was being CC'd to the list so I have pasted the
> > relevant bits of conversation. Kai, please resend your patch set and CC
> > the list linux-bcache@vger.kernel.org)
>
> Oh sorry. I don't know what's been going on with the Cc:s here.
>
> > I am glad that people are still making effective use of this patch!
>
> :)
>
> > It works great unless you are using mq-scsi (or perhaps mq-dm). For the
> > multi-queue systems out there, ioprio does not seem to pass down through
> > the stack into bcache, probably because it is passed through a worker
> > thread for the submission or some other detail that I have not researched.
>
> That sounds like a bug in the mq-scsi machinery: it surely should be
> passing the ioprio off to the worker thread so that the worker thread
> can reliably mimic the behaviour of the thread it's acting on behalf of.

Maybe this was only an issue early in mq-scsi before it got more
schedulers than just iosched-none? It has bfq now, and it should work.
Depending on the filesystem, tho, that may still not fully apply...
e.g. btrfs doesn't use ioprio for delayed refs resulting from such io,
it will simply queue it up at the top of the io queue.

>
> > Long ago others had concerns using ioprio as the mechanism for cache
> > hinting, so what does everyone think about implementing cgroup inside of
> > bcache? From what I can tell, cgroups have a stronger binding to an IO
> > than ioprio hints.
>
> Nice idea, but...

Yeah, it would fit my use-case perfectly.

> using cgroups would make this essentially unusable for
> me, and probably for most other people, because on a systemd system the
> cgroup hierarchy is more or less owned in fee simple by systemd, and it
> won't let you use cgroups for something else,

That's probably not completely true, you can still define slices which
act as a cgroup container for all services and processes contained in
it, and you can use "systemctl edit myscope.slice" to change
scheduler, memory accounting, and IO params at runtime.

> or even make your own
> underneath the ones it's managing: it sometimes seems to work but they
> can suddenly go away without warning and all the processes in them get
> transferred out by systemd or even killed off.

See above, use slices, don't try to sneak around systemd's cgroup
management - especially not in services.

> (And as for making systemd set up suitable cgroups, that too would make
> it unusable for me: I tend to run jobs ad-hoc with ionice, use ionice in
> scripts etc to reduce caching when I know it won't be needed, and that
> sort of thing is just not mature enough to be reliable in systemd yet.

You can still define a slice for such ad-hoc processes by using
systemd-run to make your process into a transient one-shot service.
It's not much different from prepending "ionice ... schedtool ...".
I'm using that put some desktop programs in a resource jail to avoid
cache thrashing, e.g. by browsers which tend to dominate the cache:
https://github.com/kakra/gentoo-cgw (this will integrate with the
package manager to replace the original executable with a wrapper).
But that has some flaws, as in when running a browser from a Steam
container, it starts to act strange... But otherwise I'm using it
quite successfully.

> It's rare for a systemd --user invocation to get everything so confused
> that the entire system is reundered unusable, but it has happened to me
> in the past, so unlike ionice I am now damn wary of using systemd --user
> invocations for anything. They're a hell of a lot clunkier for ad-hoc
> use than a simple ionice, too: you can't just say "run this command in a
> --user", you have to set up a .service file etc.)

Not sure what you did, I never experienced that. Usually that happens
when processes managed by a systemd service try to escape the current
session, i.e. by running "su -" or "sudo", so some uses of ionice may
experience similar results.

So my current situation is: I defined a slice for background jobs
(backup, maintenance jobs etc), one for games (boosting the CPU/IO/mem
share), one for browsers (limiting CPU to fight against run-away
javascripts), and some more. The trick is to define all slices with a
lower bound of memory below which the kernel won't reclaim memory from
it - I found that's one of the most important knobs to fight laggy
desktop usage. I usually look at the memory needed by the processes
when running, then add some amount of cache I think would be useful
for the processes, as cgroup memory accounting luckily app allocations
AND cache memory. Actually, limiting memory with cgroups can have
quite an opposite effect (as processes tend to swap then, even with
plenty of RAM available).

Regards,
Kai

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/3] bcache: introduce bcache sysfs entries for ioprio-based bypass/writeback hints
  2020-10-06 13:10           ` Kai Krakow
@ 2020-10-06 16:34             ` Nix
  2020-10-07  0:41               ` Eric Wheeler
  0 siblings, 1 reply; 8+ messages in thread
From: Nix @ 2020-10-06 16:34 UTC (permalink / raw)
  To: Kai Krakow; +Cc: Eric Wheeler, linux-bcache, linux-block

On 6 Oct 2020, Kai Krakow verbalised:

> Am Di., 6. Okt. 2020 um 14:28 Uhr schrieb Nix <nix@esperi.org.uk>:
>> That sounds like a bug in the mq-scsi machinery: it surely should be
>> passing the ioprio off to the worker thread so that the worker thread
>> can reliably mimic the behaviour of the thread it's acting on behalf of.
>
> Maybe this was only an issue early in mq-scsi before it got more
> schedulers than just iosched-none? It has bfq now, and it should work.
> Depending on the filesystem, tho, that may still not fully apply...
> e.g. btrfs doesn't use ioprio for delayed refs resulting from such io,
> it will simply queue it up at the top of the io queue.

Yeah. FWIW I'm using bfq for all the underlying devices and everything
still seems to be working, idle I/O doesn't get bcached etc.

>> using cgroups would make this essentially unusable for
>> me, and probably for most other people, because on a systemd system the
>> cgroup hierarchy is more or less owned in fee simple by systemd, and it
>> won't let you use cgroups for something else,
>
> That's probably not completely true, you can still define slices which
> act as a cgroup container for all services and processes contained in
> it, and you can use "systemctl edit myscope.slice" to change
> scheduler, memory accounting, and IO params at runtime.

That's... a lot clunkier than being able to say 'ionice -c 3 foo' to run
foo without caching. root has to prepare for it on a piece-by-piece
basis... not that ionice is the most pleasant of utilities to use
either.

>> (And as for making systemd set up suitable cgroups, that too would make
>> it unusable for me: I tend to run jobs ad-hoc with ionice, use ionice in
>> scripts etc to reduce caching when I know it won't be needed, and that
>> sort of thing is just not mature enough to be reliable in systemd yet.
>
> You can still define a slice for such ad-hoc processes by using
> systemd-run to make your process into a transient one-shot service.

That's one of the things that crashed my system when I tried it. I just
tried it again and it seems to work now. :) (Hm, does systemd-run wait
for return and hand back the exit code... yes, via --scope or --wait,
both of which seem to have elaborate constraints that I don't fully
understand and that makes me rather worried that using them might not be
reliable: but in this it is just like almost everything else in
systemd.)

>> It's rare for a systemd --user invocation to get everything so confused
>> that the entire system is reundered unusable, but it has happened to me
>> in the past, so unlike ionice I am now damn wary of using systemd --user
>> invocations for anything. They're a hell of a lot clunkier for ad-hoc
>> use than a simple ionice, too: you can't just say "run this command in a
>> --user", you have to set up a .service file etc.)
>
> Not sure what you did, I never experienced that. Usually that happens

It was early in the development of --user, so it may well have been a
bug that was fixed later on. In general I have found systemd to be too
tightly coupled and complex to be reliable: there seem to be all sorts
of ways to use local mounts and fs namespaces and the like to fubar PID
1 and force a reboot (which you can't do because PID 1 is too unhappy,
so it's /sbin/reboot -f time). Admittedly I do often do rather extreme
things with tens of thousands of mounts and the like, but y'know the
only thing that makes unhappy is... systemd. :/

(I have used systemd enough to both rely on it and cordially loathe it
as an immensely overcomplicated monster with far too many edge cases and
far too much propensity to insist on your managing the system its way
(e.g. what it does with cgroups), and if I do anything but the simplest
stuff I'm likely to trip over one or more bugs in those edge cases. I'd
switch to something else simple enough to understand if only all the
things I might switch to were not also too simple to be able to do the
things I want to do. The usual software engineering dilemma...)

In general, though, the problem with cgroups is that courtesy of v2
having a unified hierarchy, if any one thing uses cgroups, nothing else
really can, because they all have to agree on the shape of the
hierarchy, which is most unlikely if they're using cgroups for different
purposes. So it is probably a mistake to use cgroups for *anything*
other than handing control of it to a single central thing (like
systemd) and then trying to forget that cgroups ever existed for any
other purpose because you'll never be able to use them yourself.

A shame. They could have been a powerful abstraction...

> and some more. The trick is to define all slices with a
> lower bound of memory below which the kernel won't reclaim memory from
> it - I found that's one of the most important knobs to fight laggy
> desktop usage.

I cheated and just got a desktop with 16GiB RAM and no moving parts and
a server with so much RAM that it never swaps, and 10GbE between the two
so the desktop can get stuff off the server as fast as its disks can do
contiguous reads. bcace cuts down seek time enough that I hardly ever
have to wait for it, and bingo :)

(But my approach is probably overkill: yours is more elegant.)

> I usually look at the memory needed by the processes when running,

I've not bothered with that for years: 16GiB seems to be enough that
Chrome plus even a fairly big desktop doesn't cause the remotest
shortage of memory, and the server, well, I can run multiple Emacsen and
20+ VMs on that without touching the sides. (Also... how do you look at
it? PSS is pretty good, but other than ps_mem almost nothing uses it,
not even the insanely overdesigned procps top.)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/3] bcache: introduce bcache sysfs entries for ioprio-based bypass/writeback hints
  2020-10-06 16:34             ` Nix
@ 2020-10-07  0:41               ` Eric Wheeler
  2020-10-07 12:43                 ` Nix
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Wheeler @ 2020-10-07  0:41 UTC (permalink / raw)
  To: Nix; +Cc: Kai Krakow, linux-bcache, linux-block

On Tue, 6 Oct 2020, Nix wrote:
> On 6 Oct 2020, Kai Krakow verbalised:
> 
> > Am Di., 6. Okt. 2020 um 14:28 Uhr schrieb Nix <nix@esperi.org.uk>:
> >> That sounds like a bug in the mq-scsi machinery: it surely should be
> >> passing the ioprio off to the worker thread so that the worker thread
> >> can reliably mimic the behaviour of the thread it's acting on behalf of.
> >
> > Maybe this was only an issue early in mq-scsi before it got more
> > schedulers than just iosched-none? It has bfq now, and it should work.
> > Depending on the filesystem, tho, that may still not fully apply...
> > e.g. btrfs doesn't use ioprio for delayed refs resulting from such io,
> > it will simply queue it up at the top of the io queue.
> 
> Yeah. FWIW I'm using bfq for all the underlying devices and everything
> still seems to be working, idle I/O doesn't get bcached etc.
> 
> >> using cgroups would make this essentially unusable for
> >> me, and probably for most other people, because on a systemd system the
> >> cgroup hierarchy is more or less owned in fee simple by systemd, and it
> >> won't let you use cgroups for something else,
> >
> > That's probably not completely true, you can still define slices which
> > act as a cgroup container for all services and processes contained in
> > it, and you can use "systemctl edit myscope.slice" to change
> > scheduler, memory accounting, and IO params at runtime.
> 
> That's... a lot clunkier than being able to say 'ionice -c 3 foo' to run
> foo without caching. root has to prepare for it on a piece-by-piece
> basis... not that ionice is the most pleasant of utilities to use
> either.

I always make my own cgroups with cgcreate, cgset, and cgexec.  We're 
using centos7 which is all systemd and I've never had a problem:

Something (hypothetically) like this:
	cgcreate -g blkio:/my_bcache_settings
	cgset -r blkio.bcache.bypass='read,write'    my_bcache_settings
	cgset -r blkio.bcache.writeback='write,meta' my_bcache_settings

Then all you need to do is run this which isn't all that different from an 
ionice invocation:
	cgexec -g blkio:my_bcache_settings /usr/local/bin/some-program


--
Eric Wheeler



> 
> >> (And as for making systemd set up suitable cgroups, that too would make
> >> it unusable for me: I tend to run jobs ad-hoc with ionice, use ionice in
> >> scripts etc to reduce caching when I know it won't be needed, and that
> >> sort of thing is just not mature enough to be reliable in systemd yet.
> >
> > You can still define a slice for such ad-hoc processes by using
> > systemd-run to make your process into a transient one-shot service.
> 
> That's one of the things that crashed my system when I tried it. I just
> tried it again and it seems to work now. :) (Hm, does systemd-run wait
> for return and hand back the exit code... yes, via --scope or --wait,
> both of which seem to have elaborate constraints that I don't fully
> understand and that makes me rather worried that using them might not be
> reliable: but in this it is just like almost everything else in
> systemd.)
> 
> >> It's rare for a systemd --user invocation to get everything so confused
> >> that the entire system is reundered unusable, but it has happened to me
> >> in the past, so unlike ionice I am now damn wary of using systemd --user
> >> invocations for anything. They're a hell of a lot clunkier for ad-hoc
> >> use than a simple ionice, too: you can't just say "run this command in a
> >> --user", you have to set up a .service file etc.)
> >
> > Not sure what you did, I never experienced that. Usually that happens
> 
> It was early in the development of --user, so it may well have been a
> bug that was fixed later on. In general I have found systemd to be too
> tightly coupled and complex to be reliable: there seem to be all sorts
> of ways to use local mounts and fs namespaces and the like to fubar PID
> 1 and force a reboot (which you can't do because PID 1 is too unhappy,
> so it's /sbin/reboot -f time). Admittedly I do often do rather extreme
> things with tens of thousands of mounts and the like, but y'know the
> only thing that makes unhappy is... systemd. :/
> 
> (I have used systemd enough to both rely on it and cordially loathe it
> as an immensely overcomplicated monster with far too many edge cases and
> far too much propensity to insist on your managing the system its way
> (e.g. what it does with cgroups), and if I do anything but the simplest
> stuff I'm likely to trip over one or more bugs in those edge cases. I'd
> switch to something else simple enough to understand if only all the
> things I might switch to were not also too simple to be able to do the
> things I want to do. The usual software engineering dilemma...)
> 
> In general, though, the problem with cgroups is that courtesy of v2
> having a unified hierarchy, if any one thing uses cgroups, nothing else
> really can, because they all have to agree on the shape of the
> hierarchy, which is most unlikely if they're using cgroups for different
> purposes. So it is probably a mistake to use cgroups for *anything*
> other than handing control of it to a single central thing (like
> systemd) and then trying to forget that cgroups ever existed for any
> other purpose because you'll never be able to use them yourself.
> 
> A shame. They could have been a powerful abstraction...
> 
> > and some more. The trick is to define all slices with a
> > lower bound of memory below which the kernel won't reclaim memory from
> > it - I found that's one of the most important knobs to fight laggy
> > desktop usage.
> 
> I cheated and just got a desktop with 16GiB RAM and no moving parts and
> a server with so much RAM that it never swaps, and 10GbE between the two
> so the desktop can get stuff off the server as fast as its disks can do
> contiguous reads. bcace cuts down seek time enough that I hardly ever
> have to wait for it, and bingo :)
> 
> (But my approach is probably overkill: yours is more elegant.)
> 
> > I usually look at the memory needed by the processes when running,
> 
> I've not bothered with that for years: 16GiB seems to be enough that
> Chrome plus even a fairly big desktop doesn't cause the remotest
> shortage of memory, and the server, well, I can run multiple Emacsen and
> 20+ VMs on that without touching the sides. (Also... how do you look at
> it? PSS is pretty good, but other than ps_mem almost nothing uses it,
> not even the insanely overdesigned procps top.)
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/3] bcache: introduce bcache sysfs entries for ioprio-based bypass/writeback hints
  2020-10-07  0:41               ` Eric Wheeler
@ 2020-10-07 12:43                 ` Nix
  0 siblings, 0 replies; 8+ messages in thread
From: Nix @ 2020-10-07 12:43 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Kai Krakow, linux-bcache, linux-block

On 7 Oct 2020, Eric Wheeler said:

> I always make my own cgroups with cgcreate, cgset, and cgexec.  We're 
> using centos7 which is all systemd and I've never had a problem:

Oh, maybe I'm panicking about nothing as usual then. Maybe this is all
ancient systemd bugs that were fixed roughly when the Americas split off
from Europe but which I've been worrying over without retesting ever
since. :)

> Something (hypothetically) like this:
> 	cgcreate -g blkio:/my_bcache_settings
> 	cgset -r blkio.bcache.bypass='read,write'    my_bcache_settings
> 	cgset -r blkio.bcache.writeback='write,meta' my_bcache_settings
>
> Then all you need to do is run this which isn't all that different from an 
> ionice invocation:
> 	cgexec -g blkio:my_bcache_settings /usr/local/bin/some-program

... if it's that easy, I have no objections :) actually that looks
significantly more expressive than what we have now.

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/3] bcache: introduce bcache sysfs entries for ioprio-based bypass/writeback hints
  2020-10-05 19:41       ` [PATCH 1/3] bcache: introduce bcache sysfs entries for ioprio-based bypass/writeback hints Eric Wheeler
  2020-10-06 12:28         ` Nix
@ 2020-10-07 20:35         ` Eric Wheeler
  2020-10-08 10:45           ` Coly Li
  1 sibling, 1 reply; 8+ messages in thread
From: Eric Wheeler @ 2020-10-07 20:35 UTC (permalink / raw)
  To: Coly Li; +Cc: Kai Krakow, Nix, linux-block, linux-bcache

[+cc coly]

On Mon, 5 Oct 2020, Eric Wheeler wrote:
> On Sun, 4 Oct 2020, Kai Krakow wrote:
> 
> > Hey Nix!
> > 
> > Apparently, `git send-email` probably swallowed the patch 0/3 message for you.
> > 
> > It was about adding one additional patch which reduced boot time for
> > me with idle mode active by a factor of 2.
> > 
> > You can look at it here:
> > https://github.com/kakra/linux/pull/4
> > 
> > It's "bcache: Only skip data request in io_prio bypass mode" just if
> > you're curious.
> > 
> > Regards,
> > Kai
> > 
> > Am So., 4. Okt. 2020 um 15:19 Uhr schrieb Nix <nix@esperi.org.uk>:
> > >
> > > On 3 Oct 2020, Kai Krakow spake thusly:
> > >
> > > > Having idle IOs bypass the cache can increase performance elsewhere
> > > > since you probably don't care about their performance.  In addition,
> > > > this prevents idle IOs from promoting into (polluting) your cache and
> > > > evicting blocks that are more important elsewhere.
> > >
> > > FYI, stats from 20 days of uptime with this patch live in a stack with
> > > XFS above it and md/RAID-6 below (20 days being the time since the last
> > > reboot: I've been running this patch for years with older kernels
> > > without incident):
> > >
> > > stats_total/bypassed: 282.2G
> > > stats_total/cache_bypass_hits: 123808
> > > stats_total/cache_bypass_misses: 400813
> > > stats_total/cache_hit_ratio: 53
> > > stats_total/cache_hits: 9284282
> > > stats_total/cache_miss_collisions: 51582
> > > stats_total/cache_misses: 8183822
> > > stats_total/cache_readaheads: 0
> > > written: 168.6G
> > >
> > > ... so it's still saving a lot of seeking. This is despite having
> > > backups running every three hours (in idle mode), and the usual updatedb
> > > runs, etc, plus, well, actual work which sometimes involves huge greps
> > > etc: I also tend to do big cp -al's of transient stuff like build dirs
> > > in idle mode to suppress caching, because the build dir will be deleted
> > > long before it expires from the page cache.
> > >
> > > The SSD, which is an Intel DC S3510 and is thus read-biased rather than
> > > write-biased (not ideal for this use-case: whoops, I misread the
> > > datasheet), says
> > >
> > > EnduranceAnalyzer : 506.90 years
> > >
> > > despite also housing all the XFS journals. I am... not worried about the
> > > SSD wearing out. It'll outlast everything else at this rate. It'll
> > > probably outlast the machine's case and the floor the machine sits on.
> > > It'll certainly outlast me (or at least last long enough to be discarded
> > > by reason of being totally obsolete). Given that I really really don't
> > > want to ever have to replace it (and no doubt screw up replacing it and
> > > wreck the machine), this is excellent.
> > >
> > > (When I had to run without the ioprio patch, the expected SSD lifetime
> > > and cache hit rate both plunged. It was still years, but enough years
> > > that it could potentially have worn out before the rest of the machine
> > > did. Using ioprio for this might be a bit of an abuse of ioprio, and
> > > really some other mechanism might be better, but in the absence of such
> > > a mechanism, ioprio *is*, at least for me, fairly tightly correlated
> > > with whether I'm going to want to wait for I/O from the same block in
> > > future.)
> > 
> From Nix on 10/03 at 5:39 AM PST
> > I suppose. I'm not sure we don't want to skip even that for truly
> > idle-time I/Os, though: booting is one thing, but do you want all the
> > metadata associated with random deep directory trees you access once a
> > year to be stored in your SSD's limited space, pushing out data you
> > might actually use, because the idle-time backup traversed those trees?
> > I know I don't. The whole point of idle-time I/O is that you don't care
> > how fast it returns. If backing it up is speeding things up, I'd be
> > interested in knowing why... what this is really saying is that metadata
> > should be considered important even if the user says it isn't!
> > 
> > (I guess this is helping because of metadata that is read by idle I/Os
> > first, but then non-idle ones later, in which case for anyone who runs
> > backups this is just priming the cache with all metadata on the disk.
> > Why not just run a non-idle-time cronjob to do that in the middle of the
> > night if it's beneficial?)
> 
> (It did not look like this was being CC'd to the list so I have pasted the 
> relevant bits of conversation. Kai, please resend your patch set and CC 
> the list linux-bcache@vger.kernel.org)
> 
> I am glad that people are still making effective use of this patch!
> 
> It works great unless you are using mq-scsi (or perhaps mq-dm). For the 
> multi-queue systems out there, ioprio does not seem to pass down through 
> the stack into bcache, probably because it is passed through a worker 
> thread for the submission or some other detail that I have not researched. 
> 
> Long ago others had concerns using ioprio as the mechanism for cache 
> hinting, so what does everyone think about implementing cgroup inside of 
> bcache? From what I can tell, cgroups have a stronger binding to an IO 
> than ioprio hints. 
> 
> I think there are several per-cgroup tunables that could be useful. Here 
> are the ones that I can think of, please chime in if anyone can think of 
> others: 
>  - should_bypass_write
>  - should_bypass_read
>  - should_bypass_meta
>  - should_bypass_read_ahead
>  - should_writeback
>  - should_writeback_meta
>  - should_cache_read
>  - sequential_cutoff
> 
> Indeed, some of these could be combined into a single multi-valued cgroup 
> option such as:
>  - should_bypass = read,write,meta


Hi Coly,

Do you have any comments on the best cgroup implementation for bcache?

What other per-process cgroup parameters might be useful for tuning 
bcache behavior to various workloads?

-Eric

--
Eric Wheeler

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/3] bcache: introduce bcache sysfs entries for ioprio-based bypass/writeback hints
  2020-10-07 20:35         ` Eric Wheeler
@ 2020-10-08 10:45           ` Coly Li
  0 siblings, 0 replies; 8+ messages in thread
From: Coly Li @ 2020-10-08 10:45 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Kai Krakow, Nix, linux-block, linux-bcache

On 2020/10/8 04:35, Eric Wheeler wrote:
> [+cc coly]
> 
> On Mon, 5 Oct 2020, Eric Wheeler wrote:
>> On Sun, 4 Oct 2020, Kai Krakow wrote:
>>
>>> Hey Nix!
>>>
>>> Apparently, `git send-email` probably swallowed the patch 0/3 message for you.
>>>
>>> It was about adding one additional patch which reduced boot time for
>>> me with idle mode active by a factor of 2.
>>>
>>> You can look at it here:
>>> https://github.com/kakra/linux/pull/4
>>>
>>> It's "bcache: Only skip data request in io_prio bypass mode" just if
>>> you're curious.
>>>
>>> Regards,
>>> Kai
>>>
>>> Am So., 4. Okt. 2020 um 15:19 Uhr schrieb Nix <nix@esperi.org.uk>:
>>>>
>>>> On 3 Oct 2020, Kai Krakow spake thusly:
>>>>
>>>>> Having idle IOs bypass the cache can increase performance elsewhere
>>>>> since you probably don't care about their performance.  In addition,
>>>>> this prevents idle IOs from promoting into (polluting) your cache and
>>>>> evicting blocks that are more important elsewhere.
>>>>
>>>> FYI, stats from 20 days of uptime with this patch live in a stack with
>>>> XFS above it and md/RAID-6 below (20 days being the time since the last
>>>> reboot: I've been running this patch for years with older kernels
>>>> without incident):
>>>>
>>>> stats_total/bypassed: 282.2G
>>>> stats_total/cache_bypass_hits: 123808
>>>> stats_total/cache_bypass_misses: 400813
>>>> stats_total/cache_hit_ratio: 53
>>>> stats_total/cache_hits: 9284282
>>>> stats_total/cache_miss_collisions: 51582
>>>> stats_total/cache_misses: 8183822
>>>> stats_total/cache_readaheads: 0
>>>> written: 168.6G
>>>>
>>>> ... so it's still saving a lot of seeking. This is despite having
>>>> backups running every three hours (in idle mode), and the usual updatedb
>>>> runs, etc, plus, well, actual work which sometimes involves huge greps
>>>> etc: I also tend to do big cp -al's of transient stuff like build dirs
>>>> in idle mode to suppress caching, because the build dir will be deleted
>>>> long before it expires from the page cache.
>>>>
>>>> The SSD, which is an Intel DC S3510 and is thus read-biased rather than
>>>> write-biased (not ideal for this use-case: whoops, I misread the
>>>> datasheet), says
>>>>
>>>> EnduranceAnalyzer : 506.90 years
>>>>
>>>> despite also housing all the XFS journals. I am... not worried about the
>>>> SSD wearing out. It'll outlast everything else at this rate. It'll
>>>> probably outlast the machine's case and the floor the machine sits on.
>>>> It'll certainly outlast me (or at least last long enough to be discarded
>>>> by reason of being totally obsolete). Given that I really really don't
>>>> want to ever have to replace it (and no doubt screw up replacing it and
>>>> wreck the machine), this is excellent.
>>>>
>>>> (When I had to run without the ioprio patch, the expected SSD lifetime
>>>> and cache hit rate both plunged. It was still years, but enough years
>>>> that it could potentially have worn out before the rest of the machine
>>>> did. Using ioprio for this might be a bit of an abuse of ioprio, and
>>>> really some other mechanism might be better, but in the absence of such
>>>> a mechanism, ioprio *is*, at least for me, fairly tightly correlated
>>>> with whether I'm going to want to wait for I/O from the same block in
>>>> future.)
>>>
>> From Nix on 10/03 at 5:39 AM PST
>>> I suppose. I'm not sure we don't want to skip even that for truly
>>> idle-time I/Os, though: booting is one thing, but do you want all the
>>> metadata associated with random deep directory trees you access once a
>>> year to be stored in your SSD's limited space, pushing out data you
>>> might actually use, because the idle-time backup traversed those trees?
>>> I know I don't. The whole point of idle-time I/O is that you don't care
>>> how fast it returns. If backing it up is speeding things up, I'd be
>>> interested in knowing why... what this is really saying is that metadata
>>> should be considered important even if the user says it isn't!
>>>
>>> (I guess this is helping because of metadata that is read by idle I/Os
>>> first, but then non-idle ones later, in which case for anyone who runs
>>> backups this is just priming the cache with all metadata on the disk.
>>> Why not just run a non-idle-time cronjob to do that in the middle of the
>>> night if it's beneficial?)
>>
>> (It did not look like this was being CC'd to the list so I have pasted the 
>> relevant bits of conversation. Kai, please resend your patch set and CC 
>> the list linux-bcache@vger.kernel.org)
>>
>> I am glad that people are still making effective use of this patch!
>>
>> It works great unless you are using mq-scsi (or perhaps mq-dm). For the 
>> multi-queue systems out there, ioprio does not seem to pass down through 
>> the stack into bcache, probably because it is passed through a worker 
>> thread for the submission or some other detail that I have not researched. 
>>
>> Long ago others had concerns using ioprio as the mechanism for cache 
>> hinting, so what does everyone think about implementing cgroup inside of 
>> bcache? From what I can tell, cgroups have a stronger binding to an IO 
>> than ioprio hints. 
>>
>> I think there are several per-cgroup tunables that could be useful. Here 
>> are the ones that I can think of, please chime in if anyone can think of 
>> others: 
>>  - should_bypass_write
>>  - should_bypass_read
>>  - should_bypass_meta
>>  - should_bypass_read_ahead
>>  - should_writeback
>>  - should_writeback_meta
>>  - should_cache_read
>>  - sequential_cutoff
>>
>> Indeed, some of these could be combined into a single multi-valued cgroup 
>> option such as:
>>  - should_bypass = read,write,meta
> 
> 
> Hi Coly,
> 
> Do you have any comments on the best cgroup implementation for bcache?
> 
> What other per-process cgroup parameters might be useful for tuning 
> bcache behavior to various workloads?

Hi Eric,

This is much better than the magic numbers to control io prio.

I am not familiar with cgroup configuration and implementation, I just
wondering because most of I/Os in bcache are done by kworker or kthread,
is it possible to do per-process control.

Anyway, we may start from the bypass stuffs in your example. If you may
help to compose patches and maintain them in long term, I am glad to
take them in.

Thanks.

Coly Li

Thanks.

Coly Li

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-10-08 10:46 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20201003111056.14635-1-kai@kaishome.de>
     [not found] ` <20201003111056.14635-2-kai@kaishome.de>
     [not found]   ` <87362ucen3.fsf@esperi.org.uk>
     [not found]     ` <CAC2ZOYt+ZMep=PT5FbQKiqZ0EE1f4+JJn=oTJUtQjLwGvy=KfQ@mail.gmail.com>
2020-10-05 19:41       ` [PATCH 1/3] bcache: introduce bcache sysfs entries for ioprio-based bypass/writeback hints Eric Wheeler
2020-10-06 12:28         ` Nix
2020-10-06 13:10           ` Kai Krakow
2020-10-06 16:34             ` Nix
2020-10-07  0:41               ` Eric Wheeler
2020-10-07 12:43                 ` Nix
2020-10-07 20:35         ` Eric Wheeler
2020-10-08 10:45           ` Coly Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.