All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] writeback and cgroup
@ 2012-04-03 18:36 ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-03 18:36 UTC (permalink / raw)
  To: Fengguang Wu, Jan Kara, vgoyal, Jens Axboe
  Cc: linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel,
	linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups,
	ctalbott, rni, lsf

Hello, guys.

So, during LSF, I, Fengguang and Jan had a chance to sit down and talk
about how to cgroup support to writeback.  Here's what I got from it.

Fengguang's opinion is that the throttling algorithm implemented in
writeback is good enough and blkcg parameters can be exposed to
writeback such that those limits can be applied from writeback.  As
for reads and direct IOs, Fengguang opined that the algorithm can
easily be extended to cover those cases and IIUC all IOs, whether
buffered writes, reads or direct IOs can eventually all go through
writeback layer which will be the one layer controlling all IOs.

Unfortunately, I don't agree with that at all.  I think it's a gross
layering violation and lacks any longterm design.  We have a well
working model of applying and propagating resource pressure - we apply
the pressure where the resource exists and propagates the back
pressure through buffers to upper layers upto the originator.  Think
about network, the pressure exists or is applied at the in/egress
points which gets propagated through socket buffers and eventually
throttles the originator.

Writeback, without cgroup, isn't different.  It consists a part of the
pressure propagation chain anchored at the IO device.  IO devices
these days generate very high pressure, which gets propgated through
the IO sched and buffered requests, which in turn creates pressure at
writeback.  Here, the buffering happens in page cache and pressure at
writeback increases the amount of dirty page cache.  Propagating this
IO pressure to the dirtying task is one of the biggest
responsibililties of the writeback code, and this is the underlying
design of the whole thing.

IIUC, without cgroup, the current writeback code works more or less
like this.  Throwing in cgroup doesn't really change the fundamental
design.  Instead of a single pipe going down, we just have multiple
pipes to the same device, each of which should be treated separately.
Of course, a spinning disk can't be divided that easily and their
performance characteristics will be inter-dependent, but the place to
solve that problem is where the problem is, the block layer.

We may have to look for optimizations and expose some details to
improve the overall behavior and such optimizations may require some
deviation from the fundamental design, but such optimizations should
be justified and such deviations kept at minimum, so, no, I don't
think we're gonna be expose blkcg / block / elevator parameters
directly to writeback.  Unless someone can *really* convince me
otherwise, I'll be vetoing any change toward that direction.

Let's please keep the layering clear.  IO limitations will be applied
at the block layer and pressure will be formed there and then
propagated upwards eventually to the originator.  Sure, exposing the
whole information might result in better behavior for certain
workloads, but down the road, say, in three or five years, devices
which can be shared without worrying too much about seeks might be
commonplace and we could be swearing at a disgusting structural mess,
and sadly various cgroup support seems to be a prominent source of
such design failures.

IMHO, treating cgroup - device/bdi pair as a separate device should
suffice as the underlying design.  After all, blkio cgroup support's
ultimate goal is dividing the IO resource into separate bins.
Implementation details might change as underlying technology changes
and we learn more about how to do it better but that is the goal which
we'll always try to keep close to.  Writeback should (be able to)
treat them as separate devices.  We surely will need adjustments and
optimizations to make things work at least somewhat reasonably but
that is the baseline.

In the discussion, for such implementation, the following obstacles
were identified.

* There are a lot of cases where IOs are issued by a task which isn't
  the originiator.  ie. Writeback issues IOs for pages which are
  dirtied by some other tasks.  So, by the time an IO reaches the
  block layer, we don't know which cgroup the IO belongs to.

  Recently, block layer has grown support to attach a task to a bio
  which causes the bio to be handled as if it were issued by the
  associated task regardless of the actual issuing task.  It currently
  only allows attaching %current to a bio - bio_associate_current() -
  but changing it to support other tasks is trivial.

  We'll need to update the async issuers to tag the IOs they issue but
  the mechanism is already there.

* There's a single request pool shared by all issuers per a request
  queue.  This can lead to priority inversion among cgroups.  Note
  that problem also exists without cgroups.  Lower ioprio issuer may
  be holding a request holding back highprio issuer.

  We'll need to make request allocation cgroup (and hopefully ioprio)
  aware.  Probably in the form of separate request pools.  This will
  take some work but I don't think this will be too challenging.  I'll
  work on it.

* cfq cgroup policy throws all async IOs, which all buffered writes
  are, into the shared cgroup regardless of the actual cgroup.  This
  behavior is, I believe, mostly historical and changing it isn't
  difficult.  Prolly only few tens of lines of changes.  This may
  cause significant changes to actual IO behavior with cgroups tho.  I
  personally think the previous behavior was too wrong to keep (the
  weight was completely ignored for buffered writes) but we may want
  to introduce a switch to toggle between the two behaviors.

  Note that blk-throttle doesn't have this problem.

* Unlike dirty data pages, metadata tends to have strict ordering
  requirements and thus is susceptible to priority inversion.  Two
  solutions were suggested - 1. allow overdrawl for metadata writes so
  that low prio metadata writes don't block the whole FS, 2. provide
  an interface to query and wait for bdi-cgroup congestion which can
  be called from FS metadata paths to throttle metadata operations
  before they enter the stream of ordered operations.

  I think combination of the above two should be enough for solving
  the problem.  I *think* the second can be implemented as part of
  cgroup aware request allocation update.  The first one needs a bit
  more thinking but there can be easier interim solutions (e.g. throw
  META writes to the head of the cgroup queue or just plain ignore
  cgroup limits for META writes) for now.

* I'm sure there are a lot of design choices to be made in the
  writeback implementation but IIUC Jan seems to agree that the
  simplest would be simply deal different cgroup-bdi pairs as
  completely separate which shouldn't add too much complexity to the
  already intricate writeback code.

So, I think we have something which sounds like a plan, which at least
I can agree with and seems doable without adding a lot of complexity.

Jan, Fengguang, I'm pretty sure I missed some stuff from writeback's
side and IIUC Fengguang doesn't agree with this approach too much, so
please voice your opinions & comments.

Thank you.

--
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* [RFC] writeback and cgroup
@ 2012-04-03 18:36 ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-03 18:36 UTC (permalink / raw)
  To: Fengguang Wu, Jan Kara, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k,
	andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hello, guys.

So, during LSF, I, Fengguang and Jan had a chance to sit down and talk
about how to cgroup support to writeback.  Here's what I got from it.

Fengguang's opinion is that the throttling algorithm implemented in
writeback is good enough and blkcg parameters can be exposed to
writeback such that those limits can be applied from writeback.  As
for reads and direct IOs, Fengguang opined that the algorithm can
easily be extended to cover those cases and IIUC all IOs, whether
buffered writes, reads or direct IOs can eventually all go through
writeback layer which will be the one layer controlling all IOs.

Unfortunately, I don't agree with that at all.  I think it's a gross
layering violation and lacks any longterm design.  We have a well
working model of applying and propagating resource pressure - we apply
the pressure where the resource exists and propagates the back
pressure through buffers to upper layers upto the originator.  Think
about network, the pressure exists or is applied at the in/egress
points which gets propagated through socket buffers and eventually
throttles the originator.

Writeback, without cgroup, isn't different.  It consists a part of the
pressure propagation chain anchored at the IO device.  IO devices
these days generate very high pressure, which gets propgated through
the IO sched and buffered requests, which in turn creates pressure at
writeback.  Here, the buffering happens in page cache and pressure at
writeback increases the amount of dirty page cache.  Propagating this
IO pressure to the dirtying task is one of the biggest
responsibililties of the writeback code, and this is the underlying
design of the whole thing.

IIUC, without cgroup, the current writeback code works more or less
like this.  Throwing in cgroup doesn't really change the fundamental
design.  Instead of a single pipe going down, we just have multiple
pipes to the same device, each of which should be treated separately.
Of course, a spinning disk can't be divided that easily and their
performance characteristics will be inter-dependent, but the place to
solve that problem is where the problem is, the block layer.

We may have to look for optimizations and expose some details to
improve the overall behavior and such optimizations may require some
deviation from the fundamental design, but such optimizations should
be justified and such deviations kept at minimum, so, no, I don't
think we're gonna be expose blkcg / block / elevator parameters
directly to writeback.  Unless someone can *really* convince me
otherwise, I'll be vetoing any change toward that direction.

Let's please keep the layering clear.  IO limitations will be applied
at the block layer and pressure will be formed there and then
propagated upwards eventually to the originator.  Sure, exposing the
whole information might result in better behavior for certain
workloads, but down the road, say, in three or five years, devices
which can be shared without worrying too much about seeks might be
commonplace and we could be swearing at a disgusting structural mess,
and sadly various cgroup support seems to be a prominent source of
such design failures.

IMHO, treating cgroup - device/bdi pair as a separate device should
suffice as the underlying design.  After all, blkio cgroup support's
ultimate goal is dividing the IO resource into separate bins.
Implementation details might change as underlying technology changes
and we learn more about how to do it better but that is the goal which
we'll always try to keep close to.  Writeback should (be able to)
treat them as separate devices.  We surely will need adjustments and
optimizations to make things work at least somewhat reasonably but
that is the baseline.

In the discussion, for such implementation, the following obstacles
were identified.

* There are a lot of cases where IOs are issued by a task which isn't
  the originiator.  ie. Writeback issues IOs for pages which are
  dirtied by some other tasks.  So, by the time an IO reaches the
  block layer, we don't know which cgroup the IO belongs to.

  Recently, block layer has grown support to attach a task to a bio
  which causes the bio to be handled as if it were issued by the
  associated task regardless of the actual issuing task.  It currently
  only allows attaching %current to a bio - bio_associate_current() -
  but changing it to support other tasks is trivial.

  We'll need to update the async issuers to tag the IOs they issue but
  the mechanism is already there.

* There's a single request pool shared by all issuers per a request
  queue.  This can lead to priority inversion among cgroups.  Note
  that problem also exists without cgroups.  Lower ioprio issuer may
  be holding a request holding back highprio issuer.

  We'll need to make request allocation cgroup (and hopefully ioprio)
  aware.  Probably in the form of separate request pools.  This will
  take some work but I don't think this will be too challenging.  I'll
  work on it.

* cfq cgroup policy throws all async IOs, which all buffered writes
  are, into the shared cgroup regardless of the actual cgroup.  This
  behavior is, I believe, mostly historical and changing it isn't
  difficult.  Prolly only few tens of lines of changes.  This may
  cause significant changes to actual IO behavior with cgroups tho.  I
  personally think the previous behavior was too wrong to keep (the
  weight was completely ignored for buffered writes) but we may want
  to introduce a switch to toggle between the two behaviors.

  Note that blk-throttle doesn't have this problem.

* Unlike dirty data pages, metadata tends to have strict ordering
  requirements and thus is susceptible to priority inversion.  Two
  solutions were suggested - 1. allow overdrawl for metadata writes so
  that low prio metadata writes don't block the whole FS, 2. provide
  an interface to query and wait for bdi-cgroup congestion which can
  be called from FS metadata paths to throttle metadata operations
  before they enter the stream of ordered operations.

  I think combination of the above two should be enough for solving
  the problem.  I *think* the second can be implemented as part of
  cgroup aware request allocation update.  The first one needs a bit
  more thinking but there can be easier interim solutions (e.g. throw
  META writes to the head of the cgroup queue or just plain ignore
  cgroup limits for META writes) for now.

* I'm sure there are a lot of design choices to be made in the
  writeback implementation but IIUC Jan seems to agree that the
  simplest would be simply deal different cgroup-bdi pairs as
  completely separate which shouldn't add too much complexity to the
  already intricate writeback code.

So, I think we have something which sounds like a plan, which at least
I can agree with and seems doable without adding a lot of complexity.

Jan, Fengguang, I'm pretty sure I missed some stuff from writeback's
side and IIUC Fengguang doesn't agree with this approach too much, so
please voice your opinions & comments.

Thank you.

--
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* [RFC] writeback and cgroup
@ 2012-04-03 18:36 ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-03 18:36 UTC (permalink / raw)
  To: Fengguang Wu, Jan Kara, vgoyal, Jens Axboe
  Cc: linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel,
	linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups,
	ctalbott, rni, lsf

Hello, guys.

So, during LSF, I, Fengguang and Jan had a chance to sit down and talk
about how to cgroup support to writeback.  Here's what I got from it.

Fengguang's opinion is that the throttling algorithm implemented in
writeback is good enough and blkcg parameters can be exposed to
writeback such that those limits can be applied from writeback.  As
for reads and direct IOs, Fengguang opined that the algorithm can
easily be extended to cover those cases and IIUC all IOs, whether
buffered writes, reads or direct IOs can eventually all go through
writeback layer which will be the one layer controlling all IOs.

Unfortunately, I don't agree with that at all.  I think it's a gross
layering violation and lacks any longterm design.  We have a well
working model of applying and propagating resource pressure - we apply
the pressure where the resource exists and propagates the back
pressure through buffers to upper layers upto the originator.  Think
about network, the pressure exists or is applied at the in/egress
points which gets propagated through socket buffers and eventually
throttles the originator.

Writeback, without cgroup, isn't different.  It consists a part of the
pressure propagation chain anchored at the IO device.  IO devices
these days generate very high pressure, which gets propgated through
the IO sched and buffered requests, which in turn creates pressure at
writeback.  Here, the buffering happens in page cache and pressure at
writeback increases the amount of dirty page cache.  Propagating this
IO pressure to the dirtying task is one of the biggest
responsibililties of the writeback code, and this is the underlying
design of the whole thing.

IIUC, without cgroup, the current writeback code works more or less
like this.  Throwing in cgroup doesn't really change the fundamental
design.  Instead of a single pipe going down, we just have multiple
pipes to the same device, each of which should be treated separately.
Of course, a spinning disk can't be divided that easily and their
performance characteristics will be inter-dependent, but the place to
solve that problem is where the problem is, the block layer.

We may have to look for optimizations and expose some details to
improve the overall behavior and such optimizations may require some
deviation from the fundamental design, but such optimizations should
be justified and such deviations kept at minimum, so, no, I don't
think we're gonna be expose blkcg / block / elevator parameters
directly to writeback.  Unless someone can *really* convince me
otherwise, I'll be vetoing any change toward that direction.

Let's please keep the layering clear.  IO limitations will be applied
at the block layer and pressure will be formed there and then
propagated upwards eventually to the originator.  Sure, exposing the
whole information might result in better behavior for certain
workloads, but down the road, say, in three or five years, devices
which can be shared without worrying too much about seeks might be
commonplace and we could be swearing at a disgusting structural mess,
and sadly various cgroup support seems to be a prominent source of
such design failures.

IMHO, treating cgroup - device/bdi pair as a separate device should
suffice as the underlying design.  After all, blkio cgroup support's
ultimate goal is dividing the IO resource into separate bins.
Implementation details might change as underlying technology changes
and we learn more about how to do it better but that is the goal which
we'll always try to keep close to.  Writeback should (be able to)
treat them as separate devices.  We surely will need adjustments and
optimizations to make things work at least somewhat reasonably but
that is the baseline.

In the discussion, for such implementation, the following obstacles
were identified.

* There are a lot of cases where IOs are issued by a task which isn't
  the originiator.  ie. Writeback issues IOs for pages which are
  dirtied by some other tasks.  So, by the time an IO reaches the
  block layer, we don't know which cgroup the IO belongs to.

  Recently, block layer has grown support to attach a task to a bio
  which causes the bio to be handled as if it were issued by the
  associated task regardless of the actual issuing task.  It currently
  only allows attaching %current to a bio - bio_associate_current() -
  but changing it to support other tasks is trivial.

  We'll need to update the async issuers to tag the IOs they issue but
  the mechanism is already there.

* There's a single request pool shared by all issuers per a request
  queue.  This can lead to priority inversion among cgroups.  Note
  that problem also exists without cgroups.  Lower ioprio issuer may
  be holding a request holding back highprio issuer.

  We'll need to make request allocation cgroup (and hopefully ioprio)
  aware.  Probably in the form of separate request pools.  This will
  take some work but I don't think this will be too challenging.  I'll
  work on it.

* cfq cgroup policy throws all async IOs, which all buffered writes
  are, into the shared cgroup regardless of the actual cgroup.  This
  behavior is, I believe, mostly historical and changing it isn't
  difficult.  Prolly only few tens of lines of changes.  This may
  cause significant changes to actual IO behavior with cgroups tho.  I
  personally think the previous behavior was too wrong to keep (the
  weight was completely ignored for buffered writes) but we may want
  to introduce a switch to toggle between the two behaviors.

  Note that blk-throttle doesn't have this problem.

* Unlike dirty data pages, metadata tends to have strict ordering
  requirements and thus is susceptible to priority inversion.  Two
  solutions were suggested - 1. allow overdrawl for metadata writes so
  that low prio metadata writes don't block the whole FS, 2. provide
  an interface to query and wait for bdi-cgroup congestion which can
  be called from FS metadata paths to throttle metadata operations
  before they enter the stream of ordered operations.

  I think combination of the above two should be enough for solving
  the problem.  I *think* the second can be implemented as part of
  cgroup aware request allocation update.  The first one needs a bit
  more thinking but there can be easier interim solutions (e.g. throw
  META writes to the head of the cgroup queue or just plain ignore
  cgroup limits for META writes) for now.

* I'm sure there are a lot of design choices to be made in the
  writeback implementation but IIUC Jan seems to agree that the
  simplest would be simply deal different cgroup-bdi pairs as
  completely separate which shouldn't add too much complexity to the
  already intricate writeback code.

So, I think we have something which sounds like a plan, which at least
I can agree with and seems doable without adding a lot of complexity.

Jan, Fengguang, I'm pretty sure I missed some stuff from writeback's
side and IIUC Fengguang doesn't agree with this approach too much, so
please voice your opinions & comments.

Thank you.

--
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found] ` <20120403183655.GA23106-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
@ 2012-04-04 14:51   ` Vivek Goyal
  2012-04-04 17:51     ` Fengguang Wu
  1 sibling, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 14:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:

Hi Tejun,

Thanks for the RFC and looking into this issue. Few thoughts inline.

[..]
> IIUC, without cgroup, the current writeback code works more or less
> like this.  Throwing in cgroup doesn't really change the fundamental
> design.  Instead of a single pipe going down, we just have multiple
> pipes to the same device, each of which should be treated separately.
> Of course, a spinning disk can't be divided that easily and their
> performance characteristics will be inter-dependent, but the place to
> solve that problem is where the problem is, the block layer.

How do you take care of thorottling IO to NFS case in this model? Current
throttling logic is tied to block device and in case of NFS, there is no
block device.

[..]
> In the discussion, for such implementation, the following obstacles
> were identified.
> 
> * There are a lot of cases where IOs are issued by a task which isn't
>   the originiator.  ie. Writeback issues IOs for pages which are
>   dirtied by some other tasks.  So, by the time an IO reaches the
>   block layer, we don't know which cgroup the IO belongs to.
> 
>   Recently, block layer has grown support to attach a task to a bio
>   which causes the bio to be handled as if it were issued by the
>   associated task regardless of the actual issuing task.  It currently
>   only allows attaching %current to a bio - bio_associate_current() -
>   but changing it to support other tasks is trivial.
> 
>   We'll need to update the async issuers to tag the IOs they issue but
>   the mechanism is already there.

Most likely this tagging will take place in "struct page" and I am not
sure if we will be allowed to grow size of "struct page" for this reason.

> 
> * There's a single request pool shared by all issuers per a request
>   queue.  This can lead to priority inversion among cgroups.  Note
>   that problem also exists without cgroups.  Lower ioprio issuer may
>   be holding a request holding back highprio issuer.
> 
>   We'll need to make request allocation cgroup (and hopefully ioprio)
>   aware.  Probably in the form of separate request pools.  This will
>   take some work but I don't think this will be too challenging.  I'll
>   work on it.

This should be doable. I had implemented it long back with single request
pool but internal limits for each group. That is block the task in the
group if group has enough pending requests allocated from the pool. But
separate request pool should work equally well. 

Just that it conflits a bit with current definition of q->nr_requests.
Which specifies number of total outstanding requests on the queue. Once
you make the pool per queue, I guess this limit will have to be
transformed into per group upper limit.

> 
> * cfq cgroup policy throws all async IOs, which all buffered writes
>   are, into the shared cgroup regardless of the actual cgroup.  This
>   behavior is, I believe, mostly historical and changing it isn't
>   difficult.  Prolly only few tens of lines of changes.  This may
>   cause significant changes to actual IO behavior with cgroups tho.  I
>   personally think the previous behavior was too wrong to keep (the
>   weight was completely ignored for buffered writes) but we may want
>   to introduce a switch to toggle between the two behaviors.

I had kept all buffered writes in in same cgroup (root cgroup) for few
reasons.

- Because of single request descriptor pool for writes, anyway one writer
  gets backlogged behind other. So creating separate async queues per
  group is not going to help.

- Writeback logic was not cgroup aware. So it might not send enough IO
  from each writer to maintain parallelism. So creating separate async
  queues did not make sense till that was fixed.

- As you said, it is historical also. We prioritize READS at the expense
  of writes. Now by putting buffered/async writes in a separate group, we
  will might end up prioritizing a group's async write over other group's
  synchronous read. How many people really want that behavior? To me
  keeping service differentiation among the sync IO matters most. Even
  if all async IO is treated same, I guess not many people might care.

> 
>   Note that blk-throttle doesn't have this problem.

I am not sure what are you trying to say here. But primarily blk-throttle
will throttle read and direct IO. Buffered writes will go to root cgroup
which is typically unthrottled.

> 
> * Unlike dirty data pages, metadata tends to have strict ordering
>   requirements and thus is susceptible to priority inversion.  Two
>   solutions were suggested - 1. allow overdrawl for metadata writes so
>   that low prio metadata writes don't block the whole FS, 2. provide
>   an interface to query and wait for bdi-cgroup congestion which can
>   be called from FS metadata paths to throttle metadata operations
>   before they enter the stream of ordered operations.

So that probably will mean changing the order of operations also. IIUC, 
in case of fsync (ordered mode), we opened a meta data transaction first,
then tried to flush all the cached data and then flush metadata. So if
fsync is throttled, all the metadata operations behind it will get 
serialized for ext3/ext4.

So you seem to be suggesting that we change the design so that metadata
operation does not thrown into ordered stream till we have finished
writing all the data back to disk? I am not a filesystem developer, so
I don't know how feasible this change is.

This is just one of the points. In the past while talking to Dave Chinner,
he mentioned that in XFS, if two cgroups fall into same allocation group
then there were cases where IO of one cgroup can get serialized behind
other.

In general, the core of the issue is that filesystems are not cgroup aware
and if you do throttling below filesystems, then invariably one or other
serialization issue will come up and I am concerned that we will be constantly
fixing those serialization issues. Or the desgin point could be so central
to filesystem design that it can't be changed.

In general, if you do throttling deeper in the stakc and build back
pressure, then all the layers sitting above should be cgroup aware
to avoid problems. Two layers identified so far are writeback and
filesystems. Is it really worth the complexity. How about doing 
throttling in higher layers when IO is entering the kernel and
keep proportional IO logic at the lowest level and current mechanism
of building pressure continues to work?

Why to split. Proportional IO logic is work conserving so even if
some serialization happens, that situation should clear up pretty
soon as IO from other cgroup will dry up and IO from the group causing
serialization will make progress and at max we will lose fairness for
certain duration.

With throttling limits come from the user and one can put really low
artificial limits. So even if the underlying resources are free the
IO from throttled cgroup might not make any progress in turn choking
every other cgroup which is serialized behind it. 

So in general throttling at block layer and building back pressure is
fine. I am concerned about two cases.

- How to handle NFS.
- Do filesystem developers agree with this approach and are they willing
  to address any serialization issues arising due to this design.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-03 18:36 ` Tejun Heo
@ 2012-04-04 14:51   ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 14:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:

Hi Tejun,

Thanks for the RFC and looking into this issue. Few thoughts inline.

[..]
> IIUC, without cgroup, the current writeback code works more or less
> like this.  Throwing in cgroup doesn't really change the fundamental
> design.  Instead of a single pipe going down, we just have multiple
> pipes to the same device, each of which should be treated separately.
> Of course, a spinning disk can't be divided that easily and their
> performance characteristics will be inter-dependent, but the place to
> solve that problem is where the problem is, the block layer.

How do you take care of thorottling IO to NFS case in this model? Current
throttling logic is tied to block device and in case of NFS, there is no
block device.

[..]
> In the discussion, for such implementation, the following obstacles
> were identified.
> 
> * There are a lot of cases where IOs are issued by a task which isn't
>   the originiator.  ie. Writeback issues IOs for pages which are
>   dirtied by some other tasks.  So, by the time an IO reaches the
>   block layer, we don't know which cgroup the IO belongs to.
> 
>   Recently, block layer has grown support to attach a task to a bio
>   which causes the bio to be handled as if it were issued by the
>   associated task regardless of the actual issuing task.  It currently
>   only allows attaching %current to a bio - bio_associate_current() -
>   but changing it to support other tasks is trivial.
> 
>   We'll need to update the async issuers to tag the IOs they issue but
>   the mechanism is already there.

Most likely this tagging will take place in "struct page" and I am not
sure if we will be allowed to grow size of "struct page" for this reason.

> 
> * There's a single request pool shared by all issuers per a request
>   queue.  This can lead to priority inversion among cgroups.  Note
>   that problem also exists without cgroups.  Lower ioprio issuer may
>   be holding a request holding back highprio issuer.
> 
>   We'll need to make request allocation cgroup (and hopefully ioprio)
>   aware.  Probably in the form of separate request pools.  This will
>   take some work but I don't think this will be too challenging.  I'll
>   work on it.

This should be doable. I had implemented it long back with single request
pool but internal limits for each group. That is block the task in the
group if group has enough pending requests allocated from the pool. But
separate request pool should work equally well. 

Just that it conflits a bit with current definition of q->nr_requests.
Which specifies number of total outstanding requests on the queue. Once
you make the pool per queue, I guess this limit will have to be
transformed into per group upper limit.

> 
> * cfq cgroup policy throws all async IOs, which all buffered writes
>   are, into the shared cgroup regardless of the actual cgroup.  This
>   behavior is, I believe, mostly historical and changing it isn't
>   difficult.  Prolly only few tens of lines of changes.  This may
>   cause significant changes to actual IO behavior with cgroups tho.  I
>   personally think the previous behavior was too wrong to keep (the
>   weight was completely ignored for buffered writes) but we may want
>   to introduce a switch to toggle between the two behaviors.

I had kept all buffered writes in in same cgroup (root cgroup) for few
reasons.

- Because of single request descriptor pool for writes, anyway one writer
  gets backlogged behind other. So creating separate async queues per
  group is not going to help.

- Writeback logic was not cgroup aware. So it might not send enough IO
  from each writer to maintain parallelism. So creating separate async
  queues did not make sense till that was fixed.

- As you said, it is historical also. We prioritize READS at the expense
  of writes. Now by putting buffered/async writes in a separate group, we
  will might end up prioritizing a group's async write over other group's
  synchronous read. How many people really want that behavior? To me
  keeping service differentiation among the sync IO matters most. Even
  if all async IO is treated same, I guess not many people might care.

> 
>   Note that blk-throttle doesn't have this problem.

I am not sure what are you trying to say here. But primarily blk-throttle
will throttle read and direct IO. Buffered writes will go to root cgroup
which is typically unthrottled.

> 
> * Unlike dirty data pages, metadata tends to have strict ordering
>   requirements and thus is susceptible to priority inversion.  Two
>   solutions were suggested - 1. allow overdrawl for metadata writes so
>   that low prio metadata writes don't block the whole FS, 2. provide
>   an interface to query and wait for bdi-cgroup congestion which can
>   be called from FS metadata paths to throttle metadata operations
>   before they enter the stream of ordered operations.

So that probably will mean changing the order of operations also. IIUC, 
in case of fsync (ordered mode), we opened a meta data transaction first,
then tried to flush all the cached data and then flush metadata. So if
fsync is throttled, all the metadata operations behind it will get 
serialized for ext3/ext4.

So you seem to be suggesting that we change the design so that metadata
operation does not thrown into ordered stream till we have finished
writing all the data back to disk? I am not a filesystem developer, so
I don't know how feasible this change is.

This is just one of the points. In the past while talking to Dave Chinner,
he mentioned that in XFS, if two cgroups fall into same allocation group
then there were cases where IO of one cgroup can get serialized behind
other.

In general, the core of the issue is that filesystems are not cgroup aware
and if you do throttling below filesystems, then invariably one or other
serialization issue will come up and I am concerned that we will be constantly
fixing those serialization issues. Or the desgin point could be so central
to filesystem design that it can't be changed.

In general, if you do throttling deeper in the stakc and build back
pressure, then all the layers sitting above should be cgroup aware
to avoid problems. Two layers identified so far are writeback and
filesystems. Is it really worth the complexity. How about doing 
throttling in higher layers when IO is entering the kernel and
keep proportional IO logic at the lowest level and current mechanism
of building pressure continues to work?

Why to split. Proportional IO logic is work conserving so even if
some serialization happens, that situation should clear up pretty
soon as IO from other cgroup will dry up and IO from the group causing
serialization will make progress and at max we will lose fairness for
certain duration.

With throttling limits come from the user and one can put really low
artificial limits. So even if the underlying resources are free the
IO from throttled cgroup might not make any progress in turn choking
every other cgroup which is serialized behind it. 

So in general throttling at block layer and building back pressure is
fine. I am concerned about two cases.

- How to handle NFS.
- Do filesystem developers agree with this approach and are they willing
  to address any serialization issues arising due to this design.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 14:51   ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 14:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:

Hi Tejun,

Thanks for the RFC and looking into this issue. Few thoughts inline.

[..]
> IIUC, without cgroup, the current writeback code works more or less
> like this.  Throwing in cgroup doesn't really change the fundamental
> design.  Instead of a single pipe going down, we just have multiple
> pipes to the same device, each of which should be treated separately.
> Of course, a spinning disk can't be divided that easily and their
> performance characteristics will be inter-dependent, but the place to
> solve that problem is where the problem is, the block layer.

How do you take care of thorottling IO to NFS case in this model? Current
throttling logic is tied to block device and in case of NFS, there is no
block device.

[..]
> In the discussion, for such implementation, the following obstacles
> were identified.
> 
> * There are a lot of cases where IOs are issued by a task which isn't
>   the originiator.  ie. Writeback issues IOs for pages which are
>   dirtied by some other tasks.  So, by the time an IO reaches the
>   block layer, we don't know which cgroup the IO belongs to.
> 
>   Recently, block layer has grown support to attach a task to a bio
>   which causes the bio to be handled as if it were issued by the
>   associated task regardless of the actual issuing task.  It currently
>   only allows attaching %current to a bio - bio_associate_current() -
>   but changing it to support other tasks is trivial.
> 
>   We'll need to update the async issuers to tag the IOs they issue but
>   the mechanism is already there.

Most likely this tagging will take place in "struct page" and I am not
sure if we will be allowed to grow size of "struct page" for this reason.

> 
> * There's a single request pool shared by all issuers per a request
>   queue.  This can lead to priority inversion among cgroups.  Note
>   that problem also exists without cgroups.  Lower ioprio issuer may
>   be holding a request holding back highprio issuer.
> 
>   We'll need to make request allocation cgroup (and hopefully ioprio)
>   aware.  Probably in the form of separate request pools.  This will
>   take some work but I don't think this will be too challenging.  I'll
>   work on it.

This should be doable. I had implemented it long back with single request
pool but internal limits for each group. That is block the task in the
group if group has enough pending requests allocated from the pool. But
separate request pool should work equally well. 

Just that it conflits a bit with current definition of q->nr_requests.
Which specifies number of total outstanding requests on the queue. Once
you make the pool per queue, I guess this limit will have to be
transformed into per group upper limit.

> 
> * cfq cgroup policy throws all async IOs, which all buffered writes
>   are, into the shared cgroup regardless of the actual cgroup.  This
>   behavior is, I believe, mostly historical and changing it isn't
>   difficult.  Prolly only few tens of lines of changes.  This may
>   cause significant changes to actual IO behavior with cgroups tho.  I
>   personally think the previous behavior was too wrong to keep (the
>   weight was completely ignored for buffered writes) but we may want
>   to introduce a switch to toggle between the two behaviors.

I had kept all buffered writes in in same cgroup (root cgroup) for few
reasons.

- Because of single request descriptor pool for writes, anyway one writer
  gets backlogged behind other. So creating separate async queues per
  group is not going to help.

- Writeback logic was not cgroup aware. So it might not send enough IO
  from each writer to maintain parallelism. So creating separate async
  queues did not make sense till that was fixed.

- As you said, it is historical also. We prioritize READS at the expense
  of writes. Now by putting buffered/async writes in a separate group, we
  will might end up prioritizing a group's async write over other group's
  synchronous read. How many people really want that behavior? To me
  keeping service differentiation among the sync IO matters most. Even
  if all async IO is treated same, I guess not many people might care.

> 
>   Note that blk-throttle doesn't have this problem.

I am not sure what are you trying to say here. But primarily blk-throttle
will throttle read and direct IO. Buffered writes will go to root cgroup
which is typically unthrottled.

> 
> * Unlike dirty data pages, metadata tends to have strict ordering
>   requirements and thus is susceptible to priority inversion.  Two
>   solutions were suggested - 1. allow overdrawl for metadata writes so
>   that low prio metadata writes don't block the whole FS, 2. provide
>   an interface to query and wait for bdi-cgroup congestion which can
>   be called from FS metadata paths to throttle metadata operations
>   before they enter the stream of ordered operations.

So that probably will mean changing the order of operations also. IIUC, 
in case of fsync (ordered mode), we opened a meta data transaction first,
then tried to flush all the cached data and then flush metadata. So if
fsync is throttled, all the metadata operations behind it will get 
serialized for ext3/ext4.

So you seem to be suggesting that we change the design so that metadata
operation does not thrown into ordered stream till we have finished
writing all the data back to disk? I am not a filesystem developer, so
I don't know how feasible this change is.

This is just one of the points. In the past while talking to Dave Chinner,
he mentioned that in XFS, if two cgroups fall into same allocation group
then there were cases where IO of one cgroup can get serialized behind
other.

In general, the core of the issue is that filesystems are not cgroup aware
and if you do throttling below filesystems, then invariably one or other
serialization issue will come up and I am concerned that we will be constantly
fixing those serialization issues. Or the desgin point could be so central
to filesystem design that it can't be changed.

In general, if you do throttling deeper in the stakc and build back
pressure, then all the layers sitting above should be cgroup aware
to avoid problems. Two layers identified so far are writeback and
filesystems. Is it really worth the complexity. How about doing 
throttling in higher layers when IO is entering the kernel and
keep proportional IO logic at the lowest level and current mechanism
of building pressure continues to work?

Why to split. Proportional IO logic is work conserving so even if
some serialization happens, that situation should clear up pretty
soon as IO from other cgroup will dry up and IO from the group causing
serialization will make progress and at max we will lose fairness for
certain duration.

With throttling limits come from the user and one can put really low
artificial limits. So even if the underlying resources are free the
IO from throttled cgroup might not make any progress in turn choking
every other cgroup which is serialized behind it. 

So in general throttling at block layer and building back pressure is
fine. I am concerned about two cases.

- How to handle NFS.
- Do filesystem developers agree with this approach and are they willing
  to address any serialization issues arising due to this design.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]   ` <20120404145134.GC12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-04 15:36     ` Steve French
  2012-04-04 18:49     ` Tejun Heo
  2012-04-07  8:00     ` Jan Kara
  2 siblings, 0 replies; 262+ messages in thread
From: Steve French @ 2012-04-04 15:36 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Wed, Apr 4, 2012 at 9:51 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
>
> Hi Tejun,
>
> Thanks for the RFC and looking into this issue. Few thoughts inline.
>
> [..]
>> IIUC, without cgroup, the current writeback code works more or less
>> like this.  Throwing in cgroup doesn't really change the fundamental
>> design.  Instead of a single pipe going down, we just have multiple
>> pipes to the same device, each of which should be treated separately.
>> Of course, a spinning disk can't be divided that easily and their
>> performance characteristics will be inter-dependent, but the place to
>> solve that problem is where the problem is, the block layer.
>
> How do you take care of thorottling IO to NFS case in this model? Current
> throttling logic is tied to block device and in case of NFS, there is no
> block device.

Similarly smb2 gets congestion info (number of "credits") returned from
the server on every response - but not sure why congestion
control is tied to the block device when this would create
problems for network file systems

-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]   ` <20120404145134.GC12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2012-04-04 15:36     ` Steve French
@ 2012-04-04 15:36     ` Steve French
  2012-04-07  8:00     ` Jan Kara
  2 siblings, 0 replies; 262+ messages in thread
From: Steve French @ 2012-04-04 15:36 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Wed, Apr 4, 2012 at 9:51 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
>
> Hi Tejun,
>
> Thanks for the RFC and looking into this issue. Few thoughts inline.
>
> [..]
>> IIUC, without cgroup, the current writeback code works more or less
>> like this.  Throwing in cgroup doesn't really change the fundamental
>> design.  Instead of a single pipe going down, we just have multiple
>> pipes to the same device, each of which should be treated separately.
>> Of course, a spinning disk can't be divided that easily and their
>> performance characteristics will be inter-dependent, but the place to
>> solve that problem is where the problem is, the block layer.
>
> How do you take care of thorottling IO to NFS case in this model? Current
> throttling logic is tied to block device and in case of NFS, there is no
> block device.

Similarly smb2 gets congestion info (number of "credits") returned from
the server on every response - but not sure why congestion
control is tied to the block device when this would create
problems for network file systems

-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-04 15:36     ` Steve French
  0 siblings, 0 replies; 262+ messages in thread
From: Steve French @ 2012-04-04 15:36 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, Apr 4, 2012 at 9:51 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
>
> Hi Tejun,
>
> Thanks for the RFC and looking into this issue. Few thoughts inline.
>
> [..]
>> IIUC, without cgroup, the current writeback code works more or less
>> like this.  Throwing in cgroup doesn't really change the fundamental
>> design.  Instead of a single pipe going down, we just have multiple
>> pipes to the same device, each of which should be treated separately.
>> Of course, a spinning disk can't be divided that easily and their
>> performance characteristics will be inter-dependent, but the place to
>> solve that problem is where the problem is, the block layer.
>
> How do you take care of thorottling IO to NFS case in this model? Current
> throttling logic is tied to block device and in case of NFS, there is no
> block device.

Similarly smb2 gets congestion info (number of "credits") returned from
the server on every response - but not sure why congestion
control is tied to the block device when this would create
problems for network file systems

-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-04 15:36     ` Steve French
  0 siblings, 0 replies; 262+ messages in thread
From: Steve French @ 2012-04-04 15:36 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Wed, Apr 4, 2012 at 9:51 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
>
> Hi Tejun,
>
> Thanks for the RFC and looking into this issue. Few thoughts inline.
>
> [..]
>> IIUC, without cgroup, the current writeback code works more or less
>> like this.  Throwing in cgroup doesn't really change the fundamental
>> design.  Instead of a single pipe going down, we just have multiple
>> pipes to the same device, each of which should be treated separately.
>> Of course, a spinning disk can't be divided that easily and their
>> performance characteristics will be inter-dependent, but the place to
>> solve that problem is where the problem is, the block layer.
>
> How do you take care of thorottling IO to NFS case in this model? Current
> throttling logic is tied to block device and in case of NFS, there is no
> block device.

Similarly smb2 gets congestion info (number of "credits") returned from
the server on every response - but not sure why congestion
control is tied to the block device when this would create
problems for network file systems

-- 
Thanks,

Steve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-03 18:36 ` Tejun Heo
  (?)
@ 2012-04-04 17:51     ` Fengguang Wu
  -1 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-04 17:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA

Hi Tejun,

On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
> Hello, guys.
> 
> So, during LSF, I, Fengguang and Jan had a chance to sit down and talk
> about how to cgroup support to writeback.  Here's what I got from it.
> 
> Fengguang's opinion is that the throttling algorithm implemented in
> writeback is good enough and blkcg parameters can be exposed to
> writeback such that those limits can be applied from writeback.  As
> for reads and direct IOs, Fengguang opined that the algorithm can
> easily be extended to cover those cases and IIUC all IOs, whether
> buffered writes, reads or direct IOs can eventually all go through
> writeback layer which will be the one layer controlling all IOs.
 
Yeah it should be trivial to apply the balance_dirty_pages()
throttling algorithm to the read/direct IOs. However up to now I don't
see much added value to *duplicate* the current block IO controller
functionalities, assuming the current users and developers are happy
with it.

I did the buffered write IO controller mainly to fill the gap.  If I
happen to stand in your way, sorry that's not my initial intention.
It's a pity and surprise that Google as a big user does not buy in
this simple solution. You may prefer more comprehensive controls which
may not be easily achievable with the simple scheme. However the
complexities and overheads involved in throttling the flusher IOs
really upsets me. 

The sweet split point would be for balance_dirty_pages() to do cgroup
aware buffered write throttling and leave other IOs to the current
blkcg. For this to work well as a total solution for end users, I hope
we can cooperate and figure out ways for the two throttling entities
to work well with each other.

What I'm interested is, what's Google and other users' use schemes in
practice. What's their desired interfaces. Whether and how the
combined bdp+blkcg throttling can fulfill the goals.

> Unfortunately, I don't agree with that at all.  I think it's a gross
> layering violation and lacks any longterm design.  We have a well
> working model of applying and propagating resource pressure - we apply
> the pressure where the resource exists and propagates the back
> pressure through buffers to upper layers upto the originator.  Think
> about network, the pressure exists or is applied at the in/egress
> points which gets propagated through socket buffers and eventually
> throttles the originator.
> 
> Writeback, without cgroup, isn't different.  It consists a part of the
> pressure propagation chain anchored at the IO device.  IO devices
> these days generate very high pressure, which gets propgated through
> the IO sched and buffered requests, which in turn creates pressure at
> writeback.  Here, the buffering happens in page cache and pressure at
> writeback increases the amount of dirty page cache.  Propagating this
> IO pressure to the dirtying task is one of the biggest
> responsibililties of the writeback code, and this is the underlying
> design of the whole thing.
> 
> IIUC, without cgroup, the current writeback code works more or less
> like this.  Throwing in cgroup doesn't really change the fundamental
> design.  Instead of a single pipe going down, we just have multiple
> pipes to the same device, each of which should be treated separately.
> Of course, a spinning disk can't be divided that easily and their
> performance characteristics will be inter-dependent, but the place to
> solve that problem is where the problem is, the block layer.
> 
> We may have to look for optimizations and expose some details to
> improve the overall behavior and such optimizations may require some
> deviation from the fundamental design, but such optimizations should
> be justified and such deviations kept at minimum, so, no, I don't
> think we're gonna be expose blkcg / block / elevator parameters
> directly to writeback.  Unless someone can *really* convince me
> otherwise, I'll be vetoing any change toward that direction.
> 
> Let's please keep the layering clear.  IO limitations will be applied
> at the block layer and pressure will be formed there and then
> propagated upwards eventually to the originator.  Sure, exposing the
> whole information might result in better behavior for certain
> workloads, but down the road, say, in three or five years, devices
> which can be shared without worrying too much about seeks might be
> commonplace and we could be swearing at a disgusting structural mess,
> and sadly various cgroup support seems to be a prominent source of
> such design failures.

Super fast storages are coming which will make us regret to make the
IO path over complex.  Spinning disks are not going away anytime soon.
I doubt Google is willing to afford the disk seek costs on its
millions of disks and has the patience to wait until switching all of
the spin disks to SSD years later (if it will ever happen).

Sorry, I won't buy in the layering arguments and analog to networking.
Yeah network is a good way to show your "push back" idea, however
writeback has its own metadata, seeking, etc. problems.

I'd prefer we base our discussions on real things like complexities,
overheads, performance as well as user demands.

It's obvious that your below proposal involves a lot of complexities,
overheads, and will hurt performance. It basically involves

- running concurrent flusher threads for cgroups, which adds back the
  disk seeks and lock contentions. And still has problems with sync
  and shared inodes.

- splitting device queue for cgroups, possibly scaling up the pool of
  writeback pages (and locked pages in the case of stable pages) which
  could stall random processes in the system

- the mess of metadata handling

- unnecessarily coupled with memcg, in order to take advantage of the
  per-memcg dirty limits for balance_dirty_pages() to actually convert
  the "pushed back" dirty pages pressure into lowered dirty rate. Why
  the hell the users *have to* setup memcg (suffering from all the
  inconvenience and overheads) in order to do IO throttling?  Please,
  this is really ugly! And the "back pressure" may constantly push the
  memcg dirty pages to the limits. I'm not going to support *miss use*
  of per-memcg dirty limits like this!

I cannot believe you would keep overlooking all the problems without
good reasons. Please do tell us the reasons that matter.

Thanks,
Fengguang

> IMHO, treating cgroup - device/bdi pair as a separate device should
> suffice as the underlying design.  After all, blkio cgroup support's
> ultimate goal is dividing the IO resource into separate bins.
> Implementation details might change as underlying technology changes
> and we learn more about how to do it better but that is the goal which
> we'll always try to keep close to.  Writeback should (be able to)
> treat them as separate devices.  We surely will need adjustments and
> optimizations to make things work at least somewhat reasonably but
> that is the baseline.
> 
> In the discussion, for such implementation, the following obstacles
> were identified.
> 
> * There are a lot of cases where IOs are issued by a task which isn't
>   the originiator.  ie. Writeback issues IOs for pages which are
>   dirtied by some other tasks.  So, by the time an IO reaches the
>   block layer, we don't know which cgroup the IO belongs to.
> 
>   Recently, block layer has grown support to attach a task to a bio
>   which causes the bio to be handled as if it were issued by the
>   associated task regardless of the actual issuing task.  It currently
>   only allows attaching %current to a bio - bio_associate_current() -
>   but changing it to support other tasks is trivial.
> 
>   We'll need to update the async issuers to tag the IOs they issue but
>   the mechanism is already there.
> 
> * There's a single request pool shared by all issuers per a request
>   queue.  This can lead to priority inversion among cgroups.  Note
>   that problem also exists without cgroups.  Lower ioprio issuer may
>   be holding a request holding back highprio issuer.
> 
>   We'll need to make request allocation cgroup (and hopefully ioprio)
>   aware.  Probably in the form of separate request pools.  This will
>   take some work but I don't think this will be too challenging.  I'll
>   work on it.
> 
> * cfq cgroup policy throws all async IOs, which all buffered writes
>   are, into the shared cgroup regardless of the actual cgroup.  This
>   behavior is, I believe, mostly historical and changing it isn't
>   difficult.  Prolly only few tens of lines of changes.  This may
>   cause significant changes to actual IO behavior with cgroups tho.  I
>   personally think the previous behavior was too wrong to keep (the
>   weight was completely ignored for buffered writes) but we may want
>   to introduce a switch to toggle between the two behaviors.
> 
>   Note that blk-throttle doesn't have this problem.
> 
> * Unlike dirty data pages, metadata tends to have strict ordering
>   requirements and thus is susceptible to priority inversion.  Two
>   solutions were suggested - 1. allow overdrawl for metadata writes so
>   that low prio metadata writes don't block the whole FS, 2. provide
>   an interface to query and wait for bdi-cgroup congestion which can
>   be called from FS metadata paths to throttle metadata operations
>   before they enter the stream of ordered operations.
> 
>   I think combination of the above two should be enough for solving
>   the problem.  I *think* the second can be implemented as part of
>   cgroup aware request allocation update.  The first one needs a bit
>   more thinking but there can be easier interim solutions (e.g. throw
>   META writes to the head of the cgroup queue or just plain ignore
>   cgroup limits for META writes) for now.
> 
> * I'm sure there are a lot of design choices to be made in the
>   writeback implementation but IIUC Jan seems to agree that the
>   simplest would be simply deal different cgroup-bdi pairs as
>   completely separate which shouldn't add too much complexity to the
>   already intricate writeback code.
> 
> So, I think we have something which sounds like a plan, which at least
> I can agree with and seems doable without adding a lot of complexity.
> 
> Jan, Fengguang, I'm pretty sure I missed some stuff from writeback's
> side and IIUC Fengguang doesn't agree with this approach too much, so
> please voice your opinions & comments.
> 
> Thank you.
> 
> --
> tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 17:51     ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-04 17:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hi Tejun,

On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
> Hello, guys.
> 
> So, during LSF, I, Fengguang and Jan had a chance to sit down and talk
> about how to cgroup support to writeback.  Here's what I got from it.
> 
> Fengguang's opinion is that the throttling algorithm implemented in
> writeback is good enough and blkcg parameters can be exposed to
> writeback such that those limits can be applied from writeback.  As
> for reads and direct IOs, Fengguang opined that the algorithm can
> easily be extended to cover those cases and IIUC all IOs, whether
> buffered writes, reads or direct IOs can eventually all go through
> writeback layer which will be the one layer controlling all IOs.
 
Yeah it should be trivial to apply the balance_dirty_pages()
throttling algorithm to the read/direct IOs. However up to now I don't
see much added value to *duplicate* the current block IO controller
functionalities, assuming the current users and developers are happy
with it.

I did the buffered write IO controller mainly to fill the gap.  If I
happen to stand in your way, sorry that's not my initial intention.
It's a pity and surprise that Google as a big user does not buy in
this simple solution. You may prefer more comprehensive controls which
may not be easily achievable with the simple scheme. However the
complexities and overheads involved in throttling the flusher IOs
really upsets me. 

The sweet split point would be for balance_dirty_pages() to do cgroup
aware buffered write throttling and leave other IOs to the current
blkcg. For this to work well as a total solution for end users, I hope
we can cooperate and figure out ways for the two throttling entities
to work well with each other.

What I'm interested is, what's Google and other users' use schemes in
practice. What's their desired interfaces. Whether and how the
combined bdp+blkcg throttling can fulfill the goals.

> Unfortunately, I don't agree with that at all.  I think it's a gross
> layering violation and lacks any longterm design.  We have a well
> working model of applying and propagating resource pressure - we apply
> the pressure where the resource exists and propagates the back
> pressure through buffers to upper layers upto the originator.  Think
> about network, the pressure exists or is applied at the in/egress
> points which gets propagated through socket buffers and eventually
> throttles the originator.
> 
> Writeback, without cgroup, isn't different.  It consists a part of the
> pressure propagation chain anchored at the IO device.  IO devices
> these days generate very high pressure, which gets propgated through
> the IO sched and buffered requests, which in turn creates pressure at
> writeback.  Here, the buffering happens in page cache and pressure at
> writeback increases the amount of dirty page cache.  Propagating this
> IO pressure to the dirtying task is one of the biggest
> responsibililties of the writeback code, and this is the underlying
> design of the whole thing.
> 
> IIUC, without cgroup, the current writeback code works more or less
> like this.  Throwing in cgroup doesn't really change the fundamental
> design.  Instead of a single pipe going down, we just have multiple
> pipes to the same device, each of which should be treated separately.
> Of course, a spinning disk can't be divided that easily and their
> performance characteristics will be inter-dependent, but the place to
> solve that problem is where the problem is, the block layer.
> 
> We may have to look for optimizations and expose some details to
> improve the overall behavior and such optimizations may require some
> deviation from the fundamental design, but such optimizations should
> be justified and such deviations kept at minimum, so, no, I don't
> think we're gonna be expose blkcg / block / elevator parameters
> directly to writeback.  Unless someone can *really* convince me
> otherwise, I'll be vetoing any change toward that direction.
> 
> Let's please keep the layering clear.  IO limitations will be applied
> at the block layer and pressure will be formed there and then
> propagated upwards eventually to the originator.  Sure, exposing the
> whole information might result in better behavior for certain
> workloads, but down the road, say, in three or five years, devices
> which can be shared without worrying too much about seeks might be
> commonplace and we could be swearing at a disgusting structural mess,
> and sadly various cgroup support seems to be a prominent source of
> such design failures.

Super fast storages are coming which will make us regret to make the
IO path over complex.  Spinning disks are not going away anytime soon.
I doubt Google is willing to afford the disk seek costs on its
millions of disks and has the patience to wait until switching all of
the spin disks to SSD years later (if it will ever happen).

Sorry, I won't buy in the layering arguments and analog to networking.
Yeah network is a good way to show your "push back" idea, however
writeback has its own metadata, seeking, etc. problems.

I'd prefer we base our discussions on real things like complexities,
overheads, performance as well as user demands.

It's obvious that your below proposal involves a lot of complexities,
overheads, and will hurt performance. It basically involves

- running concurrent flusher threads for cgroups, which adds back the
  disk seeks and lock contentions. And still has problems with sync
  and shared inodes.

- splitting device queue for cgroups, possibly scaling up the pool of
  writeback pages (and locked pages in the case of stable pages) which
  could stall random processes in the system

- the mess of metadata handling

- unnecessarily coupled with memcg, in order to take advantage of the
  per-memcg dirty limits for balance_dirty_pages() to actually convert
  the "pushed back" dirty pages pressure into lowered dirty rate. Why
  the hell the users *have to* setup memcg (suffering from all the
  inconvenience and overheads) in order to do IO throttling?  Please,
  this is really ugly! And the "back pressure" may constantly push the
  memcg dirty pages to the limits. I'm not going to support *miss use*
  of per-memcg dirty limits like this!

I cannot believe you would keep overlooking all the problems without
good reasons. Please do tell us the reasons that matter.

Thanks,
Fengguang

> IMHO, treating cgroup - device/bdi pair as a separate device should
> suffice as the underlying design.  After all, blkio cgroup support's
> ultimate goal is dividing the IO resource into separate bins.
> Implementation details might change as underlying technology changes
> and we learn more about how to do it better but that is the goal which
> we'll always try to keep close to.  Writeback should (be able to)
> treat them as separate devices.  We surely will need adjustments and
> optimizations to make things work at least somewhat reasonably but
> that is the baseline.
> 
> In the discussion, for such implementation, the following obstacles
> were identified.
> 
> * There are a lot of cases where IOs are issued by a task which isn't
>   the originiator.  ie. Writeback issues IOs for pages which are
>   dirtied by some other tasks.  So, by the time an IO reaches the
>   block layer, we don't know which cgroup the IO belongs to.
> 
>   Recently, block layer has grown support to attach a task to a bio
>   which causes the bio to be handled as if it were issued by the
>   associated task regardless of the actual issuing task.  It currently
>   only allows attaching %current to a bio - bio_associate_current() -
>   but changing it to support other tasks is trivial.
> 
>   We'll need to update the async issuers to tag the IOs they issue but
>   the mechanism is already there.
> 
> * There's a single request pool shared by all issuers per a request
>   queue.  This can lead to priority inversion among cgroups.  Note
>   that problem also exists without cgroups.  Lower ioprio issuer may
>   be holding a request holding back highprio issuer.
> 
>   We'll need to make request allocation cgroup (and hopefully ioprio)
>   aware.  Probably in the form of separate request pools.  This will
>   take some work but I don't think this will be too challenging.  I'll
>   work on it.
> 
> * cfq cgroup policy throws all async IOs, which all buffered writes
>   are, into the shared cgroup regardless of the actual cgroup.  This
>   behavior is, I believe, mostly historical and changing it isn't
>   difficult.  Prolly only few tens of lines of changes.  This may
>   cause significant changes to actual IO behavior with cgroups tho.  I
>   personally think the previous behavior was too wrong to keep (the
>   weight was completely ignored for buffered writes) but we may want
>   to introduce a switch to toggle between the two behaviors.
> 
>   Note that blk-throttle doesn't have this problem.
> 
> * Unlike dirty data pages, metadata tends to have strict ordering
>   requirements and thus is susceptible to priority inversion.  Two
>   solutions were suggested - 1. allow overdrawl for metadata writes so
>   that low prio metadata writes don't block the whole FS, 2. provide
>   an interface to query and wait for bdi-cgroup congestion which can
>   be called from FS metadata paths to throttle metadata operations
>   before they enter the stream of ordered operations.
> 
>   I think combination of the above two should be enough for solving
>   the problem.  I *think* the second can be implemented as part of
>   cgroup aware request allocation update.  The first one needs a bit
>   more thinking but there can be easier interim solutions (e.g. throw
>   META writes to the head of the cgroup queue or just plain ignore
>   cgroup limits for META writes) for now.
> 
> * I'm sure there are a lot of design choices to be made in the
>   writeback implementation but IIUC Jan seems to agree that the
>   simplest would be simply deal different cgroup-bdi pairs as
>   completely separate which shouldn't add too much complexity to the
>   already intricate writeback code.
> 
> So, I think we have something which sounds like a plan, which at least
> I can agree with and seems doable without adding a lot of complexity.
> 
> Jan, Fengguang, I'm pretty sure I missed some stuff from writeback's
> side and IIUC Fengguang doesn't agree with this approach too much, so
> please voice your opinions & comments.
> 
> Thank you.
> 
> --
> tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 17:51     ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-04 17:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hi Tejun,

On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
> Hello, guys.
> 
> So, during LSF, I, Fengguang and Jan had a chance to sit down and talk
> about how to cgroup support to writeback.  Here's what I got from it.
> 
> Fengguang's opinion is that the throttling algorithm implemented in
> writeback is good enough and blkcg parameters can be exposed to
> writeback such that those limits can be applied from writeback.  As
> for reads and direct IOs, Fengguang opined that the algorithm can
> easily be extended to cover those cases and IIUC all IOs, whether
> buffered writes, reads or direct IOs can eventually all go through
> writeback layer which will be the one layer controlling all IOs.
 
Yeah it should be trivial to apply the balance_dirty_pages()
throttling algorithm to the read/direct IOs. However up to now I don't
see much added value to *duplicate* the current block IO controller
functionalities, assuming the current users and developers are happy
with it.

I did the buffered write IO controller mainly to fill the gap.  If I
happen to stand in your way, sorry that's not my initial intention.
It's a pity and surprise that Google as a big user does not buy in
this simple solution. You may prefer more comprehensive controls which
may not be easily achievable with the simple scheme. However the
complexities and overheads involved in throttling the flusher IOs
really upsets me. 

The sweet split point would be for balance_dirty_pages() to do cgroup
aware buffered write throttling and leave other IOs to the current
blkcg. For this to work well as a total solution for end users, I hope
we can cooperate and figure out ways for the two throttling entities
to work well with each other.

What I'm interested is, what's Google and other users' use schemes in
practice. What's their desired interfaces. Whether and how the
combined bdp+blkcg throttling can fulfill the goals.

> Unfortunately, I don't agree with that at all.  I think it's a gross
> layering violation and lacks any longterm design.  We have a well
> working model of applying and propagating resource pressure - we apply
> the pressure where the resource exists and propagates the back
> pressure through buffers to upper layers upto the originator.  Think
> about network, the pressure exists or is applied at the in/egress
> points which gets propagated through socket buffers and eventually
> throttles the originator.
> 
> Writeback, without cgroup, isn't different.  It consists a part of the
> pressure propagation chain anchored at the IO device.  IO devices
> these days generate very high pressure, which gets propgated through
> the IO sched and buffered requests, which in turn creates pressure at
> writeback.  Here, the buffering happens in page cache and pressure at
> writeback increases the amount of dirty page cache.  Propagating this
> IO pressure to the dirtying task is one of the biggest
> responsibililties of the writeback code, and this is the underlying
> design of the whole thing.
> 
> IIUC, without cgroup, the current writeback code works more or less
> like this.  Throwing in cgroup doesn't really change the fundamental
> design.  Instead of a single pipe going down, we just have multiple
> pipes to the same device, each of which should be treated separately.
> Of course, a spinning disk can't be divided that easily and their
> performance characteristics will be inter-dependent, but the place to
> solve that problem is where the problem is, the block layer.
> 
> We may have to look for optimizations and expose some details to
> improve the overall behavior and such optimizations may require some
> deviation from the fundamental design, but such optimizations should
> be justified and such deviations kept at minimum, so, no, I don't
> think we're gonna be expose blkcg / block / elevator parameters
> directly to writeback.  Unless someone can *really* convince me
> otherwise, I'll be vetoing any change toward that direction.
> 
> Let's please keep the layering clear.  IO limitations will be applied
> at the block layer and pressure will be formed there and then
> propagated upwards eventually to the originator.  Sure, exposing the
> whole information might result in better behavior for certain
> workloads, but down the road, say, in three or five years, devices
> which can be shared without worrying too much about seeks might be
> commonplace and we could be swearing at a disgusting structural mess,
> and sadly various cgroup support seems to be a prominent source of
> such design failures.

Super fast storages are coming which will make us regret to make the
IO path over complex.  Spinning disks are not going away anytime soon.
I doubt Google is willing to afford the disk seek costs on its
millions of disks and has the patience to wait until switching all of
the spin disks to SSD years later (if it will ever happen).

Sorry, I won't buy in the layering arguments and analog to networking.
Yeah network is a good way to show your "push back" idea, however
writeback has its own metadata, seeking, etc. problems.

I'd prefer we base our discussions on real things like complexities,
overheads, performance as well as user demands.

It's obvious that your below proposal involves a lot of complexities,
overheads, and will hurt performance. It basically involves

- running concurrent flusher threads for cgroups, which adds back the
  disk seeks and lock contentions. And still has problems with sync
  and shared inodes.

- splitting device queue for cgroups, possibly scaling up the pool of
  writeback pages (and locked pages in the case of stable pages) which
  could stall random processes in the system

- the mess of metadata handling

- unnecessarily coupled with memcg, in order to take advantage of the
  per-memcg dirty limits for balance_dirty_pages() to actually convert
  the "pushed back" dirty pages pressure into lowered dirty rate. Why
  the hell the users *have to* setup memcg (suffering from all the
  inconvenience and overheads) in order to do IO throttling?  Please,
  this is really ugly! And the "back pressure" may constantly push the
  memcg dirty pages to the limits. I'm not going to support *miss use*
  of per-memcg dirty limits like this!

I cannot believe you would keep overlooking all the problems without
good reasons. Please do tell us the reasons that matter.

Thanks,
Fengguang

> IMHO, treating cgroup - device/bdi pair as a separate device should
> suffice as the underlying design.  After all, blkio cgroup support's
> ultimate goal is dividing the IO resource into separate bins.
> Implementation details might change as underlying technology changes
> and we learn more about how to do it better but that is the goal which
> we'll always try to keep close to.  Writeback should (be able to)
> treat them as separate devices.  We surely will need adjustments and
> optimizations to make things work at least somewhat reasonably but
> that is the baseline.
> 
> In the discussion, for such implementation, the following obstacles
> were identified.
> 
> * There are a lot of cases where IOs are issued by a task which isn't
>   the originiator.  ie. Writeback issues IOs for pages which are
>   dirtied by some other tasks.  So, by the time an IO reaches the
>   block layer, we don't know which cgroup the IO belongs to.
> 
>   Recently, block layer has grown support to attach a task to a bio
>   which causes the bio to be handled as if it were issued by the
>   associated task regardless of the actual issuing task.  It currently
>   only allows attaching %current to a bio - bio_associate_current() -
>   but changing it to support other tasks is trivial.
> 
>   We'll need to update the async issuers to tag the IOs they issue but
>   the mechanism is already there.
> 
> * There's a single request pool shared by all issuers per a request
>   queue.  This can lead to priority inversion among cgroups.  Note
>   that problem also exists without cgroups.  Lower ioprio issuer may
>   be holding a request holding back highprio issuer.
> 
>   We'll need to make request allocation cgroup (and hopefully ioprio)
>   aware.  Probably in the form of separate request pools.  This will
>   take some work but I don't think this will be too challenging.  I'll
>   work on it.
> 
> * cfq cgroup policy throws all async IOs, which all buffered writes
>   are, into the shared cgroup regardless of the actual cgroup.  This
>   behavior is, I believe, mostly historical and changing it isn't
>   difficult.  Prolly only few tens of lines of changes.  This may
>   cause significant changes to actual IO behavior with cgroups tho.  I
>   personally think the previous behavior was too wrong to keep (the
>   weight was completely ignored for buffered writes) but we may want
>   to introduce a switch to toggle between the two behaviors.
> 
>   Note that blk-throttle doesn't have this problem.
> 
> * Unlike dirty data pages, metadata tends to have strict ordering
>   requirements and thus is susceptible to priority inversion.  Two
>   solutions were suggested - 1. allow overdrawl for metadata writes so
>   that low prio metadata writes don't block the whole FS, 2. provide
>   an interface to query and wait for bdi-cgroup congestion which can
>   be called from FS metadata paths to throttle metadata operations
>   before they enter the stream of ordered operations.
> 
>   I think combination of the above two should be enough for solving
>   the problem.  I *think* the second can be implemented as part of
>   cgroup aware request allocation update.  The first one needs a bit
>   more thinking but there can be easier interim solutions (e.g. throw
>   META writes to the head of the cgroup queue or just plain ignore
>   cgroup limits for META writes) for now.
> 
> * I'm sure there are a lot of design choices to be made in the
>   writeback implementation but IIUC Jan seems to agree that the
>   simplest would be simply deal different cgroup-bdi pairs as
>   completely separate which shouldn't add too much complexity to the
>   already intricate writeback code.
> 
> So, I think we have something which sounds like a plan, which at least
> I can agree with and seems doable without adding a lot of complexity.
> 
> Jan, Fengguang, I'm pretty sure I missed some stuff from writeback's
> side and IIUC Fengguang doesn't agree with this approach too much, so
> please voice your opinions & comments.
> 
> Thank you.
> 
> --
> tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-04 17:51     ` Fengguang Wu
                       ` (2 preceding siblings ...)
  (?)
@ 2012-04-04 18:35     ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 18:35 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:

[..]
> The sweet split point would be for balance_dirty_pages() to do cgroup
> aware buffered write throttling and leave other IOs to the current
> blkcg. For this to work well as a total solution for end users, I hope
> we can cooperate and figure out ways for the two throttling entities
> to work well with each other.

Throttling read + direct IO, higher up has few issues too. Users will
not like that a task got blocked as it tried to submit a read from a
throttled group. Current async behavior works well where we queue up the
bio from the task in throttled group and let task do other things. Same
is true for AIO where we would not like to block in bio submission.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-04 17:51     ` Fengguang Wu
@ 2012-04-04 18:35       ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 18:35 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:

[..]
> The sweet split point would be for balance_dirty_pages() to do cgroup
> aware buffered write throttling and leave other IOs to the current
> blkcg. For this to work well as a total solution for end users, I hope
> we can cooperate and figure out ways for the two throttling entities
> to work well with each other.

Throttling read + direct IO, higher up has few issues too. Users will
not like that a task got blocked as it tried to submit a read from a
throttled group. Current async behavior works well where we queue up the
bio from the task in throttled group and let task do other things. Same
is true for AIO where we would not like to block in bio submission.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 18:35       ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 18:35 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:

[..]
> The sweet split point would be for balance_dirty_pages() to do cgroup
> aware buffered write throttling and leave other IOs to the current
> blkcg. For this to work well as a total solution for end users, I hope
> we can cooperate and figure out ways for the two throttling entities
> to work well with each other.

Throttling read + direct IO, higher up has few issues too. Users will
not like that a task got blocked as it tried to submit a read from a
throttled group. Current async behavior works well where we queue up the
bio from the task in throttled group and let task do other things. Same
is true for AIO where we would not like to block in bio submission.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]   ` <20120404145134.GC12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2012-04-04 15:36     ` Steve French
@ 2012-04-04 18:49     ` Tejun Heo
  2012-04-07  8:00     ` Jan Kara
  2 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 18:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

Hey, Vivek.

On Wed, Apr 04, 2012 at 10:51:34AM -0400, Vivek Goyal wrote:
> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
> > IIUC, without cgroup, the current writeback code works more or less
> > like this.  Throwing in cgroup doesn't really change the fundamental
> > design.  Instead of a single pipe going down, we just have multiple
> > pipes to the same device, each of which should be treated separately.
> > Of course, a spinning disk can't be divided that easily and their
> > performance characteristics will be inter-dependent, but the place to
> > solve that problem is where the problem is, the block layer.
> 
> How do you take care of thorottling IO to NFS case in this model? Current
> throttling logic is tied to block device and in case of NFS, there is no
> block device.

On principle, I don't think it has be any different.  Filesystems's
interface to the underlying device is through bdi.  If a fs is block
backed, block pressure should be propagated through bdi, which should
be mostly trivial.  If a fs is network backed, we can implement a
mechanism for network backed bdis, so that they can relay the pressure
from the server side to the local fs users.

That said, network filesystems often show different behaviors and use
different mechanisms for various reasons and it wouldn't be too
surprising if something different would fit them better here or we
might need something supplemental to the usual mechanism.

> [..]
> > In the discussion, for such implementation, the following obstacles
> > were identified.
> > 
> > * There are a lot of cases where IOs are issued by a task which isn't
> >   the originiator.  ie. Writeback issues IOs for pages which are
> >   dirtied by some other tasks.  So, by the time an IO reaches the
> >   block layer, we don't know which cgroup the IO belongs to.
> > 
> >   Recently, block layer has grown support to attach a task to a bio
> >   which causes the bio to be handled as if it were issued by the
> >   associated task regardless of the actual issuing task.  It currently
> >   only allows attaching %current to a bio - bio_associate_current() -
> >   but changing it to support other tasks is trivial.
> > 
> >   We'll need to update the async issuers to tag the IOs they issue but
> >   the mechanism is already there.
> 
> Most likely this tagging will take place in "struct page" and I am not
> sure if we will be allowed to grow size of "struct page" for this reason.

With memcg enabled, we are already doing that and IIUC Jan and
Fengguang think that using inode granularity should be good enough for
writeback blaming.

> > * There's a single request pool shared by all issuers per a request
> >   queue.  This can lead to priority inversion among cgroups.  Note
> >   that problem also exists without cgroups.  Lower ioprio issuer may
> >   be holding a request holding back highprio issuer.
> > 
> >   We'll need to make request allocation cgroup (and hopefully ioprio)
> >   aware.  Probably in the form of separate request pools.  This will
> >   take some work but I don't think this will be too challenging.  I'll
> >   work on it.
> 
> This should be doable. I had implemented it long back with single request
> pool but internal limits for each group. That is block the task in the
> group if group has enough pending requests allocated from the pool. But
> separate request pool should work equally well. 
> 
> Just that it conflits a bit with current definition of q->nr_requests.
> Which specifies number of total outstanding requests on the queue. Once
> you make the pool per queue, I guess this limit will have to be
> transformed into per group upper limit.

I'm not sure about the details yet.  I *think* the suckiest part is
the actual allocation part.  We're deferring cgroup - request_queue
association until actual usage and depending on atomic allocations to
create those associations on IO path.  Doing the same for requests
might not be too pleasant.  Hmm....  allocation failure handling on
that path is already broken BTW.  Maybe we just need to get the
fallback behavior properly working.  Unsure.

> > * cfq cgroup policy throws all async IOs, which all buffered writes
> >   are, into the shared cgroup regardless of the actual cgroup.  This
> >   behavior is, I believe, mostly historical and changing it isn't
> >   difficult.  Prolly only few tens of lines of changes.  This may
> >   cause significant changes to actual IO behavior with cgroups tho.  I
> >   personally think the previous behavior was too wrong to keep (the
> >   weight was completely ignored for buffered writes) but we may want
> >   to introduce a switch to toggle between the two behaviors.
> 
> I had kept all buffered writes in in same cgroup (root cgroup) for few
> reasons.
> 
> - Because of single request descriptor pool for writes, anyway one writer
>   gets backlogged behind other. So creating separate async queues per
>   group is not going to help.
> 
> - Writeback logic was not cgroup aware. So it might not send enough IO
>   from each writer to maintain parallelism. So creating separate async
>   queues did not make sense till that was fixed.

Yeah, the above are why I find "buffered writes need separate controls
because cfq doesn't distinguish async writes" argument very ironic.
We introduce one quirk to compensate for shortages in the other part
and then later we work that around in that other part for that quirk?
I mean, that's just twisted.

> - As you said, it is historical also. We prioritize READS at the expense
>   of writes. Now by putting buffered/async writes in a separate group, we
>   will might end up prioritizing a group's async write over other group's
>   synchronous read. How many people really want that behavior? To me
>   keeping service differentiation among the sync IO matters most. Even
>   if all async IO is treated same, I guess not many people might care.

While segregation of async IOs may not matter in some cases, it does
matter to many other use cases, so it seems to me that we hard coded
that optimization decision without thinking too much about it.  For a
lot of container type use cases, the current implementation is nearly
useless (I know of cases where people are explicitly patching for
separate async queues).  At the same time, switching the default
behavior *may* disturb some of the current users and that's why I'm
thinking abut having a switch for the new behavior.

> >   Note that blk-throttle doesn't have this problem.
> 
> I am not sure what are you trying to say here. But primarily blk-throttle
> will throttle read and direct IO. Buffered writes will go to root cgroup
> which is typically unthrottled.

Ooh, my bad then.  Anyways, then the same applies to blk-throttle.
Our current implementation essentially collapses at the face of
write-heavy workload.

> > * Unlike dirty data pages, metadata tends to have strict ordering
> >   requirements and thus is susceptible to priority inversion.  Two
> >   solutions were suggested - 1. allow overdrawl for metadata writes so
> >   that low prio metadata writes don't block the whole FS, 2. provide
> >   an interface to query and wait for bdi-cgroup congestion which can
> >   be called from FS metadata paths to throttle metadata operations
> >   before they enter the stream of ordered operations.
> 
> So that probably will mean changing the order of operations also. IIUC, 
> in case of fsync (ordered mode), we opened a meta data transaction first,
> then tried to flush all the cached data and then flush metadata. So if
> fsync is throttled, all the metadata operations behind it will get 
> serialized for ext3/ext4.
> 
> So you seem to be suggesting that we change the design so that metadata
> operation does not thrown into ordered stream till we have finished
> writing all the data back to disk? I am not a filesystem developer, so
> I don't know how feasible this change is.

Jan explained it to me and I don't think it requires extensive changes
to the filesystem.  IIUC, filesystems would just block tasks creating
journal entry while its matching bdi is congested and that's the
extent of the necessary change.

> This is just one of the points. In the past while talking to Dave Chinner,
> he mentioned that in XFS, if two cgroups fall into same allocation group
> then there were cases where IO of one cgroup can get serialized behind
> other.
> 
> In general, the core of the issue is that filesystems are not cgroup aware
> and if you do throttling below filesystems, then invariably one or other
> serialization issue will come up and I am concerned that we will be constantly
> fixing those serialization issues. Or the desgin point could be so central
> to filesystem design that it can't be changed.

So, the idea is to avoid allowing any congested cgroup to enter
serialized journal.  As there's time gap until journal commit, the bdi
might be congested by the commit time.  In that case, META writes get
to overdraw cgroup limits to avoid causing priority inversion.  I
think we should be able to get most working with bdi congestion check
at the front and limit bypass for META for now.  We can worry about
overdrawing later.

> In general, if you do throttling deeper in the stakc and build back
> pressure, then all the layers sitting above should be cgroup aware
> to avoid problems. Two layers identified so far are writeback and
> filesystems. Is it really worth the complexity. How about doing 
> throttling in higher layers when IO is entering the kernel and
> keep proportional IO logic at the lowest level and current mechanism
> of building pressure continues to work?

First, I just don't think it's the right design.  It's a rather
abstract statement but I want to emphasize that having the "right"
design, in the sense that we look at the overall picture and put
configs, controls and other logics where they belong to in the
structure that their roles point to tends to make long-term
development and maintenance much easier in ways which may not be
immediately foreseeable, for both technical and social reasons -
logical structuring and layering keep us sane and make new comer's
lives at least bearable.

Secondly, I don't think it'll be a lot of added complexity.  We *need*
to fix all the said shortcoming in block layer for proper cgroup
support anyway, right?  Propagating that support upwards doesn't take
too much code.  Other than the metadata thing, it mostly just requires
updates to writeback code such that they deal with bdi-cgroup
combination instead of individual cgroups.  They'll surely require
some adjustments but we're not gonna be burdening the main paths with
cgroup awareness.  cgroup support would just make the existing
implementation work on finer grained domains.

Thirdly, I don't see how writeback can control all the IOs.  I mean,
what about reads or direct IOs?  It's not like IO devices have
separate channels for those different types of IOs.  They interact
heavily.  Let's say we have iops/bps limitation applied on top of
proportional IO distribution or a device holds two partitions and one
of them is being used for direct IO w/o filesystems.  How would that
work?  I think the question goes even deeper, what do the separate
limits even mean?  Does the IO sched have to calculate allocation of
IO resource to different types of IOs and then give a "number" to
writeback which in turn enforces that limit?  How does the elevator
know what number to give?  Is the number iops or bps or weight?  If
the iosched doesn't know how much write workload exists, how does it
distribute the surplus buffered writeback resource across different
cgroups?  If so, what makes the limit actualy enforceable (due to
inaccuracies in estimation, fluctuation in workload, delay in
enforcement in different layers and whatnot) except for block layer
applying the limit *again* on the resulting stream of combined IOs?

Fourthly, having clear layering usually means much more flexibility.
The assumptions about IO characteristics that we have are still mostly
based on devices with spindles, probably because they're still causing
the most amount of pain.  The assumptions keep changing and if we get
the layering correct, we can mostly deal with changes at the layers
concerning them - ie. in the block layer.  Maybe we'll have a
different iosched or cfq can be evolved to cover the new cases, but
the required adaptation would be logical and while upper layers might
need some adjustments they wouldn't need any major overhaul.  They'll
be still working from back pressure from IO.

So, the above are the reasons why I don't like the idea of splitting
IO control across multiple layers, well the ones that I can think of
right now anyway.  I'm currently feeling rather strong about them in
the sense of "oh no, this is about to be messed up" but maybe I'm just
not seeing what Fengguang is seeing.  I'll keep discussing there.

> So in general throttling at block layer and building back pressure is
> fine. I am concerned about two cases.
> 
> - How to handle NFS.

As said above, maybe through network based bdi pressure propagation,
Maybe some other special case mechanism.  Unsure but I don't think
this concern should dictate the whole design.

> - Do filesystem developers agree with this approach and are they willing
>   to address any serialization issues arising due to this design.

Jan, can you please fill in?  Did I understand it correctly?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]   ` <20120404145134.GC12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2012-04-04 15:36     ` Steve French
@ 2012-04-04 18:49     ` Tejun Heo
  2012-04-07  8:00     ` Jan Kara
  2 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 18:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hey, Vivek.

On Wed, Apr 04, 2012 at 10:51:34AM -0400, Vivek Goyal wrote:
> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
> > IIUC, without cgroup, the current writeback code works more or less
> > like this.  Throwing in cgroup doesn't really change the fundamental
> > design.  Instead of a single pipe going down, we just have multiple
> > pipes to the same device, each of which should be treated separately.
> > Of course, a spinning disk can't be divided that easily and their
> > performance characteristics will be inter-dependent, but the place to
> > solve that problem is where the problem is, the block layer.
> 
> How do you take care of thorottling IO to NFS case in this model? Current
> throttling logic is tied to block device and in case of NFS, there is no
> block device.

On principle, I don't think it has be any different.  Filesystems's
interface to the underlying device is through bdi.  If a fs is block
backed, block pressure should be propagated through bdi, which should
be mostly trivial.  If a fs is network backed, we can implement a
mechanism for network backed bdis, so that they can relay the pressure
from the server side to the local fs users.

That said, network filesystems often show different behaviors and use
different mechanisms for various reasons and it wouldn't be too
surprising if something different would fit them better here or we
might need something supplemental to the usual mechanism.

> [..]
> > In the discussion, for such implementation, the following obstacles
> > were identified.
> > 
> > * There are a lot of cases where IOs are issued by a task which isn't
> >   the originiator.  ie. Writeback issues IOs for pages which are
> >   dirtied by some other tasks.  So, by the time an IO reaches the
> >   block layer, we don't know which cgroup the IO belongs to.
> > 
> >   Recently, block layer has grown support to attach a task to a bio
> >   which causes the bio to be handled as if it were issued by the
> >   associated task regardless of the actual issuing task.  It currently
> >   only allows attaching %current to a bio - bio_associate_current() -
> >   but changing it to support other tasks is trivial.
> > 
> >   We'll need to update the async issuers to tag the IOs they issue but
> >   the mechanism is already there.
> 
> Most likely this tagging will take place in "struct page" and I am not
> sure if we will be allowed to grow size of "struct page" for this reason.

With memcg enabled, we are already doing that and IIUC Jan and
Fengguang think that using inode granularity should be good enough for
writeback blaming.

> > * There's a single request pool shared by all issuers per a request
> >   queue.  This can lead to priority inversion among cgroups.  Note
> >   that problem also exists without cgroups.  Lower ioprio issuer may
> >   be holding a request holding back highprio issuer.
> > 
> >   We'll need to make request allocation cgroup (and hopefully ioprio)
> >   aware.  Probably in the form of separate request pools.  This will
> >   take some work but I don't think this will be too challenging.  I'll
> >   work on it.
> 
> This should be doable. I had implemented it long back with single request
> pool but internal limits for each group. That is block the task in the
> group if group has enough pending requests allocated from the pool. But
> separate request pool should work equally well. 
> 
> Just that it conflits a bit with current definition of q->nr_requests.
> Which specifies number of total outstanding requests on the queue. Once
> you make the pool per queue, I guess this limit will have to be
> transformed into per group upper limit.

I'm not sure about the details yet.  I *think* the suckiest part is
the actual allocation part.  We're deferring cgroup - request_queue
association until actual usage and depending on atomic allocations to
create those associations on IO path.  Doing the same for requests
might not be too pleasant.  Hmm....  allocation failure handling on
that path is already broken BTW.  Maybe we just need to get the
fallback behavior properly working.  Unsure.

> > * cfq cgroup policy throws all async IOs, which all buffered writes
> >   are, into the shared cgroup regardless of the actual cgroup.  This
> >   behavior is, I believe, mostly historical and changing it isn't
> >   difficult.  Prolly only few tens of lines of changes.  This may
> >   cause significant changes to actual IO behavior with cgroups tho.  I
> >   personally think the previous behavior was too wrong to keep (the
> >   weight was completely ignored for buffered writes) but we may want
> >   to introduce a switch to toggle between the two behaviors.
> 
> I had kept all buffered writes in in same cgroup (root cgroup) for few
> reasons.
> 
> - Because of single request descriptor pool for writes, anyway one writer
>   gets backlogged behind other. So creating separate async queues per
>   group is not going to help.
> 
> - Writeback logic was not cgroup aware. So it might not send enough IO
>   from each writer to maintain parallelism. So creating separate async
>   queues did not make sense till that was fixed.

Yeah, the above are why I find "buffered writes need separate controls
because cfq doesn't distinguish async writes" argument very ironic.
We introduce one quirk to compensate for shortages in the other part
and then later we work that around in that other part for that quirk?
I mean, that's just twisted.

> - As you said, it is historical also. We prioritize READS at the expense
>   of writes. Now by putting buffered/async writes in a separate group, we
>   will might end up prioritizing a group's async write over other group's
>   synchronous read. How many people really want that behavior? To me
>   keeping service differentiation among the sync IO matters most. Even
>   if all async IO is treated same, I guess not many people might care.

While segregation of async IOs may not matter in some cases, it does
matter to many other use cases, so it seems to me that we hard coded
that optimization decision without thinking too much about it.  For a
lot of container type use cases, the current implementation is nearly
useless (I know of cases where people are explicitly patching for
separate async queues).  At the same time, switching the default
behavior *may* disturb some of the current users and that's why I'm
thinking abut having a switch for the new behavior.

> >   Note that blk-throttle doesn't have this problem.
> 
> I am not sure what are you trying to say here. But primarily blk-throttle
> will throttle read and direct IO. Buffered writes will go to root cgroup
> which is typically unthrottled.

Ooh, my bad then.  Anyways, then the same applies to blk-throttle.
Our current implementation essentially collapses at the face of
write-heavy workload.

> > * Unlike dirty data pages, metadata tends to have strict ordering
> >   requirements and thus is susceptible to priority inversion.  Two
> >   solutions were suggested - 1. allow overdrawl for metadata writes so
> >   that low prio metadata writes don't block the whole FS, 2. provide
> >   an interface to query and wait for bdi-cgroup congestion which can
> >   be called from FS metadata paths to throttle metadata operations
> >   before they enter the stream of ordered operations.
> 
> So that probably will mean changing the order of operations also. IIUC, 
> in case of fsync (ordered mode), we opened a meta data transaction first,
> then tried to flush all the cached data and then flush metadata. So if
> fsync is throttled, all the metadata operations behind it will get 
> serialized for ext3/ext4.
> 
> So you seem to be suggesting that we change the design so that metadata
> operation does not thrown into ordered stream till we have finished
> writing all the data back to disk? I am not a filesystem developer, so
> I don't know how feasible this change is.

Jan explained it to me and I don't think it requires extensive changes
to the filesystem.  IIUC, filesystems would just block tasks creating
journal entry while its matching bdi is congested and that's the
extent of the necessary change.

> This is just one of the points. In the past while talking to Dave Chinner,
> he mentioned that in XFS, if two cgroups fall into same allocation group
> then there were cases where IO of one cgroup can get serialized behind
> other.
> 
> In general, the core of the issue is that filesystems are not cgroup aware
> and if you do throttling below filesystems, then invariably one or other
> serialization issue will come up and I am concerned that we will be constantly
> fixing those serialization issues. Or the desgin point could be so central
> to filesystem design that it can't be changed.

So, the idea is to avoid allowing any congested cgroup to enter
serialized journal.  As there's time gap until journal commit, the bdi
might be congested by the commit time.  In that case, META writes get
to overdraw cgroup limits to avoid causing priority inversion.  I
think we should be able to get most working with bdi congestion check
at the front and limit bypass for META for now.  We can worry about
overdrawing later.

> In general, if you do throttling deeper in the stakc and build back
> pressure, then all the layers sitting above should be cgroup aware
> to avoid problems. Two layers identified so far are writeback and
> filesystems. Is it really worth the complexity. How about doing 
> throttling in higher layers when IO is entering the kernel and
> keep proportional IO logic at the lowest level and current mechanism
> of building pressure continues to work?

First, I just don't think it's the right design.  It's a rather
abstract statement but I want to emphasize that having the "right"
design, in the sense that we look at the overall picture and put
configs, controls and other logics where they belong to in the
structure that their roles point to tends to make long-term
development and maintenance much easier in ways which may not be
immediately foreseeable, for both technical and social reasons -
logical structuring and layering keep us sane and make new comer's
lives at least bearable.

Secondly, I don't think it'll be a lot of added complexity.  We *need*
to fix all the said shortcoming in block layer for proper cgroup
support anyway, right?  Propagating that support upwards doesn't take
too much code.  Other than the metadata thing, it mostly just requires
updates to writeback code such that they deal with bdi-cgroup
combination instead of individual cgroups.  They'll surely require
some adjustments but we're not gonna be burdening the main paths with
cgroup awareness.  cgroup support would just make the existing
implementation work on finer grained domains.

Thirdly, I don't see how writeback can control all the IOs.  I mean,
what about reads or direct IOs?  It's not like IO devices have
separate channels for those different types of IOs.  They interact
heavily.  Let's say we have iops/bps limitation applied on top of
proportional IO distribution or a device holds two partitions and one
of them is being used for direct IO w/o filesystems.  How would that
work?  I think the question goes even deeper, what do the separate
limits even mean?  Does the IO sched have to calculate allocation of
IO resource to different types of IOs and then give a "number" to
writeback which in turn enforces that limit?  How does the elevator
know what number to give?  Is the number iops or bps or weight?  If
the iosched doesn't know how much write workload exists, how does it
distribute the surplus buffered writeback resource across different
cgroups?  If so, what makes the limit actualy enforceable (due to
inaccuracies in estimation, fluctuation in workload, delay in
enforcement in different layers and whatnot) except for block layer
applying the limit *again* on the resulting stream of combined IOs?

Fourthly, having clear layering usually means much more flexibility.
The assumptions about IO characteristics that we have are still mostly
based on devices with spindles, probably because they're still causing
the most amount of pain.  The assumptions keep changing and if we get
the layering correct, we can mostly deal with changes at the layers
concerning them - ie. in the block layer.  Maybe we'll have a
different iosched or cfq can be evolved to cover the new cases, but
the required adaptation would be logical and while upper layers might
need some adjustments they wouldn't need any major overhaul.  They'll
be still working from back pressure from IO.

So, the above are the reasons why I don't like the idea of splitting
IO control across multiple layers, well the ones that I can think of
right now anyway.  I'm currently feeling rather strong about them in
the sense of "oh no, this is about to be messed up" but maybe I'm just
not seeing what Fengguang is seeing.  I'll keep discussing there.

> So in general throttling at block layer and building back pressure is
> fine. I am concerned about two cases.
> 
> - How to handle NFS.

As said above, maybe through network based bdi pressure propagation,
Maybe some other special case mechanism.  Unsure but I don't think
this concern should dictate the whole design.

> - Do filesystem developers agree with this approach and are they willing
>   to address any serialization issues arising due to this design.

Jan, can you please fill in?  Did I understand it correctly?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 18:49     ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 18:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Jan Kara, Jens Axboe,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k,
	andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hey, Vivek.

On Wed, Apr 04, 2012 at 10:51:34AM -0400, Vivek Goyal wrote:
> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
> > IIUC, without cgroup, the current writeback code works more or less
> > like this.  Throwing in cgroup doesn't really change the fundamental
> > design.  Instead of a single pipe going down, we just have multiple
> > pipes to the same device, each of which should be treated separately.
> > Of course, a spinning disk can't be divided that easily and their
> > performance characteristics will be inter-dependent, but the place to
> > solve that problem is where the problem is, the block layer.
> 
> How do you take care of thorottling IO to NFS case in this model? Current
> throttling logic is tied to block device and in case of NFS, there is no
> block device.

On principle, I don't think it has be any different.  Filesystems's
interface to the underlying device is through bdi.  If a fs is block
backed, block pressure should be propagated through bdi, which should
be mostly trivial.  If a fs is network backed, we can implement a
mechanism for network backed bdis, so that they can relay the pressure
from the server side to the local fs users.

That said, network filesystems often show different behaviors and use
different mechanisms for various reasons and it wouldn't be too
surprising if something different would fit them better here or we
might need something supplemental to the usual mechanism.

> [..]
> > In the discussion, for such implementation, the following obstacles
> > were identified.
> > 
> > * There are a lot of cases where IOs are issued by a task which isn't
> >   the originiator.  ie. Writeback issues IOs for pages which are
> >   dirtied by some other tasks.  So, by the time an IO reaches the
> >   block layer, we don't know which cgroup the IO belongs to.
> > 
> >   Recently, block layer has grown support to attach a task to a bio
> >   which causes the bio to be handled as if it were issued by the
> >   associated task regardless of the actual issuing task.  It currently
> >   only allows attaching %current to a bio - bio_associate_current() -
> >   but changing it to support other tasks is trivial.
> > 
> >   We'll need to update the async issuers to tag the IOs they issue but
> >   the mechanism is already there.
> 
> Most likely this tagging will take place in "struct page" and I am not
> sure if we will be allowed to grow size of "struct page" for this reason.

With memcg enabled, we are already doing that and IIUC Jan and
Fengguang think that using inode granularity should be good enough for
writeback blaming.

> > * There's a single request pool shared by all issuers per a request
> >   queue.  This can lead to priority inversion among cgroups.  Note
> >   that problem also exists without cgroups.  Lower ioprio issuer may
> >   be holding a request holding back highprio issuer.
> > 
> >   We'll need to make request allocation cgroup (and hopefully ioprio)
> >   aware.  Probably in the form of separate request pools.  This will
> >   take some work but I don't think this will be too challenging.  I'll
> >   work on it.
> 
> This should be doable. I had implemented it long back with single request
> pool but internal limits for each group. That is block the task in the
> group if group has enough pending requests allocated from the pool. But
> separate request pool should work equally well. 
> 
> Just that it conflits a bit with current definition of q->nr_requests.
> Which specifies number of total outstanding requests on the queue. Once
> you make the pool per queue, I guess this limit will have to be
> transformed into per group upper limit.

I'm not sure about the details yet.  I *think* the suckiest part is
the actual allocation part.  We're deferring cgroup - request_queue
association until actual usage and depending on atomic allocations to
create those associations on IO path.  Doing the same for requests
might not be too pleasant.  Hmm....  allocation failure handling on
that path is already broken BTW.  Maybe we just need to get the
fallback behavior properly working.  Unsure.

> > * cfq cgroup policy throws all async IOs, which all buffered writes
> >   are, into the shared cgroup regardless of the actual cgroup.  This
> >   behavior is, I believe, mostly historical and changing it isn't
> >   difficult.  Prolly only few tens of lines of changes.  This may
> >   cause significant changes to actual IO behavior with cgroups tho.  I
> >   personally think the previous behavior was too wrong to keep (the
> >   weight was completely ignored for buffered writes) but we may want
> >   to introduce a switch to toggle between the two behaviors.
> 
> I had kept all buffered writes in in same cgroup (root cgroup) for few
> reasons.
> 
> - Because of single request descriptor pool for writes, anyway one writer
>   gets backlogged behind other. So creating separate async queues per
>   group is not going to help.
> 
> - Writeback logic was not cgroup aware. So it might not send enough IO
>   from each writer to maintain parallelism. So creating separate async
>   queues did not make sense till that was fixed.

Yeah, the above are why I find "buffered writes need separate controls
because cfq doesn't distinguish async writes" argument very ironic.
We introduce one quirk to compensate for shortages in the other part
and then later we work that around in that other part for that quirk?
I mean, that's just twisted.

> - As you said, it is historical also. We prioritize READS at the expense
>   of writes. Now by putting buffered/async writes in a separate group, we
>   will might end up prioritizing a group's async write over other group's
>   synchronous read. How many people really want that behavior? To me
>   keeping service differentiation among the sync IO matters most. Even
>   if all async IO is treated same, I guess not many people might care.

While segregation of async IOs may not matter in some cases, it does
matter to many other use cases, so it seems to me that we hard coded
that optimization decision without thinking too much about it.  For a
lot of container type use cases, the current implementation is nearly
useless (I know of cases where people are explicitly patching for
separate async queues).  At the same time, switching the default
behavior *may* disturb some of the current users and that's why I'm
thinking abut having a switch for the new behavior.

> >   Note that blk-throttle doesn't have this problem.
> 
> I am not sure what are you trying to say here. But primarily blk-throttle
> will throttle read and direct IO. Buffered writes will go to root cgroup
> which is typically unthrottled.

Ooh, my bad then.  Anyways, then the same applies to blk-throttle.
Our current implementation essentially collapses at the face of
write-heavy workload.

> > * Unlike dirty data pages, metadata tends to have strict ordering
> >   requirements and thus is susceptible to priority inversion.  Two
> >   solutions were suggested - 1. allow overdrawl for metadata writes so
> >   that low prio metadata writes don't block the whole FS, 2. provide
> >   an interface to query and wait for bdi-cgroup congestion which can
> >   be called from FS metadata paths to throttle metadata operations
> >   before they enter the stream of ordered operations.
> 
> So that probably will mean changing the order of operations also. IIUC, 
> in case of fsync (ordered mode), we opened a meta data transaction first,
> then tried to flush all the cached data and then flush metadata. So if
> fsync is throttled, all the metadata operations behind it will get 
> serialized for ext3/ext4.
> 
> So you seem to be suggesting that we change the design so that metadata
> operation does not thrown into ordered stream till we have finished
> writing all the data back to disk? I am not a filesystem developer, so
> I don't know how feasible this change is.

Jan explained it to me and I don't think it requires extensive changes
to the filesystem.  IIUC, filesystems would just block tasks creating
journal entry while its matching bdi is congested and that's the
extent of the necessary change.

> This is just one of the points. In the past while talking to Dave Chinner,
> he mentioned that in XFS, if two cgroups fall into same allocation group
> then there were cases where IO of one cgroup can get serialized behind
> other.
> 
> In general, the core of the issue is that filesystems are not cgroup aware
> and if you do throttling below filesystems, then invariably one or other
> serialization issue will come up and I am concerned that we will be constantly
> fixing those serialization issues. Or the desgin point could be so central
> to filesystem design that it can't be changed.

So, the idea is to avoid allowing any congested cgroup to enter
serialized journal.  As there's time gap until journal commit, the bdi
might be congested by the commit time.  In that case, META writes get
to overdraw cgroup limits to avoid causing priority inversion.  I
think we should be able to get most working with bdi congestion check
at the front and limit bypass for META for now.  We can worry about
overdrawing later.

> In general, if you do throttling deeper in the stakc and build back
> pressure, then all the layers sitting above should be cgroup aware
> to avoid problems. Two layers identified so far are writeback and
> filesystems. Is it really worth the complexity. How about doing 
> throttling in higher layers when IO is entering the kernel and
> keep proportional IO logic at the lowest level and current mechanism
> of building pressure continues to work?

First, I just don't think it's the right design.  It's a rather
abstract statement but I want to emphasize that having the "right"
design, in the sense that we look at the overall picture and put
configs, controls and other logics where they belong to in the
structure that their roles point to tends to make long-term
development and maintenance much easier in ways which may not be
immediately foreseeable, for both technical and social reasons -
logical structuring and layering keep us sane and make new comer's
lives at least bearable.

Secondly, I don't think it'll be a lot of added complexity.  We *need*
to fix all the said shortcoming in block layer for proper cgroup
support anyway, right?  Propagating that support upwards doesn't take
too much code.  Other than the metadata thing, it mostly just requires
updates to writeback code such that they deal with bdi-cgroup
combination instead of individual cgroups.  They'll surely require
some adjustments but we're not gonna be burdening the main paths with
cgroup awareness.  cgroup support would just make the existing
implementation work on finer grained domains.

Thirdly, I don't see how writeback can control all the IOs.  I mean,
what about reads or direct IOs?  It's not like IO devices have
separate channels for those different types of IOs.  They interact
heavily.  Let's say we have iops/bps limitation applied on top of
proportional IO distribution or a device holds two partitions and one
of them is being used for direct IO w/o filesystems.  How would that
work?  I think the question goes even deeper, what do the separate
limits even mean?  Does the IO sched have to calculate allocation of
IO resource to different types of IOs and then give a "number" to
writeback which in turn enforces that limit?  How does the elevator
know what number to give?  Is the number iops or bps or weight?  If
the iosched doesn't know how much write workload exists, how does it
distribute the surplus buffered writeback resource across different
cgroups?  If so, what makes the limit actualy enforceable (due to
inaccuracies in estimation, fluctuation in workload, delay in
enforcement in different layers and whatnot) except for block layer
applying the limit *again* on the resulting stream of combined IOs?

Fourthly, having clear layering usually means much more flexibility.
The assumptions about IO characteristics that we have are still mostly
based on devices with spindles, probably because they're still causing
the most amount of pain.  The assumptions keep changing and if we get
the layering correct, we can mostly deal with changes at the layers
concerning them - ie. in the block layer.  Maybe we'll have a
different iosched or cfq can be evolved to cover the new cases, but
the required adaptation would be logical and while upper layers might
need some adjustments they wouldn't need any major overhaul.  They'll
be still working from back pressure from IO.

So, the above are the reasons why I don't like the idea of splitting
IO control across multiple layers, well the ones that I can think of
right now anyway.  I'm currently feeling rather strong about them in
the sense of "oh no, this is about to be messed up" but maybe I'm just
not seeing what Fengguang is seeing.  I'll keep discussing there.

> So in general throttling at block layer and building back pressure is
> fine. I am concerned about two cases.
> 
> - How to handle NFS.

As said above, maybe through network based bdi pressure propagation,
Maybe some other special case mechanism.  Unsure but I don't think
this concern should dictate the whole design.

> - Do filesystem developers agree with this approach and are they willing
>   to address any serialization issues arising due to this design.

Jan, can you please fill in?  Did I understand it correctly?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 18:49     ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 18:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hey, Vivek.

On Wed, Apr 04, 2012 at 10:51:34AM -0400, Vivek Goyal wrote:
> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
> > IIUC, without cgroup, the current writeback code works more or less
> > like this.  Throwing in cgroup doesn't really change the fundamental
> > design.  Instead of a single pipe going down, we just have multiple
> > pipes to the same device, each of which should be treated separately.
> > Of course, a spinning disk can't be divided that easily and their
> > performance characteristics will be inter-dependent, but the place to
> > solve that problem is where the problem is, the block layer.
> 
> How do you take care of thorottling IO to NFS case in this model? Current
> throttling logic is tied to block device and in case of NFS, there is no
> block device.

On principle, I don't think it has be any different.  Filesystems's
interface to the underlying device is through bdi.  If a fs is block
backed, block pressure should be propagated through bdi, which should
be mostly trivial.  If a fs is network backed, we can implement a
mechanism for network backed bdis, so that they can relay the pressure
from the server side to the local fs users.

That said, network filesystems often show different behaviors and use
different mechanisms for various reasons and it wouldn't be too
surprising if something different would fit them better here or we
might need something supplemental to the usual mechanism.

> [..]
> > In the discussion, for such implementation, the following obstacles
> > were identified.
> > 
> > * There are a lot of cases where IOs are issued by a task which isn't
> >   the originiator.  ie. Writeback issues IOs for pages which are
> >   dirtied by some other tasks.  So, by the time an IO reaches the
> >   block layer, we don't know which cgroup the IO belongs to.
> > 
> >   Recently, block layer has grown support to attach a task to a bio
> >   which causes the bio to be handled as if it were issued by the
> >   associated task regardless of the actual issuing task.  It currently
> >   only allows attaching %current to a bio - bio_associate_current() -
> >   but changing it to support other tasks is trivial.
> > 
> >   We'll need to update the async issuers to tag the IOs they issue but
> >   the mechanism is already there.
> 
> Most likely this tagging will take place in "struct page" and I am not
> sure if we will be allowed to grow size of "struct page" for this reason.

With memcg enabled, we are already doing that and IIUC Jan and
Fengguang think that using inode granularity should be good enough for
writeback blaming.

> > * There's a single request pool shared by all issuers per a request
> >   queue.  This can lead to priority inversion among cgroups.  Note
> >   that problem also exists without cgroups.  Lower ioprio issuer may
> >   be holding a request holding back highprio issuer.
> > 
> >   We'll need to make request allocation cgroup (and hopefully ioprio)
> >   aware.  Probably in the form of separate request pools.  This will
> >   take some work but I don't think this will be too challenging.  I'll
> >   work on it.
> 
> This should be doable. I had implemented it long back with single request
> pool but internal limits for each group. That is block the task in the
> group if group has enough pending requests allocated from the pool. But
> separate request pool should work equally well. 
> 
> Just that it conflits a bit with current definition of q->nr_requests.
> Which specifies number of total outstanding requests on the queue. Once
> you make the pool per queue, I guess this limit will have to be
> transformed into per group upper limit.

I'm not sure about the details yet.  I *think* the suckiest part is
the actual allocation part.  We're deferring cgroup - request_queue
association until actual usage and depending on atomic allocations to
create those associations on IO path.  Doing the same for requests
might not be too pleasant.  Hmm....  allocation failure handling on
that path is already broken BTW.  Maybe we just need to get the
fallback behavior properly working.  Unsure.

> > * cfq cgroup policy throws all async IOs, which all buffered writes
> >   are, into the shared cgroup regardless of the actual cgroup.  This
> >   behavior is, I believe, mostly historical and changing it isn't
> >   difficult.  Prolly only few tens of lines of changes.  This may
> >   cause significant changes to actual IO behavior with cgroups tho.  I
> >   personally think the previous behavior was too wrong to keep (the
> >   weight was completely ignored for buffered writes) but we may want
> >   to introduce a switch to toggle between the two behaviors.
> 
> I had kept all buffered writes in in same cgroup (root cgroup) for few
> reasons.
> 
> - Because of single request descriptor pool for writes, anyway one writer
>   gets backlogged behind other. So creating separate async queues per
>   group is not going to help.
> 
> - Writeback logic was not cgroup aware. So it might not send enough IO
>   from each writer to maintain parallelism. So creating separate async
>   queues did not make sense till that was fixed.

Yeah, the above are why I find "buffered writes need separate controls
because cfq doesn't distinguish async writes" argument very ironic.
We introduce one quirk to compensate for shortages in the other part
and then later we work that around in that other part for that quirk?
I mean, that's just twisted.

> - As you said, it is historical also. We prioritize READS at the expense
>   of writes. Now by putting buffered/async writes in a separate group, we
>   will might end up prioritizing a group's async write over other group's
>   synchronous read. How many people really want that behavior? To me
>   keeping service differentiation among the sync IO matters most. Even
>   if all async IO is treated same, I guess not many people might care.

While segregation of async IOs may not matter in some cases, it does
matter to many other use cases, so it seems to me that we hard coded
that optimization decision without thinking too much about it.  For a
lot of container type use cases, the current implementation is nearly
useless (I know of cases where people are explicitly patching for
separate async queues).  At the same time, switching the default
behavior *may* disturb some of the current users and that's why I'm
thinking abut having a switch for the new behavior.

> >   Note that blk-throttle doesn't have this problem.
> 
> I am not sure what are you trying to say here. But primarily blk-throttle
> will throttle read and direct IO. Buffered writes will go to root cgroup
> which is typically unthrottled.

Ooh, my bad then.  Anyways, then the same applies to blk-throttle.
Our current implementation essentially collapses at the face of
write-heavy workload.

> > * Unlike dirty data pages, metadata tends to have strict ordering
> >   requirements and thus is susceptible to priority inversion.  Two
> >   solutions were suggested - 1. allow overdrawl for metadata writes so
> >   that low prio metadata writes don't block the whole FS, 2. provide
> >   an interface to query and wait for bdi-cgroup congestion which can
> >   be called from FS metadata paths to throttle metadata operations
> >   before they enter the stream of ordered operations.
> 
> So that probably will mean changing the order of operations also. IIUC, 
> in case of fsync (ordered mode), we opened a meta data transaction first,
> then tried to flush all the cached data and then flush metadata. So if
> fsync is throttled, all the metadata operations behind it will get 
> serialized for ext3/ext4.
> 
> So you seem to be suggesting that we change the design so that metadata
> operation does not thrown into ordered stream till we have finished
> writing all the data back to disk? I am not a filesystem developer, so
> I don't know how feasible this change is.

Jan explained it to me and I don't think it requires extensive changes
to the filesystem.  IIUC, filesystems would just block tasks creating
journal entry while its matching bdi is congested and that's the
extent of the necessary change.

> This is just one of the points. In the past while talking to Dave Chinner,
> he mentioned that in XFS, if two cgroups fall into same allocation group
> then there were cases where IO of one cgroup can get serialized behind
> other.
> 
> In general, the core of the issue is that filesystems are not cgroup aware
> and if you do throttling below filesystems, then invariably one or other
> serialization issue will come up and I am concerned that we will be constantly
> fixing those serialization issues. Or the desgin point could be so central
> to filesystem design that it can't be changed.

So, the idea is to avoid allowing any congested cgroup to enter
serialized journal.  As there's time gap until journal commit, the bdi
might be congested by the commit time.  In that case, META writes get
to overdraw cgroup limits to avoid causing priority inversion.  I
think we should be able to get most working with bdi congestion check
at the front and limit bypass for META for now.  We can worry about
overdrawing later.

> In general, if you do throttling deeper in the stakc and build back
> pressure, then all the layers sitting above should be cgroup aware
> to avoid problems. Two layers identified so far are writeback and
> filesystems. Is it really worth the complexity. How about doing 
> throttling in higher layers when IO is entering the kernel and
> keep proportional IO logic at the lowest level and current mechanism
> of building pressure continues to work?

First, I just don't think it's the right design.  It's a rather
abstract statement but I want to emphasize that having the "right"
design, in the sense that we look at the overall picture and put
configs, controls and other logics where they belong to in the
structure that their roles point to tends to make long-term
development and maintenance much easier in ways which may not be
immediately foreseeable, for both technical and social reasons -
logical structuring and layering keep us sane and make new comer's
lives at least bearable.

Secondly, I don't think it'll be a lot of added complexity.  We *need*
to fix all the said shortcoming in block layer for proper cgroup
support anyway, right?  Propagating that support upwards doesn't take
too much code.  Other than the metadata thing, it mostly just requires
updates to writeback code such that they deal with bdi-cgroup
combination instead of individual cgroups.  They'll surely require
some adjustments but we're not gonna be burdening the main paths with
cgroup awareness.  cgroup support would just make the existing
implementation work on finer grained domains.

Thirdly, I don't see how writeback can control all the IOs.  I mean,
what about reads or direct IOs?  It's not like IO devices have
separate channels for those different types of IOs.  They interact
heavily.  Let's say we have iops/bps limitation applied on top of
proportional IO distribution or a device holds two partitions and one
of them is being used for direct IO w/o filesystems.  How would that
work?  I think the question goes even deeper, what do the separate
limits even mean?  Does the IO sched have to calculate allocation of
IO resource to different types of IOs and then give a "number" to
writeback which in turn enforces that limit?  How does the elevator
know what number to give?  Is the number iops or bps or weight?  If
the iosched doesn't know how much write workload exists, how does it
distribute the surplus buffered writeback resource across different
cgroups?  If so, what makes the limit actualy enforceable (due to
inaccuracies in estimation, fluctuation in workload, delay in
enforcement in different layers and whatnot) except for block layer
applying the limit *again* on the resulting stream of combined IOs?

Fourthly, having clear layering usually means much more flexibility.
The assumptions about IO characteristics that we have are still mostly
based on devices with spindles, probably because they're still causing
the most amount of pain.  The assumptions keep changing and if we get
the layering correct, we can mostly deal with changes at the layers
concerning them - ie. in the block layer.  Maybe we'll have a
different iosched or cfq can be evolved to cover the new cases, but
the required adaptation would be logical and while upper layers might
need some adjustments they wouldn't need any major overhaul.  They'll
be still working from back pressure from IO.

So, the above are the reasons why I don't like the idea of splitting
IO control across multiple layers, well the ones that I can think of
right now anyway.  I'm currently feeling rather strong about them in
the sense of "oh no, this is about to be messed up" but maybe I'm just
not seeing what Fengguang is seeing.  I'll keep discussing there.

> So in general throttling at block layer and building back pressure is
> fine. I am concerned about two cases.
> 
> - How to handle NFS.

As said above, maybe through network based bdi pressure propagation,
Maybe some other special case mechanism.  Unsure but I don't think
this concern should dictate the whole design.

> - Do filesystem developers agree with this approach and are they willing
>   to address any serialization issues arising due to this design.

Jan, can you please fill in?  Did I understand it correctly?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]     ` <CAH2r5mtwQa0Uu=_Yd2JywVJXA=OMGV43X_OUfziC-yeVy9BGtQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2012-04-04 18:56       ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 18:56 UTC (permalink / raw)
  To: Steve French
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal

On Wed, Apr 04, 2012 at 10:36:04AM -0500, Steve French wrote:
> > How do you take care of thorottling IO to NFS case in this model? Current
> > throttling logic is tied to block device and in case of NFS, there is no
> > block device.
> 
> Similarly smb2 gets congestion info (number of "credits") returned from
> the server on every response - but not sure why congestion
> control is tied to the block device when this would create
> problems for network file systems

I hope the previous replies answered this.  It's about writeback
getting pressure from bdi and isn't restricted to block devices.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-04 15:36     ` Steve French
@ 2012-04-04 18:56       ` Tejun Heo
  -1 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 18:56 UTC (permalink / raw)
  To: Steve French
  Cc: Vivek Goyal, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Wed, Apr 04, 2012 at 10:36:04AM -0500, Steve French wrote:
> > How do you take care of thorottling IO to NFS case in this model? Current
> > throttling logic is tied to block device and in case of NFS, there is no
> > block device.
> 
> Similarly smb2 gets congestion info (number of "credits") returned from
> the server on every response - but not sure why congestion
> control is tied to the block device when this would create
> problems for network file systems

I hope the previous replies answered this.  It's about writeback
getting pressure from bdi and isn't restricted to block devices.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-04 18:56       ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 18:56 UTC (permalink / raw)
  To: Steve French
  Cc: Vivek Goyal, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Wed, Apr 04, 2012 at 10:36:04AM -0500, Steve French wrote:
> > How do you take care of thorottling IO to NFS case in this model? Current
> > throttling logic is tied to block device and in case of NFS, there is no
> > block device.
> 
> Similarly smb2 gets congestion info (number of "credits") returned from
> the server on every response - but not sure why congestion
> control is tied to the block device when this would create
> problems for network file systems

I hope the previous replies answered this.  It's about writeback
getting pressure from bdi and isn't restricted to block devices.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]       ` <20120404185605.GC29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
@ 2012-04-04 19:19         ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 19:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	andrea-oIIqvOZpAevzfdHfmsDf5w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Steve French,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, Apr 04, 2012 at 11:56:05AM -0700, Tejun Heo wrote:
> On Wed, Apr 04, 2012 at 10:36:04AM -0500, Steve French wrote:
> > > How do you take care of thorottling IO to NFS case in this model? Current
> > > throttling logic is tied to block device and in case of NFS, there is no
> > > block device.
> > 
> > Similarly smb2 gets congestion info (number of "credits") returned from
> > the server on every response - but not sure why congestion
> > control is tied to the block device when this would create
> > problems for network file systems
> 
> I hope the previous replies answered this.  It's about writeback
> getting pressure from bdi and isn't restricted to block devices.

So the controlling knobs for network filesystems will be very different
as current throttling knobs are per device (and not per bdi). So
presumably there will be some throttling logic in network layer (network
tc), and that should communicate the back pressure.

I have tried limiting network traffic on NFS using network controller
and tc but that did not help for variety of reasons.

- We again have the problem of losing submitter's context down the layer.

- We have interesting TCP/IP sequencing issues. I don't have the details
  but if you throttle traffic from one group, it kind of led to some 
  kind of multiple re-transmissions from server for ack due to some
  sequence number issues. Sorry, I am short on details as it was long back
  and nfs guys told me that pNFS might help here.

  The basic problem seemed to that that if you multiplex traffic from
  all cgroups on single tcp/ip session and then choke IO suddenly from
  one of them, that was leading to some sequence number issues and led
  to really sucky performance.

So something to keep in mind while coming up ways for how to implement
throttling for network file systems.

Thanks
Vivek 

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-04 18:56       ` Tejun Heo
@ 2012-04-04 19:19         ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 19:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Steve French, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Wed, Apr 04, 2012 at 11:56:05AM -0700, Tejun Heo wrote:
> On Wed, Apr 04, 2012 at 10:36:04AM -0500, Steve French wrote:
> > > How do you take care of thorottling IO to NFS case in this model? Current
> > > throttling logic is tied to block device and in case of NFS, there is no
> > > block device.
> > 
> > Similarly smb2 gets congestion info (number of "credits") returned from
> > the server on every response - but not sure why congestion
> > control is tied to the block device when this would create
> > problems for network file systems
> 
> I hope the previous replies answered this.  It's about writeback
> getting pressure from bdi and isn't restricted to block devices.

So the controlling knobs for network filesystems will be very different
as current throttling knobs are per device (and not per bdi). So
presumably there will be some throttling logic in network layer (network
tc), and that should communicate the back pressure.

I have tried limiting network traffic on NFS using network controller
and tc but that did not help for variety of reasons.

- We again have the problem of losing submitter's context down the layer.

- We have interesting TCP/IP sequencing issues. I don't have the details
  but if you throttle traffic from one group, it kind of led to some 
  kind of multiple re-transmissions from server for ack due to some
  sequence number issues. Sorry, I am short on details as it was long back
  and nfs guys told me that pNFS might help here.

  The basic problem seemed to that that if you multiplex traffic from
  all cgroups on single tcp/ip session and then choke IO suddenly from
  one of them, that was leading to some sequence number issues and led
  to really sucky performance.

So something to keep in mind while coming up ways for how to implement
throttling for network file systems.

Thanks
Vivek 

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-04 19:19         ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 19:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Steve French, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Wed, Apr 04, 2012 at 11:56:05AM -0700, Tejun Heo wrote:
> On Wed, Apr 04, 2012 at 10:36:04AM -0500, Steve French wrote:
> > > How do you take care of thorottling IO to NFS case in this model? Current
> > > throttling logic is tied to block device and in case of NFS, there is no
> > > block device.
> > 
> > Similarly smb2 gets congestion info (number of "credits") returned from
> > the server on every response - but not sure why congestion
> > control is tied to the block device when this would create
> > problems for network file systems
> 
> I hope the previous replies answered this.  It's about writeback
> getting pressure from bdi and isn't restricted to block devices.

So the controlling knobs for network filesystems will be very different
as current throttling knobs are per device (and not per bdi). So
presumably there will be some throttling logic in network layer (network
tc), and that should communicate the back pressure.

I have tried limiting network traffic on NFS using network controller
and tc but that did not help for variety of reasons.

- We again have the problem of losing submitter's context down the layer.

- We have interesting TCP/IP sequencing issues. I don't have the details
  but if you throttle traffic from one group, it kind of led to some 
  kind of multiple re-transmissions from server for ack due to some
  sequence number issues. Sorry, I am short on details as it was long back
  and nfs guys told me that pNFS might help here.

  The basic problem seemed to that that if you multiplex traffic from
  all cgroups on single tcp/ip session and then choke IO suddenly from
  one of them, that was leading to some sequence number issues and led
  to really sucky performance.

So something to keep in mind while coming up ways for how to implement
throttling for network file systems.

Thanks
Vivek 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]     ` <20120404184909.GB29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
@ 2012-04-04 19:23       ` Steve French
  2012-04-04 20:32       ` Vivek Goyal
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 262+ messages in thread
From: Steve French @ 2012-04-04 19:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal

On Wed, Apr 4, 2012 at 1:49 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hey, Vivek.
>
> On Wed, Apr 04, 2012 at 10:51:34AM -0400, Vivek Goyal wrote:
>> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
>> > IIUC, without cgroup, the current writeback code works more or less
>> > like this.  Throwing in cgroup doesn't really change the fundamental
>> > design.  Instead of a single pipe going down, we just have multiple
>> > pipes to the same device, each of which should be treated separately.
>> > Of course, a spinning disk can't be divided that easily and their
>> > performance characteristics will be inter-dependent, but the place to
>> > solve that problem is where the problem is, the block layer.
>>
>> How do you take care of thorottling IO to NFS case in this model? Current
>> throttling logic is tied to block device and in case of NFS, there is no
>> block device.
>
> On principle, I don't think it has be any different.  Filesystems's
> interface to the underlying device is through bdi.  If a fs is block
> backed, block pressure should be propagated through bdi, which should
> be mostly trivial.  If a fs is network backed, we can implement a
> mechanism for network backed bdis, so that they can relay the pressure
> from the server side to the local fs users.
>
> That said, network filesystems often show different behaviors and use
> different mechanisms for various reasons and it wouldn't be too
> surprising if something different would fit them better here or we
> might need something supplemental to the usual mechanism.

For the network file system clients, we may be close already,
but I don't know how to allow servers like Samba or Apache
to query btrfs, xfs etc. for this information.

superblock -> struct backing_dev_info is probably fine as long
as we aren't making that structure more block device specific.
Current use of bdi is a little hard to understand since
there are 25+ fields in the structure.  Is their use/purpose written
up anywhere?  I have a feeling we are under-utilizing what
is already there.  In any case bdi is "backing" info not "block"
specific info.  Since bdi can be assigned to a superblock
and an inode, it seems reasonable for either network or local.

Note that it isn't just traditional network file systems (nfs and cifs and smb2)
but also virtualization (virtfs) and some special purpose file systems
for which block device specific interfaces to higher layers (above the fs)
are an awkward way to think about congestion.   What
about a case of a file system like btrfs that could back a
volume to a pool of devices and distribute hot/cold data
across multiple physical or logical devices?

By the way, there may be less of a problem with current
network file system clients due to small limits on simultaneous i/o.
Until recently NFS client had a low default slot count of 16 IIRC and
it was not much better for cifs.   The typical cifs server defaulted
to allowing a client to only send 50 simultaneous requests to that
server at one time ...
The cifs protocol allows more (up to 64K) and in 3.4 the client now
can send more requests (up to 32K) if the server is so configured.

With SMB2 since "credits" are returned on every response, fast
servers (e.g. Samba running on a good clustered file system,
or a good NAS box) may end up allowing thousands of simultaneous
requests if they have the resources to handle this.   Unfortunately,
the Samba server developers do not know how to request information
on superblock->bdi congestion information from user space.
I vaguely remember bdi debugging info available in sysfs, but how
would an application find out how congested the underlying volume
it is exporting is.

-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-04 18:49     ` Tejun Heo
@ 2012-04-04 19:23       ` Steve French
  -1 siblings, 0 replies; 262+ messages in thread
From: Steve French @ 2012-04-04 19:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Wed, Apr 4, 2012 at 1:49 PM, Tejun Heo <tj@kernel.org> wrote:
> Hey, Vivek.
>
> On Wed, Apr 04, 2012 at 10:51:34AM -0400, Vivek Goyal wrote:
>> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
>> > IIUC, without cgroup, the current writeback code works more or less
>> > like this.  Throwing in cgroup doesn't really change the fundamental
>> > design.  Instead of a single pipe going down, we just have multiple
>> > pipes to the same device, each of which should be treated separately.
>> > Of course, a spinning disk can't be divided that easily and their
>> > performance characteristics will be inter-dependent, but the place to
>> > solve that problem is where the problem is, the block layer.
>>
>> How do you take care of thorottling IO to NFS case in this model? Current
>> throttling logic is tied to block device and in case of NFS, there is no
>> block device.
>
> On principle, I don't think it has be any different.  Filesystems's
> interface to the underlying device is through bdi.  If a fs is block
> backed, block pressure should be propagated through bdi, which should
> be mostly trivial.  If a fs is network backed, we can implement a
> mechanism for network backed bdis, so that they can relay the pressure
> from the server side to the local fs users.
>
> That said, network filesystems often show different behaviors and use
> different mechanisms for various reasons and it wouldn't be too
> surprising if something different would fit them better here or we
> might need something supplemental to the usual mechanism.

For the network file system clients, we may be close already,
but I don't know how to allow servers like Samba or Apache
to query btrfs, xfs etc. for this information.

superblock -> struct backing_dev_info is probably fine as long
as we aren't making that structure more block device specific.
Current use of bdi is a little hard to understand since
there are 25+ fields in the structure.  Is their use/purpose written
up anywhere?  I have a feeling we are under-utilizing what
is already there.  In any case bdi is "backing" info not "block"
specific info.  Since bdi can be assigned to a superblock
and an inode, it seems reasonable for either network or local.

Note that it isn't just traditional network file systems (nfs and cifs and smb2)
but also virtualization (virtfs) and some special purpose file systems
for which block device specific interfaces to higher layers (above the fs)
are an awkward way to think about congestion.   What
about a case of a file system like btrfs that could back a
volume to a pool of devices and distribute hot/cold data
across multiple physical or logical devices?

By the way, there may be less of a problem with current
network file system clients due to small limits on simultaneous i/o.
Until recently NFS client had a low default slot count of 16 IIRC and
it was not much better for cifs.   The typical cifs server defaulted
to allowing a client to only send 50 simultaneous requests to that
server at one time ...
The cifs protocol allows more (up to 64K) and in 3.4 the client now
can send more requests (up to 32K) if the server is so configured.

With SMB2 since "credits" are returned on every response, fast
servers (e.g. Samba running on a good clustered file system,
or a good NAS box) may end up allowing thousands of simultaneous
requests if they have the resources to handle this.   Unfortunately,
the Samba server developers do not know how to request information
on superblock->bdi congestion information from user space.
I vaguely remember bdi debugging info available in sysfs, but how
would an application find out how congested the underlying volume
it is exporting is.

-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-04 19:23       ` Steve French
  0 siblings, 0 replies; 262+ messages in thread
From: Steve French @ 2012-04-04 19:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Wed, Apr 4, 2012 at 1:49 PM, Tejun Heo <tj@kernel.org> wrote:
> Hey, Vivek.
>
> On Wed, Apr 04, 2012 at 10:51:34AM -0400, Vivek Goyal wrote:
>> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
>> > IIUC, without cgroup, the current writeback code works more or less
>> > like this.  Throwing in cgroup doesn't really change the fundamental
>> > design.  Instead of a single pipe going down, we just have multiple
>> > pipes to the same device, each of which should be treated separately.
>> > Of course, a spinning disk can't be divided that easily and their
>> > performance characteristics will be inter-dependent, but the place to
>> > solve that problem is where the problem is, the block layer.
>>
>> How do you take care of thorottling IO to NFS case in this model? Current
>> throttling logic is tied to block device and in case of NFS, there is no
>> block device.
>
> On principle, I don't think it has be any different.  Filesystems's
> interface to the underlying device is through bdi.  If a fs is block
> backed, block pressure should be propagated through bdi, which should
> be mostly trivial.  If a fs is network backed, we can implement a
> mechanism for network backed bdis, so that they can relay the pressure
> from the server side to the local fs users.
>
> That said, network filesystems often show different behaviors and use
> different mechanisms for various reasons and it wouldn't be too
> surprising if something different would fit them better here or we
> might need something supplemental to the usual mechanism.

For the network file system clients, we may be close already,
but I don't know how to allow servers like Samba or Apache
to query btrfs, xfs etc. for this information.

superblock -> struct backing_dev_info is probably fine as long
as we aren't making that structure more block device specific.
Current use of bdi is a little hard to understand since
there are 25+ fields in the structure.  Is their use/purpose written
up anywhere?  I have a feeling we are under-utilizing what
is already there.  In any case bdi is "backing" info not "block"
specific info.  Since bdi can be assigned to a superblock
and an inode, it seems reasonable for either network or local.

Note that it isn't just traditional network file systems (nfs and cifs and smb2)
but also virtualization (virtfs) and some special purpose file systems
for which block device specific interfaces to higher layers (above the fs)
are an awkward way to think about congestion.   What
about a case of a file system like btrfs that could back a
volume to a pool of devices and distribute hot/cold data
across multiple physical or logical devices?

By the way, there may be less of a problem with current
network file system clients due to small limits on simultaneous i/o.
Until recently NFS client had a low default slot count of 16 IIRC and
it was not much better for cifs.   The typical cifs server defaulted
to allowing a client to only send 50 simultaneous requests to that
server at one time ...
The cifs protocol allows more (up to 64K) and in 3.4 the client now
can send more requests (up to 32K) if the server is so configured.

With SMB2 since "credits" are returned on every response, fast
servers (e.g. Samba running on a good clustered file system,
or a good NAS box) may end up allowing thousands of simultaneous
requests if they have the resources to handle this.   Unfortunately,
the Samba server developers do not know how to request information
on superblock->bdi congestion information from user space.
I vaguely remember bdi debugging info available in sysfs, but how
would an application find out how congested the underlying volume
it is exporting is.

-- 
Thanks,

Steve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-04 17:51     ` Fengguang Wu
  (?)
@ 2012-04-04 19:33       ` Tejun Heo
  -1 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 19:33 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA

Hey, Fengguang.

On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> Yeah it should be trivial to apply the balance_dirty_pages()
> throttling algorithm to the read/direct IOs. However up to now I don't
> see much added value to *duplicate* the current block IO controller
> functionalities, assuming the current users and developers are happy
> with it.

Heh, trust me.  It's half broken and people ain't happy.  I get that
your algorithm can be updatd to consider all IOs and I believe that
but what I don't get is how would such information get to writeback
and in turn how writeback would enforce the result on reads and direct
IOs.  Through what path?  Will all reads and direct IOs travel through
balance_dirty_pages() even direct IOs on raw block devices?  Or would
the writeback algorithm take the configuration from cfq, apply the
algorithm and give back the limits to enforce to cfq?  If the latter,
isn't that at least somewhat messed up?

> I did the buffered write IO controller mainly to fill the gap.  If I
> happen to stand in your way, sorry that's not my initial intention.

No, no, it's not about standing in my way.  As Vivek said in the other
reply, it's that the "gap" that you filled was created *because*
writeback wasn't cgroup aware and now you're in turn filling that gap
by making writeback work around that "gap".  I mean, my mind boggles.
Doesn't yours?  I strongly believe everyone's should.

> It's a pity and surprise that Google as a big user does not buy in
> this simple solution. You may prefer more comprehensive controls which
> may not be easily achievable with the simple scheme. However the
> complexities and overheads involved in throttling the flusher IOs
> really upsets me. 

Heh, believe it or not, I'm not really wearing google hat on this
subject and google's writeback people may have completely different
opinions on the subject than mine.  In fact, I'm not even sure how
much "work" time I'll be able to assign to this.  :(

> The sweet split point would be for balance_dirty_pages() to do cgroup
> aware buffered write throttling and leave other IOs to the current
> blkcg. For this to work well as a total solution for end users, I hope
> we can cooperate and figure out ways for the two throttling entities
> to work well with each other.

There's where I'm confused.  How is the said split supposed to work?
They aren't independent.  I mean, who gets to decide what and where
are those decisions enforced?

> What I'm interested is, what's Google and other users' use schemes in
> practice. What's their desired interfaces. Whether and how the
> combined bdp+blkcg throttling can fulfill the goals.

I'm not too privy of mm and writeback in google and even if so I
probably shouldn't talk too much about it.  Confidentiality and all.
That said, I have the general feeling that goog already figured out
how to at least work around the existing implementation and would be
able to continue no matter how upstream development fans out.

That said, wearing the cgroup maintainer and general kernel
contributor hat, I'd really like to avoid another design mess up.

> > Let's please keep the layering clear.  IO limitations will be applied
> > at the block layer and pressure will be formed there and then
> > propagated upwards eventually to the originator.  Sure, exposing the
> > whole information might result in better behavior for certain
> > workloads, but down the road, say, in three or five years, devices
> > which can be shared without worrying too much about seeks might be
> > commonplace and we could be swearing at a disgusting structural mess,
> > and sadly various cgroup support seems to be a prominent source of
> > such design failures.
> 
> Super fast storages are coming which will make us regret to make the
> IO path over complex.  Spinning disks are not going away anytime soon.
> I doubt Google is willing to afford the disk seek costs on its
> millions of disks and has the patience to wait until switching all of
> the spin disks to SSD years later (if it will ever happen).

This is new.  Let's keep the damn employer out of the discussion.
While the area I work on is affected by my employment (writeback isn't
even my area BTW), I'm not gonna do something adverse to upstream even
if it's beneficial to google and I'm much more likely to do something
which may hurt google a bit if it's gonna benefit upstream.

As for the faster / newer storage argument, that is *exactly* why we
want to keep the layering proper.  Writeback works from the pressure
from the IO stack.  If IO technology changes, we update the IO stack
and writeback still works from the pressure.  It may need to be
adjusted but the principles don't change.

> It's obvious that your below proposal involves a lot of complexities,
> overheads, and will hurt performance. It basically involves

Hmmm... that's not the impression I got from the discussion.
According to Jan, applying the current writeback logic to cgroup'fied
bdi shouldn't be too complex, no?

> - running concurrent flusher threads for cgroups, which adds back the
>   disk seeks and lock contentions. And still has problems with sync
>   and shared inodes.

I agree this is an actual concern but if the user wants to split one
spindle to multiple resource domains, there's gonna be considerable
amount of overhead no matter what.  If you want to improve how block
layer handles the split, you're welcome to dive into the block layer,
where the split is made, and improve it.

> - splitting device queue for cgroups, possibly scaling up the pool of
>   writeback pages (and locked pages in the case of stable pages) which
>   could stall random processes in the system

Sure, it'll take up more buffering and memory but that's the overhead
of the cgroup business.  I want it to be less intrusive at the cost of
somewhat more resource consumption.  ie. I don't want writeback logic
itself deeply involved in block IO cgroup enforcement even if that
means somewhat less efficient resource usage.

> - the mess of metadata handling

Does throttling from writeback actually solve this problem?  What
about fsync()?  Does that already go through balance_dirty_pages()?

> - unnecessarily coupled with memcg, in order to take advantage of the
>   per-memcg dirty limits for balance_dirty_pages() to actually convert
>   the "pushed back" dirty pages pressure into lowered dirty rate. Why
>   the hell the users *have to* setup memcg (suffering from all the
>   inconvenience and overheads) in order to do IO throttling?  Please,
>   this is really ugly! And the "back pressure" may constantly push the
>   memcg dirty pages to the limits. I'm not going to support *miss use*
>   of per-memcg dirty limits like this!

Writeback sits between blkcg and memcg and it indeed can be hairy to
consider both sides especially given the current sorry complex state
of cgroup and I can see why it would seem tempting to add a separate
controller or at least knobs to support that.  That said, I *think*
given that memcg controls all other memory parameters it probably
would make most sense giving that parameter to memcg too.  I don't
think this is really relevant to this discussion tho.  Who owns
dirty_limits is a separate issue.

> I cannot believe you would keep overlooking all the problems without
> good reasons. Please do tell us the reasons that matter.

Well, I tried and I hope some of it got through.  I also wrote a lot
of questions, mainly regarding how what you have in mind is supposed
to work through what path.  Maybe I'm just not seeing what you're
seeing but I just can't see where all the IOs would go through and
come together.  Can you please elaborate more on that?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 19:33       ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 19:33 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hey, Fengguang.

On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> Yeah it should be trivial to apply the balance_dirty_pages()
> throttling algorithm to the read/direct IOs. However up to now I don't
> see much added value to *duplicate* the current block IO controller
> functionalities, assuming the current users and developers are happy
> with it.

Heh, trust me.  It's half broken and people ain't happy.  I get that
your algorithm can be updatd to consider all IOs and I believe that
but what I don't get is how would such information get to writeback
and in turn how writeback would enforce the result on reads and direct
IOs.  Through what path?  Will all reads and direct IOs travel through
balance_dirty_pages() even direct IOs on raw block devices?  Or would
the writeback algorithm take the configuration from cfq, apply the
algorithm and give back the limits to enforce to cfq?  If the latter,
isn't that at least somewhat messed up?

> I did the buffered write IO controller mainly to fill the gap.  If I
> happen to stand in your way, sorry that's not my initial intention.

No, no, it's not about standing in my way.  As Vivek said in the other
reply, it's that the "gap" that you filled was created *because*
writeback wasn't cgroup aware and now you're in turn filling that gap
by making writeback work around that "gap".  I mean, my mind boggles.
Doesn't yours?  I strongly believe everyone's should.

> It's a pity and surprise that Google as a big user does not buy in
> this simple solution. You may prefer more comprehensive controls which
> may not be easily achievable with the simple scheme. However the
> complexities and overheads involved in throttling the flusher IOs
> really upsets me. 

Heh, believe it or not, I'm not really wearing google hat on this
subject and google's writeback people may have completely different
opinions on the subject than mine.  In fact, I'm not even sure how
much "work" time I'll be able to assign to this.  :(

> The sweet split point would be for balance_dirty_pages() to do cgroup
> aware buffered write throttling and leave other IOs to the current
> blkcg. For this to work well as a total solution for end users, I hope
> we can cooperate and figure out ways for the two throttling entities
> to work well with each other.

There's where I'm confused.  How is the said split supposed to work?
They aren't independent.  I mean, who gets to decide what and where
are those decisions enforced?

> What I'm interested is, what's Google and other users' use schemes in
> practice. What's their desired interfaces. Whether and how the
> combined bdp+blkcg throttling can fulfill the goals.

I'm not too privy of mm and writeback in google and even if so I
probably shouldn't talk too much about it.  Confidentiality and all.
That said, I have the general feeling that goog already figured out
how to at least work around the existing implementation and would be
able to continue no matter how upstream development fans out.

That said, wearing the cgroup maintainer and general kernel
contributor hat, I'd really like to avoid another design mess up.

> > Let's please keep the layering clear.  IO limitations will be applied
> > at the block layer and pressure will be formed there and then
> > propagated upwards eventually to the originator.  Sure, exposing the
> > whole information might result in better behavior for certain
> > workloads, but down the road, say, in three or five years, devices
> > which can be shared without worrying too much about seeks might be
> > commonplace and we could be swearing at a disgusting structural mess,
> > and sadly various cgroup support seems to be a prominent source of
> > such design failures.
> 
> Super fast storages are coming which will make us regret to make the
> IO path over complex.  Spinning disks are not going away anytime soon.
> I doubt Google is willing to afford the disk seek costs on its
> millions of disks and has the patience to wait until switching all of
> the spin disks to SSD years later (if it will ever happen).

This is new.  Let's keep the damn employer out of the discussion.
While the area I work on is affected by my employment (writeback isn't
even my area BTW), I'm not gonna do something adverse to upstream even
if it's beneficial to google and I'm much more likely to do something
which may hurt google a bit if it's gonna benefit upstream.

As for the faster / newer storage argument, that is *exactly* why we
want to keep the layering proper.  Writeback works from the pressure
from the IO stack.  If IO technology changes, we update the IO stack
and writeback still works from the pressure.  It may need to be
adjusted but the principles don't change.

> It's obvious that your below proposal involves a lot of complexities,
> overheads, and will hurt performance. It basically involves

Hmmm... that's not the impression I got from the discussion.
According to Jan, applying the current writeback logic to cgroup'fied
bdi shouldn't be too complex, no?

> - running concurrent flusher threads for cgroups, which adds back the
>   disk seeks and lock contentions. And still has problems with sync
>   and shared inodes.

I agree this is an actual concern but if the user wants to split one
spindle to multiple resource domains, there's gonna be considerable
amount of overhead no matter what.  If you want to improve how block
layer handles the split, you're welcome to dive into the block layer,
where the split is made, and improve it.

> - splitting device queue for cgroups, possibly scaling up the pool of
>   writeback pages (and locked pages in the case of stable pages) which
>   could stall random processes in the system

Sure, it'll take up more buffering and memory but that's the overhead
of the cgroup business.  I want it to be less intrusive at the cost of
somewhat more resource consumption.  ie. I don't want writeback logic
itself deeply involved in block IO cgroup enforcement even if that
means somewhat less efficient resource usage.

> - the mess of metadata handling

Does throttling from writeback actually solve this problem?  What
about fsync()?  Does that already go through balance_dirty_pages()?

> - unnecessarily coupled with memcg, in order to take advantage of the
>   per-memcg dirty limits for balance_dirty_pages() to actually convert
>   the "pushed back" dirty pages pressure into lowered dirty rate. Why
>   the hell the users *have to* setup memcg (suffering from all the
>   inconvenience and overheads) in order to do IO throttling?  Please,
>   this is really ugly! And the "back pressure" may constantly push the
>   memcg dirty pages to the limits. I'm not going to support *miss use*
>   of per-memcg dirty limits like this!

Writeback sits between blkcg and memcg and it indeed can be hairy to
consider both sides especially given the current sorry complex state
of cgroup and I can see why it would seem tempting to add a separate
controller or at least knobs to support that.  That said, I *think*
given that memcg controls all other memory parameters it probably
would make most sense giving that parameter to memcg too.  I don't
think this is really relevant to this discussion tho.  Who owns
dirty_limits is a separate issue.

> I cannot believe you would keep overlooking all the problems without
> good reasons. Please do tell us the reasons that matter.

Well, I tried and I hope some of it got through.  I also wrote a lot
of questions, mainly regarding how what you have in mind is supposed
to work through what path.  Maybe I'm just not seeing what you're
seeing but I just can't see where all the IOs would go through and
come together.  Can you please elaborate more on that?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 19:33       ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 19:33 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hey, Fengguang.

On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> Yeah it should be trivial to apply the balance_dirty_pages()
> throttling algorithm to the read/direct IOs. However up to now I don't
> see much added value to *duplicate* the current block IO controller
> functionalities, assuming the current users and developers are happy
> with it.

Heh, trust me.  It's half broken and people ain't happy.  I get that
your algorithm can be updatd to consider all IOs and I believe that
but what I don't get is how would such information get to writeback
and in turn how writeback would enforce the result on reads and direct
IOs.  Through what path?  Will all reads and direct IOs travel through
balance_dirty_pages() even direct IOs on raw block devices?  Or would
the writeback algorithm take the configuration from cfq, apply the
algorithm and give back the limits to enforce to cfq?  If the latter,
isn't that at least somewhat messed up?

> I did the buffered write IO controller mainly to fill the gap.  If I
> happen to stand in your way, sorry that's not my initial intention.

No, no, it's not about standing in my way.  As Vivek said in the other
reply, it's that the "gap" that you filled was created *because*
writeback wasn't cgroup aware and now you're in turn filling that gap
by making writeback work around that "gap".  I mean, my mind boggles.
Doesn't yours?  I strongly believe everyone's should.

> It's a pity and surprise that Google as a big user does not buy in
> this simple solution. You may prefer more comprehensive controls which
> may not be easily achievable with the simple scheme. However the
> complexities and overheads involved in throttling the flusher IOs
> really upsets me. 

Heh, believe it or not, I'm not really wearing google hat on this
subject and google's writeback people may have completely different
opinions on the subject than mine.  In fact, I'm not even sure how
much "work" time I'll be able to assign to this.  :(

> The sweet split point would be for balance_dirty_pages() to do cgroup
> aware buffered write throttling and leave other IOs to the current
> blkcg. For this to work well as a total solution for end users, I hope
> we can cooperate and figure out ways for the two throttling entities
> to work well with each other.

There's where I'm confused.  How is the said split supposed to work?
They aren't independent.  I mean, who gets to decide what and where
are those decisions enforced?

> What I'm interested is, what's Google and other users' use schemes in
> practice. What's their desired interfaces. Whether and how the
> combined bdp+blkcg throttling can fulfill the goals.

I'm not too privy of mm and writeback in google and even if so I
probably shouldn't talk too much about it.  Confidentiality and all.
That said, I have the general feeling that goog already figured out
how to at least work around the existing implementation and would be
able to continue no matter how upstream development fans out.

That said, wearing the cgroup maintainer and general kernel
contributor hat, I'd really like to avoid another design mess up.

> > Let's please keep the layering clear.  IO limitations will be applied
> > at the block layer and pressure will be formed there and then
> > propagated upwards eventually to the originator.  Sure, exposing the
> > whole information might result in better behavior for certain
> > workloads, but down the road, say, in three or five years, devices
> > which can be shared without worrying too much about seeks might be
> > commonplace and we could be swearing at a disgusting structural mess,
> > and sadly various cgroup support seems to be a prominent source of
> > such design failures.
> 
> Super fast storages are coming which will make us regret to make the
> IO path over complex.  Spinning disks are not going away anytime soon.
> I doubt Google is willing to afford the disk seek costs on its
> millions of disks and has the patience to wait until switching all of
> the spin disks to SSD years later (if it will ever happen).

This is new.  Let's keep the damn employer out of the discussion.
While the area I work on is affected by my employment (writeback isn't
even my area BTW), I'm not gonna do something adverse to upstream even
if it's beneficial to google and I'm much more likely to do something
which may hurt google a bit if it's gonna benefit upstream.

As for the faster / newer storage argument, that is *exactly* why we
want to keep the layering proper.  Writeback works from the pressure
from the IO stack.  If IO technology changes, we update the IO stack
and writeback still works from the pressure.  It may need to be
adjusted but the principles don't change.

> It's obvious that your below proposal involves a lot of complexities,
> overheads, and will hurt performance. It basically involves

Hmmm... that's not the impression I got from the discussion.
According to Jan, applying the current writeback logic to cgroup'fied
bdi shouldn't be too complex, no?

> - running concurrent flusher threads for cgroups, which adds back the
>   disk seeks and lock contentions. And still has problems with sync
>   and shared inodes.

I agree this is an actual concern but if the user wants to split one
spindle to multiple resource domains, there's gonna be considerable
amount of overhead no matter what.  If you want to improve how block
layer handles the split, you're welcome to dive into the block layer,
where the split is made, and improve it.

> - splitting device queue for cgroups, possibly scaling up the pool of
>   writeback pages (and locked pages in the case of stable pages) which
>   could stall random processes in the system

Sure, it'll take up more buffering and memory but that's the overhead
of the cgroup business.  I want it to be less intrusive at the cost of
somewhat more resource consumption.  ie. I don't want writeback logic
itself deeply involved in block IO cgroup enforcement even if that
means somewhat less efficient resource usage.

> - the mess of metadata handling

Does throttling from writeback actually solve this problem?  What
about fsync()?  Does that already go through balance_dirty_pages()?

> - unnecessarily coupled with memcg, in order to take advantage of the
>   per-memcg dirty limits for balance_dirty_pages() to actually convert
>   the "pushed back" dirty pages pressure into lowered dirty rate. Why
>   the hell the users *have to* setup memcg (suffering from all the
>   inconvenience and overheads) in order to do IO throttling?  Please,
>   this is really ugly! And the "back pressure" may constantly push the
>   memcg dirty pages to the limits. I'm not going to support *miss use*
>   of per-memcg dirty limits like this!

Writeback sits between blkcg and memcg and it indeed can be hairy to
consider both sides especially given the current sorry complex state
of cgroup and I can see why it would seem tempting to add a separate
controller or at least knobs to support that.  That said, I *think*
given that memcg controls all other memory parameters it probably
would make most sense giving that parameter to memcg too.  I don't
think this is really relevant to this discussion tho.  Who owns
dirty_limits is a separate issue.

> I cannot believe you would keep overlooking all the problems without
> good reasons. Please do tell us the reasons that matter.

Well, I tried and I hope some of it got through.  I also wrote a lot
of questions, mainly regarding how what you have in mind is supposed
to work through what path.  Maybe I'm just not seeing what you're
seeing but I just can't see where all the IOs would go through and
come together.  Can you please elaborate more on that?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-04 19:33       ` Tejun Heo
  (?)
@ 2012-04-04 20:18           ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 20:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Wed, Apr 04, 2012 at 12:33:55PM -0700, Tejun Heo wrote:
> Hey, Fengguang.
> 
> On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> > Yeah it should be trivial to apply the balance_dirty_pages()
> > throttling algorithm to the read/direct IOs. However up to now I don't
> > see much added value to *duplicate* the current block IO controller
> > functionalities, assuming the current users and developers are happy
> > with it.
> 
> Heh, trust me.  It's half broken and people ain't happy.  I get that
> your algorithm can be updatd to consider all IOs and I believe that
> but what I don't get is how would such information get to writeback
> and in turn how writeback would enforce the result on reads and direct
> IOs.  Through what path?  Will all reads and direct IOs travel through
> balance_dirty_pages() even direct IOs on raw block devices?  Or would
> the writeback algorithm take the configuration from cfq, apply the
> algorithm and give back the limits to enforce to cfq?  If the latter,
> isn't that at least somewhat messed up?

I think he wanted to get the configuration with the help of blkcg
interface and just implement those policies up there without any
further interaction with CFQ or lower layers.

[..]
> > The sweet split point would be for balance_dirty_pages() to do cgroup
> > aware buffered write throttling and leave other IOs to the current
> > blkcg. For this to work well as a total solution for end users, I hope
> > we can cooperate and figure out ways for the two throttling entities
> > to work well with each other.
> 
> There's where I'm confused.  How is the said split supposed to work?
> They aren't independent.  I mean, who gets to decide what and where
> are those decisions enforced?

As you said, split is just a temporary gap filling in the absense of a
good solutiong for throttling buffered writes (which is often a source
of problem for sync IO latencies). So with this solution one could put
independetly control the buffered write rate of a cgroup. Lower layers
will not throttle that traffic again as it would show up in root
cgroup. Hence blkcg and writeback need not to communicate much as
such except for confirations knobs and possibly for some stats.

[..]
> > - running concurrent flusher threads for cgroups, which adds back the
> >   disk seeks and lock contentions. And still has problems with sync
> >   and shared inodes.
> 

Or, export the notion of per group per bdi congestion and flusher does
not try to submit IO from an inode if device is congested. That way
flusher will not get blocked and we don't have to create one flusher
thread per cgroup and be happy with one flusher per bdi.

And with the comprobmise of one inode belonging to one cgroup, we will
still dispatch a bunch of IO from one inode and then move to next.
Depending on size of chunk we can reduce the seek a bit. Size of quantum
will decide tradeoff between seek and fairness of writes from inodes.

[..]
> > - the mess of metadata handling
> 
> Does throttling from writeback actually solve this problem?  What
> about fsync()?  Does that already go through balance_dirty_pages()?

By throttling the process at the time of dirtying memory, you just allowed
enough IO from process as allowed by the limits. Now fsync() has to send
only those pages to the disk and does not have to be throttled again.

So throttling process while you are admitting IO avoids these issues
with filesystem metadata.

But at the same time it does not feel right to throttle read and AIO
synchronously. Current behavior of kernel queuing up bio and throttling
it asynchronously is desirable. Only buffered write is a special case
as we anyway throttle it actively based on amount of dirty memory.

[..]
> 
> > - unnecessarily coupled with memcg, in order to take advantage of the
> >   per-memcg dirty limits for balance_dirty_pages() to actually convert
> >   the "pushed back" dirty pages pressure into lowered dirty rate. Why
> >   the hell the users *have to* setup memcg (suffering from all the
> >   inconvenience and overheads) in order to do IO throttling?  Please,
> >   this is really ugly! And the "back pressure" may constantly push the
> >   memcg dirty pages to the limits. I'm not going to support *miss use*
> >   of per-memcg dirty limits like this!
> 
> Writeback sits between blkcg and memcg and it indeed can be hairy to
> consider both sides especially given the current sorry complex state
> of cgroup and I can see why it would seem tempting to add a separate
> controller or at least knobs to support that.  That said, I *think*
> given that memcg controls all other memory parameters it probably
> would make most sense giving that parameter to memcg too.  I don't
> think this is really relevant to this discussion tho.  Who owns
> dirty_limits is a separate issue.

I agree that dirty_limit control resembles more closely to memcg than
blkcg as it is all about writing to memory and that's the resource
controlled by memcg.

I think Fegguang wanted to keep those knobs in blkcg as he thinks that
in writeback logic he can actively throttle readers and direct IO too.
But that does not sounds little messy to me too.

Hey how about reconsidering my other proposal for which I had posted
the patches. And that is keep throttling still at device level. Reads
and direct IO get throttled asynchronously but buffered writes get
throttled synchronously.

Advantages of this scheme.

- There are no separate knobs.

- All the IO (read, direct IO and buffered write) is controlled using
  same set of knobs and goes in queue of same cgroup.

- Writeback logic has no knowledge of throttling. It just invokes a 
  hook into throttling logic of device queue.

I guess this is a hybrid of active writeback throttling and back pressure
mechanism.

But it still does not solve the NFS issue as well as for direct IO,
filesystems still can get serialized, so metadata issue still needs to 
be resolved. So one can argue that why not go for full "back pressure"
method, despite it being more complex.

Here is the link, just to refresh the memory. Something to keep in mind
while assessing alternatives.

https://lkml.org/lkml/2011/6/28/243

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 20:18           ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 20:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 04, 2012 at 12:33:55PM -0700, Tejun Heo wrote:
> Hey, Fengguang.
> 
> On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> > Yeah it should be trivial to apply the balance_dirty_pages()
> > throttling algorithm to the read/direct IOs. However up to now I don't
> > see much added value to *duplicate* the current block IO controller
> > functionalities, assuming the current users and developers are happy
> > with it.
> 
> Heh, trust me.  It's half broken and people ain't happy.  I get that
> your algorithm can be updatd to consider all IOs and I believe that
> but what I don't get is how would such information get to writeback
> and in turn how writeback would enforce the result on reads and direct
> IOs.  Through what path?  Will all reads and direct IOs travel through
> balance_dirty_pages() even direct IOs on raw block devices?  Or would
> the writeback algorithm take the configuration from cfq, apply the
> algorithm and give back the limits to enforce to cfq?  If the latter,
> isn't that at least somewhat messed up?

I think he wanted to get the configuration with the help of blkcg
interface and just implement those policies up there without any
further interaction with CFQ or lower layers.

[..]
> > The sweet split point would be for balance_dirty_pages() to do cgroup
> > aware buffered write throttling and leave other IOs to the current
> > blkcg. For this to work well as a total solution for end users, I hope
> > we can cooperate and figure out ways for the two throttling entities
> > to work well with each other.
> 
> There's where I'm confused.  How is the said split supposed to work?
> They aren't independent.  I mean, who gets to decide what and where
> are those decisions enforced?

As you said, split is just a temporary gap filling in the absense of a
good solutiong for throttling buffered writes (which is often a source
of problem for sync IO latencies). So with this solution one could put
independetly control the buffered write rate of a cgroup. Lower layers
will not throttle that traffic again as it would show up in root
cgroup. Hence blkcg and writeback need not to communicate much as
such except for confirations knobs and possibly for some stats.

[..]
> > - running concurrent flusher threads for cgroups, which adds back the
> >   disk seeks and lock contentions. And still has problems with sync
> >   and shared inodes.
> 

Or, export the notion of per group per bdi congestion and flusher does
not try to submit IO from an inode if device is congested. That way
flusher will not get blocked and we don't have to create one flusher
thread per cgroup and be happy with one flusher per bdi.

And with the comprobmise of one inode belonging to one cgroup, we will
still dispatch a bunch of IO from one inode and then move to next.
Depending on size of chunk we can reduce the seek a bit. Size of quantum
will decide tradeoff between seek and fairness of writes from inodes.

[..]
> > - the mess of metadata handling
> 
> Does throttling from writeback actually solve this problem?  What
> about fsync()?  Does that already go through balance_dirty_pages()?

By throttling the process at the time of dirtying memory, you just allowed
enough IO from process as allowed by the limits. Now fsync() has to send
only those pages to the disk and does not have to be throttled again.

So throttling process while you are admitting IO avoids these issues
with filesystem metadata.

But at the same time it does not feel right to throttle read and AIO
synchronously. Current behavior of kernel queuing up bio and throttling
it asynchronously is desirable. Only buffered write is a special case
as we anyway throttle it actively based on amount of dirty memory.

[..]
> 
> > - unnecessarily coupled with memcg, in order to take advantage of the
> >   per-memcg dirty limits for balance_dirty_pages() to actually convert
> >   the "pushed back" dirty pages pressure into lowered dirty rate. Why
> >   the hell the users *have to* setup memcg (suffering from all the
> >   inconvenience and overheads) in order to do IO throttling?  Please,
> >   this is really ugly! And the "back pressure" may constantly push the
> >   memcg dirty pages to the limits. I'm not going to support *miss use*
> >   of per-memcg dirty limits like this!
> 
> Writeback sits between blkcg and memcg and it indeed can be hairy to
> consider both sides especially given the current sorry complex state
> of cgroup and I can see why it would seem tempting to add a separate
> controller or at least knobs to support that.  That said, I *think*
> given that memcg controls all other memory parameters it probably
> would make most sense giving that parameter to memcg too.  I don't
> think this is really relevant to this discussion tho.  Who owns
> dirty_limits is a separate issue.

I agree that dirty_limit control resembles more closely to memcg than
blkcg as it is all about writing to memory and that's the resource
controlled by memcg.

I think Fegguang wanted to keep those knobs in blkcg as he thinks that
in writeback logic he can actively throttle readers and direct IO too.
But that does not sounds little messy to me too.

Hey how about reconsidering my other proposal for which I had posted
the patches. And that is keep throttling still at device level. Reads
and direct IO get throttled asynchronously but buffered writes get
throttled synchronously.

Advantages of this scheme.

- There are no separate knobs.

- All the IO (read, direct IO and buffered write) is controlled using
  same set of knobs and goes in queue of same cgroup.

- Writeback logic has no knowledge of throttling. It just invokes a 
  hook into throttling logic of device queue.

I guess this is a hybrid of active writeback throttling and back pressure
mechanism.

But it still does not solve the NFS issue as well as for direct IO,
filesystems still can get serialized, so metadata issue still needs to 
be resolved. So one can argue that why not go for full "back pressure"
method, despite it being more complex.

Here is the link, just to refresh the memory. Something to keep in mind
while assessing alternatives.

https://lkml.org/lkml/2011/6/28/243

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 20:18           ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 20:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 04, 2012 at 12:33:55PM -0700, Tejun Heo wrote:
> Hey, Fengguang.
> 
> On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> > Yeah it should be trivial to apply the balance_dirty_pages()
> > throttling algorithm to the read/direct IOs. However up to now I don't
> > see much added value to *duplicate* the current block IO controller
> > functionalities, assuming the current users and developers are happy
> > with it.
> 
> Heh, trust me.  It's half broken and people ain't happy.  I get that
> your algorithm can be updatd to consider all IOs and I believe that
> but what I don't get is how would such information get to writeback
> and in turn how writeback would enforce the result on reads and direct
> IOs.  Through what path?  Will all reads and direct IOs travel through
> balance_dirty_pages() even direct IOs on raw block devices?  Or would
> the writeback algorithm take the configuration from cfq, apply the
> algorithm and give back the limits to enforce to cfq?  If the latter,
> isn't that at least somewhat messed up?

I think he wanted to get the configuration with the help of blkcg
interface and just implement those policies up there without any
further interaction with CFQ or lower layers.

[..]
> > The sweet split point would be for balance_dirty_pages() to do cgroup
> > aware buffered write throttling and leave other IOs to the current
> > blkcg. For this to work well as a total solution for end users, I hope
> > we can cooperate and figure out ways for the two throttling entities
> > to work well with each other.
> 
> There's where I'm confused.  How is the said split supposed to work?
> They aren't independent.  I mean, who gets to decide what and where
> are those decisions enforced?

As you said, split is just a temporary gap filling in the absense of a
good solutiong for throttling buffered writes (which is often a source
of problem for sync IO latencies). So with this solution one could put
independetly control the buffered write rate of a cgroup. Lower layers
will not throttle that traffic again as it would show up in root
cgroup. Hence blkcg and writeback need not to communicate much as
such except for confirations knobs and possibly for some stats.

[..]
> > - running concurrent flusher threads for cgroups, which adds back the
> >   disk seeks and lock contentions. And still has problems with sync
> >   and shared inodes.
> 

Or, export the notion of per group per bdi congestion and flusher does
not try to submit IO from an inode if device is congested. That way
flusher will not get blocked and we don't have to create one flusher
thread per cgroup and be happy with one flusher per bdi.

And with the comprobmise of one inode belonging to one cgroup, we will
still dispatch a bunch of IO from one inode and then move to next.
Depending on size of chunk we can reduce the seek a bit. Size of quantum
will decide tradeoff between seek and fairness of writes from inodes.

[..]
> > - the mess of metadata handling
> 
> Does throttling from writeback actually solve this problem?  What
> about fsync()?  Does that already go through balance_dirty_pages()?

By throttling the process at the time of dirtying memory, you just allowed
enough IO from process as allowed by the limits. Now fsync() has to send
only those pages to the disk and does not have to be throttled again.

So throttling process while you are admitting IO avoids these issues
with filesystem metadata.

But at the same time it does not feel right to throttle read and AIO
synchronously. Current behavior of kernel queuing up bio and throttling
it asynchronously is desirable. Only buffered write is a special case
as we anyway throttle it actively based on amount of dirty memory.

[..]
> 
> > - unnecessarily coupled with memcg, in order to take advantage of the
> >   per-memcg dirty limits for balance_dirty_pages() to actually convert
> >   the "pushed back" dirty pages pressure into lowered dirty rate. Why
> >   the hell the users *have to* setup memcg (suffering from all the
> >   inconvenience and overheads) in order to do IO throttling?  Please,
> >   this is really ugly! And the "back pressure" may constantly push the
> >   memcg dirty pages to the limits. I'm not going to support *miss use*
> >   of per-memcg dirty limits like this!
> 
> Writeback sits between blkcg and memcg and it indeed can be hairy to
> consider both sides especially given the current sorry complex state
> of cgroup and I can see why it would seem tempting to add a separate
> controller or at least knobs to support that.  That said, I *think*
> given that memcg controls all other memory parameters it probably
> would make most sense giving that parameter to memcg too.  I don't
> think this is really relevant to this discussion tho.  Who owns
> dirty_limits is a separate issue.

I agree that dirty_limit control resembles more closely to memcg than
blkcg as it is all about writing to memory and that's the resource
controlled by memcg.

I think Fegguang wanted to keep those knobs in blkcg as he thinks that
in writeback logic he can actively throttle readers and direct IO too.
But that does not sounds little messy to me too.

Hey how about reconsidering my other proposal for which I had posted
the patches. And that is keep throttling still at device level. Reads
and direct IO get throttled asynchronously but buffered writes get
throttled synchronously.

Advantages of this scheme.

- There are no separate knobs.

- All the IO (read, direct IO and buffered write) is controlled using
  same set of knobs and goes in queue of same cgroup.

- Writeback logic has no knowledge of throttling. It just invokes a 
  hook into throttling logic of device queue.

I guess this is a hybrid of active writeback throttling and back pressure
mechanism.

But it still does not solve the NFS issue as well as for direct IO,
filesystems still can get serialized, so metadata issue still needs to 
be resolved. So one can argue that why not go for full "back pressure"
method, despite it being more complex.

Here is the link, just to refresh the memory. Something to keep in mind
while assessing alternatives.

https://lkml.org/lkml/2011/6/28/243

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]     ` <20120404184909.GB29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
  2012-04-04 19:23       ` Steve French
@ 2012-04-04 20:32       ` Vivek Goyal
  2012-04-05 16:38       ` Tejun Heo
  2012-04-14 11:53       ` [Lsf] " Peter Zijlstra
  3 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 20:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote:

[..]

> Thirdly, I don't see how writeback can control all the IOs.  I mean,
> what about reads or direct IOs?  It's not like IO devices have
> separate channels for those different types of IOs.  They interact
> heavily.

> Let's say we have iops/bps limitation applied on top of proportional IO
> distribution

We already do that. First IO is subjected to throttling limit and only 
then it is passed to the elevator to do the proportional IO. So throttling
is already stacked on top of proportional IO. The only question is 
should it be pushed to even higher layers or not.


> or a device holds two partitions and one
> of them is being used for direct IO w/o filesystems.  How would that
> work?  I think the question goes even deeper, what do the separate
> limits even mean?

Separate limits for buffered writes are just filling the gap. Agreed it
is not a very neat solution.

>  Does the IO sched have to calculate allocation of
> IO resource to different types of IOs and then give a "number" to
> writeback which in turn enforces that limit?  How does the elevator
> know what number to give?  Is the number iops or bps or weight?

If we push up all the throttling somewhere in higher layer, say some
of kind of per bdi throttling interface, then elevator just have to
worry about doing proportional IO. No interaction with higher layers
regarding iops/bps etc. (Not that elevator worries about it today).

> If
> the iosched doesn't know how much write workload exists, how does it
> distribute the surplus buffered writeback resource across different
> cgroups?  If so, what makes the limit actualy enforceable (due to
> inaccuracies in estimation, fluctuation in workload, delay in
> enforcement in different layers and whatnot) except for block layer
> applying the limit *again* on the resulting stream of combined IOs?

So split model is definitely confusing. Anyway, block layer will not
apply the limits again as flusher IO will go in root cgroup which 
generally goes to root which is unthrottled generally. Or flusher
could mark the bios with a flag saying "do not throttle" bios again as
these have been throttled already. So throttling again is probably not
an issue. 

In summary, agreed that split is confusing and it fills a gap existing
today.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-04 18:49     ` Tejun Heo
@ 2012-04-04 20:32       ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 20:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote:

[..]

> Thirdly, I don't see how writeback can control all the IOs.  I mean,
> what about reads or direct IOs?  It's not like IO devices have
> separate channels for those different types of IOs.  They interact
> heavily.

> Let's say we have iops/bps limitation applied on top of proportional IO
> distribution

We already do that. First IO is subjected to throttling limit and only 
then it is passed to the elevator to do the proportional IO. So throttling
is already stacked on top of proportional IO. The only question is 
should it be pushed to even higher layers or not.


> or a device holds two partitions and one
> of them is being used for direct IO w/o filesystems.  How would that
> work?  I think the question goes even deeper, what do the separate
> limits even mean?

Separate limits for buffered writes are just filling the gap. Agreed it
is not a very neat solution.

>  Does the IO sched have to calculate allocation of
> IO resource to different types of IOs and then give a "number" to
> writeback which in turn enforces that limit?  How does the elevator
> know what number to give?  Is the number iops or bps or weight?

If we push up all the throttling somewhere in higher layer, say some
of kind of per bdi throttling interface, then elevator just have to
worry about doing proportional IO. No interaction with higher layers
regarding iops/bps etc. (Not that elevator worries about it today).

> If
> the iosched doesn't know how much write workload exists, how does it
> distribute the surplus buffered writeback resource across different
> cgroups?  If so, what makes the limit actualy enforceable (due to
> inaccuracies in estimation, fluctuation in workload, delay in
> enforcement in different layers and whatnot) except for block layer
> applying the limit *again* on the resulting stream of combined IOs?

So split model is definitely confusing. Anyway, block layer will not
apply the limits again as flusher IO will go in root cgroup which 
generally goes to root which is unthrottled generally. Or flusher
could mark the bios with a flag saying "do not throttle" bios again as
these have been throttled already. So throttling again is probably not
an issue. 

In summary, agreed that split is confusing and it fills a gap existing
today.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 20:32       ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 20:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote:

[..]

> Thirdly, I don't see how writeback can control all the IOs.  I mean,
> what about reads or direct IOs?  It's not like IO devices have
> separate channels for those different types of IOs.  They interact
> heavily.

> Let's say we have iops/bps limitation applied on top of proportional IO
> distribution

We already do that. First IO is subjected to throttling limit and only 
then it is passed to the elevator to do the proportional IO. So throttling
is already stacked on top of proportional IO. The only question is 
should it be pushed to even higher layers or not.


> or a device holds two partitions and one
> of them is being used for direct IO w/o filesystems.  How would that
> work?  I think the question goes even deeper, what do the separate
> limits even mean?

Separate limits for buffered writes are just filling the gap. Agreed it
is not a very neat solution.

>  Does the IO sched have to calculate allocation of
> IO resource to different types of IOs and then give a "number" to
> writeback which in turn enforces that limit?  How does the elevator
> know what number to give?  Is the number iops or bps or weight?

If we push up all the throttling somewhere in higher layer, say some
of kind of per bdi throttling interface, then elevator just have to
worry about doing proportional IO. No interaction with higher layers
regarding iops/bps etc. (Not that elevator worries about it today).

> If
> the iosched doesn't know how much write workload exists, how does it
> distribute the surplus buffered writeback resource across different
> cgroups?  If so, what makes the limit actualy enforceable (due to
> inaccuracies in estimation, fluctuation in workload, delay in
> enforcement in different layers and whatnot) except for block layer
> applying the limit *again* on the resulting stream of combined IOs?

So split model is definitely confusing. Anyway, block layer will not
apply the limits again as flusher IO will go in root cgroup which 
generally goes to root which is unthrottled generally. Or flusher
could mark the bios with a flag saying "do not throttle" bios again as
these have been throttled already. So throttling again is probably not
an issue. 

In summary, agreed that split is confusing and it fills a gap existing
today.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-04 18:35       ` Vivek Goyal
  (?)
@ 2012-04-04 21:42           ` Fengguang Wu
  -1 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-04 21:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote:
> On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> 
> [..]
> > The sweet split point would be for balance_dirty_pages() to do cgroup
> > aware buffered write throttling and leave other IOs to the current
> > blkcg. For this to work well as a total solution for end users, I hope
> > we can cooperate and figure out ways for the two throttling entities
> > to work well with each other.
> 
> Throttling read + direct IO, higher up has few issues too. Users will

Yeah I have a bit worry about high layer throttling, too.
Anyway here are the ideas.

> not like that a task got blocked as it tried to submit a read from a
> throttled group.

That's not the same issue I worried about :) Throttling is about
inserting small sleep/waits into selected points. For reads, the ideal
sleep point is immediately after readahead IO is summited, at the end
of __do_page_cache_readahead(). The same should be applicable to
direct IO.

> Current async behavior works well where we queue up the
> bio from the task in throttled group and let task do other things. Same
> is true for AIO where we would not like to block in bio submission.

For AIO, we'll need to delay the IO completion notification or status
update, which may involve computing some delay time and delay the
calls to io_complete() with the help of some delayed work queue. There
may be more issues to deal with as I didn't look into aio.c carefully.

The thing worried me is that in the proportional throttling case, the
high level throttling works on the *estimated* task_ratelimit =
disk_bandwidth / N, where N is the number of read IO tasks. When N
suddenly changes from 2 to 1, it may take 1 second for the estimated
task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth,
during which time the disk won't get 100% utilized because of the
temporally over-throttling of the remaining IO task.

This is not a problem when throttling at the block/cfq layer, since it
has the full information of pending requests and should not depend on
such estimations.

The workaround I can think of, is to put the throttled task into a wait
queue, and let block layer wake up the waiters when the IO queue runs
empty. This should be able to avoid most disk idle time.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 21:42           ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-04 21:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote:
> On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> 
> [..]
> > The sweet split point would be for balance_dirty_pages() to do cgroup
> > aware buffered write throttling and leave other IOs to the current
> > blkcg. For this to work well as a total solution for end users, I hope
> > we can cooperate and figure out ways for the two throttling entities
> > to work well with each other.
> 
> Throttling read + direct IO, higher up has few issues too. Users will

Yeah I have a bit worry about high layer throttling, too.
Anyway here are the ideas.

> not like that a task got blocked as it tried to submit a read from a
> throttled group.

That's not the same issue I worried about :) Throttling is about
inserting small sleep/waits into selected points. For reads, the ideal
sleep point is immediately after readahead IO is summited, at the end
of __do_page_cache_readahead(). The same should be applicable to
direct IO.

> Current async behavior works well where we queue up the
> bio from the task in throttled group and let task do other things. Same
> is true for AIO where we would not like to block in bio submission.

For AIO, we'll need to delay the IO completion notification or status
update, which may involve computing some delay time and delay the
calls to io_complete() with the help of some delayed work queue. There
may be more issues to deal with as I didn't look into aio.c carefully.

The thing worried me is that in the proportional throttling case, the
high level throttling works on the *estimated* task_ratelimit =
disk_bandwidth / N, where N is the number of read IO tasks. When N
suddenly changes from 2 to 1, it may take 1 second for the estimated
task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth,
during which time the disk won't get 100% utilized because of the
temporally over-throttling of the remaining IO task.

This is not a problem when throttling at the block/cfq layer, since it
has the full information of pending requests and should not depend on
such estimations.

The workaround I can think of, is to put the throttled task into a wait
queue, and let block layer wake up the waiters when the IO queue runs
empty. This should be able to avoid most disk idle time.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 21:42           ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-04 21:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote:
> On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> 
> [..]
> > The sweet split point would be for balance_dirty_pages() to do cgroup
> > aware buffered write throttling and leave other IOs to the current
> > blkcg. For this to work well as a total solution for end users, I hope
> > we can cooperate and figure out ways for the two throttling entities
> > to work well with each other.
> 
> Throttling read + direct IO, higher up has few issues too. Users will

Yeah I have a bit worry about high layer throttling, too.
Anyway here are the ideas.

> not like that a task got blocked as it tried to submit a read from a
> throttled group.

That's not the same issue I worried about :) Throttling is about
inserting small sleep/waits into selected points. For reads, the ideal
sleep point is immediately after readahead IO is summited, at the end
of __do_page_cache_readahead(). The same should be applicable to
direct IO.

> Current async behavior works well where we queue up the
> bio from the task in throttled group and let task do other things. Same
> is true for AIO where we would not like to block in bio submission.

For AIO, we'll need to delay the IO completion notification or status
update, which may involve computing some delay time and delay the
calls to io_complete() with the help of some delayed work queue. There
may be more issues to deal with as I didn't look into aio.c carefully.

The thing worried me is that in the proportional throttling case, the
high level throttling works on the *estimated* task_ratelimit =
disk_bandwidth / N, where N is the number of read IO tasks. When N
suddenly changes from 2 to 1, it may take 1 second for the estimated
task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth,
during which time the disk won't get 100% utilized because of the
temporally over-throttling of the remaining IO task.

This is not a problem when throttling at the block/cfq layer, since it
has the full information of pending requests and should not depend on
such estimations.

The workaround I can think of, is to put the throttled task into a wait
queue, and let block layer wake up the waiters when the IO queue runs
empty. This should be able to avoid most disk idle time.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]       ` <20120404203239.GM12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-04 23:02         ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 23:02 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

Hello, Vivek.

On Wed, Apr 04, 2012 at 04:32:39PM -0400, Vivek Goyal wrote:
> > Let's say we have iops/bps limitation applied on top of proportional IO
> > distribution
> 
> We already do that. First IO is subjected to throttling limit and only 
> then it is passed to the elevator to do the proportional IO. So throttling
> is already stacked on top of proportional IO. The only question is 
> should it be pushed to even higher layers or not.

Yeah, I know we already can do that.  I was trying to give an example
of non-trivial IO limit configuration.

> So split model is definitely confusing. Anyway, block layer will not
> apply the limits again as flusher IO will go in root cgroup which 
> generally goes to root which is unthrottled generally. Or flusher
> could mark the bios with a flag saying "do not throttle" bios again as
> these have been throttled already. So throttling again is probably not
> an issue. 
> 
> In summary, agreed that split is confusing and it fills a gap existing
> today.

It's not only confusing.  It's broken.  So, what you're saying is that
there's no provision to orchestrate between buffered writes and other
types of IOs.  So, it would essentially work as if there are two
separate controls controlling each of two heavily interacting parts
with no designed provision between them.  What the....

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]       ` <20120404203239.GM12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2012-04-04 23:02         ` Tejun Heo
@ 2012-04-04 23:02         ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 23:02 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hello, Vivek.

On Wed, Apr 04, 2012 at 04:32:39PM -0400, Vivek Goyal wrote:
> > Let's say we have iops/bps limitation applied on top of proportional IO
> > distribution
> 
> We already do that. First IO is subjected to throttling limit and only 
> then it is passed to the elevator to do the proportional IO. So throttling
> is already stacked on top of proportional IO. The only question is 
> should it be pushed to even higher layers or not.

Yeah, I know we already can do that.  I was trying to give an example
of non-trivial IO limit configuration.

> So split model is definitely confusing. Anyway, block layer will not
> apply the limits again as flusher IO will go in root cgroup which 
> generally goes to root which is unthrottled generally. Or flusher
> could mark the bios with a flag saying "do not throttle" bios again as
> these have been throttled already. So throttling again is probably not
> an issue. 
> 
> In summary, agreed that split is confusing and it fills a gap existing
> today.

It's not only confusing.  It's broken.  So, what you're saying is that
there's no provision to orchestrate between buffered writes and other
types of IOs.  So, it would essentially work as if there are two
separate controls controlling each of two heavily interacting parts
with no designed provision between them.  What the....

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 23:02         ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 23:02 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Jan Kara, Jens Axboe,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k,
	andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hello, Vivek.

On Wed, Apr 04, 2012 at 04:32:39PM -0400, Vivek Goyal wrote:
> > Let's say we have iops/bps limitation applied on top of proportional IO
> > distribution
> 
> We already do that. First IO is subjected to throttling limit and only 
> then it is passed to the elevator to do the proportional IO. So throttling
> is already stacked on top of proportional IO. The only question is 
> should it be pushed to even higher layers or not.

Yeah, I know we already can do that.  I was trying to give an example
of non-trivial IO limit configuration.

> So split model is definitely confusing. Anyway, block layer will not
> apply the limits again as flusher IO will go in root cgroup which 
> generally goes to root which is unthrottled generally. Or flusher
> could mark the bios with a flag saying "do not throttle" bios again as
> these have been throttled already. So throttling again is probably not
> an issue. 
> 
> In summary, agreed that split is confusing and it fills a gap existing
> today.

It's not only confusing.  It's broken.  So, what you're saying is that
there's no provision to orchestrate between buffered writes and other
types of IOs.  So, it would essentially work as if there are two
separate controls controlling each of two heavily interacting parts
with no designed provision between them.  What the....

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 23:02         ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 23:02 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hello, Vivek.

On Wed, Apr 04, 2012 at 04:32:39PM -0400, Vivek Goyal wrote:
> > Let's say we have iops/bps limitation applied on top of proportional IO
> > distribution
> 
> We already do that. First IO is subjected to throttling limit and only 
> then it is passed to the elevator to do the proportional IO. So throttling
> is already stacked on top of proportional IO. The only question is 
> should it be pushed to even higher layers or not.

Yeah, I know we already can do that.  I was trying to give an example
of non-trivial IO limit configuration.

> So split model is definitely confusing. Anyway, block layer will not
> apply the limits again as flusher IO will go in root cgroup which 
> generally goes to root which is unthrottled generally. Or flusher
> could mark the bios with a flag saying "do not throttle" bios again as
> these have been throttled already. So throttling again is probably not
> an issue. 
> 
> In summary, agreed that split is confusing and it fills a gap existing
> today.

It's not only confusing.  It's broken.  So, what you're saying is that
there's no provision to orchestrate between buffered writes and other
types of IOs.  So, it would essentially work as if there are two
separate controls controlling each of two heavily interacting parts
with no designed provision between them.  What the....

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-04 21:42           ` Fengguang Wu
                             ` (2 preceding siblings ...)
  (?)
@ 2012-04-05 15:10           ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-05 15:10 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Wed, Apr 04, 2012 at 02:42:28PM -0700, Fengguang Wu wrote:
> On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote:
> > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> > 
> > [..]
> > > The sweet split point would be for balance_dirty_pages() to do cgroup
> > > aware buffered write throttling and leave other IOs to the current
> > > blkcg. For this to work well as a total solution for end users, I hope
> > > we can cooperate and figure out ways for the two throttling entities
> > > to work well with each other.
> > 
> > Throttling read + direct IO, higher up has few issues too. Users will
> 
> Yeah I have a bit worry about high layer throttling, too.
> Anyway here are the ideas.
> 
> > not like that a task got blocked as it tried to submit a read from a
> > throttled group.
> 
> That's not the same issue I worried about :) Throttling is about
> inserting small sleep/waits into selected points. For reads, the ideal
> sleep point is immediately after readahead IO is summited, at the end
> of __do_page_cache_readahead(). The same should be applicable to
> direct IO.

But after a read the process might want to process the read data and
do something else altogether. So throttling the process after completing
the read is not the best thing.

> 
> > Current async behavior works well where we queue up the
> > bio from the task in throttled group and let task do other things. Same
> > is true for AIO where we would not like to block in bio submission.
> 
> For AIO, we'll need to delay the IO completion notification or status
> update, which may involve computing some delay time and delay the
> calls to io_complete() with the help of some delayed work queue. There
> may be more issues to deal with as I didn't look into aio.c carefully.

I don't know but delaying compltion notifications sounds odd to me. So
you don't throttle while submitting requests. That does not help with
pressure on request queue as process can dump whole bunch of IO without
waiting for completion?

What I like better that AIO is allowed to submit bunch of IO till it
hits the nr_requests limit on request queue and then it is blocked as
request queue is too busy and not enough request descriptors are free.

> 
> The thing worried me is that in the proportional throttling case, the
> high level throttling works on the *estimated* task_ratelimit =
> disk_bandwidth / N, where N is the number of read IO tasks. When N
> suddenly changes from 2 to 1, it may take 1 second for the estimated
> task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth,
> during which time the disk won't get 100% utilized because of the
> temporally over-throttling of the remaining IO task.

I thought we were only considering the case of absolute throttling in
higher layers. Proportional IO will continue to be in CFQ. I don't think
we need to push proportional IO in higher layers.

> 
> This is not a problem when throttling at the block/cfq layer, since it
> has the full information of pending requests and should not depend on
> such estimations.

CFQ does not even look at pending requests. It just maintains bunch
of IO queues and selects one queue to dispatch IO from based on its
weight. So proportional IO comes very naturally to CFQ.

> 
> The workaround I can think of, is to put the throttled task into a wait
> queue, and let block layer wake up the waiters when the IO queue runs
> empty. This should be able to avoid most disk idle time.

Again, I am not convinced that proportional IO should go in higher layers.

For fast devices we are already suffering from queue locking overhead and
Jens seems to have patches for multi queue. Now by trying to implement
something at higher layer, that locking overhead will show up there too
and we will end up doing something similar to multi queue there and it
is not desirable.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-04 21:42           ` Fengguang Wu
@ 2012-04-05 15:10             ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-05 15:10 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 04, 2012 at 02:42:28PM -0700, Fengguang Wu wrote:
> On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote:
> > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> > 
> > [..]
> > > The sweet split point would be for balance_dirty_pages() to do cgroup
> > > aware buffered write throttling and leave other IOs to the current
> > > blkcg. For this to work well as a total solution for end users, I hope
> > > we can cooperate and figure out ways for the two throttling entities
> > > to work well with each other.
> > 
> > Throttling read + direct IO, higher up has few issues too. Users will
> 
> Yeah I have a bit worry about high layer throttling, too.
> Anyway here are the ideas.
> 
> > not like that a task got blocked as it tried to submit a read from a
> > throttled group.
> 
> That's not the same issue I worried about :) Throttling is about
> inserting small sleep/waits into selected points. For reads, the ideal
> sleep point is immediately after readahead IO is summited, at the end
> of __do_page_cache_readahead(). The same should be applicable to
> direct IO.

But after a read the process might want to process the read data and
do something else altogether. So throttling the process after completing
the read is not the best thing.

> 
> > Current async behavior works well where we queue up the
> > bio from the task in throttled group and let task do other things. Same
> > is true for AIO where we would not like to block in bio submission.
> 
> For AIO, we'll need to delay the IO completion notification or status
> update, which may involve computing some delay time and delay the
> calls to io_complete() with the help of some delayed work queue. There
> may be more issues to deal with as I didn't look into aio.c carefully.

I don't know but delaying compltion notifications sounds odd to me. So
you don't throttle while submitting requests. That does not help with
pressure on request queue as process can dump whole bunch of IO without
waiting for completion?

What I like better that AIO is allowed to submit bunch of IO till it
hits the nr_requests limit on request queue and then it is blocked as
request queue is too busy and not enough request descriptors are free.

> 
> The thing worried me is that in the proportional throttling case, the
> high level throttling works on the *estimated* task_ratelimit =
> disk_bandwidth / N, where N is the number of read IO tasks. When N
> suddenly changes from 2 to 1, it may take 1 second for the estimated
> task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth,
> during which time the disk won't get 100% utilized because of the
> temporally over-throttling of the remaining IO task.

I thought we were only considering the case of absolute throttling in
higher layers. Proportional IO will continue to be in CFQ. I don't think
we need to push proportional IO in higher layers.

> 
> This is not a problem when throttling at the block/cfq layer, since it
> has the full information of pending requests and should not depend on
> such estimations.

CFQ does not even look at pending requests. It just maintains bunch
of IO queues and selects one queue to dispatch IO from based on its
weight. So proportional IO comes very naturally to CFQ.

> 
> The workaround I can think of, is to put the throttled task into a wait
> queue, and let block layer wake up the waiters when the IO queue runs
> empty. This should be able to avoid most disk idle time.

Again, I am not convinced that proportional IO should go in higher layers.

For fast devices we are already suffering from queue locking overhead and
Jens seems to have patches for multi queue. Now by trying to implement
something at higher layer, that locking overhead will show up there too
and we will end up doing something similar to multi queue there and it
is not desirable.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-05 15:10             ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-05 15:10 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 04, 2012 at 02:42:28PM -0700, Fengguang Wu wrote:
> On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote:
> > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> > 
> > [..]
> > > The sweet split point would be for balance_dirty_pages() to do cgroup
> > > aware buffered write throttling and leave other IOs to the current
> > > blkcg. For this to work well as a total solution for end users, I hope
> > > we can cooperate and figure out ways for the two throttling entities
> > > to work well with each other.
> > 
> > Throttling read + direct IO, higher up has few issues too. Users will
> 
> Yeah I have a bit worry about high layer throttling, too.
> Anyway here are the ideas.
> 
> > not like that a task got blocked as it tried to submit a read from a
> > throttled group.
> 
> That's not the same issue I worried about :) Throttling is about
> inserting small sleep/waits into selected points. For reads, the ideal
> sleep point is immediately after readahead IO is summited, at the end
> of __do_page_cache_readahead(). The same should be applicable to
> direct IO.

But after a read the process might want to process the read data and
do something else altogether. So throttling the process after completing
the read is not the best thing.

> 
> > Current async behavior works well where we queue up the
> > bio from the task in throttled group and let task do other things. Same
> > is true for AIO where we would not like to block in bio submission.
> 
> For AIO, we'll need to delay the IO completion notification or status
> update, which may involve computing some delay time and delay the
> calls to io_complete() with the help of some delayed work queue. There
> may be more issues to deal with as I didn't look into aio.c carefully.

I don't know but delaying compltion notifications sounds odd to me. So
you don't throttle while submitting requests. That does not help with
pressure on request queue as process can dump whole bunch of IO without
waiting for completion?

What I like better that AIO is allowed to submit bunch of IO till it
hits the nr_requests limit on request queue and then it is blocked as
request queue is too busy and not enough request descriptors are free.

> 
> The thing worried me is that in the proportional throttling case, the
> high level throttling works on the *estimated* task_ratelimit =
> disk_bandwidth / N, where N is the number of read IO tasks. When N
> suddenly changes from 2 to 1, it may take 1 second for the estimated
> task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth,
> during which time the disk won't get 100% utilized because of the
> temporally over-throttling of the remaining IO task.

I thought we were only considering the case of absolute throttling in
higher layers. Proportional IO will continue to be in CFQ. I don't think
we need to push proportional IO in higher layers.

> 
> This is not a problem when throttling at the block/cfq layer, since it
> has the full information of pending requests and should not depend on
> such estimations.

CFQ does not even look at pending requests. It just maintains bunch
of IO queues and selects one queue to dispatch IO from based on its
weight. So proportional IO comes very naturally to CFQ.

> 
> The workaround I can think of, is to put the throttled task into a wait
> queue, and let block layer wake up the waiters when the IO queue runs
> empty. This should be able to avoid most disk idle time.

Again, I am not convinced that proportional IO should go in higher layers.

For fast devices we are already suffering from queue locking overhead and
Jens seems to have patches for multi queue. Now by trying to implement
something at higher layer, that locking overhead will show up there too
and we will end up doing something similar to multi queue there and it
is not desirable.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]           ` <20120404201816.GL12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-05 16:31             ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-05 16:31 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

Hey, Vivek.

On Wed, Apr 04, 2012 at 04:18:16PM -0400, Vivek Goyal wrote:
> Hey how about reconsidering my other proposal for which I had posted
> the patches. And that is keep throttling still at device level. Reads
> and direct IO get throttled asynchronously but buffered writes get
> throttled synchronously.
> 
> Advantages of this scheme.
> 
> - There are no separate knobs.
> 
> - All the IO (read, direct IO and buffered write) is controlled using
>   same set of knobs and goes in queue of same cgroup.
> 
> - Writeback logic has no knowledge of throttling. It just invokes a 
>   hook into throttling logic of device queue.
> 
> I guess this is a hybrid of active writeback throttling and back pressure
> mechanism.
> 
> But it still does not solve the NFS issue as well as for direct IO,
> filesystems still can get serialized, so metadata issue still needs to 
> be resolved. So one can argue that why not go for full "back pressure"
> method, despite it being more complex.
> 
> Here is the link, just to refresh the memory. Something to keep in mind
> while assessing alternatives.
> 
> https://lkml.org/lkml/2011/6/28/243

Hmmm... so, this only works for blk-throttle and not with the weight.
How do you manage interaction between buffered writes and direct
writes for the same cgroup?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-04 20:18           ` Vivek Goyal
@ 2012-04-05 16:31             ` Tejun Heo
  -1 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-05 16:31 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hey, Vivek.

On Wed, Apr 04, 2012 at 04:18:16PM -0400, Vivek Goyal wrote:
> Hey how about reconsidering my other proposal for which I had posted
> the patches. And that is keep throttling still at device level. Reads
> and direct IO get throttled asynchronously but buffered writes get
> throttled synchronously.
> 
> Advantages of this scheme.
> 
> - There are no separate knobs.
> 
> - All the IO (read, direct IO and buffered write) is controlled using
>   same set of knobs and goes in queue of same cgroup.
> 
> - Writeback logic has no knowledge of throttling. It just invokes a 
>   hook into throttling logic of device queue.
> 
> I guess this is a hybrid of active writeback throttling and back pressure
> mechanism.
> 
> But it still does not solve the NFS issue as well as for direct IO,
> filesystems still can get serialized, so metadata issue still needs to 
> be resolved. So one can argue that why not go for full "back pressure"
> method, despite it being more complex.
> 
> Here is the link, just to refresh the memory. Something to keep in mind
> while assessing alternatives.
> 
> https://lkml.org/lkml/2011/6/28/243

Hmmm... so, this only works for blk-throttle and not with the weight.
How do you manage interaction between buffered writes and direct
writes for the same cgroup?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-05 16:31             ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-05 16:31 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hey, Vivek.

On Wed, Apr 04, 2012 at 04:18:16PM -0400, Vivek Goyal wrote:
> Hey how about reconsidering my other proposal for which I had posted
> the patches. And that is keep throttling still at device level. Reads
> and direct IO get throttled asynchronously but buffered writes get
> throttled synchronously.
> 
> Advantages of this scheme.
> 
> - There are no separate knobs.
> 
> - All the IO (read, direct IO and buffered write) is controlled using
>   same set of knobs and goes in queue of same cgroup.
> 
> - Writeback logic has no knowledge of throttling. It just invokes a 
>   hook into throttling logic of device queue.
> 
> I guess this is a hybrid of active writeback throttling and back pressure
> mechanism.
> 
> But it still does not solve the NFS issue as well as for direct IO,
> filesystems still can get serialized, so metadata issue still needs to 
> be resolved. So one can argue that why not go for full "back pressure"
> method, despite it being more complex.
> 
> Here is the link, just to refresh the memory. Something to keep in mind
> while assessing alternatives.
> 
> https://lkml.org/lkml/2011/6/28/243

Hmmm... so, this only works for blk-throttle and not with the weight.
How do you manage interaction between buffered writes and direct
writes for the same cgroup?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]     ` <20120404184909.GB29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
  2012-04-04 19:23       ` Steve French
  2012-04-04 20:32       ` Vivek Goyal
@ 2012-04-05 16:38       ` Tejun Heo
  2012-04-14 11:53       ` [Lsf] " Peter Zijlstra
  3 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-05 16:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

Hey, Vivek.

On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote:
> > I am not sure what are you trying to say here. But primarily blk-throttle
> > will throttle read and direct IO. Buffered writes will go to root cgroup
> > which is typically unthrottled.
> 
> Ooh, my bad then.  Anyways, then the same applies to blk-throttle.
> Our current implementation essentially collapses at the face of
> write-heavy workload.

I went through the code and couldn't find where blk-throttle is
discriminating async IOs.  Were you saying that blk-throttle currently
doesn't throttle because those IOs aren't associated with the dirtying
task?  If so, note that it's different from cfq which explicitly
assigns all async IOs when choosing cfqq even if we fix tagging.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]     ` <20120404184909.GB29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
  2012-04-04 19:23       ` Steve French
@ 2012-04-05 16:38       ` Tejun Heo
  2012-04-05 16:38       ` Tejun Heo
  2012-04-14 11:53       ` [Lsf] " Peter Zijlstra
  3 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-05 16:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hey, Vivek.

On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote:
> > I am not sure what are you trying to say here. But primarily blk-throttle
> > will throttle read and direct IO. Buffered writes will go to root cgroup
> > which is typically unthrottled.
> 
> Ooh, my bad then.  Anyways, then the same applies to blk-throttle.
> Our current implementation essentially collapses at the face of
> write-heavy workload.

I went through the code and couldn't find where blk-throttle is
discriminating async IOs.  Were you saying that blk-throttle currently
doesn't throttle because those IOs aren't associated with the dirtying
task?  If so, note that it's different from cfq which explicitly
assigns all async IOs when choosing cfqq even if we fix tagging.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-05 16:38       ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-05 16:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Jan Kara, Jens Axboe,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k,
	andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hey, Vivek.

On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote:
> > I am not sure what are you trying to say here. But primarily blk-throttle
> > will throttle read and direct IO. Buffered writes will go to root cgroup
> > which is typically unthrottled.
> 
> Ooh, my bad then.  Anyways, then the same applies to blk-throttle.
> Our current implementation essentially collapses at the face of
> write-heavy workload.

I went through the code and couldn't find where blk-throttle is
discriminating async IOs.  Were you saying that blk-throttle currently
doesn't throttle because those IOs aren't associated with the dirtying
task?  If so, note that it's different from cfq which explicitly
assigns all async IOs when choosing cfqq even if we fix tagging.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-05 16:38       ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-05 16:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hey, Vivek.

On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote:
> > I am not sure what are you trying to say here. But primarily blk-throttle
> > will throttle read and direct IO. Buffered writes will go to root cgroup
> > which is typically unthrottled.
> 
> Ooh, my bad then.  Anyways, then the same applies to blk-throttle.
> Our current implementation essentially collapses at the face of
> write-heavy workload.

I went through the code and couldn't find where blk-throttle is
discriminating async IOs.  Were you saying that blk-throttle currently
doesn't throttle because those IOs aren't associated with the dirtying
task?  If so, note that it's different from cfq which explicitly
assigns all async IOs when choosing cfqq even if we fix tagging.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]             ` <20120405163113.GD12854-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-04-05 17:09               ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-05 17:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Thu, Apr 05, 2012 at 09:31:13AM -0700, Tejun Heo wrote:
> Hey, Vivek.
> 
> On Wed, Apr 04, 2012 at 04:18:16PM -0400, Vivek Goyal wrote:
> > Hey how about reconsidering my other proposal for which I had posted
> > the patches. And that is keep throttling still at device level. Reads
> > and direct IO get throttled asynchronously but buffered writes get
> > throttled synchronously.
> > 
> > Advantages of this scheme.
> > 
> > - There are no separate knobs.
> > 
> > - All the IO (read, direct IO and buffered write) is controlled using
> >   same set of knobs and goes in queue of same cgroup.
> > 
> > - Writeback logic has no knowledge of throttling. It just invokes a 
> >   hook into throttling logic of device queue.
> > 
> > I guess this is a hybrid of active writeback throttling and back pressure
> > mechanism.
> > 
> > But it still does not solve the NFS issue as well as for direct IO,
> > filesystems still can get serialized, so metadata issue still needs to 
> > be resolved. So one can argue that why not go for full "back pressure"
> > method, despite it being more complex.
> > 
> > Here is the link, just to refresh the memory. Something to keep in mind
> > while assessing alternatives.
> > 
> > https://lkml.org/lkml/2011/6/28/243
> 
> Hmmm... so, this only works for blk-throttle and not with the weight.
> How do you manage interaction between buffered writes and direct
> writes for the same cgroup?
> 

Yes, it is only for blk-throttle. We just account for buffered write
in balance_dirty_pages() instead of when they are actually submitted to
device by flusher thread.

IIRC, I just had two queues. In one queue I had bios and in another queue
I had  tasks with information how much memory they are dirtying. So I 
did round robin in terms of dispatch between two queues depending on
throttling rate. I will allow dispatch bio from direct IO queue, then 
look at the other queue and see how much IO other task wanted to do and
when sufficient time had passed based on throttling rate, I will remove
that task from my wait queue and wake it up. 

That way it becomes equivalent to that two IO paths (direct IO + buffered
write),  doing IO to single pipe which has throttling limit. Both the
IOs are sujected to same common limit (and no split). Just that we round
robin between two types of IO and try to divide available bandwidth
equally (This ofcourse could be made tunable).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-05 16:31             ` Tejun Heo
@ 2012-04-05 17:09               ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-05 17:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Thu, Apr 05, 2012 at 09:31:13AM -0700, Tejun Heo wrote:
> Hey, Vivek.
> 
> On Wed, Apr 04, 2012 at 04:18:16PM -0400, Vivek Goyal wrote:
> > Hey how about reconsidering my other proposal for which I had posted
> > the patches. And that is keep throttling still at device level. Reads
> > and direct IO get throttled asynchronously but buffered writes get
> > throttled synchronously.
> > 
> > Advantages of this scheme.
> > 
> > - There are no separate knobs.
> > 
> > - All the IO (read, direct IO and buffered write) is controlled using
> >   same set of knobs and goes in queue of same cgroup.
> > 
> > - Writeback logic has no knowledge of throttling. It just invokes a 
> >   hook into throttling logic of device queue.
> > 
> > I guess this is a hybrid of active writeback throttling and back pressure
> > mechanism.
> > 
> > But it still does not solve the NFS issue as well as for direct IO,
> > filesystems still can get serialized, so metadata issue still needs to 
> > be resolved. So one can argue that why not go for full "back pressure"
> > method, despite it being more complex.
> > 
> > Here is the link, just to refresh the memory. Something to keep in mind
> > while assessing alternatives.
> > 
> > https://lkml.org/lkml/2011/6/28/243
> 
> Hmmm... so, this only works for blk-throttle and not with the weight.
> How do you manage interaction between buffered writes and direct
> writes for the same cgroup?
> 

Yes, it is only for blk-throttle. We just account for buffered write
in balance_dirty_pages() instead of when they are actually submitted to
device by flusher thread.

IIRC, I just had two queues. In one queue I had bios and in another queue
I had  tasks with information how much memory they are dirtying. So I 
did round robin in terms of dispatch between two queues depending on
throttling rate. I will allow dispatch bio from direct IO queue, then 
look at the other queue and see how much IO other task wanted to do and
when sufficient time had passed based on throttling rate, I will remove
that task from my wait queue and wake it up. 

That way it becomes equivalent to that two IO paths (direct IO + buffered
write),  doing IO to single pipe which has throttling limit. Both the
IOs are sujected to same common limit (and no split). Just that we round
robin between two types of IO and try to divide available bandwidth
equally (This ofcourse could be made tunable).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-05 17:09               ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-05 17:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Thu, Apr 05, 2012 at 09:31:13AM -0700, Tejun Heo wrote:
> Hey, Vivek.
> 
> On Wed, Apr 04, 2012 at 04:18:16PM -0400, Vivek Goyal wrote:
> > Hey how about reconsidering my other proposal for which I had posted
> > the patches. And that is keep throttling still at device level. Reads
> > and direct IO get throttled asynchronously but buffered writes get
> > throttled synchronously.
> > 
> > Advantages of this scheme.
> > 
> > - There are no separate knobs.
> > 
> > - All the IO (read, direct IO and buffered write) is controlled using
> >   same set of knobs and goes in queue of same cgroup.
> > 
> > - Writeback logic has no knowledge of throttling. It just invokes a 
> >   hook into throttling logic of device queue.
> > 
> > I guess this is a hybrid of active writeback throttling and back pressure
> > mechanism.
> > 
> > But it still does not solve the NFS issue as well as for direct IO,
> > filesystems still can get serialized, so metadata issue still needs to 
> > be resolved. So one can argue that why not go for full "back pressure"
> > method, despite it being more complex.
> > 
> > Here is the link, just to refresh the memory. Something to keep in mind
> > while assessing alternatives.
> > 
> > https://lkml.org/lkml/2011/6/28/243
> 
> Hmmm... so, this only works for blk-throttle and not with the weight.
> How do you manage interaction between buffered writes and direct
> writes for the same cgroup?
> 

Yes, it is only for blk-throttle. We just account for buffered write
in balance_dirty_pages() instead of when they are actually submitted to
device by flusher thread.

IIRC, I just had two queues. In one queue I had bios and in another queue
I had  tasks with information how much memory they are dirtying. So I 
did round robin in terms of dispatch between two queues depending on
throttling rate. I will allow dispatch bio from direct IO queue, then 
look at the other queue and see how much IO other task wanted to do and
when sufficient time had passed based on throttling rate, I will remove
that task from my wait queue and wake it up. 

That way it becomes equivalent to that two IO paths (direct IO + buffered
write),  doing IO to single pipe which has throttling limit. Both the
IOs are sujected to same common limit (and no split). Just that we round
robin between two types of IO and try to divide available bandwidth
equally (This ofcourse could be made tunable).

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-05 16:38       ` Tejun Heo
  (?)
@ 2012-04-05 17:13           ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-05 17:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Thu, Apr 05, 2012 at 09:38:54AM -0700, Tejun Heo wrote:
> Hey, Vivek.
> 
> On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote:
> > > I am not sure what are you trying to say here. But primarily blk-throttle
> > > will throttle read and direct IO. Buffered writes will go to root cgroup
> > > which is typically unthrottled.
> > 
> > Ooh, my bad then.  Anyways, then the same applies to blk-throttle.
> > Our current implementation essentially collapses at the face of
> > write-heavy workload.
> 
> I went through the code and couldn't find where blk-throttle is
> discriminating async IOs.  Were you saying that blk-throttle currently
> doesn't throttle because those IOs aren't associated with the dirtying
> task?

Yes that's what I meant. Currently most of the async IO will come from
flusher thread which is in root cgroup. So all the async IO will be in
root group and we typically keep root group unthrottled. Sorry for the
confusion here.

> If so, note that it's different from cfq which explicitly
> assigns all async IOs when choosing cfqq even if we fix tagging.

Yes. So if we can properly account for submitter, and for blk-throttle,
async IO will go in right cgroup. Unlike CFQ, there is no hard coded logic
to keep async IO in a particular group. It is just a matter of getting
the right cgroup information.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-05 17:13           ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-05 17:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Thu, Apr 05, 2012 at 09:38:54AM -0700, Tejun Heo wrote:
> Hey, Vivek.
> 
> On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote:
> > > I am not sure what are you trying to say here. But primarily blk-throttle
> > > will throttle read and direct IO. Buffered writes will go to root cgroup
> > > which is typically unthrottled.
> > 
> > Ooh, my bad then.  Anyways, then the same applies to blk-throttle.
> > Our current implementation essentially collapses at the face of
> > write-heavy workload.
> 
> I went through the code and couldn't find where blk-throttle is
> discriminating async IOs.  Were you saying that blk-throttle currently
> doesn't throttle because those IOs aren't associated with the dirtying
> task?

Yes that's what I meant. Currently most of the async IO will come from
flusher thread which is in root cgroup. So all the async IO will be in
root group and we typically keep root group unthrottled. Sorry for the
confusion here.

> If so, note that it's different from cfq which explicitly
> assigns all async IOs when choosing cfqq even if we fix tagging.

Yes. So if we can properly account for submitter, and for blk-throttle,
async IO will go in right cgroup. Unlike CFQ, there is no hard coded logic
to keep async IO in a particular group. It is just a matter of getting
the right cgroup information.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-05 17:13           ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-05 17:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Thu, Apr 05, 2012 at 09:38:54AM -0700, Tejun Heo wrote:
> Hey, Vivek.
> 
> On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote:
> > > I am not sure what are you trying to say here. But primarily blk-throttle
> > > will throttle read and direct IO. Buffered writes will go to root cgroup
> > > which is typically unthrottled.
> > 
> > Ooh, my bad then.  Anyways, then the same applies to blk-throttle.
> > Our current implementation essentially collapses at the face of
> > write-heavy workload.
> 
> I went through the code and couldn't find where blk-throttle is
> discriminating async IOs.  Were you saying that blk-throttle currently
> doesn't throttle because those IOs aren't associated with the dirtying
> task?

Yes that's what I meant. Currently most of the async IO will come from
flusher thread which is in root cgroup. So all the async IO will be in
root group and we typically keep root group unthrottled. Sorry for the
confusion here.

> If so, note that it's different from cfq which explicitly
> assigns all async IOs when choosing cfqq even if we fix tagging.

Yes. So if we can properly account for submitter, and for blk-throttle,
async IO will go in right cgroup. Unlike CFQ, there is no hard coded logic
to keep async IO in a particular group. It is just a matter of getting
the right cgroup information.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]             ` <20120405151026.GB23999-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-06  0:32               ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-06  0:32 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Vivek,

I totally agree that direct IOs can be best handled in block/cfq layers.

On Thu, Apr 05, 2012 at 11:10:26AM -0400, Vivek Goyal wrote:
> On Wed, Apr 04, 2012 at 02:42:28PM -0700, Fengguang Wu wrote:
> > On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote:
> > > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> > > 
> > > [..]
> > > > The sweet split point would be for balance_dirty_pages() to do cgroup
> > > > aware buffered write throttling and leave other IOs to the current
> > > > blkcg. For this to work well as a total solution for end users, I hope
> > > > we can cooperate and figure out ways for the two throttling entities
> > > > to work well with each other.
> > > 
> > > Throttling read + direct IO, higher up has few issues too. Users will
> > 
> > Yeah I have a bit worry about high layer throttling, too.
> > Anyway here are the ideas.
> > 
> > > not like that a task got blocked as it tried to submit a read from a
> > > throttled group.
> > 
> > That's not the same issue I worried about :) Throttling is about
> > inserting small sleep/waits into selected points. For reads, the ideal
> > sleep point is immediately after readahead IO is summited, at the end
> > of __do_page_cache_readahead(). The same should be applicable to
> > direct IO.
> 
> But after a read the process might want to process the read data and
> do something else altogether. So throttling the process after completing
> the read is not the best thing.

__do_page_cache_readahead() returns immediately after queuing the read
IOs. It may block occasionally on metadata IO but not data IO.

> > > Current async behavior works well where we queue up the
> > > bio from the task in throttled group and let task do other things. Same
> > > is true for AIO where we would not like to block in bio submission.
> > 
> > For AIO, we'll need to delay the IO completion notification or status
> > update, which may involve computing some delay time and delay the
> > calls to io_complete() with the help of some delayed work queue. There
> > may be more issues to deal with as I didn't look into aio.c carefully.
> 
> I don't know but delaying compltion notifications sounds odd to me. So
> you don't throttle while submitting requests. That does not help with
> pressure on request queue as process can dump whole bunch of IO without
> waiting for completion?
> 
> What I like better that AIO is allowed to submit bunch of IO till it
> hits the nr_requests limit on request queue and then it is blocked as
> request queue is too busy and not enough request descriptors are free.

You are right. Throttling direct IO and AIO in high layer has the
problem of added delays and less queue fullness. I suspect it may also
lead to extra cfq anticipatory idling and disk idles. And it won't be
able to deal with ioprio. All in all there are lots of problems actually.

> > The thing worried me is that in the proportional throttling case, the
> > high level throttling works on the *estimated* task_ratelimit =
> > disk_bandwidth / N, where N is the number of read IO tasks. When N
> > suddenly changes from 2 to 1, it may take 1 second for the estimated
> > task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth,
> > during which time the disk won't get 100% utilized because of the
> > temporally over-throttling of the remaining IO task.
> 
> I thought we were only considering the case of absolute throttling in
> higher layers. Proportional IO will continue to be in CFQ. I don't think
> we need to push proportional IO in higher layers.

Agreed for direct IO.

As for buffered writes, I'm seriously considering the possibility of
doing proportional IO control in balance_dirty_pages().

I'd take this as the central problem of this thread. If the CFQ
proportional IO controller can do its work well for direct IOs and
leave the buffered writes to the balance_dirty_pages() proportional IO
controller, it would result in a simple and efficient "feedback" system
(comparing to the "push back" idea).

I don't really know about any real use cases. However it seems to me
(and perhaps Jan Kara) the most user friendly and manageable IO
controller interfaces would allow the user to divide disk time (no
matter it's used for reads or writes, direct or buffered IOs) among
the cgroups. Then allow each cgroup to further split up disk time (or
bps/iops) to different types of IO.

For simplicity, let's assume only direct/buffered writes are happening
and the user configures 3 blkio cgroups A, B, C with equal split of
disk time and equal direct:buffered splits inside each cgroup.

In the case of

        A:      1 direct write dd + 1 buffered write dd
        B:      1 direct write dd
        C:      1 buffered write dd

The dd tasks should ideally be throttled to

        A.direct:       1/6 disk time
        A.buffered:     1/6 disk time
        B.direct:       1/3 disk time
        C.buffered:     1/3 disk time

So is it possible for the proportional block IO controller to throttle
direct IOs to

        A.direct:       1/6 disk time
        B.direct:       1/3 disk time

and leave the remaining 1/2 disk time to buffered writes from the
flusher thread?

Then I promise that balance_dirty_pages() will be able to throttle the
buffered writes to:

        A.buffered:     1/6 disk time
        C.buffered:     1/3 disk time

thanks to the fact that the balance_dirty_pages() throttling algorithm
is pretty adaptive. It will be able to work well with the blkio
throttling to achieve the throttling goals.

In the above case,

        equal split of disk time == equal split of write bandwidth

since all cgroups run the same type of workload.
balance_dirty_pages() will be able to work in that
cooperative way after adding some direct IO rate accounting.

In order to deal with mixed random/sequential workloads,
balance_dirty_pages() will also need some disk time stats feedback.
It will then throttle the dirtiers so that the disk time goals are
matched in long run.

> > This is not a problem when throttling at the block/cfq layer, since it
> > has the full information of pending requests and should not depend on
> > such estimations.
> 
> CFQ does not even look at pending requests. It just maintains bunch
> of IO queues and selects one queue to dispatch IO from based on its
> weight. So proportional IO comes very naturally to CFQ.

Sure. Nice work!

> > 
> > The workaround I can think of, is to put the throttled task into a wait
> > queue, and let block layer wake up the waiters when the IO queue runs
> > empty. This should be able to avoid most disk idle time.
> 
> Again, I am not convinced that proportional IO should go in higher layers.
> 
> For fast devices we are already suffering from queue locking overhead and
> Jens seems to have patches for multi queue. Now by trying to implement
> something at higher layer, that locking overhead will show up there too
> and we will end up doing something similar to multi queue there and it
> is not desirable.

Sure, yeah it's a hack. I was not really happy with it.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-05 15:10             ` Vivek Goyal
@ 2012-04-06  0:32               ` Fengguang Wu
  -1 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-06  0:32 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Vivek,

I totally agree that direct IOs can be best handled in block/cfq layers.

On Thu, Apr 05, 2012 at 11:10:26AM -0400, Vivek Goyal wrote:
> On Wed, Apr 04, 2012 at 02:42:28PM -0700, Fengguang Wu wrote:
> > On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote:
> > > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> > > 
> > > [..]
> > > > The sweet split point would be for balance_dirty_pages() to do cgroup
> > > > aware buffered write throttling and leave other IOs to the current
> > > > blkcg. For this to work well as a total solution for end users, I hope
> > > > we can cooperate and figure out ways for the two throttling entities
> > > > to work well with each other.
> > > 
> > > Throttling read + direct IO, higher up has few issues too. Users will
> > 
> > Yeah I have a bit worry about high layer throttling, too.
> > Anyway here are the ideas.
> > 
> > > not like that a task got blocked as it tried to submit a read from a
> > > throttled group.
> > 
> > That's not the same issue I worried about :) Throttling is about
> > inserting small sleep/waits into selected points. For reads, the ideal
> > sleep point is immediately after readahead IO is summited, at the end
> > of __do_page_cache_readahead(). The same should be applicable to
> > direct IO.
> 
> But after a read the process might want to process the read data and
> do something else altogether. So throttling the process after completing
> the read is not the best thing.

__do_page_cache_readahead() returns immediately after queuing the read
IOs. It may block occasionally on metadata IO but not data IO.

> > > Current async behavior works well where we queue up the
> > > bio from the task in throttled group and let task do other things. Same
> > > is true for AIO where we would not like to block in bio submission.
> > 
> > For AIO, we'll need to delay the IO completion notification or status
> > update, which may involve computing some delay time and delay the
> > calls to io_complete() with the help of some delayed work queue. There
> > may be more issues to deal with as I didn't look into aio.c carefully.
> 
> I don't know but delaying compltion notifications sounds odd to me. So
> you don't throttle while submitting requests. That does not help with
> pressure on request queue as process can dump whole bunch of IO without
> waiting for completion?
> 
> What I like better that AIO is allowed to submit bunch of IO till it
> hits the nr_requests limit on request queue and then it is blocked as
> request queue is too busy and not enough request descriptors are free.

You are right. Throttling direct IO and AIO in high layer has the
problem of added delays and less queue fullness. I suspect it may also
lead to extra cfq anticipatory idling and disk idles. And it won't be
able to deal with ioprio. All in all there are lots of problems actually.

> > The thing worried me is that in the proportional throttling case, the
> > high level throttling works on the *estimated* task_ratelimit =
> > disk_bandwidth / N, where N is the number of read IO tasks. When N
> > suddenly changes from 2 to 1, it may take 1 second for the estimated
> > task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth,
> > during which time the disk won't get 100% utilized because of the
> > temporally over-throttling of the remaining IO task.
> 
> I thought we were only considering the case of absolute throttling in
> higher layers. Proportional IO will continue to be in CFQ. I don't think
> we need to push proportional IO in higher layers.

Agreed for direct IO.

As for buffered writes, I'm seriously considering the possibility of
doing proportional IO control in balance_dirty_pages().

I'd take this as the central problem of this thread. If the CFQ
proportional IO controller can do its work well for direct IOs and
leave the buffered writes to the balance_dirty_pages() proportional IO
controller, it would result in a simple and efficient "feedback" system
(comparing to the "push back" idea).

I don't really know about any real use cases. However it seems to me
(and perhaps Jan Kara) the most user friendly and manageable IO
controller interfaces would allow the user to divide disk time (no
matter it's used for reads or writes, direct or buffered IOs) among
the cgroups. Then allow each cgroup to further split up disk time (or
bps/iops) to different types of IO.

For simplicity, let's assume only direct/buffered writes are happening
and the user configures 3 blkio cgroups A, B, C with equal split of
disk time and equal direct:buffered splits inside each cgroup.

In the case of

        A:      1 direct write dd + 1 buffered write dd
        B:      1 direct write dd
        C:      1 buffered write dd

The dd tasks should ideally be throttled to

        A.direct:       1/6 disk time
        A.buffered:     1/6 disk time
        B.direct:       1/3 disk time
        C.buffered:     1/3 disk time

So is it possible for the proportional block IO controller to throttle
direct IOs to

        A.direct:       1/6 disk time
        B.direct:       1/3 disk time

and leave the remaining 1/2 disk time to buffered writes from the
flusher thread?

Then I promise that balance_dirty_pages() will be able to throttle the
buffered writes to:

        A.buffered:     1/6 disk time
        C.buffered:     1/3 disk time

thanks to the fact that the balance_dirty_pages() throttling algorithm
is pretty adaptive. It will be able to work well with the blkio
throttling to achieve the throttling goals.

In the above case,

        equal split of disk time == equal split of write bandwidth

since all cgroups run the same type of workload.
balance_dirty_pages() will be able to work in that
cooperative way after adding some direct IO rate accounting.

In order to deal with mixed random/sequential workloads,
balance_dirty_pages() will also need some disk time stats feedback.
It will then throttle the dirtiers so that the disk time goals are
matched in long run.

> > This is not a problem when throttling at the block/cfq layer, since it
> > has the full information of pending requests and should not depend on
> > such estimations.
> 
> CFQ does not even look at pending requests. It just maintains bunch
> of IO queues and selects one queue to dispatch IO from based on its
> weight. So proportional IO comes very naturally to CFQ.

Sure. Nice work!

> > 
> > The workaround I can think of, is to put the throttled task into a wait
> > queue, and let block layer wake up the waiters when the IO queue runs
> > empty. This should be able to avoid most disk idle time.
> 
> Again, I am not convinced that proportional IO should go in higher layers.
> 
> For fast devices we are already suffering from queue locking overhead and
> Jens seems to have patches for multi queue. Now by trying to implement
> something at higher layer, that locking overhead will show up there too
> and we will end up doing something similar to multi queue there and it
> is not desirable.

Sure, yeah it's a hack. I was not really happy with it.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-06  0:32               ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-06  0:32 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Vivek,

I totally agree that direct IOs can be best handled in block/cfq layers.

On Thu, Apr 05, 2012 at 11:10:26AM -0400, Vivek Goyal wrote:
> On Wed, Apr 04, 2012 at 02:42:28PM -0700, Fengguang Wu wrote:
> > On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote:
> > > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> > > 
> > > [..]
> > > > The sweet split point would be for balance_dirty_pages() to do cgroup
> > > > aware buffered write throttling and leave other IOs to the current
> > > > blkcg. For this to work well as a total solution for end users, I hope
> > > > we can cooperate and figure out ways for the two throttling entities
> > > > to work well with each other.
> > > 
> > > Throttling read + direct IO, higher up has few issues too. Users will
> > 
> > Yeah I have a bit worry about high layer throttling, too.
> > Anyway here are the ideas.
> > 
> > > not like that a task got blocked as it tried to submit a read from a
> > > throttled group.
> > 
> > That's not the same issue I worried about :) Throttling is about
> > inserting small sleep/waits into selected points. For reads, the ideal
> > sleep point is immediately after readahead IO is summited, at the end
> > of __do_page_cache_readahead(). The same should be applicable to
> > direct IO.
> 
> But after a read the process might want to process the read data and
> do something else altogether. So throttling the process after completing
> the read is not the best thing.

__do_page_cache_readahead() returns immediately after queuing the read
IOs. It may block occasionally on metadata IO but not data IO.

> > > Current async behavior works well where we queue up the
> > > bio from the task in throttled group and let task do other things. Same
> > > is true for AIO where we would not like to block in bio submission.
> > 
> > For AIO, we'll need to delay the IO completion notification or status
> > update, which may involve computing some delay time and delay the
> > calls to io_complete() with the help of some delayed work queue. There
> > may be more issues to deal with as I didn't look into aio.c carefully.
> 
> I don't know but delaying compltion notifications sounds odd to me. So
> you don't throttle while submitting requests. That does not help with
> pressure on request queue as process can dump whole bunch of IO without
> waiting for completion?
> 
> What I like better that AIO is allowed to submit bunch of IO till it
> hits the nr_requests limit on request queue and then it is blocked as
> request queue is too busy and not enough request descriptors are free.

You are right. Throttling direct IO and AIO in high layer has the
problem of added delays and less queue fullness. I suspect it may also
lead to extra cfq anticipatory idling and disk idles. And it won't be
able to deal with ioprio. All in all there are lots of problems actually.

> > The thing worried me is that in the proportional throttling case, the
> > high level throttling works on the *estimated* task_ratelimit =
> > disk_bandwidth / N, where N is the number of read IO tasks. When N
> > suddenly changes from 2 to 1, it may take 1 second for the estimated
> > task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth,
> > during which time the disk won't get 100% utilized because of the
> > temporally over-throttling of the remaining IO task.
> 
> I thought we were only considering the case of absolute throttling in
> higher layers. Proportional IO will continue to be in CFQ. I don't think
> we need to push proportional IO in higher layers.

Agreed for direct IO.

As for buffered writes, I'm seriously considering the possibility of
doing proportional IO control in balance_dirty_pages().

I'd take this as the central problem of this thread. If the CFQ
proportional IO controller can do its work well for direct IOs and
leave the buffered writes to the balance_dirty_pages() proportional IO
controller, it would result in a simple and efficient "feedback" system
(comparing to the "push back" idea).

I don't really know about any real use cases. However it seems to me
(and perhaps Jan Kara) the most user friendly and manageable IO
controller interfaces would allow the user to divide disk time (no
matter it's used for reads or writes, direct or buffered IOs) among
the cgroups. Then allow each cgroup to further split up disk time (or
bps/iops) to different types of IO.

For simplicity, let's assume only direct/buffered writes are happening
and the user configures 3 blkio cgroups A, B, C with equal split of
disk time and equal direct:buffered splits inside each cgroup.

In the case of

        A:      1 direct write dd + 1 buffered write dd
        B:      1 direct write dd
        C:      1 buffered write dd

The dd tasks should ideally be throttled to

        A.direct:       1/6 disk time
        A.buffered:     1/6 disk time
        B.direct:       1/3 disk time
        C.buffered:     1/3 disk time

So is it possible for the proportional block IO controller to throttle
direct IOs to

        A.direct:       1/6 disk time
        B.direct:       1/3 disk time

and leave the remaining 1/2 disk time to buffered writes from the
flusher thread?

Then I promise that balance_dirty_pages() will be able to throttle the
buffered writes to:

        A.buffered:     1/6 disk time
        C.buffered:     1/3 disk time

thanks to the fact that the balance_dirty_pages() throttling algorithm
is pretty adaptive. It will be able to work well with the blkio
throttling to achieve the throttling goals.

In the above case,

        equal split of disk time == equal split of write bandwidth

since all cgroups run the same type of workload.
balance_dirty_pages() will be able to work in that
cooperative way after adding some direct IO rate accounting.

In order to deal with mixed random/sequential workloads,
balance_dirty_pages() will also need some disk time stats feedback.
It will then throttle the dirtiers so that the disk time goals are
matched in long run.

> > This is not a problem when throttling at the block/cfq layer, since it
> > has the full information of pending requests and should not depend on
> > such estimations.
> 
> CFQ does not even look at pending requests. It just maintains bunch
> of IO queues and selects one queue to dispatch IO from based on its
> weight. So proportional IO comes very naturally to CFQ.

Sure. Nice work!

> > 
> > The workaround I can think of, is to put the throttled task into a wait
> > queue, and let block layer wake up the waiters when the IO queue runs
> > empty. This should be able to avoid most disk idle time.
> 
> Again, I am not convinced that proportional IO should go in higher layers.
> 
> For fast devices we are already suffering from queue locking overhead and
> Jens seems to have patches for multi queue. Now by trying to implement
> something at higher layer, that locking overhead will show up there too
> and we will end up doing something similar to multi queue there and it
> is not desirable.

Sure, yeah it's a hack. I was not really happy with it.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]       ` <20120404193355.GD29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
  2012-04-04 20:18           ` Vivek Goyal
@ 2012-04-06  9:59         ` Fengguang Wu
  1 sibling, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-06  9:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA

Hi Tejun,

On Wed, Apr 04, 2012 at 12:33:55PM -0700, Tejun Heo wrote:
> Hey, Fengguang.
> 
> On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> > Yeah it should be trivial to apply the balance_dirty_pages()
> > throttling algorithm to the read/direct IOs. However up to now I don't
> > see much added value to *duplicate* the current block IO controller
> > functionalities, assuming the current users and developers are happy
> > with it.
> 
> Heh, trust me.  It's half broken and people ain't happy.  I get that

Yeah, although the balance_dirty_pages() IO controller for buffered
writes looks perfect in itself, it's not enough to meet user demands.

The user expectation should be: hey, please throttle *all* IOs from
this cgroup to this amount, either in absolute bps/iops limits or in
some proportional weight value (or both, whatever the lower takes
effect).  And if necessary, he may request further limits/weights for
each type of IO inside the cgroup.

Now the blkio cgroup supports direct IO and the balance_dirty_pages()
IO controller supports buffered writes. They are providing
limits/weights for either direct IO or buffered writes, which is fine
if it's pure direct IO or pure buffered write. For the common mixed
IO workloads, it's obviously not enough.

Fortunately, the above gap can be easily filled judging from the
block/cfq IO controller code. By adding some direct IO accounting
and changing several lines of my patches to make use of the collected
stats, the semantics of the blkio.throttle.write_bps interfaces can be
changed from "limit for direct IO" to "limit for direct+buffered IOs".
Ditto for blkio.weight and blkio.write_iops, as long as some
iops/device time stats are made available to balance_dirty_pages().

It would be a fairly *easy* change. :-) It's merely adding some
accounting code and there is no need to change the block IO
controlling algorithm at all. I'll do the work of accounting (which
is basically independent of the IO controlling) and use the new stats
in balance_dirty_pages().

The only problem I can see now, is that balance_dirty_pages() works
per-bdi and blkcg works per-device. So the two ends may not match
nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where
sdb is shared by lv0 and lv1. However it should be rare situations and
be much more acceptable than the problems arise from the "push back"
approach which impacts everyone.

> your algorithm can be updatd to consider all IOs and I believe that
> but what I don't get is how would such information get to writeback
> and in turn how writeback would enforce the result on reads and direct
> IOs.  Through what path?  Will all reads and direct IOs travel through
> balance_dirty_pages() even direct IOs on raw block devices?  Or would
> the writeback algorithm take the configuration from cfq, apply the
> algorithm and give back the limits to enforce to cfq?  If the latter,
> isn't that at least somewhat messed up?

cfq is working well and don't need any modifications. Let's just make
balance_dirty_pages() cgroup aware and fill the gap of the current
block IO controller.

If the balance_dirty_pages() throttling algorithms will ever be
applied to read and direct IOs, it would be for NFS, CIFS etc. Even
for them, there may be better throttling choices. For example, Trond
mentioned the RPC layer to me during the summit.

> > I did the buffered write IO controller mainly to fill the gap.  If I
> > happen to stand in your way, sorry that's not my initial intention.
> 
> No, no, it's not about standing in my way.  As Vivek said in the other
> reply, it's that the "gap" that you filled was created *because*
> writeback wasn't cgroup aware and now you're in turn filling that gap
> by making writeback work around that "gap".  I mean, my mind boggles.
> Doesn't yours?  I strongly believe everyone's should.

Heh. It's a hard problem indeed. I felt great pains in the IO-less
dirty throttling work. I did a lot reasoning about it, and have in
fact kept cgroup IO controller in mind since its early days. Now I'd
say it's hands down for it to adapt to the gap between the total IO
limit and what's carried out by the block IO controller.

> > It's a pity and surprise that Google as a big user does not buy in
> > this simple solution. You may prefer more comprehensive controls which
> > may not be easily achievable with the simple scheme. However the
> > complexities and overheads involved in throttling the flusher IOs
> > really upsets me. 
> 
> Heh, believe it or not, I'm not really wearing google hat on this
> subject and google's writeback people may have completely different
> opinions on the subject than mine.  In fact, I'm not even sure how
> much "work" time I'll be able to assign to this.  :(

OK, understand.

> > The sweet split point would be for balance_dirty_pages() to do cgroup
> > aware buffered write throttling and leave other IOs to the current
> > blkcg. For this to work well as a total solution for end users, I hope
> > we can cooperate and figure out ways for the two throttling entities
> > to work well with each other.
> 
> There's where I'm confused.  How is the said split supposed to work?
> They aren't independent.  I mean, who gets to decide what and where
> are those decisions enforced?

Yeah it's not independent. It's about

- keep block IO cgroup untouched (in its current algorithm, for
  throttling direct IO)

- let balance_dirty_pages() adapt to the throttling target
  
        buffered_write_limit = total_limit - direct_IOs

> > What I'm interested is, what's Google and other users' use schemes in
> > practice. What's their desired interfaces. Whether and how the
> > combined bdp+blkcg throttling can fulfill the goals.
> 
> I'm not too privy of mm and writeback in google and even if so I
> probably shouldn't talk too much about it.  Confidentiality and all.
> That said, I have the general feeling that goog already figured out
> how to at least work around the existing implementation and would be
> able to continue no matter how upstream development fans out.
> 
> That said, wearing the cgroup maintainer and general kernel
> contributor hat, I'd really like to avoid another design mess up.

To me it looks a pretty clean split and find it to be an easy
solution (after sorting it out the hard way). I'll show the code and
test results after some time.

> > > Let's please keep the layering clear.  IO limitations will be applied
> > > at the block layer and pressure will be formed there and then
> > > propagated upwards eventually to the originator.  Sure, exposing the
> > > whole information might result in better behavior for certain
> > > workloads, but down the road, say, in three or five years, devices
> > > which can be shared without worrying too much about seeks might be
> > > commonplace and we could be swearing at a disgusting structural mess,
> > > and sadly various cgroup support seems to be a prominent source of
> > > such design failures.
> > 
> > Super fast storages are coming which will make us regret to make the
> > IO path over complex.  Spinning disks are not going away anytime soon.
> > I doubt Google is willing to afford the disk seek costs on its
> > millions of disks and has the patience to wait until switching all of
> > the spin disks to SSD years later (if it will ever happen).
> 
> This is new.  Let's keep the damn employer out of the discussion.
> While the area I work on is affected by my employment (writeback isn't
> even my area BTW), I'm not gonna do something adverse to upstream even
> if it's beneficial to google and I'm much more likely to do something
> which may hurt google a bit if it's gonna benefit upstream.
> 
> As for the faster / newer storage argument, that is *exactly* why we
> want to keep the layering proper.  Writeback works from the pressure
> from the IO stack.  If IO technology changes, we update the IO stack
> and writeback still works from the pressure.  It may need to be
> adjusted but the principles don't change.

To me, balance_dirty_pages() is *the* proper layer for buffered writes.
It's always there doing 1:1 proportional throttling. Then you try to
kick in to add *double* throttling in block/cfq layer. Now the low
layer may enforce 10:1 throttling and push balance_dirty_pages() away
from its balanced state, leading to large fluctuations and program
stalls.  This can be avoided by telling balance_dirty_pages(): "your
balance goal is no longer 1:1, but 10:1". With this information
balance_dirty_pages() will behave right. Then there is the question:
if balance_dirty_pages() will work just well provided the information,
why bother doing the throttling at low layer and "push back" the
pressure all the way up?

> > It's obvious that your below proposal involves a lot of complexities,
> > overheads, and will hurt performance. It basically involves
> 
> Hmmm... that's not the impression I got from the discussion.
> According to Jan, applying the current writeback logic to cgroup'fied
> bdi shouldn't be too complex, no?

In the sense of "avoidable" complexity :-)

> > - running concurrent flusher threads for cgroups, which adds back the
> >   disk seeks and lock contentions. And still has problems with sync
> >   and shared inodes.
> 
> I agree this is an actual concern but if the user wants to split one
> spindle to multiple resource domains, there's gonna be considerable
> amount of overhead no matter what.  If you want to improve how block
> layer handles the split, you're welcome to dive into the block layer,
> where the split is made, and improve it.
> 
> > - splitting device queue for cgroups, possibly scaling up the pool of
> >   writeback pages (and locked pages in the case of stable pages) which
> >   could stall random processes in the system
> 
> Sure, it'll take up more buffering and memory but that's the overhead
> of the cgroup business.  I want it to be less intrusive at the cost of
> somewhat more resource consumption.  ie. I don't want writeback logic
> itself deeply involved in block IO cgroup enforcement even if that
> means somewhat less efficient resource usage.

The balance_dirty_pages() is already deeply involved in dirty throttling.
As you can see from this patchset, the same algorithms can be extended
trivially to work with cgroup IO limits.

buffered write IO controller in balance_dirty_pages()
https://lkml.org/lkml/2012/3/28/275

It does not require forking off the flusher threads and splitting up
the IO queue at all.

> > - the mess of metadata handling
> 
> Does throttling from writeback actually solve this problem?  What
> about fsync()?  Does that already go through balance_dirty_pages()?

balance_dirty_pages() does throttling at safe points outside of fs
transactions/locks.

fsync() only submits IO for already dirtied pages and won't be
throttled by balance_dirty_pages(). Throttling happens at earlier
times when the task is dirtying the pages.

> > - unnecessarily coupled with memcg, in order to take advantage of the
> >   per-memcg dirty limits for balance_dirty_pages() to actually convert
> >   the "pushed back" dirty pages pressure into lowered dirty rate. Why
> >   the hell the users *have to* setup memcg (suffering from all the
> >   inconvenience and overheads) in order to do IO throttling?  Please,
> >   this is really ugly! And the "back pressure" may constantly push the
> >   memcg dirty pages to the limits. I'm not going to support *miss use*
> >   of per-memcg dirty limits like this!
> 
> Writeback sits between blkcg and memcg and it indeed can be hairy to
> consider both sides especially given the current sorry complex state
> of cgroup and I can see why it would seem tempting to add a separate
> controller or at least knobs to support that.  That said, I *think*
> given that memcg controls all other memory parameters it probably
> would make most sense giving that parameter to memcg too.  I don't
> think this is really relevant to this discussion tho.  Who owns
> dirty_limits is a separate issue.

In the "back pressure" scheme, memcg is a must because only it has all
the infrastructure to track dirty pages upon which you can apply some
dirty_limits. Don't tell me you want to account dirty pages in blkcg...

> > I cannot believe you would keep overlooking all the problems without
> > good reasons. Please do tell us the reasons that matter.
> 
> Well, I tried and I hope some of it got through.  I also wrote a lot
> of questions, mainly regarding how what you have in mind is supposed
> to work through what path.  Maybe I'm just not seeing what you're
> seeing but I just can't see where all the IOs would go through and
> come together.  Can you please elaborate more on that?

What I can see is, it looks pretty simple and nature to let
balance_dirty_pages() fill the gap towards a total solution :-)

- add direct IO accounting in some convenient point of the IO path
  IO submission or completion point, either is fine.

- change several lines of the buffered write IO controller to
  integrate the direct IO rate into the formula to fit the "total
  IO" limit

- in future, add more accounting as well as feedback control to make
  balance_dirty_pages() work with IOPS and disk time

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]       ` <20120404193355.GD29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
  2012-04-04 20:18           ` Vivek Goyal
@ 2012-04-06  9:59         ` Fengguang Wu
  1 sibling, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-06  9:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hi Tejun,

On Wed, Apr 04, 2012 at 12:33:55PM -0700, Tejun Heo wrote:
> Hey, Fengguang.
> 
> On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> > Yeah it should be trivial to apply the balance_dirty_pages()
> > throttling algorithm to the read/direct IOs. However up to now I don't
> > see much added value to *duplicate* the current block IO controller
> > functionalities, assuming the current users and developers are happy
> > with it.
> 
> Heh, trust me.  It's half broken and people ain't happy.  I get that

Yeah, although the balance_dirty_pages() IO controller for buffered
writes looks perfect in itself, it's not enough to meet user demands.

The user expectation should be: hey, please throttle *all* IOs from
this cgroup to this amount, either in absolute bps/iops limits or in
some proportional weight value (or both, whatever the lower takes
effect).  And if necessary, he may request further limits/weights for
each type of IO inside the cgroup.

Now the blkio cgroup supports direct IO and the balance_dirty_pages()
IO controller supports buffered writes. They are providing
limits/weights for either direct IO or buffered writes, which is fine
if it's pure direct IO or pure buffered write. For the common mixed
IO workloads, it's obviously not enough.

Fortunately, the above gap can be easily filled judging from the
block/cfq IO controller code. By adding some direct IO accounting
and changing several lines of my patches to make use of the collected
stats, the semantics of the blkio.throttle.write_bps interfaces can be
changed from "limit for direct IO" to "limit for direct+buffered IOs".
Ditto for blkio.weight and blkio.write_iops, as long as some
iops/device time stats are made available to balance_dirty_pages().

It would be a fairly *easy* change. :-) It's merely adding some
accounting code and there is no need to change the block IO
controlling algorithm at all. I'll do the work of accounting (which
is basically independent of the IO controlling) and use the new stats
in balance_dirty_pages().

The only problem I can see now, is that balance_dirty_pages() works
per-bdi and blkcg works per-device. So the two ends may not match
nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where
sdb is shared by lv0 and lv1. However it should be rare situations and
be much more acceptable than the problems arise from the "push back"
approach which impacts everyone.

> your algorithm can be updatd to consider all IOs and I believe that
> but what I don't get is how would such information get to writeback
> and in turn how writeback would enforce the result on reads and direct
> IOs.  Through what path?  Will all reads and direct IOs travel through
> balance_dirty_pages() even direct IOs on raw block devices?  Or would
> the writeback algorithm take the configuration from cfq, apply the
> algorithm and give back the limits to enforce to cfq?  If the latter,
> isn't that at least somewhat messed up?

cfq is working well and don't need any modifications. Let's just make
balance_dirty_pages() cgroup aware and fill the gap of the current
block IO controller.

If the balance_dirty_pages() throttling algorithms will ever be
applied to read and direct IOs, it would be for NFS, CIFS etc. Even
for them, there may be better throttling choices. For example, Trond
mentioned the RPC layer to me during the summit.

> > I did the buffered write IO controller mainly to fill the gap.  If I
> > happen to stand in your way, sorry that's not my initial intention.
> 
> No, no, it's not about standing in my way.  As Vivek said in the other
> reply, it's that the "gap" that you filled was created *because*
> writeback wasn't cgroup aware and now you're in turn filling that gap
> by making writeback work around that "gap".  I mean, my mind boggles.
> Doesn't yours?  I strongly believe everyone's should.

Heh. It's a hard problem indeed. I felt great pains in the IO-less
dirty throttling work. I did a lot reasoning about it, and have in
fact kept cgroup IO controller in mind since its early days. Now I'd
say it's hands down for it to adapt to the gap between the total IO
limit and what's carried out by the block IO controller.

> > It's a pity and surprise that Google as a big user does not buy in
> > this simple solution. You may prefer more comprehensive controls which
> > may not be easily achievable with the simple scheme. However the
> > complexities and overheads involved in throttling the flusher IOs
> > really upsets me. 
> 
> Heh, believe it or not, I'm not really wearing google hat on this
> subject and google's writeback people may have completely different
> opinions on the subject than mine.  In fact, I'm not even sure how
> much "work" time I'll be able to assign to this.  :(

OK, understand.

> > The sweet split point would be for balance_dirty_pages() to do cgroup
> > aware buffered write throttling and leave other IOs to the current
> > blkcg. For this to work well as a total solution for end users, I hope
> > we can cooperate and figure out ways for the two throttling entities
> > to work well with each other.
> 
> There's where I'm confused.  How is the said split supposed to work?
> They aren't independent.  I mean, who gets to decide what and where
> are those decisions enforced?

Yeah it's not independent. It's about

- keep block IO cgroup untouched (in its current algorithm, for
  throttling direct IO)

- let balance_dirty_pages() adapt to the throttling target
  
        buffered_write_limit = total_limit - direct_IOs

> > What I'm interested is, what's Google and other users' use schemes in
> > practice. What's their desired interfaces. Whether and how the
> > combined bdp+blkcg throttling can fulfill the goals.
> 
> I'm not too privy of mm and writeback in google and even if so I
> probably shouldn't talk too much about it.  Confidentiality and all.
> That said, I have the general feeling that goog already figured out
> how to at least work around the existing implementation and would be
> able to continue no matter how upstream development fans out.
> 
> That said, wearing the cgroup maintainer and general kernel
> contributor hat, I'd really like to avoid another design mess up.

To me it looks a pretty clean split and find it to be an easy
solution (after sorting it out the hard way). I'll show the code and
test results after some time.

> > > Let's please keep the layering clear.  IO limitations will be applied
> > > at the block layer and pressure will be formed there and then
> > > propagated upwards eventually to the originator.  Sure, exposing the
> > > whole information might result in better behavior for certain
> > > workloads, but down the road, say, in three or five years, devices
> > > which can be shared without worrying too much about seeks might be
> > > commonplace and we could be swearing at a disgusting structural mess,
> > > and sadly various cgroup support seems to be a prominent source of
> > > such design failures.
> > 
> > Super fast storages are coming which will make us regret to make the
> > IO path over complex.  Spinning disks are not going away anytime soon.
> > I doubt Google is willing to afford the disk seek costs on its
> > millions of disks and has the patience to wait until switching all of
> > the spin disks to SSD years later (if it will ever happen).
> 
> This is new.  Let's keep the damn employer out of the discussion.
> While the area I work on is affected by my employment (writeback isn't
> even my area BTW), I'm not gonna do something adverse to upstream even
> if it's beneficial to google and I'm much more likely to do something
> which may hurt google a bit if it's gonna benefit upstream.
> 
> As for the faster / newer storage argument, that is *exactly* why we
> want to keep the layering proper.  Writeback works from the pressure
> from the IO stack.  If IO technology changes, we update the IO stack
> and writeback still works from the pressure.  It may need to be
> adjusted but the principles don't change.

To me, balance_dirty_pages() is *the* proper layer for buffered writes.
It's always there doing 1:1 proportional throttling. Then you try to
kick in to add *double* throttling in block/cfq layer. Now the low
layer may enforce 10:1 throttling and push balance_dirty_pages() away
from its balanced state, leading to large fluctuations and program
stalls.  This can be avoided by telling balance_dirty_pages(): "your
balance goal is no longer 1:1, but 10:1". With this information
balance_dirty_pages() will behave right. Then there is the question:
if balance_dirty_pages() will work just well provided the information,
why bother doing the throttling at low layer and "push back" the
pressure all the way up?

> > It's obvious that your below proposal involves a lot of complexities,
> > overheads, and will hurt performance. It basically involves
> 
> Hmmm... that's not the impression I got from the discussion.
> According to Jan, applying the current writeback logic to cgroup'fied
> bdi shouldn't be too complex, no?

In the sense of "avoidable" complexity :-)

> > - running concurrent flusher threads for cgroups, which adds back the
> >   disk seeks and lock contentions. And still has problems with sync
> >   and shared inodes.
> 
> I agree this is an actual concern but if the user wants to split one
> spindle to multiple resource domains, there's gonna be considerable
> amount of overhead no matter what.  If you want to improve how block
> layer handles the split, you're welcome to dive into the block layer,
> where the split is made, and improve it.
> 
> > - splitting device queue for cgroups, possibly scaling up the pool of
> >   writeback pages (and locked pages in the case of stable pages) which
> >   could stall random processes in the system
> 
> Sure, it'll take up more buffering and memory but that's the overhead
> of the cgroup business.  I want it to be less intrusive at the cost of
> somewhat more resource consumption.  ie. I don't want writeback logic
> itself deeply involved in block IO cgroup enforcement even if that
> means somewhat less efficient resource usage.

The balance_dirty_pages() is already deeply involved in dirty throttling.
As you can see from this patchset, the same algorithms can be extended
trivially to work with cgroup IO limits.

buffered write IO controller in balance_dirty_pages()
https://lkml.org/lkml/2012/3/28/275

It does not require forking off the flusher threads and splitting up
the IO queue at all.

> > - the mess of metadata handling
> 
> Does throttling from writeback actually solve this problem?  What
> about fsync()?  Does that already go through balance_dirty_pages()?

balance_dirty_pages() does throttling at safe points outside of fs
transactions/locks.

fsync() only submits IO for already dirtied pages and won't be
throttled by balance_dirty_pages(). Throttling happens at earlier
times when the task is dirtying the pages.

> > - unnecessarily coupled with memcg, in order to take advantage of the
> >   per-memcg dirty limits for balance_dirty_pages() to actually convert
> >   the "pushed back" dirty pages pressure into lowered dirty rate. Why
> >   the hell the users *have to* setup memcg (suffering from all the
> >   inconvenience and overheads) in order to do IO throttling?  Please,
> >   this is really ugly! And the "back pressure" may constantly push the
> >   memcg dirty pages to the limits. I'm not going to support *miss use*
> >   of per-memcg dirty limits like this!
> 
> Writeback sits between blkcg and memcg and it indeed can be hairy to
> consider both sides especially given the current sorry complex state
> of cgroup and I can see why it would seem tempting to add a separate
> controller or at least knobs to support that.  That said, I *think*
> given that memcg controls all other memory parameters it probably
> would make most sense giving that parameter to memcg too.  I don't
> think this is really relevant to this discussion tho.  Who owns
> dirty_limits is a separate issue.

In the "back pressure" scheme, memcg is a must because only it has all
the infrastructure to track dirty pages upon which you can apply some
dirty_limits. Don't tell me you want to account dirty pages in blkcg...

> > I cannot believe you would keep overlooking all the problems without
> > good reasons. Please do tell us the reasons that matter.
> 
> Well, I tried and I hope some of it got through.  I also wrote a lot
> of questions, mainly regarding how what you have in mind is supposed
> to work through what path.  Maybe I'm just not seeing what you're
> seeing but I just can't see where all the IOs would go through and
> come together.  Can you please elaborate more on that?

What I can see is, it looks pretty simple and nature to let
balance_dirty_pages() fill the gap towards a total solution :-)

- add direct IO accounting in some convenient point of the IO path
  IO submission or completion point, either is fine.

- change several lines of the buffered write IO controller to
  integrate the direct IO rate into the formula to fit the "total
  IO" limit

- in future, add more accounting as well as feedback control to make
  balance_dirty_pages() work with IOPS and disk time

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-06  9:59         ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-06  9:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k,
	andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hi Tejun,

On Wed, Apr 04, 2012 at 12:33:55PM -0700, Tejun Heo wrote:
> Hey, Fengguang.
> 
> On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> > Yeah it should be trivial to apply the balance_dirty_pages()
> > throttling algorithm to the read/direct IOs. However up to now I don't
> > see much added value to *duplicate* the current block IO controller
> > functionalities, assuming the current users and developers are happy
> > with it.
> 
> Heh, trust me.  It's half broken and people ain't happy.  I get that

Yeah, although the balance_dirty_pages() IO controller for buffered
writes looks perfect in itself, it's not enough to meet user demands.

The user expectation should be: hey, please throttle *all* IOs from
this cgroup to this amount, either in absolute bps/iops limits or in
some proportional weight value (or both, whatever the lower takes
effect).  And if necessary, he may request further limits/weights for
each type of IO inside the cgroup.

Now the blkio cgroup supports direct IO and the balance_dirty_pages()
IO controller supports buffered writes. They are providing
limits/weights for either direct IO or buffered writes, which is fine
if it's pure direct IO or pure buffered write. For the common mixed
IO workloads, it's obviously not enough.

Fortunately, the above gap can be easily filled judging from the
block/cfq IO controller code. By adding some direct IO accounting
and changing several lines of my patches to make use of the collected
stats, the semantics of the blkio.throttle.write_bps interfaces can be
changed from "limit for direct IO" to "limit for direct+buffered IOs".
Ditto for blkio.weight and blkio.write_iops, as long as some
iops/device time stats are made available to balance_dirty_pages().

It would be a fairly *easy* change. :-) It's merely adding some
accounting code and there is no need to change the block IO
controlling algorithm at all. I'll do the work of accounting (which
is basically independent of the IO controlling) and use the new stats
in balance_dirty_pages().

The only problem I can see now, is that balance_dirty_pages() works
per-bdi and blkcg works per-device. So the two ends may not match
nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where
sdb is shared by lv0 and lv1. However it should be rare situations and
be much more acceptable than the problems arise from the "push back"
approach which impacts everyone.

> your algorithm can be updatd to consider all IOs and I believe that
> but what I don't get is how would such information get to writeback
> and in turn how writeback would enforce the result on reads and direct
> IOs.  Through what path?  Will all reads and direct IOs travel through
> balance_dirty_pages() even direct IOs on raw block devices?  Or would
> the writeback algorithm take the configuration from cfq, apply the
> algorithm and give back the limits to enforce to cfq?  If the latter,
> isn't that at least somewhat messed up?

cfq is working well and don't need any modifications. Let's just make
balance_dirty_pages() cgroup aware and fill the gap of the current
block IO controller.

If the balance_dirty_pages() throttling algorithms will ever be
applied to read and direct IOs, it would be for NFS, CIFS etc. Even
for them, there may be better throttling choices. For example, Trond
mentioned the RPC layer to me during the summit.

> > I did the buffered write IO controller mainly to fill the gap.  If I
> > happen to stand in your way, sorry that's not my initial intention.
> 
> No, no, it's not about standing in my way.  As Vivek said in the other
> reply, it's that the "gap" that you filled was created *because*
> writeback wasn't cgroup aware and now you're in turn filling that gap
> by making writeback work around that "gap".  I mean, my mind boggles.
> Doesn't yours?  I strongly believe everyone's should.

Heh. It's a hard problem indeed. I felt great pains in the IO-less
dirty throttling work. I did a lot reasoning about it, and have in
fact kept cgroup IO controller in mind since its early days. Now I'd
say it's hands down for it to adapt to the gap between the total IO
limit and what's carried out by the block IO controller.

> > It's a pity and surprise that Google as a big user does not buy in
> > this simple solution. You may prefer more comprehensive controls which
> > may not be easily achievable with the simple scheme. However the
> > complexities and overheads involved in throttling the flusher IOs
> > really upsets me. 
> 
> Heh, believe it or not, I'm not really wearing google hat on this
> subject and google's writeback people may have completely different
> opinions on the subject than mine.  In fact, I'm not even sure how
> much "work" time I'll be able to assign to this.  :(

OK, understand.

> > The sweet split point would be for balance_dirty_pages() to do cgroup
> > aware buffered write throttling and leave other IOs to the current
> > blkcg. For this to work well as a total solution for end users, I hope
> > we can cooperate and figure out ways for the two throttling entities
> > to work well with each other.
> 
> There's where I'm confused.  How is the said split supposed to work?
> They aren't independent.  I mean, who gets to decide what and where
> are those decisions enforced?

Yeah it's not independent. It's about

- keep block IO cgroup untouched (in its current algorithm, for
  throttling direct IO)

- let balance_dirty_pages() adapt to the throttling target
  
        buffered_write_limit = total_limit - direct_IOs

> > What I'm interested is, what's Google and other users' use schemes in
> > practice. What's their desired interfaces. Whether and how the
> > combined bdp+blkcg throttling can fulfill the goals.
> 
> I'm not too privy of mm and writeback in google and even if so I
> probably shouldn't talk too much about it.  Confidentiality and all.
> That said, I have the general feeling that goog already figured out
> how to at least work around the existing implementation and would be
> able to continue no matter how upstream development fans out.
> 
> That said, wearing the cgroup maintainer and general kernel
> contributor hat, I'd really like to avoid another design mess up.

To me it looks a pretty clean split and find it to be an easy
solution (after sorting it out the hard way). I'll show the code and
test results after some time.

> > > Let's please keep the layering clear.  IO limitations will be applied
> > > at the block layer and pressure will be formed there and then
> > > propagated upwards eventually to the originator.  Sure, exposing the
> > > whole information might result in better behavior for certain
> > > workloads, but down the road, say, in three or five years, devices
> > > which can be shared without worrying too much about seeks might be
> > > commonplace and we could be swearing at a disgusting structural mess,
> > > and sadly various cgroup support seems to be a prominent source of
> > > such design failures.
> > 
> > Super fast storages are coming which will make us regret to make the
> > IO path over complex.  Spinning disks are not going away anytime soon.
> > I doubt Google is willing to afford the disk seek costs on its
> > millions of disks and has the patience to wait until switching all of
> > the spin disks to SSD years later (if it will ever happen).
> 
> This is new.  Let's keep the damn employer out of the discussion.
> While the area I work on is affected by my employment (writeback isn't
> even my area BTW), I'm not gonna do something adverse to upstream even
> if it's beneficial to google and I'm much more likely to do something
> which may hurt google a bit if it's gonna benefit upstream.
> 
> As for the faster / newer storage argument, that is *exactly* why we
> want to keep the layering proper.  Writeback works from the pressure
> from the IO stack.  If IO technology changes, we update the IO stack
> and writeback still works from the pressure.  It may need to be
> adjusted but the principles don't change.

To me, balance_dirty_pages() is *the* proper layer for buffered writes.
It's always there doing 1:1 proportional throttling. Then you try to
kick in to add *double* throttling in block/cfq layer. Now the low
layer may enforce 10:1 throttling and push balance_dirty_pages() away
from its balanced state, leading to large fluctuations and program
stalls.  This can be avoided by telling balance_dirty_pages(): "your
balance goal is no longer 1:1, but 10:1". With this information
balance_dirty_pages() will behave right. Then there is the question:
if balance_dirty_pages() will work just well provided the information,
why bother doing the throttling at low layer and "push back" the
pressure all the way up?

> > It's obvious that your below proposal involves a lot of complexities,
> > overheads, and will hurt performance. It basically involves
> 
> Hmmm... that's not the impression I got from the discussion.
> According to Jan, applying the current writeback logic to cgroup'fied
> bdi shouldn't be too complex, no?

In the sense of "avoidable" complexity :-)

> > - running concurrent flusher threads for cgroups, which adds back the
> >   disk seeks and lock contentions. And still has problems with sync
> >   and shared inodes.
> 
> I agree this is an actual concern but if the user wants to split one
> spindle to multiple resource domains, there's gonna be considerable
> amount of overhead no matter what.  If you want to improve how block
> layer handles the split, you're welcome to dive into the block layer,
> where the split is made, and improve it.
> 
> > - splitting device queue for cgroups, possibly scaling up the pool of
> >   writeback pages (and locked pages in the case of stable pages) which
> >   could stall random processes in the system
> 
> Sure, it'll take up more buffering and memory but that's the overhead
> of the cgroup business.  I want it to be less intrusive at the cost of
> somewhat more resource consumption.  ie. I don't want writeback logic
> itself deeply involved in block IO cgroup enforcement even if that
> means somewhat less efficient resource usage.

The balance_dirty_pages() is already deeply involved in dirty throttling.
As you can see from this patchset, the same algorithms can be extended
trivially to work with cgroup IO limits.

buffered write IO controller in balance_dirty_pages()
https://lkml.org/lkml/2012/3/28/275

It does not require forking off the flusher threads and splitting up
the IO queue at all.

> > - the mess of metadata handling
> 
> Does throttling from writeback actually solve this problem?  What
> about fsync()?  Does that already go through balance_dirty_pages()?

balance_dirty_pages() does throttling at safe points outside of fs
transactions/locks.

fsync() only submits IO for already dirtied pages and won't be
throttled by balance_dirty_pages(). Throttling happens at earlier
times when the task is dirtying the pages.

> > - unnecessarily coupled with memcg, in order to take advantage of the
> >   per-memcg dirty limits for balance_dirty_pages() to actually convert
> >   the "pushed back" dirty pages pressure into lowered dirty rate. Why
> >   the hell the users *have to* setup memcg (suffering from all the
> >   inconvenience and overheads) in order to do IO throttling?  Please,
> >   this is really ugly! And the "back pressure" may constantly push the
> >   memcg dirty pages to the limits. I'm not going to support *miss use*
> >   of per-memcg dirty limits like this!
> 
> Writeback sits between blkcg and memcg and it indeed can be hairy to
> consider both sides especially given the current sorry complex state
> of cgroup and I can see why it would seem tempting to add a separate
> controller or at least knobs to support that.  That said, I *think*
> given that memcg controls all other memory parameters it probably
> would make most sense giving that parameter to memcg too.  I don't
> think this is really relevant to this discussion tho.  Who owns
> dirty_limits is a separate issue.

In the "back pressure" scheme, memcg is a must because only it has all
the infrastructure to track dirty pages upon which you can apply some
dirty_limits. Don't tell me you want to account dirty pages in blkcg...

> > I cannot believe you would keep overlooking all the problems without
> > good reasons. Please do tell us the reasons that matter.
> 
> Well, I tried and I hope some of it got through.  I also wrote a lot
> of questions, mainly regarding how what you have in mind is supposed
> to work through what path.  Maybe I'm just not seeing what you're
> seeing but I just can't see where all the IOs would go through and
> come together.  Can you please elaborate more on that?

What I can see is, it looks pretty simple and nature to let
balance_dirty_pages() fill the gap towards a total solution :-)

- add direct IO accounting in some convenient point of the IO path
  IO submission or completion point, either is fine.

- change several lines of the buffered write IO controller to
  integrate the direct IO rate into the formula to fit the "total
  IO" limit

- in future, add more accounting as well as feedback control to make
  balance_dirty_pages() work with IOPS and disk time

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-06  9:59         ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-06  9:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hi Tejun,

On Wed, Apr 04, 2012 at 12:33:55PM -0700, Tejun Heo wrote:
> Hey, Fengguang.
> 
> On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> > Yeah it should be trivial to apply the balance_dirty_pages()
> > throttling algorithm to the read/direct IOs. However up to now I don't
> > see much added value to *duplicate* the current block IO controller
> > functionalities, assuming the current users and developers are happy
> > with it.
> 
> Heh, trust me.  It's half broken and people ain't happy.  I get that

Yeah, although the balance_dirty_pages() IO controller for buffered
writes looks perfect in itself, it's not enough to meet user demands.

The user expectation should be: hey, please throttle *all* IOs from
this cgroup to this amount, either in absolute bps/iops limits or in
some proportional weight value (or both, whatever the lower takes
effect).  And if necessary, he may request further limits/weights for
each type of IO inside the cgroup.

Now the blkio cgroup supports direct IO and the balance_dirty_pages()
IO controller supports buffered writes. They are providing
limits/weights for either direct IO or buffered writes, which is fine
if it's pure direct IO or pure buffered write. For the common mixed
IO workloads, it's obviously not enough.

Fortunately, the above gap can be easily filled judging from the
block/cfq IO controller code. By adding some direct IO accounting
and changing several lines of my patches to make use of the collected
stats, the semantics of the blkio.throttle.write_bps interfaces can be
changed from "limit for direct IO" to "limit for direct+buffered IOs".
Ditto for blkio.weight and blkio.write_iops, as long as some
iops/device time stats are made available to balance_dirty_pages().

It would be a fairly *easy* change. :-) It's merely adding some
accounting code and there is no need to change the block IO
controlling algorithm at all. I'll do the work of accounting (which
is basically independent of the IO controlling) and use the new stats
in balance_dirty_pages().

The only problem I can see now, is that balance_dirty_pages() works
per-bdi and blkcg works per-device. So the two ends may not match
nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where
sdb is shared by lv0 and lv1. However it should be rare situations and
be much more acceptable than the problems arise from the "push back"
approach which impacts everyone.

> your algorithm can be updatd to consider all IOs and I believe that
> but what I don't get is how would such information get to writeback
> and in turn how writeback would enforce the result on reads and direct
> IOs.  Through what path?  Will all reads and direct IOs travel through
> balance_dirty_pages() even direct IOs on raw block devices?  Or would
> the writeback algorithm take the configuration from cfq, apply the
> algorithm and give back the limits to enforce to cfq?  If the latter,
> isn't that at least somewhat messed up?

cfq is working well and don't need any modifications. Let's just make
balance_dirty_pages() cgroup aware and fill the gap of the current
block IO controller.

If the balance_dirty_pages() throttling algorithms will ever be
applied to read and direct IOs, it would be for NFS, CIFS etc. Even
for them, there may be better throttling choices. For example, Trond
mentioned the RPC layer to me during the summit.

> > I did the buffered write IO controller mainly to fill the gap.  If I
> > happen to stand in your way, sorry that's not my initial intention.
> 
> No, no, it's not about standing in my way.  As Vivek said in the other
> reply, it's that the "gap" that you filled was created *because*
> writeback wasn't cgroup aware and now you're in turn filling that gap
> by making writeback work around that "gap".  I mean, my mind boggles.
> Doesn't yours?  I strongly believe everyone's should.

Heh. It's a hard problem indeed. I felt great pains in the IO-less
dirty throttling work. I did a lot reasoning about it, and have in
fact kept cgroup IO controller in mind since its early days. Now I'd
say it's hands down for it to adapt to the gap between the total IO
limit and what's carried out by the block IO controller.

> > It's a pity and surprise that Google as a big user does not buy in
> > this simple solution. You may prefer more comprehensive controls which
> > may not be easily achievable with the simple scheme. However the
> > complexities and overheads involved in throttling the flusher IOs
> > really upsets me. 
> 
> Heh, believe it or not, I'm not really wearing google hat on this
> subject and google's writeback people may have completely different
> opinions on the subject than mine.  In fact, I'm not even sure how
> much "work" time I'll be able to assign to this.  :(

OK, understand.

> > The sweet split point would be for balance_dirty_pages() to do cgroup
> > aware buffered write throttling and leave other IOs to the current
> > blkcg. For this to work well as a total solution for end users, I hope
> > we can cooperate and figure out ways for the two throttling entities
> > to work well with each other.
> 
> There's where I'm confused.  How is the said split supposed to work?
> They aren't independent.  I mean, who gets to decide what and where
> are those decisions enforced?

Yeah it's not independent. It's about

- keep block IO cgroup untouched (in its current algorithm, for
  throttling direct IO)

- let balance_dirty_pages() adapt to the throttling target
  
        buffered_write_limit = total_limit - direct_IOs

> > What I'm interested is, what's Google and other users' use schemes in
> > practice. What's their desired interfaces. Whether and how the
> > combined bdp+blkcg throttling can fulfill the goals.
> 
> I'm not too privy of mm and writeback in google and even if so I
> probably shouldn't talk too much about it.  Confidentiality and all.
> That said, I have the general feeling that goog already figured out
> how to at least work around the existing implementation and would be
> able to continue no matter how upstream development fans out.
> 
> That said, wearing the cgroup maintainer and general kernel
> contributor hat, I'd really like to avoid another design mess up.

To me it looks a pretty clean split and find it to be an easy
solution (after sorting it out the hard way). I'll show the code and
test results after some time.

> > > Let's please keep the layering clear.  IO limitations will be applied
> > > at the block layer and pressure will be formed there and then
> > > propagated upwards eventually to the originator.  Sure, exposing the
> > > whole information might result in better behavior for certain
> > > workloads, but down the road, say, in three or five years, devices
> > > which can be shared without worrying too much about seeks might be
> > > commonplace and we could be swearing at a disgusting structural mess,
> > > and sadly various cgroup support seems to be a prominent source of
> > > such design failures.
> > 
> > Super fast storages are coming which will make us regret to make the
> > IO path over complex.  Spinning disks are not going away anytime soon.
> > I doubt Google is willing to afford the disk seek costs on its
> > millions of disks and has the patience to wait until switching all of
> > the spin disks to SSD years later (if it will ever happen).
> 
> This is new.  Let's keep the damn employer out of the discussion.
> While the area I work on is affected by my employment (writeback isn't
> even my area BTW), I'm not gonna do something adverse to upstream even
> if it's beneficial to google and I'm much more likely to do something
> which may hurt google a bit if it's gonna benefit upstream.
> 
> As for the faster / newer storage argument, that is *exactly* why we
> want to keep the layering proper.  Writeback works from the pressure
> from the IO stack.  If IO technology changes, we update the IO stack
> and writeback still works from the pressure.  It may need to be
> adjusted but the principles don't change.

To me, balance_dirty_pages() is *the* proper layer for buffered writes.
It's always there doing 1:1 proportional throttling. Then you try to
kick in to add *double* throttling in block/cfq layer. Now the low
layer may enforce 10:1 throttling and push balance_dirty_pages() away
from its balanced state, leading to large fluctuations and program
stalls.  This can be avoided by telling balance_dirty_pages(): "your
balance goal is no longer 1:1, but 10:1". With this information
balance_dirty_pages() will behave right. Then there is the question:
if balance_dirty_pages() will work just well provided the information,
why bother doing the throttling at low layer and "push back" the
pressure all the way up?

> > It's obvious that your below proposal involves a lot of complexities,
> > overheads, and will hurt performance. It basically involves
> 
> Hmmm... that's not the impression I got from the discussion.
> According to Jan, applying the current writeback logic to cgroup'fied
> bdi shouldn't be too complex, no?

In the sense of "avoidable" complexity :-)

> > - running concurrent flusher threads for cgroups, which adds back the
> >   disk seeks and lock contentions. And still has problems with sync
> >   and shared inodes.
> 
> I agree this is an actual concern but if the user wants to split one
> spindle to multiple resource domains, there's gonna be considerable
> amount of overhead no matter what.  If you want to improve how block
> layer handles the split, you're welcome to dive into the block layer,
> where the split is made, and improve it.
> 
> > - splitting device queue for cgroups, possibly scaling up the pool of
> >   writeback pages (and locked pages in the case of stable pages) which
> >   could stall random processes in the system
> 
> Sure, it'll take up more buffering and memory but that's the overhead
> of the cgroup business.  I want it to be less intrusive at the cost of
> somewhat more resource consumption.  ie. I don't want writeback logic
> itself deeply involved in block IO cgroup enforcement even if that
> means somewhat less efficient resource usage.

The balance_dirty_pages() is already deeply involved in dirty throttling.
As you can see from this patchset, the same algorithms can be extended
trivially to work with cgroup IO limits.

buffered write IO controller in balance_dirty_pages()
https://lkml.org/lkml/2012/3/28/275

It does not require forking off the flusher threads and splitting up
the IO queue at all.

> > - the mess of metadata handling
> 
> Does throttling from writeback actually solve this problem?  What
> about fsync()?  Does that already go through balance_dirty_pages()?

balance_dirty_pages() does throttling at safe points outside of fs
transactions/locks.

fsync() only submits IO for already dirtied pages and won't be
throttled by balance_dirty_pages(). Throttling happens at earlier
times when the task is dirtying the pages.

> > - unnecessarily coupled with memcg, in order to take advantage of the
> >   per-memcg dirty limits for balance_dirty_pages() to actually convert
> >   the "pushed back" dirty pages pressure into lowered dirty rate. Why
> >   the hell the users *have to* setup memcg (suffering from all the
> >   inconvenience and overheads) in order to do IO throttling?  Please,
> >   this is really ugly! And the "back pressure" may constantly push the
> >   memcg dirty pages to the limits. I'm not going to support *miss use*
> >   of per-memcg dirty limits like this!
> 
> Writeback sits between blkcg and memcg and it indeed can be hairy to
> consider both sides especially given the current sorry complex state
> of cgroup and I can see why it would seem tempting to add a separate
> controller or at least knobs to support that.  That said, I *think*
> given that memcg controls all other memory parameters it probably
> would make most sense giving that parameter to memcg too.  I don't
> think this is really relevant to this discussion tho.  Who owns
> dirty_limits is a separate issue.

In the "back pressure" scheme, memcg is a must because only it has all
the infrastructure to track dirty pages upon which you can apply some
dirty_limits. Don't tell me you want to account dirty pages in blkcg...

> > I cannot believe you would keep overlooking all the problems without
> > good reasons. Please do tell us the reasons that matter.
> 
> Well, I tried and I hope some of it got through.  I also wrote a lot
> of questions, mainly regarding how what you have in mind is supposed
> to work through what path.  Maybe I'm just not seeing what you're
> seeing but I just can't see where all the IOs would go through and
> come together.  Can you please elaborate more on that?

What I can see is, it looks pretty simple and nature to let
balance_dirty_pages() fill the gap towards a total solution :-)

- add direct IO accounting in some convenient point of the IO path
  IO submission or completion point, either is fine.

- change several lines of the buffered write IO controller to
  integrate the direct IO rate into the formula to fit the "total
  IO" limit

- in future, add more accounting as well as feedback control to make
  balance_dirty_pages() work with IOPS and disk time

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]   ` <20120404145134.GC12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2012-04-04 15:36     ` Steve French
  2012-04-04 18:49     ` Tejun Heo
@ 2012-04-07  8:00     ` Jan Kara
  2 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-07  8:00 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

  Hi Vivek,

On Wed 04-04-12 10:51:34, Vivek Goyal wrote:
> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
> [..]
> > IIUC, without cgroup, the current writeback code works more or less
> > like this.  Throwing in cgroup doesn't really change the fundamental
> > design.  Instead of a single pipe going down, we just have multiple
> > pipes to the same device, each of which should be treated separately.
> > Of course, a spinning disk can't be divided that easily and their
> > performance characteristics will be inter-dependent, but the place to
> > solve that problem is where the problem is, the block layer.
> 
> How do you take care of thorottling IO to NFS case in this model? Current
> throttling logic is tied to block device and in case of NFS, there is no
> block device.
  Yeah, for throttling NFS or other network filesystems we'd have to come
up with some throttling mechanism at some other level. The problem with
throttling at higher levels is that you have to somehow extract information
from lower levels about amount of work so I'm not completely certain now,
where would be the right place. Possibly it also depends on the intended
usecase - so far I don't know about any real user for this functionality...

> [..]
> > In the discussion, for such implementation, the following obstacles
> > were identified.
> > 
> > * There are a lot of cases where IOs are issued by a task which isn't
> >   the originiator.  ie. Writeback issues IOs for pages which are
> >   dirtied by some other tasks.  So, by the time an IO reaches the
> >   block layer, we don't know which cgroup the IO belongs to.
> > 
> >   Recently, block layer has grown support to attach a task to a bio
> >   which causes the bio to be handled as if it were issued by the
> >   associated task regardless of the actual issuing task.  It currently
> >   only allows attaching %current to a bio - bio_associate_current() -
> >   but changing it to support other tasks is trivial.
> > 
> >   We'll need to update the async issuers to tag the IOs they issue but
> >   the mechanism is already there.
> 
> Most likely this tagging will take place in "struct page" and I am not
> sure if we will be allowed to grow size of "struct page" for this reason.
  We can tag inodes and then bios so this should be fine.

> > * Unlike dirty data pages, metadata tends to have strict ordering
> >   requirements and thus is susceptible to priority inversion.  Two
> >   solutions were suggested - 1. allow overdrawl for metadata writes so
> >   that low prio metadata writes don't block the whole FS, 2. provide
> >   an interface to query and wait for bdi-cgroup congestion which can
> >   be called from FS metadata paths to throttle metadata operations
> >   before they enter the stream of ordered operations.
> 
> So that probably will mean changing the order of operations also. IIUC, 
> in case of fsync (ordered mode), we opened a meta data transaction first,
> then tried to flush all the cached data and then flush metadata. So if
> fsync is throttled, all the metadata operations behind it will get 
> serialized for ext3/ext4.
> 
> So you seem to be suggesting that we change the design so that metadata
> operation does not thrown into ordered stream till we have finished
> writing all the data back to disk? I am not a filesystem developer, so
> I don't know how feasible this change is.
> 
> This is just one of the points. In the past while talking to Dave Chinner,
> he mentioned that in XFS, if two cgroups fall into same allocation group
> then there were cases where IO of one cgroup can get serialized behind
> other.
> 
> In general, the core of the issue is that filesystems are not cgroup aware
> and if you do throttling below filesystems, then invariably one or other
> serialization issue will come up and I am concerned that we will be constantly
> fixing those serialization issues. Or the desgin point could be so central
> to filesystem design that it can't be changed.
  We talked about this at LSF and Dave Chinner had the idea that we could
make processes wait at the time when a transaction is started. At that time
we don't hold any global locks so process can be throttled without
serializing other processes. This effectively builds some cgroup awareness
into filesystems but pretty simple one so it should be doable.

> In general, if you do throttling deeper in the stakc and build back
> pressure, then all the layers sitting above should be cgroup aware
> to avoid problems. Two layers identified so far are writeback and
> filesystems. Is it really worth the complexity. How about doing 
> throttling in higher layers when IO is entering the kernel and
> keep proportional IO logic at the lowest level and current mechanism
> of building pressure continues to work?
  I would like to keep single throttling mechanism for different limitting
methods - i.e. handle proportional IO the same way as IO hard limits. So we
cannot really rely on the fact that throttling is work preserving.

The advantage of throttling at IO layer is that we can keep all the details
inside it and only export pretty minimal information (like is bdi congested
for given cgroup) to upper layers. If we wanted to do throttling at upper
layers (such as Fengguang's buffered write throttling), we need to export
the internal details to allow effective throttling...

								Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-04 14:51   ` Vivek Goyal
@ 2012-04-07  8:00     ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-07  8:00 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, Fengguang Wu, Jan Kara, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

  Hi Vivek,

On Wed 04-04-12 10:51:34, Vivek Goyal wrote:
> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
> [..]
> > IIUC, without cgroup, the current writeback code works more or less
> > like this.  Throwing in cgroup doesn't really change the fundamental
> > design.  Instead of a single pipe going down, we just have multiple
> > pipes to the same device, each of which should be treated separately.
> > Of course, a spinning disk can't be divided that easily and their
> > performance characteristics will be inter-dependent, but the place to
> > solve that problem is where the problem is, the block layer.
> 
> How do you take care of thorottling IO to NFS case in this model? Current
> throttling logic is tied to block device and in case of NFS, there is no
> block device.
  Yeah, for throttling NFS or other network filesystems we'd have to come
up with some throttling mechanism at some other level. The problem with
throttling at higher levels is that you have to somehow extract information
from lower levels about amount of work so I'm not completely certain now,
where would be the right place. Possibly it also depends on the intended
usecase - so far I don't know about any real user for this functionality...

> [..]
> > In the discussion, for such implementation, the following obstacles
> > were identified.
> > 
> > * There are a lot of cases where IOs are issued by a task which isn't
> >   the originiator.  ie. Writeback issues IOs for pages which are
> >   dirtied by some other tasks.  So, by the time an IO reaches the
> >   block layer, we don't know which cgroup the IO belongs to.
> > 
> >   Recently, block layer has grown support to attach a task to a bio
> >   which causes the bio to be handled as if it were issued by the
> >   associated task regardless of the actual issuing task.  It currently
> >   only allows attaching %current to a bio - bio_associate_current() -
> >   but changing it to support other tasks is trivial.
> > 
> >   We'll need to update the async issuers to tag the IOs they issue but
> >   the mechanism is already there.
> 
> Most likely this tagging will take place in "struct page" and I am not
> sure if we will be allowed to grow size of "struct page" for this reason.
  We can tag inodes and then bios so this should be fine.

> > * Unlike dirty data pages, metadata tends to have strict ordering
> >   requirements and thus is susceptible to priority inversion.  Two
> >   solutions were suggested - 1. allow overdrawl for metadata writes so
> >   that low prio metadata writes don't block the whole FS, 2. provide
> >   an interface to query and wait for bdi-cgroup congestion which can
> >   be called from FS metadata paths to throttle metadata operations
> >   before they enter the stream of ordered operations.
> 
> So that probably will mean changing the order of operations also. IIUC, 
> in case of fsync (ordered mode), we opened a meta data transaction first,
> then tried to flush all the cached data and then flush metadata. So if
> fsync is throttled, all the metadata operations behind it will get 
> serialized for ext3/ext4.
> 
> So you seem to be suggesting that we change the design so that metadata
> operation does not thrown into ordered stream till we have finished
> writing all the data back to disk? I am not a filesystem developer, so
> I don't know how feasible this change is.
> 
> This is just one of the points. In the past while talking to Dave Chinner,
> he mentioned that in XFS, if two cgroups fall into same allocation group
> then there were cases where IO of one cgroup can get serialized behind
> other.
> 
> In general, the core of the issue is that filesystems are not cgroup aware
> and if you do throttling below filesystems, then invariably one or other
> serialization issue will come up and I am concerned that we will be constantly
> fixing those serialization issues. Or the desgin point could be so central
> to filesystem design that it can't be changed.
  We talked about this at LSF and Dave Chinner had the idea that we could
make processes wait at the time when a transaction is started. At that time
we don't hold any global locks so process can be throttled without
serializing other processes. This effectively builds some cgroup awareness
into filesystems but pretty simple one so it should be doable.

> In general, if you do throttling deeper in the stakc and build back
> pressure, then all the layers sitting above should be cgroup aware
> to avoid problems. Two layers identified so far are writeback and
> filesystems. Is it really worth the complexity. How about doing 
> throttling in higher layers when IO is entering the kernel and
> keep proportional IO logic at the lowest level and current mechanism
> of building pressure continues to work?
  I would like to keep single throttling mechanism for different limitting
methods - i.e. handle proportional IO the same way as IO hard limits. So we
cannot really rely on the fact that throttling is work preserving.

The advantage of throttling at IO layer is that we can keep all the details
inside it and only export pretty minimal information (like is bdi congested
for given cgroup) to upper layers. If we wanted to do throttling at upper
layers (such as Fengguang's buffered write throttling), we need to export
the internal details to allow effective throttling...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-07  8:00     ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-07  8:00 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, Fengguang Wu, Jan Kara, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

  Hi Vivek,

On Wed 04-04-12 10:51:34, Vivek Goyal wrote:
> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
> [..]
> > IIUC, without cgroup, the current writeback code works more or less
> > like this.  Throwing in cgroup doesn't really change the fundamental
> > design.  Instead of a single pipe going down, we just have multiple
> > pipes to the same device, each of which should be treated separately.
> > Of course, a spinning disk can't be divided that easily and their
> > performance characteristics will be inter-dependent, but the place to
> > solve that problem is where the problem is, the block layer.
> 
> How do you take care of thorottling IO to NFS case in this model? Current
> throttling logic is tied to block device and in case of NFS, there is no
> block device.
  Yeah, for throttling NFS or other network filesystems we'd have to come
up with some throttling mechanism at some other level. The problem with
throttling at higher levels is that you have to somehow extract information
from lower levels about amount of work so I'm not completely certain now,
where would be the right place. Possibly it also depends on the intended
usecase - so far I don't know about any real user for this functionality...

> [..]
> > In the discussion, for such implementation, the following obstacles
> > were identified.
> > 
> > * There are a lot of cases where IOs are issued by a task which isn't
> >   the originiator.  ie. Writeback issues IOs for pages which are
> >   dirtied by some other tasks.  So, by the time an IO reaches the
> >   block layer, we don't know which cgroup the IO belongs to.
> > 
> >   Recently, block layer has grown support to attach a task to a bio
> >   which causes the bio to be handled as if it were issued by the
> >   associated task regardless of the actual issuing task.  It currently
> >   only allows attaching %current to a bio - bio_associate_current() -
> >   but changing it to support other tasks is trivial.
> > 
> >   We'll need to update the async issuers to tag the IOs they issue but
> >   the mechanism is already there.
> 
> Most likely this tagging will take place in "struct page" and I am not
> sure if we will be allowed to grow size of "struct page" for this reason.
  We can tag inodes and then bios so this should be fine.

> > * Unlike dirty data pages, metadata tends to have strict ordering
> >   requirements and thus is susceptible to priority inversion.  Two
> >   solutions were suggested - 1. allow overdrawl for metadata writes so
> >   that low prio metadata writes don't block the whole FS, 2. provide
> >   an interface to query and wait for bdi-cgroup congestion which can
> >   be called from FS metadata paths to throttle metadata operations
> >   before they enter the stream of ordered operations.
> 
> So that probably will mean changing the order of operations also. IIUC, 
> in case of fsync (ordered mode), we opened a meta data transaction first,
> then tried to flush all the cached data and then flush metadata. So if
> fsync is throttled, all the metadata operations behind it will get 
> serialized for ext3/ext4.
> 
> So you seem to be suggesting that we change the design so that metadata
> operation does not thrown into ordered stream till we have finished
> writing all the data back to disk? I am not a filesystem developer, so
> I don't know how feasible this change is.
> 
> This is just one of the points. In the past while talking to Dave Chinner,
> he mentioned that in XFS, if two cgroups fall into same allocation group
> then there were cases where IO of one cgroup can get serialized behind
> other.
> 
> In general, the core of the issue is that filesystems are not cgroup aware
> and if you do throttling below filesystems, then invariably one or other
> serialization issue will come up and I am concerned that we will be constantly
> fixing those serialization issues. Or the desgin point could be so central
> to filesystem design that it can't be changed.
  We talked about this at LSF and Dave Chinner had the idea that we could
make processes wait at the time when a transaction is started. At that time
we don't hold any global locks so process can be throttled without
serializing other processes. This effectively builds some cgroup awareness
into filesystems but pretty simple one so it should be doable.

> In general, if you do throttling deeper in the stakc and build back
> pressure, then all the layers sitting above should be cgroup aware
> to avoid problems. Two layers identified so far are writeback and
> filesystems. Is it really worth the complexity. How about doing 
> throttling in higher layers when IO is entering the kernel and
> keep proportional IO logic at the lowest level and current mechanism
> of building pressure continues to work?
  I would like to keep single throttling mechanism for different limitting
methods - i.e. handle proportional IO the same way as IO hard limits. So we
cannot really rely on the fact that throttling is work preserving.

The advantage of throttling at IO layer is that we can keep all the details
inside it and only export pretty minimal information (like is bdi congested
for given cgroup) to upper layers. If we wanted to do throttling at upper
layers (such as Fengguang's buffered write throttling), we need to export
the internal details to allow effective throttling...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]     ` <20120407080027.GA2584-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
@ 2012-04-10 16:23       ` Steve French
  2012-04-10 18:06       ` Vivek Goyal
  1 sibling, 0 replies; 262+ messages in thread
From: Steve French @ 2012-04-10 16:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal

On Sat, Apr 7, 2012 at 3:00 AM, Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> wrote:
>  Hi Vivek,
>
> On Wed 04-04-12 10:51:34, Vivek Goyal wrote:
>> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
>> [..]
>> > IIUC, without cgroup, the current writeback code works more or less
>> > like this.  Throwing in cgroup doesn't really change the fundamental
>> > design.  Instead of a single pipe going down, we just have multiple
>> > pipes to the same device, each of which should be treated separately.
>> > Of course, a spinning disk can't be divided that easily and their
>> > performance characteristics will be inter-dependent, but the place to
>> > solve that problem is where the problem is, the block layer.
>>
>> How do you take care of thorottling IO to NFS case in this model? Current
>> throttling logic is tied to block device and in case of NFS, there is no
>> block device.
>  Yeah, for throttling NFS or other network filesystems we'd have to come
> up with some throttling mechanism at some other level. The problem with
> throttling at higher levels is that you have to somehow extract information
> from lower levels about amount of work so I'm not completely certain now,
> where would be the right place. Possibly it also depends on the intended
> usecase - so far I don't know about any real user for this functionality...

Remember to distinguish between the two ends of the network file system.
There are slightly different problems.   The client has to be able to
expose the number of requests (and size of writes, or equivalently
number of pages it can write at one time) so that writeback is not done
too aggressively.  File servers have to be able to
discover the i/o limits dynamically of the underlying volume (not the
block device, but potentially a pool of devices) so it can tell
the client how much i/o it can send.  For SMB2 server (Samba) and
eventually for NFS, how many simultaneous requests it
can support will allow them to sanely set the number of "credits"
on each response - ie tell the client how many requests
are allowed in flight to a particular export.

In the case of block device throttling - other than the file system
internally using such APIs who would use block device specific
throttling - only the file system knows where it wants to put hot data,
and in the case of btrfs, doesn't the file system manage the
storage pool.   The block device should be transparent to the
user in the long run, and only the volume visible.


-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]     ` <20120407080027.GA2584-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
  2012-04-10 16:23       ` [Lsf] " Steve French
@ 2012-04-10 16:23       ` Steve French
  1 sibling, 0 replies; 262+ messages in thread
From: Steve French @ 2012-04-10 16:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel

On Sat, Apr 7, 2012 at 3:00 AM, Jan Kara <jack@suse.cz> wrote:
>  Hi Vivek,
>
> On Wed 04-04-12 10:51:34, Vivek Goyal wrote:
>> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
>> [..]
>> > IIUC, without cgroup, the current writeback code works more or less
>> > like this.  Throwing in cgroup doesn't really change the fundamental
>> > design.  Instead of a single pipe going down, we just have multiple
>> > pipes to the same device, each of which should be treated separately.
>> > Of course, a spinning disk can't be divided that easily and their
>> > performance characteristics will be inter-dependent, but the place to
>> > solve that problem is where the problem is, the block layer.
>>
>> How do you take care of thorottling IO to NFS case in this model? Current
>> throttling logic is tied to block device and in case of NFS, there is no
>> block device.
>  Yeah, for throttling NFS or other network filesystems we'd have to come
> up with some throttling mechanism at some other level. The problem with
> throttling at higher levels is that you have to somehow extract information
> from lower levels about amount of work so I'm not completely certain now,
> where would be the right place. Possibly it also depends on the intended
> usecase - so far I don't know about any real user for this functionality...

Remember to distinguish between the two ends of the network file system.
There are slightly different problems.   The client has to be able to
expose the number of requests (and size of writes, or equivalently
number of pages it can write at one time) so that writeback is not done
too aggressively.  File servers have to be able to
discover the i/o limits dynamically of the underlying volume (not the
block device, but potentially a pool of devices) so it can tell
the client how much i/o it can send.  For SMB2 server (Samba) and
eventually for NFS, how many simultaneous requests it
can support will allow them to sanely set the number of "credits"
on each response - ie tell the client how many requests
are allowed in flight to a particular export.

In the case of block device throttling - other than the file system
internally using such APIs who would use block device specific
throttling - only the file system knows where it wants to put hot data,
and in the case of btrfs, doesn't the file system manage the
storage pool.   The block device should be transparent to the
user in the long run, and only the volume visible.


-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-10 16:23       ` Steve French
  0 siblings, 0 replies; 262+ messages in thread
From: Steve French @ 2012-04-10 16:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Sat, Apr 7, 2012 at 3:00 AM, Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> wrote:
>  Hi Vivek,
>
> On Wed 04-04-12 10:51:34, Vivek Goyal wrote:
>> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
>> [..]
>> > IIUC, without cgroup, the current writeback code works more or less
>> > like this.  Throwing in cgroup doesn't really change the fundamental
>> > design.  Instead of a single pipe going down, we just have multiple
>> > pipes to the same device, each of which should be treated separately.
>> > Of course, a spinning disk can't be divided that easily and their
>> > performance characteristics will be inter-dependent, but the place to
>> > solve that problem is where the problem is, the block layer.
>>
>> How do you take care of thorottling IO to NFS case in this model? Current
>> throttling logic is tied to block device and in case of NFS, there is no
>> block device.
>  Yeah, for throttling NFS or other network filesystems we'd have to come
> up with some throttling mechanism at some other level. The problem with
> throttling at higher levels is that you have to somehow extract information
> from lower levels about amount of work so I'm not completely certain now,
> where would be the right place. Possibly it also depends on the intended
> usecase - so far I don't know about any real user for this functionality...

Remember to distinguish between the two ends of the network file system.
There are slightly different problems.   The client has to be able to
expose the number of requests (and size of writes, or equivalently
number of pages it can write at one time) so that writeback is not done
too aggressively.  File servers have to be able to
discover the i/o limits dynamically of the underlying volume (not the
block device, but potentially a pool of devices) so it can tell
the client how much i/o it can send.  For SMB2 server (Samba) and
eventually for NFS, how many simultaneous requests it
can support will allow them to sanely set the number of "credits"
on each response - ie tell the client how many requests
are allowed in flight to a particular export.

In the case of block device throttling - other than the file system
internally using such APIs who would use block device specific
throttling - only the file system knows where it wants to put hot data,
and in the case of btrfs, doesn't the file system manage the
storage pool.   The block device should be transparent to the
user in the long run, and only the volume visible.


-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-10 16:23       ` Steve French
  0 siblings, 0 replies; 262+ messages in thread
From: Steve French @ 2012-04-10 16:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel

On Sat, Apr 7, 2012 at 3:00 AM, Jan Kara <jack@suse.cz> wrote:
>  Hi Vivek,
>
> On Wed 04-04-12 10:51:34, Vivek Goyal wrote:
>> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
>> [..]
>> > IIUC, without cgroup, the current writeback code works more or less
>> > like this.  Throwing in cgroup doesn't really change the fundamental
>> > design.  Instead of a single pipe going down, we just have multiple
>> > pipes to the same device, each of which should be treated separately.
>> > Of course, a spinning disk can't be divided that easily and their
>> > performance characteristics will be inter-dependent, but the place to
>> > solve that problem is where the problem is, the block layer.
>>
>> How do you take care of thorottling IO to NFS case in this model? Current
>> throttling logic is tied to block device and in case of NFS, there is no
>> block device.
>  Yeah, for throttling NFS or other network filesystems we'd have to come
> up with some throttling mechanism at some other level. The problem with
> throttling at higher levels is that you have to somehow extract information
> from lower levels about amount of work so I'm not completely certain now,
> where would be the right place. Possibly it also depends on the intended
> usecase - so far I don't know about any real user for this functionality...

Remember to distinguish between the two ends of the network file system.
There are slightly different problems.   The client has to be able to
expose the number of requests (and size of writes, or equivalently
number of pages it can write at one time) so that writeback is not done
too aggressively.  File servers have to be able to
discover the i/o limits dynamically of the underlying volume (not the
block device, but potentially a pool of devices) so it can tell
the client how much i/o it can send.  For SMB2 server (Samba) and
eventually for NFS, how many simultaneous requests it
can support will allow them to sanely set the number of "credits"
on each response - ie tell the client how many requests
are allowed in flight to a particular export.

In the case of block device throttling - other than the file system
internally using such APIs who would use block device specific
throttling - only the file system knows where it wants to put hot data,
and in the case of btrfs, doesn't the file system manage the
storage pool.   The block device should be transparent to the
user in the long run, and only the volume visible.


-- 
Thanks,

Steve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]     ` <20120407080027.GA2584-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
  2012-04-10 16:23       ` [Lsf] " Steve French
@ 2012-04-10 18:06       ` Vivek Goyal
  1 sibling, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-10 18:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Sat, Apr 07, 2012 at 10:00:27AM +0200, Jan Kara wrote:
Hi Jan,

[..]
> > In general, the core of the issue is that filesystems are not cgroup aware
> > and if you do throttling below filesystems, then invariably one or other
> > serialization issue will come up and I am concerned that we will be constantly
> > fixing those serialization issues. Or the desgin point could be so central
> > to filesystem design that it can't be changed.
>   We talked about this at LSF and Dave Chinner had the idea that we could
> make processes wait at the time when a transaction is started. At that time
> we don't hold any global locks so process can be throttled without
> serializing other processes. This effectively builds some cgroup awareness
> into filesystems but pretty simple one so it should be doable.

Ok. So what is the meaning of "make process wait" here? What it will be
dependent on? I am thinking of a case where a process has 100MB of dirty
data, has 10MB/s write limit and it issues fsync. So before that process
is able to open a transaction, one needs to wait atleast 10seconds
(assuming other processes are not doing IO in same cgroup). 

If this wait is based on making sure all dirty data has been written back
before opening transaction, then it will work without any interaction with
block layer and sounds more feasible.

> 
> > In general, if you do throttling deeper in the stakc and build back
> > pressure, then all the layers sitting above should be cgroup aware
> > to avoid problems. Two layers identified so far are writeback and
> > filesystems. Is it really worth the complexity. How about doing 
> > throttling in higher layers when IO is entering the kernel and
> > keep proportional IO logic at the lowest level and current mechanism
> > of building pressure continues to work?
>   I would like to keep single throttling mechanism for different limitting
> methods - i.e. handle proportional IO the same way as IO hard limits. So we
> cannot really rely on the fact that throttling is work preserving.
> 
> The advantage of throttling at IO layer is that we can keep all the details
> inside it and only export pretty minimal information (like is bdi congested
> for given cgroup) to upper layers. If we wanted to do throttling at upper
> layers (such as Fengguang's buffered write throttling), we need to export
> the internal details to allow effective throttling...

For absolute throttling we really don't have to expose any details. In
fact in my implementation of throttling buffered writes, I just had exported
a single function to be called in bdi dirty rate limit. The caller will
simply sleep long enough depending on the size of IO it is doing and
how many other processes are doing IO in same cgroup.

So implementation was still in block layer and only a single function
was exposed to higher layers.

One more factor makes absolute throttling interesting and that is global
throttling and not per device throttling. For example in case of btrfs,
there is no single stacked device on which to put total throttling
limits.

So if filesystems can handle serialization issue, then back pressure
method looks more clean (thought complex).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-07  8:00     ` Jan Kara
@ 2012-04-10 18:06       ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-10 18:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Sat, Apr 07, 2012 at 10:00:27AM +0200, Jan Kara wrote:
Hi Jan,

[..]
> > In general, the core of the issue is that filesystems are not cgroup aware
> > and if you do throttling below filesystems, then invariably one or other
> > serialization issue will come up and I am concerned that we will be constantly
> > fixing those serialization issues. Or the desgin point could be so central
> > to filesystem design that it can't be changed.
>   We talked about this at LSF and Dave Chinner had the idea that we could
> make processes wait at the time when a transaction is started. At that time
> we don't hold any global locks so process can be throttled without
> serializing other processes. This effectively builds some cgroup awareness
> into filesystems but pretty simple one so it should be doable.

Ok. So what is the meaning of "make process wait" here? What it will be
dependent on? I am thinking of a case where a process has 100MB of dirty
data, has 10MB/s write limit and it issues fsync. So before that process
is able to open a transaction, one needs to wait atleast 10seconds
(assuming other processes are not doing IO in same cgroup). 

If this wait is based on making sure all dirty data has been written back
before opening transaction, then it will work without any interaction with
block layer and sounds more feasible.

> 
> > In general, if you do throttling deeper in the stakc and build back
> > pressure, then all the layers sitting above should be cgroup aware
> > to avoid problems. Two layers identified so far are writeback and
> > filesystems. Is it really worth the complexity. How about doing 
> > throttling in higher layers when IO is entering the kernel and
> > keep proportional IO logic at the lowest level and current mechanism
> > of building pressure continues to work?
>   I would like to keep single throttling mechanism for different limitting
> methods - i.e. handle proportional IO the same way as IO hard limits. So we
> cannot really rely on the fact that throttling is work preserving.
> 
> The advantage of throttling at IO layer is that we can keep all the details
> inside it and only export pretty minimal information (like is bdi congested
> for given cgroup) to upper layers. If we wanted to do throttling at upper
> layers (such as Fengguang's buffered write throttling), we need to export
> the internal details to allow effective throttling...

For absolute throttling we really don't have to expose any details. In
fact in my implementation of throttling buffered writes, I just had exported
a single function to be called in bdi dirty rate limit. The caller will
simply sleep long enough depending on the size of IO it is doing and
how many other processes are doing IO in same cgroup.

So implementation was still in block layer and only a single function
was exposed to higher layers.

One more factor makes absolute throttling interesting and that is global
throttling and not per device throttling. For example in case of btrfs,
there is no single stacked device on which to put total throttling
limits.

So if filesystems can handle serialization issue, then back pressure
method looks more clean (thought complex).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-10 18:06       ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-10 18:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Sat, Apr 07, 2012 at 10:00:27AM +0200, Jan Kara wrote:
Hi Jan,

[..]
> > In general, the core of the issue is that filesystems are not cgroup aware
> > and if you do throttling below filesystems, then invariably one or other
> > serialization issue will come up and I am concerned that we will be constantly
> > fixing those serialization issues. Or the desgin point could be so central
> > to filesystem design that it can't be changed.
>   We talked about this at LSF and Dave Chinner had the idea that we could
> make processes wait at the time when a transaction is started. At that time
> we don't hold any global locks so process can be throttled without
> serializing other processes. This effectively builds some cgroup awareness
> into filesystems but pretty simple one so it should be doable.

Ok. So what is the meaning of "make process wait" here? What it will be
dependent on? I am thinking of a case where a process has 100MB of dirty
data, has 10MB/s write limit and it issues fsync. So before that process
is able to open a transaction, one needs to wait atleast 10seconds
(assuming other processes are not doing IO in same cgroup). 

If this wait is based on making sure all dirty data has been written back
before opening transaction, then it will work without any interaction with
block layer and sounds more feasible.

> 
> > In general, if you do throttling deeper in the stakc and build back
> > pressure, then all the layers sitting above should be cgroup aware
> > to avoid problems. Two layers identified so far are writeback and
> > filesystems. Is it really worth the complexity. How about doing 
> > throttling in higher layers when IO is entering the kernel and
> > keep proportional IO logic at the lowest level and current mechanism
> > of building pressure continues to work?
>   I would like to keep single throttling mechanism for different limitting
> methods - i.e. handle proportional IO the same way as IO hard limits. So we
> cannot really rely on the fact that throttling is work preserving.
> 
> The advantage of throttling at IO layer is that we can keep all the details
> inside it and only export pretty minimal information (like is bdi congested
> for given cgroup) to upper layers. If we wanted to do throttling at upper
> layers (such as Fengguang's buffered write throttling), we need to export
> the internal details to allow effective throttling...

For absolute throttling we really don't have to expose any details. In
fact in my implementation of throttling buffered writes, I just had exported
a single function to be called in bdi dirty rate limit. The caller will
simply sleep long enough depending on the size of IO it is doing and
how many other processes are doing IO in same cgroup.

So implementation was still in block layer and only a single function
was exposed to higher layers.

One more factor makes absolute throttling interesting and that is global
throttling and not per device throttling. For example in case of btrfs,
there is no single stacked device on which to put total throttling
limits.

So if filesystems can handle serialization issue, then back pressure
method looks more clean (thought complex).

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]       ` <CAH2r5mvLVnM3Se5vBBsYzwaz5Ckp3i6SVnGp2T0XaGe9_u8YYA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2012-04-10 18:16         ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-10 18:16 UTC (permalink / raw)
  To: Steve French
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue, Apr 10, 2012 at 11:23:16AM -0500, Steve French wrote:

[..]
> In the case of block device throttling - other than the file system
> internally using such APIs who would use block device specific
> throttling - only the file system knows where it wants to put hot data,
> and in the case of btrfs, doesn't the file system manage the
> storage pool.   The block device should be transparent to the
> user in the long run, and only the volume visible.

This is a good point. I guess this goes back to Jan's question of what's
the intended use case of absolute throttling. Having a dependency on 
per device limits has the drawback of user knowing exactly the details
of storage stack and it assumes that there is one single aggregation point
of block devices. (Which is not true in case of btrfs).

If user is simply looking for something like that I don't want a backup
process to be writing at more than 50MB/s (so that other processes doing
IO to same filesystem are effected less), then it is a case of global
throttling and per device throttling really does not gel well.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-10 16:23       ` Steve French
@ 2012-04-10 18:16         ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-10 18:16 UTC (permalink / raw)
  To: Steve French
  Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel

On Tue, Apr 10, 2012 at 11:23:16AM -0500, Steve French wrote:

[..]
> In the case of block device throttling - other than the file system
> internally using such APIs who would use block device specific
> throttling - only the file system knows where it wants to put hot data,
> and in the case of btrfs, doesn't the file system manage the
> storage pool.   The block device should be transparent to the
> user in the long run, and only the volume visible.

This is a good point. I guess this goes back to Jan's question of what's
the intended use case of absolute throttling. Having a dependency on 
per device limits has the drawback of user knowing exactly the details
of storage stack and it assumes that there is one single aggregation point
of block devices. (Which is not true in case of btrfs).

If user is simply looking for something like that I don't want a backup
process to be writing at more than 50MB/s (so that other processes doing
IO to same filesystem are effected less), then it is a case of global
throttling and per device throttling really does not gel well.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-10 18:16         ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-10 18:16 UTC (permalink / raw)
  To: Steve French
  Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel

On Tue, Apr 10, 2012 at 11:23:16AM -0500, Steve French wrote:

[..]
> In the case of block device throttling - other than the file system
> internally using such APIs who would use block device specific
> throttling - only the file system knows where it wants to put hot data,
> and in the case of btrfs, doesn't the file system manage the
> storage pool.   The block device should be transparent to the
> user in the long run, and only the volume visible.

This is a good point. I guess this goes back to Jan's question of what's
the intended use case of absolute throttling. Having a dependency on 
per device limits has the drawback of user knowing exactly the details
of storage stack and it assumes that there is one single aggregation point
of block devices. (Which is not true in case of btrfs).

If user is simply looking for something like that I don't want a backup
process to be writing at more than 50MB/s (so that other processes doing
IO to same filesystem are effected less), then it is a case of global
throttling and per device throttling really does not gel well.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-10 18:06       ` Vivek Goyal
  (?)
@ 2012-04-10 21:05           ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-10 21:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

  Hi Vivek,

On Tue 10-04-12 14:06:53, Vivek Goyal wrote:
> On Sat, Apr 07, 2012 at 10:00:27AM +0200, Jan Kara wrote:
> > > In general, the core of the issue is that filesystems are not cgroup aware
> > > and if you do throttling below filesystems, then invariably one or other
> > > serialization issue will come up and I am concerned that we will be constantly
> > > fixing those serialization issues. Or the desgin point could be so central
> > > to filesystem design that it can't be changed.
> >   We talked about this at LSF and Dave Chinner had the idea that we could
> > make processes wait at the time when a transaction is started. At that time
> > we don't hold any global locks so process can be throttled without
> > serializing other processes. This effectively builds some cgroup awareness
> > into filesystems but pretty simple one so it should be doable.
> 
> Ok. So what is the meaning of "make process wait" here? What it will be
> dependent on? I am thinking of a case where a process has 100MB of dirty
> data, has 10MB/s write limit and it issues fsync. So before that process
> is able to open a transaction, one needs to wait atleast 10seconds
> (assuming other processes are not doing IO in same cgroup). 
  The original idea was that we'd have "bdi-congested-for-cgroup" flag
and the process starting a transaction will wait for this flag to get
cleared before starting a new transaction. This will be easy to implement
in filesystems and won't have serialization issues. But my knowledge of
blk-throttle is lacking so there might be some problems with this approach.

> If this wait is based on making sure all dirty data has been written back
> before opening transaction, then it will work without any interaction with
> block layer and sounds more feasible.
> 
> > 
> > > In general, if you do throttling deeper in the stakc and build back
> > > pressure, then all the layers sitting above should be cgroup aware
> > > to avoid problems. Two layers identified so far are writeback and
> > > filesystems. Is it really worth the complexity. How about doing 
> > > throttling in higher layers when IO is entering the kernel and
> > > keep proportional IO logic at the lowest level and current mechanism
> > > of building pressure continues to work?
> >   I would like to keep single throttling mechanism for different limitting
> > methods - i.e. handle proportional IO the same way as IO hard limits. So we
> > cannot really rely on the fact that throttling is work preserving.
> > 
> > The advantage of throttling at IO layer is that we can keep all the details
> > inside it and only export pretty minimal information (like is bdi congested
> > for given cgroup) to upper layers. If we wanted to do throttling at upper
> > layers (such as Fengguang's buffered write throttling), we need to export
> > the internal details to allow effective throttling...
> 
> For absolute throttling we really don't have to expose any details. In
> fact in my implementation of throttling buffered writes, I just had exported
> a single function to be called in bdi dirty rate limit. The caller will
> simply sleep long enough depending on the size of IO it is doing and
> how many other processes are doing IO in same cgroup.
>
> So implementation was still in block layer and only a single function
> was exposed to higher layers.
  OK, I see.
 
> One more factor makes absolute throttling interesting and that is global
> throttling and not per device throttling. For example in case of btrfs,
> there is no single stacked device on which to put total throttling
> limits.
  Yes. My intended interface for the throttling is bdi. But you are right
it does not exactly match the fact that the throttling happens per device
so it might get tricky. Which brings up a question - shouldn't the
throttling blk-throttle does rather happen at bdi layer? Because the
uses of the functionality I have in mind would match that better.

								Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-10 21:05           ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-10 21:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

  Hi Vivek,

On Tue 10-04-12 14:06:53, Vivek Goyal wrote:
> On Sat, Apr 07, 2012 at 10:00:27AM +0200, Jan Kara wrote:
> > > In general, the core of the issue is that filesystems are not cgroup aware
> > > and if you do throttling below filesystems, then invariably one or other
> > > serialization issue will come up and I am concerned that we will be constantly
> > > fixing those serialization issues. Or the desgin point could be so central
> > > to filesystem design that it can't be changed.
> >   We talked about this at LSF and Dave Chinner had the idea that we could
> > make processes wait at the time when a transaction is started. At that time
> > we don't hold any global locks so process can be throttled without
> > serializing other processes. This effectively builds some cgroup awareness
> > into filesystems but pretty simple one so it should be doable.
> 
> Ok. So what is the meaning of "make process wait" here? What it will be
> dependent on? I am thinking of a case where a process has 100MB of dirty
> data, has 10MB/s write limit and it issues fsync. So before that process
> is able to open a transaction, one needs to wait atleast 10seconds
> (assuming other processes are not doing IO in same cgroup). 
  The original idea was that we'd have "bdi-congested-for-cgroup" flag
and the process starting a transaction will wait for this flag to get
cleared before starting a new transaction. This will be easy to implement
in filesystems and won't have serialization issues. But my knowledge of
blk-throttle is lacking so there might be some problems with this approach.

> If this wait is based on making sure all dirty data has been written back
> before opening transaction, then it will work without any interaction with
> block layer and sounds more feasible.
> 
> > 
> > > In general, if you do throttling deeper in the stakc and build back
> > > pressure, then all the layers sitting above should be cgroup aware
> > > to avoid problems. Two layers identified so far are writeback and
> > > filesystems. Is it really worth the complexity. How about doing 
> > > throttling in higher layers when IO is entering the kernel and
> > > keep proportional IO logic at the lowest level and current mechanism
> > > of building pressure continues to work?
> >   I would like to keep single throttling mechanism for different limitting
> > methods - i.e. handle proportional IO the same way as IO hard limits. So we
> > cannot really rely on the fact that throttling is work preserving.
> > 
> > The advantage of throttling at IO layer is that we can keep all the details
> > inside it and only export pretty minimal information (like is bdi congested
> > for given cgroup) to upper layers. If we wanted to do throttling at upper
> > layers (such as Fengguang's buffered write throttling), we need to export
> > the internal details to allow effective throttling...
> 
> For absolute throttling we really don't have to expose any details. In
> fact in my implementation of throttling buffered writes, I just had exported
> a single function to be called in bdi dirty rate limit. The caller will
> simply sleep long enough depending on the size of IO it is doing and
> how many other processes are doing IO in same cgroup.
>
> So implementation was still in block layer and only a single function
> was exposed to higher layers.
  OK, I see.
 
> One more factor makes absolute throttling interesting and that is global
> throttling and not per device throttling. For example in case of btrfs,
> there is no single stacked device on which to put total throttling
> limits.
  Yes. My intended interface for the throttling is bdi. But you are right
it does not exactly match the fact that the throttling happens per device
so it might get tricky. Which brings up a question - shouldn't the
throttling blk-throttle does rather happen at bdi layer? Because the
uses of the functionality I have in mind would match that better.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-10 21:05           ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-10 21:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

  Hi Vivek,

On Tue 10-04-12 14:06:53, Vivek Goyal wrote:
> On Sat, Apr 07, 2012 at 10:00:27AM +0200, Jan Kara wrote:
> > > In general, the core of the issue is that filesystems are not cgroup aware
> > > and if you do throttling below filesystems, then invariably one or other
> > > serialization issue will come up and I am concerned that we will be constantly
> > > fixing those serialization issues. Or the desgin point could be so central
> > > to filesystem design that it can't be changed.
> >   We talked about this at LSF and Dave Chinner had the idea that we could
> > make processes wait at the time when a transaction is started. At that time
> > we don't hold any global locks so process can be throttled without
> > serializing other processes. This effectively builds some cgroup awareness
> > into filesystems but pretty simple one so it should be doable.
> 
> Ok. So what is the meaning of "make process wait" here? What it will be
> dependent on? I am thinking of a case where a process has 100MB of dirty
> data, has 10MB/s write limit and it issues fsync. So before that process
> is able to open a transaction, one needs to wait atleast 10seconds
> (assuming other processes are not doing IO in same cgroup). 
  The original idea was that we'd have "bdi-congested-for-cgroup" flag
and the process starting a transaction will wait for this flag to get
cleared before starting a new transaction. This will be easy to implement
in filesystems and won't have serialization issues. But my knowledge of
blk-throttle is lacking so there might be some problems with this approach.

> If this wait is based on making sure all dirty data has been written back
> before opening transaction, then it will work without any interaction with
> block layer and sounds more feasible.
> 
> > 
> > > In general, if you do throttling deeper in the stakc and build back
> > > pressure, then all the layers sitting above should be cgroup aware
> > > to avoid problems. Two layers identified so far are writeback and
> > > filesystems. Is it really worth the complexity. How about doing 
> > > throttling in higher layers when IO is entering the kernel and
> > > keep proportional IO logic at the lowest level and current mechanism
> > > of building pressure continues to work?
> >   I would like to keep single throttling mechanism for different limitting
> > methods - i.e. handle proportional IO the same way as IO hard limits. So we
> > cannot really rely on the fact that throttling is work preserving.
> > 
> > The advantage of throttling at IO layer is that we can keep all the details
> > inside it and only export pretty minimal information (like is bdi congested
> > for given cgroup) to upper layers. If we wanted to do throttling at upper
> > layers (such as Fengguang's buffered write throttling), we need to export
> > the internal details to allow effective throttling...
> 
> For absolute throttling we really don't have to expose any details. In
> fact in my implementation of throttling buffered writes, I just had exported
> a single function to be called in bdi dirty rate limit. The caller will
> simply sleep long enough depending on the size of IO it is doing and
> how many other processes are doing IO in same cgroup.
>
> So implementation was still in block layer and only a single function
> was exposed to higher layers.
  OK, I see.
 
> One more factor makes absolute throttling interesting and that is global
> throttling and not per device throttling. For example in case of btrfs,
> there is no single stacked device on which to put total throttling
> limits.
  Yes. My intended interface for the throttling is bdi. But you are right
it does not exactly match the fact that the throttling happens per device
so it might get tricky. Which brings up a question - shouldn't the
throttling blk-throttle does rather happen at bdi layer? Because the
uses of the functionality I have in mind would match that better.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]           ` <20120410210505.GE4936-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
@ 2012-04-10 21:20             ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-10 21:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Tue, Apr 10, 2012 at 11:05:05PM +0200, Jan Kara wrote:

[..]
> > Ok. So what is the meaning of "make process wait" here? What it will be
> > dependent on? I am thinking of a case where a process has 100MB of dirty
> > data, has 10MB/s write limit and it issues fsync. So before that process
> > is able to open a transaction, one needs to wait atleast 10seconds
> > (assuming other processes are not doing IO in same cgroup). 
>   The original idea was that we'd have "bdi-congested-for-cgroup" flag
> and the process starting a transaction will wait for this flag to get
> cleared before starting a new transaction. This will be easy to implement
> in filesystems and won't have serialization issues. But my knowledge of
> blk-throttle is lacking so there might be some problems with this approach.

I have implemented and posted patches for per bdi per cgroup congestion
flag. The only problem I see with that is that a group might be congested
for a long time because of lots of other IO happening (say direct IO) and
if you keep on backing off and never submit the metadata IO (transaction),
you get starved. And if you go ahead and submit IO in a congested group,
we are back to serialization issue.

[..]
> > One more factor makes absolute throttling interesting and that is global
> > throttling and not per device throttling. For example in case of btrfs,
> > there is no single stacked device on which to put total throttling
> > limits.
>   Yes. My intended interface for the throttling is bdi. But you are right
> it does not exactly match the fact that the throttling happens per device
> so it might get tricky. Which brings up a question - shouldn't the
> throttling blk-throttle does rather happen at bdi layer? Because the
> uses of the functionality I have in mind would match that better.

I guess throttling at bdi layer will take care of network filesystem
case too?  But isn't the notion of "bdi" internal to kernel and user does
not really program thing in terms of bdi.

Also per bdi limit mechanism will not solve the issue of global throttling
where in case of btrfs an IO might go to multiple bdi's. So throttling limits
are not total but per bdi.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-10 21:05           ` Jan Kara
@ 2012-04-10 21:20             ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-10 21:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Tue, Apr 10, 2012 at 11:05:05PM +0200, Jan Kara wrote:

[..]
> > Ok. So what is the meaning of "make process wait" here? What it will be
> > dependent on? I am thinking of a case where a process has 100MB of dirty
> > data, has 10MB/s write limit and it issues fsync. So before that process
> > is able to open a transaction, one needs to wait atleast 10seconds
> > (assuming other processes are not doing IO in same cgroup). 
>   The original idea was that we'd have "bdi-congested-for-cgroup" flag
> and the process starting a transaction will wait for this flag to get
> cleared before starting a new transaction. This will be easy to implement
> in filesystems and won't have serialization issues. But my knowledge of
> blk-throttle is lacking so there might be some problems with this approach.

I have implemented and posted patches for per bdi per cgroup congestion
flag. The only problem I see with that is that a group might be congested
for a long time because of lots of other IO happening (say direct IO) and
if you keep on backing off and never submit the metadata IO (transaction),
you get starved. And if you go ahead and submit IO in a congested group,
we are back to serialization issue.

[..]
> > One more factor makes absolute throttling interesting and that is global
> > throttling and not per device throttling. For example in case of btrfs,
> > there is no single stacked device on which to put total throttling
> > limits.
>   Yes. My intended interface for the throttling is bdi. But you are right
> it does not exactly match the fact that the throttling happens per device
> so it might get tricky. Which brings up a question - shouldn't the
> throttling blk-throttle does rather happen at bdi layer? Because the
> uses of the functionality I have in mind would match that better.

I guess throttling at bdi layer will take care of network filesystem
case too?  But isn't the notion of "bdi" internal to kernel and user does
not really program thing in terms of bdi.

Also per bdi limit mechanism will not solve the issue of global throttling
where in case of btrfs an IO might go to multiple bdi's. So throttling limits
are not total but per bdi.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-10 21:20             ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-10 21:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Tue, Apr 10, 2012 at 11:05:05PM +0200, Jan Kara wrote:

[..]
> > Ok. So what is the meaning of "make process wait" here? What it will be
> > dependent on? I am thinking of a case where a process has 100MB of dirty
> > data, has 10MB/s write limit and it issues fsync. So before that process
> > is able to open a transaction, one needs to wait atleast 10seconds
> > (assuming other processes are not doing IO in same cgroup). 
>   The original idea was that we'd have "bdi-congested-for-cgroup" flag
> and the process starting a transaction will wait for this flag to get
> cleared before starting a new transaction. This will be easy to implement
> in filesystems and won't have serialization issues. But my knowledge of
> blk-throttle is lacking so there might be some problems with this approach.

I have implemented and posted patches for per bdi per cgroup congestion
flag. The only problem I see with that is that a group might be congested
for a long time because of lots of other IO happening (say direct IO) and
if you keep on backing off and never submit the metadata IO (transaction),
you get starved. And if you go ahead and submit IO in a congested group,
we are back to serialization issue.

[..]
> > One more factor makes absolute throttling interesting and that is global
> > throttling and not per device throttling. For example in case of btrfs,
> > there is no single stacked device on which to put total throttling
> > limits.
>   Yes. My intended interface for the throttling is bdi. But you are right
> it does not exactly match the fact that the throttling happens per device
> so it might get tricky. Which brings up a question - shouldn't the
> throttling blk-throttle does rather happen at bdi layer? Because the
> uses of the functionality I have in mind would match that better.

I guess throttling at bdi layer will take care of network filesystem
case too?  But isn't the notion of "bdi" internal to kernel and user does
not really program thing in terms of bdi.

Also per bdi limit mechanism will not solve the issue of global throttling
where in case of btrfs an IO might go to multiple bdi's. So throttling limits
are not total but per bdi.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]             ` <20120410212041.GP21801-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-10 22:24               ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-10 22:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Tue 10-04-12 17:20:41, Vivek Goyal wrote:
> On Tue, Apr 10, 2012 at 11:05:05PM +0200, Jan Kara wrote:
> 
> [..]
> > > Ok. So what is the meaning of "make process wait" here? What it will be
> > > dependent on? I am thinking of a case where a process has 100MB of dirty
> > > data, has 10MB/s write limit and it issues fsync. So before that process
> > > is able to open a transaction, one needs to wait atleast 10seconds
> > > (assuming other processes are not doing IO in same cgroup). 
> >   The original idea was that we'd have "bdi-congested-for-cgroup" flag
> > and the process starting a transaction will wait for this flag to get
> > cleared before starting a new transaction. This will be easy to implement
> > in filesystems and won't have serialization issues. But my knowledge of
> > blk-throttle is lacking so there might be some problems with this approach.
> 
> I have implemented and posted patches for per bdi per cgroup congestion
> flag. The only problem I see with that is that a group might be congested
> for a long time because of lots of other IO happening (say direct IO) and
> if you keep on backing off and never submit the metadata IO (transaction),
> you get starved. And if you go ahead and submit IO in a congested group,
> we are back to serialization issue.
  Clearly, we mustn't throttle metadata IO once it gets to the block layer.
That's why we discuss throttling of processes at transaction start after
all. But I agree starvation is an issue - I originally thought blk-throttle
throttles synchronously which wouldn't have starvation issues. But when
that's not the case things are a bit more tricky. We could treat
transaction start as an IO of some size (since we already have some
estimation how large a transaction will be when we are starting it) and let
the transaction start only when our "virtual" IO would be submitted but
I feel that gets maybe too complicated... Maybe we could just delay the
transaction start by the amount reported from blk-throttle layer? Something
along your callback for throttling you implemented?

> [..]
> > > One more factor makes absolute throttling interesting and that is global
> > > throttling and not per device throttling. For example in case of btrfs,
> > > there is no single stacked device on which to put total throttling
> > > limits.
> >   Yes. My intended interface for the throttling is bdi. But you are right
> > it does not exactly match the fact that the throttling happens per device
> > so it might get tricky. Which brings up a question - shouldn't the
> > throttling blk-throttle does rather happen at bdi layer? Because the
> > uses of the functionality I have in mind would match that better.
> 
> I guess throttling at bdi layer will take care of network filesystem
> case too?
  Yes. At least for client side. On sever side Steve wants server to have
insight into how much IO we could push in future so that it can limit
number of outstanding requests if I understand him right. I'm not sure we
really want / are able to provide this amount of knowledge to filesystems
even less userspace...

> But isn't the notion of "bdi" internal to kernel and user does
> not really program thing in terms of bdi.
  Well, it is. But we already have per-bdi tunables (e.g.  readahead) that
are exported in /sys/block/<device>/queue/ so we have some precedens.
 
> Also per bdi limit mechanism will not solve the issue of global throttling
> where in case of btrfs an IO might go to multiple bdi's. So throttling limits
> are not total but per bdi.
  Well, btrfs plays tricks with bdi's but there is a special bdi called
"btrfs" which backs the whole filesystem and that is what's put in
sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a
global bdi to work with.

									Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-10 21:20             ` Vivek Goyal
@ 2012-04-10 22:24               ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-10 22:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

On Tue 10-04-12 17:20:41, Vivek Goyal wrote:
> On Tue, Apr 10, 2012 at 11:05:05PM +0200, Jan Kara wrote:
> 
> [..]
> > > Ok. So what is the meaning of "make process wait" here? What it will be
> > > dependent on? I am thinking of a case where a process has 100MB of dirty
> > > data, has 10MB/s write limit and it issues fsync. So before that process
> > > is able to open a transaction, one needs to wait atleast 10seconds
> > > (assuming other processes are not doing IO in same cgroup). 
> >   The original idea was that we'd have "bdi-congested-for-cgroup" flag
> > and the process starting a transaction will wait for this flag to get
> > cleared before starting a new transaction. This will be easy to implement
> > in filesystems and won't have serialization issues. But my knowledge of
> > blk-throttle is lacking so there might be some problems with this approach.
> 
> I have implemented and posted patches for per bdi per cgroup congestion
> flag. The only problem I see with that is that a group might be congested
> for a long time because of lots of other IO happening (say direct IO) and
> if you keep on backing off and never submit the metadata IO (transaction),
> you get starved. And if you go ahead and submit IO in a congested group,
> we are back to serialization issue.
  Clearly, we mustn't throttle metadata IO once it gets to the block layer.
That's why we discuss throttling of processes at transaction start after
all. But I agree starvation is an issue - I originally thought blk-throttle
throttles synchronously which wouldn't have starvation issues. But when
that's not the case things are a bit more tricky. We could treat
transaction start as an IO of some size (since we already have some
estimation how large a transaction will be when we are starting it) and let
the transaction start only when our "virtual" IO would be submitted but
I feel that gets maybe too complicated... Maybe we could just delay the
transaction start by the amount reported from blk-throttle layer? Something
along your callback for throttling you implemented?

> [..]
> > > One more factor makes absolute throttling interesting and that is global
> > > throttling and not per device throttling. For example in case of btrfs,
> > > there is no single stacked device on which to put total throttling
> > > limits.
> >   Yes. My intended interface for the throttling is bdi. But you are right
> > it does not exactly match the fact that the throttling happens per device
> > so it might get tricky. Which brings up a question - shouldn't the
> > throttling blk-throttle does rather happen at bdi layer? Because the
> > uses of the functionality I have in mind would match that better.
> 
> I guess throttling at bdi layer will take care of network filesystem
> case too?
  Yes. At least for client side. On sever side Steve wants server to have
insight into how much IO we could push in future so that it can limit
number of outstanding requests if I understand him right. I'm not sure we
really want / are able to provide this amount of knowledge to filesystems
even less userspace...

> But isn't the notion of "bdi" internal to kernel and user does
> not really program thing in terms of bdi.
  Well, it is. But we already have per-bdi tunables (e.g.  readahead) that
are exported in /sys/block/<device>/queue/ so we have some precedens.
 
> Also per bdi limit mechanism will not solve the issue of global throttling
> where in case of btrfs an IO might go to multiple bdi's. So throttling limits
> are not total but per bdi.
  Well, btrfs plays tricks with bdi's but there is a special bdi called
"btrfs" which backs the whole filesystem and that is what's put in
sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a
global bdi to work with.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-10 22:24               ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-10 22:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

On Tue 10-04-12 17:20:41, Vivek Goyal wrote:
> On Tue, Apr 10, 2012 at 11:05:05PM +0200, Jan Kara wrote:
> 
> [..]
> > > Ok. So what is the meaning of "make process wait" here? What it will be
> > > dependent on? I am thinking of a case where a process has 100MB of dirty
> > > data, has 10MB/s write limit and it issues fsync. So before that process
> > > is able to open a transaction, one needs to wait atleast 10seconds
> > > (assuming other processes are not doing IO in same cgroup). 
> >   The original idea was that we'd have "bdi-congested-for-cgroup" flag
> > and the process starting a transaction will wait for this flag to get
> > cleared before starting a new transaction. This will be easy to implement
> > in filesystems and won't have serialization issues. But my knowledge of
> > blk-throttle is lacking so there might be some problems with this approach.
> 
> I have implemented and posted patches for per bdi per cgroup congestion
> flag. The only problem I see with that is that a group might be congested
> for a long time because of lots of other IO happening (say direct IO) and
> if you keep on backing off and never submit the metadata IO (transaction),
> you get starved. And if you go ahead and submit IO in a congested group,
> we are back to serialization issue.
  Clearly, we mustn't throttle metadata IO once it gets to the block layer.
That's why we discuss throttling of processes at transaction start after
all. But I agree starvation is an issue - I originally thought blk-throttle
throttles synchronously which wouldn't have starvation issues. But when
that's not the case things are a bit more tricky. We could treat
transaction start as an IO of some size (since we already have some
estimation how large a transaction will be when we are starting it) and let
the transaction start only when our "virtual" IO would be submitted but
I feel that gets maybe too complicated... Maybe we could just delay the
transaction start by the amount reported from blk-throttle layer? Something
along your callback for throttling you implemented?

> [..]
> > > One more factor makes absolute throttling interesting and that is global
> > > throttling and not per device throttling. For example in case of btrfs,
> > > there is no single stacked device on which to put total throttling
> > > limits.
> >   Yes. My intended interface for the throttling is bdi. But you are right
> > it does not exactly match the fact that the throttling happens per device
> > so it might get tricky. Which brings up a question - shouldn't the
> > throttling blk-throttle does rather happen at bdi layer? Because the
> > uses of the functionality I have in mind would match that better.
> 
> I guess throttling at bdi layer will take care of network filesystem
> case too?
  Yes. At least for client side. On sever side Steve wants server to have
insight into how much IO we could push in future so that it can limit
number of outstanding requests if I understand him right. I'm not sure we
really want / are able to provide this amount of knowledge to filesystems
even less userspace...

> But isn't the notion of "bdi" internal to kernel and user does
> not really program thing in terms of bdi.
  Well, it is. But we already have per-bdi tunables (e.g.  readahead) that
are exported in /sys/block/<device>/queue/ so we have some precedens.
 
> Also per bdi limit mechanism will not solve the issue of global throttling
> where in case of btrfs an IO might go to multiple bdi's. So throttling limits
> are not total but per bdi.
  Well, btrfs plays tricks with bdi's but there is a special bdi called
"btrfs" which backs the whole filesystem and that is what's put in
sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a
global bdi to work with.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-10 22:24               ` Jan Kara
  (?)
@ 2012-04-11 15:40                   ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-11 15:40 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:

[..]
> > I have implemented and posted patches for per bdi per cgroup congestion
> > flag. The only problem I see with that is that a group might be congested
> > for a long time because of lots of other IO happening (say direct IO) and
> > if you keep on backing off and never submit the metadata IO (transaction),
> > you get starved. And if you go ahead and submit IO in a congested group,
> > we are back to serialization issue.
>   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> That's why we discuss throttling of processes at transaction start after
> all. But I agree starvation is an issue - I originally thought blk-throttle
> throttles synchronously which wouldn't have starvation issues. But when
> that's not the case things are a bit more tricky. We could treat
> transaction start as an IO of some size (since we already have some
> estimation how large a transaction will be when we are starting it) and let
> the transaction start only when our "virtual" IO would be submitted but
> I feel that gets maybe too complicated... Maybe we could just delay the
> transaction start by the amount reported from blk-throttle layer? Something
> along your callback for throttling you implemented?

I think now I have lost you. It probably stems from the fact that I don't
know much about transactions and filesystem.
 
So all the metadata IO will happen thorough journaling thread and that
will be in root group which should remain unthrottled. So any journal
IO going to disk should remain unthrottled.

Now, IIRC, fsync problem with throttling was that we had opened a
transaction but could not write it back to disk because we had to
wait for all the cached data to go to disk (which is throttled). So
my question is, can't we first wait for all the data to be flushed
to disk and then open a transaction for metadata. metadata will be
unthrottled so filesystem will not have to do any tricks like bdi is
congested or not.

IOW, can't we first wait for dependent operation to finish before we
throw anything into metada stream.

[..]
> > I guess throttling at bdi layer will take care of network filesystem
> > case too?
>   Yes. At least for client side. On sever side Steve wants server to have
> insight into how much IO we could push in future so that it can limit
> number of outstanding requests if I understand him right. I'm not sure we
> really want / are able to provide this amount of knowledge to filesystems
> even less userspace...

I am not sure what does it mean but server could simply query the bdi
and read configured rate and then it knows at what rate IO will go to
disk and make predictions about future?

> 
> > But isn't the notion of "bdi" internal to kernel and user does
> > not really program thing in terms of bdi.
>   Well, it is. But we already have per-bdi tunables (e.g.  readahead) that
> are exported in /sys/block/<device>/queue/ so we have some precedens.

ok, so they are exposed as if they are queue/device tunables but
internally stored in bdi and work accordingly.

>  
> > Also per bdi limit mechanism will not solve the issue of global throttling
> > where in case of btrfs an IO might go to multiple bdi's. So throttling limits
> > are not total but per bdi.
>   Well, btrfs plays tricks with bdi's but there is a special bdi called
> "btrfs" which backs the whole filesystem and that is what's put in
> sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a
> global bdi to work with.

Ok, that's good to know. How would we configure this special bdi? I am
assuming there is no backing device visible in /sys/block/<device>/queue/?
Same is true for network file systems.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-11 15:40                   ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-11 15:40 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:

[..]
> > I have implemented and posted patches for per bdi per cgroup congestion
> > flag. The only problem I see with that is that a group might be congested
> > for a long time because of lots of other IO happening (say direct IO) and
> > if you keep on backing off and never submit the metadata IO (transaction),
> > you get starved. And if you go ahead and submit IO in a congested group,
> > we are back to serialization issue.
>   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> That's why we discuss throttling of processes at transaction start after
> all. But I agree starvation is an issue - I originally thought blk-throttle
> throttles synchronously which wouldn't have starvation issues. But when
> that's not the case things are a bit more tricky. We could treat
> transaction start as an IO of some size (since we already have some
> estimation how large a transaction will be when we are starting it) and let
> the transaction start only when our "virtual" IO would be submitted but
> I feel that gets maybe too complicated... Maybe we could just delay the
> transaction start by the amount reported from blk-throttle layer? Something
> along your callback for throttling you implemented?

I think now I have lost you. It probably stems from the fact that I don't
know much about transactions and filesystem.
 
So all the metadata IO will happen thorough journaling thread and that
will be in root group which should remain unthrottled. So any journal
IO going to disk should remain unthrottled.

Now, IIRC, fsync problem with throttling was that we had opened a
transaction but could not write it back to disk because we had to
wait for all the cached data to go to disk (which is throttled). So
my question is, can't we first wait for all the data to be flushed
to disk and then open a transaction for metadata. metadata will be
unthrottled so filesystem will not have to do any tricks like bdi is
congested or not.

IOW, can't we first wait for dependent operation to finish before we
throw anything into metada stream.

[..]
> > I guess throttling at bdi layer will take care of network filesystem
> > case too?
>   Yes. At least for client side. On sever side Steve wants server to have
> insight into how much IO we could push in future so that it can limit
> number of outstanding requests if I understand him right. I'm not sure we
> really want / are able to provide this amount of knowledge to filesystems
> even less userspace...

I am not sure what does it mean but server could simply query the bdi
and read configured rate and then it knows at what rate IO will go to
disk and make predictions about future?

> 
> > But isn't the notion of "bdi" internal to kernel and user does
> > not really program thing in terms of bdi.
>   Well, it is. But we already have per-bdi tunables (e.g.  readahead) that
> are exported in /sys/block/<device>/queue/ so we have some precedens.

ok, so they are exposed as if they are queue/device tunables but
internally stored in bdi and work accordingly.

>  
> > Also per bdi limit mechanism will not solve the issue of global throttling
> > where in case of btrfs an IO might go to multiple bdi's. So throttling limits
> > are not total but per bdi.
>   Well, btrfs plays tricks with bdi's but there is a special bdi called
> "btrfs" which backs the whole filesystem and that is what's put in
> sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a
> global bdi to work with.

Ok, that's good to know. How would we configure this special bdi? I am
assuming there is no backing device visible in /sys/block/<device>/queue/?
Same is true for network file systems.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-11 15:40                   ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-11 15:40 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:

[..]
> > I have implemented and posted patches for per bdi per cgroup congestion
> > flag. The only problem I see with that is that a group might be congested
> > for a long time because of lots of other IO happening (say direct IO) and
> > if you keep on backing off and never submit the metadata IO (transaction),
> > you get starved. And if you go ahead and submit IO in a congested group,
> > we are back to serialization issue.
>   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> That's why we discuss throttling of processes at transaction start after
> all. But I agree starvation is an issue - I originally thought blk-throttle
> throttles synchronously which wouldn't have starvation issues. But when
> that's not the case things are a bit more tricky. We could treat
> transaction start as an IO of some size (since we already have some
> estimation how large a transaction will be when we are starting it) and let
> the transaction start only when our "virtual" IO would be submitted but
> I feel that gets maybe too complicated... Maybe we could just delay the
> transaction start by the amount reported from blk-throttle layer? Something
> along your callback for throttling you implemented?

I think now I have lost you. It probably stems from the fact that I don't
know much about transactions and filesystem.
 
So all the metadata IO will happen thorough journaling thread and that
will be in root group which should remain unthrottled. So any journal
IO going to disk should remain unthrottled.

Now, IIRC, fsync problem with throttling was that we had opened a
transaction but could not write it back to disk because we had to
wait for all the cached data to go to disk (which is throttled). So
my question is, can't we first wait for all the data to be flushed
to disk and then open a transaction for metadata. metadata will be
unthrottled so filesystem will not have to do any tricks like bdi is
congested or not.

IOW, can't we first wait for dependent operation to finish before we
throw anything into metada stream.

[..]
> > I guess throttling at bdi layer will take care of network filesystem
> > case too?
>   Yes. At least for client side. On sever side Steve wants server to have
> insight into how much IO we could push in future so that it can limit
> number of outstanding requests if I understand him right. I'm not sure we
> really want / are able to provide this amount of knowledge to filesystems
> even less userspace...

I am not sure what does it mean but server could simply query the bdi
and read configured rate and then it knows at what rate IO will go to
disk and make predictions about future?

> 
> > But isn't the notion of "bdi" internal to kernel and user does
> > not really program thing in terms of bdi.
>   Well, it is. But we already have per-bdi tunables (e.g.  readahead) that
> are exported in /sys/block/<device>/queue/ so we have some precedens.

ok, so they are exposed as if they are queue/device tunables but
internally stored in bdi and work accordingly.

>  
> > Also per bdi limit mechanism will not solve the issue of global throttling
> > where in case of btrfs an IO might go to multiple bdi's. So throttling limits
> > are not total but per bdi.
>   Well, btrfs plays tricks with bdi's but there is a special bdi called
> "btrfs" which backs the whole filesystem and that is what's put in
> sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a
> global bdi to work with.

Ok, that's good to know. How would we configure this special bdi? I am
assuming there is no backing device visible in /sys/block/<device>/queue/?
Same is true for network file systems.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                   ` <20120411154005.GD16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-11 15:45                     ` Vivek Goyal
  2012-04-11 19:22                     ` Jan Kara
  2012-04-14 12:25                     ` [Lsf] " Peter Zijlstra
  2 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-11 15:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote:
> On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> 
> [..]
> > > I have implemented and posted patches for per bdi per cgroup congestion
> > > flag. The only problem I see with that is that a group might be congested
> > > for a long time because of lots of other IO happening (say direct IO) and
> > > if you keep on backing off and never submit the metadata IO (transaction),
> > > you get starved. And if you go ahead and submit IO in a congested group,
> > > we are back to serialization issue.
> >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > That's why we discuss throttling of processes at transaction start after
> > all. But I agree starvation is an issue - I originally thought blk-throttle
> > throttles synchronously which wouldn't have starvation issues.

Current bio throttling is asynchrounous. Process can submit the bio
and go back and wait for bio to finish. That bio will be queued at device
queue in a per cgroup queue and will be dispatched to device according
to configured IO rate for cgroup.

The additional feature for buffered throttle (which never went upstream),
was synchronous in nature. That is we were actively putting writer to
sleep on a per cgroup wait queue in the request queue and wake it up when
it can do further IO based on cgroup limits.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-11 15:40                   ` Vivek Goyal
@ 2012-04-11 15:45                     ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-11 15:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote:
> On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> 
> [..]
> > > I have implemented and posted patches for per bdi per cgroup congestion
> > > flag. The only problem I see with that is that a group might be congested
> > > for a long time because of lots of other IO happening (say direct IO) and
> > > if you keep on backing off and never submit the metadata IO (transaction),
> > > you get starved. And if you go ahead and submit IO in a congested group,
> > > we are back to serialization issue.
> >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > That's why we discuss throttling of processes at transaction start after
> > all. But I agree starvation is an issue - I originally thought blk-throttle
> > throttles synchronously which wouldn't have starvation issues.

Current bio throttling is asynchrounous. Process can submit the bio
and go back and wait for bio to finish. That bio will be queued at device
queue in a per cgroup queue and will be dispatched to device according
to configured IO rate for cgroup.

The additional feature for buffered throttle (which never went upstream),
was synchronous in nature. That is we were actively putting writer to
sleep on a per cgroup wait queue in the request queue and wake it up when
it can do further IO based on cgroup limits.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-11 15:45                     ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-11 15:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote:
> On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> 
> [..]
> > > I have implemented and posted patches for per bdi per cgroup congestion
> > > flag. The only problem I see with that is that a group might be congested
> > > for a long time because of lots of other IO happening (say direct IO) and
> > > if you keep on backing off and never submit the metadata IO (transaction),
> > > you get starved. And if you go ahead and submit IO in a congested group,
> > > we are back to serialization issue.
> >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > That's why we discuss throttling of processes at transaction start after
> > all. But I agree starvation is an issue - I originally thought blk-throttle
> > throttles synchronously which wouldn't have starvation issues.

Current bio throttling is asynchrounous. Process can submit the bio
and go back and wait for bio to finish. That bio will be queued at device
queue in a per cgroup queue and will be dispatched to device according
to configured IO rate for cgroup.

The additional feature for buffered throttle (which never went upstream),
was synchronous in nature. That is we were actively putting writer to
sleep on a per cgroup wait queue in the request queue and wake it up when
it can do further IO based on cgroup limits.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                     ` <20120411154531.GE16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-11 17:05                       ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-11 17:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Wed 11-04-12 11:45:31, Vivek Goyal wrote:
> On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote:
> > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> > 
> > [..]
> > > > I have implemented and posted patches for per bdi per cgroup congestion
> > > > flag. The only problem I see with that is that a group might be congested
> > > > for a long time because of lots of other IO happening (say direct IO) and
> > > > if you keep on backing off and never submit the metadata IO (transaction),
> > > > you get starved. And if you go ahead and submit IO in a congested group,
> > > > we are back to serialization issue.
> > >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > > That's why we discuss throttling of processes at transaction start after
> > > all. But I agree starvation is an issue - I originally thought blk-throttle
> > > throttles synchronously which wouldn't have starvation issues.
> 
> Current bio throttling is asynchrounous. Process can submit the bio
> and go back and wait for bio to finish. That bio will be queued at device
> queue in a per cgroup queue and will be dispatched to device according
> to configured IO rate for cgroup.
> 
> The additional feature for buffered throttle (which never went upstream),
> was synchronous in nature. That is we were actively putting writer to
> sleep on a per cgroup wait queue in the request queue and wake it up when
> it can do further IO based on cgroup limits.
  Hmm, but then there would be similar starvation issues as with my simple
scheme because async IO could always use the whole available bandwidth.
Mixing of sync & async throttling is really problematic... I'm wondering
how useful the async throttling is. Because we will block on request
allocation once there are more than nr_requests pending requests so at that
point throttling becomes sync anyway.

								Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-11 15:45                     ` Vivek Goyal
@ 2012-04-11 17:05                       ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-11 17:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

On Wed 11-04-12 11:45:31, Vivek Goyal wrote:
> On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote:
> > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> > 
> > [..]
> > > > I have implemented and posted patches for per bdi per cgroup congestion
> > > > flag. The only problem I see with that is that a group might be congested
> > > > for a long time because of lots of other IO happening (say direct IO) and
> > > > if you keep on backing off and never submit the metadata IO (transaction),
> > > > you get starved. And if you go ahead and submit IO in a congested group,
> > > > we are back to serialization issue.
> > >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > > That's why we discuss throttling of processes at transaction start after
> > > all. But I agree starvation is an issue - I originally thought blk-throttle
> > > throttles synchronously which wouldn't have starvation issues.
> 
> Current bio throttling is asynchrounous. Process can submit the bio
> and go back and wait for bio to finish. That bio will be queued at device
> queue in a per cgroup queue and will be dispatched to device according
> to configured IO rate for cgroup.
> 
> The additional feature for buffered throttle (which never went upstream),
> was synchronous in nature. That is we were actively putting writer to
> sleep on a per cgroup wait queue in the request queue and wake it up when
> it can do further IO based on cgroup limits.
  Hmm, but then there would be similar starvation issues as with my simple
scheme because async IO could always use the whole available bandwidth.
Mixing of sync & async throttling is really problematic... I'm wondering
how useful the async throttling is. Because we will block on request
allocation once there are more than nr_requests pending requests so at that
point throttling becomes sync anyway.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-11 17:05                       ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-11 17:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

On Wed 11-04-12 11:45:31, Vivek Goyal wrote:
> On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote:
> > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> > 
> > [..]
> > > > I have implemented and posted patches for per bdi per cgroup congestion
> > > > flag. The only problem I see with that is that a group might be congested
> > > > for a long time because of lots of other IO happening (say direct IO) and
> > > > if you keep on backing off and never submit the metadata IO (transaction),
> > > > you get starved. And if you go ahead and submit IO in a congested group,
> > > > we are back to serialization issue.
> > >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > > That's why we discuss throttling of processes at transaction start after
> > > all. But I agree starvation is an issue - I originally thought blk-throttle
> > > throttles synchronously which wouldn't have starvation issues.
> 
> Current bio throttling is asynchrounous. Process can submit the bio
> and go back and wait for bio to finish. That bio will be queued at device
> queue in a per cgroup queue and will be dispatched to device according
> to configured IO rate for cgroup.
> 
> The additional feature for buffered throttle (which never went upstream),
> was synchronous in nature. That is we were actively putting writer to
> sleep on a per cgroup wait queue in the request queue and wake it up when
> it can do further IO based on cgroup limits.
  Hmm, but then there would be similar starvation issues as with my simple
scheme because async IO could always use the whole available bandwidth.
Mixing of sync & async throttling is really problematic... I'm wondering
how useful the async throttling is. Because we will block on request
allocation once there are more than nr_requests pending requests so at that
point throttling becomes sync anyway.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                       ` <20120411170542.GB16008-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
@ 2012-04-11 17:23                         ` Vivek Goyal
  2012-04-17 21:48                         ` Tejun Heo
  1 sibling, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-11 17:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote:
> On Wed 11-04-12 11:45:31, Vivek Goyal wrote:
> > On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote:
> > > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> > > 
> > > [..]
> > > > > I have implemented and posted patches for per bdi per cgroup congestion
> > > > > flag. The only problem I see with that is that a group might be congested
> > > > > for a long time because of lots of other IO happening (say direct IO) and
> > > > > if you keep on backing off and never submit the metadata IO (transaction),
> > > > > you get starved. And if you go ahead and submit IO in a congested group,
> > > > > we are back to serialization issue.
> > > >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > > > That's why we discuss throttling of processes at transaction start after
> > > > all. But I agree starvation is an issue - I originally thought blk-throttle
> > > > throttles synchronously which wouldn't have starvation issues.
> > 
> > Current bio throttling is asynchrounous. Process can submit the bio
> > and go back and wait for bio to finish. That bio will be queued at device
> > queue in a per cgroup queue and will be dispatched to device according
> > to configured IO rate for cgroup.
> > 
> > The additional feature for buffered throttle (which never went upstream),
> > was synchronous in nature. That is we were actively putting writer to
> > sleep on a per cgroup wait queue in the request queue and wake it up when
> > it can do further IO based on cgroup limits.
>   Hmm, but then there would be similar starvation issues as with my simple
> scheme because async IO could always use the whole available bandwidth.

It depends on how the throttling logic decides to divide bandwidth between
sync and async. I had chosen a round robin policy of dispatching some
bios and then allowing some async IO etc. So async IO was not consuming
the whole available bandwidth. We could easibly tilt it in favor of sync IO
with a tunable knob.

> Mixing of sync & async throttling is really problematic... I'm wondering
> how useful the async throttling is.

If sync throttling is useful, then async throttling has to be useful too?
Especially given the fact that often async IO consumes all bandwidth
impacting sync latencies.

> Because we will block on request
> allocation once there are more than nr_requests pending requests so at that
> point throttling becomes sync anyway.

First of all flushers will block on nr_requests and not actual writers.
And secondly we thought of having per group request descriptors so that
writes of one group don't impact others. So once the writes of a group
are backlogged, then flusher can query the congestion status of group
and not submit any more writes to that group. As some writes are already
queued in that group, writes will not be starved. Well, in case of
deadline, even direct writes go in write queue so theoritically we can
hit starvation issue (flush not being able to submit writes without
risking blocking) there too.

To avoid this starvation, ideally we need per bdi per cgroup flusher. so
that flusher can simply block if there are not enough request descriptors
in the cgroup.

So trying to throttle buffered writes synchronously in balance_dirty_pages(),
atleast simlifies the implementation.  I like my implementation better
over Fengguang's approach of throttling for simple reason that buffered
writes and direct writes can be subjected to same throttling limits
instead of separate limits for buffered writes.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-11 17:05                       ` Jan Kara
@ 2012-04-11 17:23                         ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-11 17:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote:
> On Wed 11-04-12 11:45:31, Vivek Goyal wrote:
> > On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote:
> > > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> > > 
> > > [..]
> > > > > I have implemented and posted patches for per bdi per cgroup congestion
> > > > > flag. The only problem I see with that is that a group might be congested
> > > > > for a long time because of lots of other IO happening (say direct IO) and
> > > > > if you keep on backing off and never submit the metadata IO (transaction),
> > > > > you get starved. And if you go ahead and submit IO in a congested group,
> > > > > we are back to serialization issue.
> > > >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > > > That's why we discuss throttling of processes at transaction start after
> > > > all. But I agree starvation is an issue - I originally thought blk-throttle
> > > > throttles synchronously which wouldn't have starvation issues.
> > 
> > Current bio throttling is asynchrounous. Process can submit the bio
> > and go back and wait for bio to finish. That bio will be queued at device
> > queue in a per cgroup queue and will be dispatched to device according
> > to configured IO rate for cgroup.
> > 
> > The additional feature for buffered throttle (which never went upstream),
> > was synchronous in nature. That is we were actively putting writer to
> > sleep on a per cgroup wait queue in the request queue and wake it up when
> > it can do further IO based on cgroup limits.
>   Hmm, but then there would be similar starvation issues as with my simple
> scheme because async IO could always use the whole available bandwidth.

It depends on how the throttling logic decides to divide bandwidth between
sync and async. I had chosen a round robin policy of dispatching some
bios and then allowing some async IO etc. So async IO was not consuming
the whole available bandwidth. We could easibly tilt it in favor of sync IO
with a tunable knob.

> Mixing of sync & async throttling is really problematic... I'm wondering
> how useful the async throttling is.

If sync throttling is useful, then async throttling has to be useful too?
Especially given the fact that often async IO consumes all bandwidth
impacting sync latencies.

> Because we will block on request
> allocation once there are more than nr_requests pending requests so at that
> point throttling becomes sync anyway.

First of all flushers will block on nr_requests and not actual writers.
And secondly we thought of having per group request descriptors so that
writes of one group don't impact others. So once the writes of a group
are backlogged, then flusher can query the congestion status of group
and not submit any more writes to that group. As some writes are already
queued in that group, writes will not be starved. Well, in case of
deadline, even direct writes go in write queue so theoritically we can
hit starvation issue (flush not being able to submit writes without
risking blocking) there too.

To avoid this starvation, ideally we need per bdi per cgroup flusher. so
that flusher can simply block if there are not enough request descriptors
in the cgroup.

So trying to throttle buffered writes synchronously in balance_dirty_pages(),
atleast simlifies the implementation.  I like my implementation better
over Fengguang's approach of throttling for simple reason that buffered
writes and direct writes can be subjected to same throttling limits
instead of separate limits for buffered writes.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-11 17:23                         ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-11 17:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote:
> On Wed 11-04-12 11:45:31, Vivek Goyal wrote:
> > On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote:
> > > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> > > 
> > > [..]
> > > > > I have implemented and posted patches for per bdi per cgroup congestion
> > > > > flag. The only problem I see with that is that a group might be congested
> > > > > for a long time because of lots of other IO happening (say direct IO) and
> > > > > if you keep on backing off and never submit the metadata IO (transaction),
> > > > > you get starved. And if you go ahead and submit IO in a congested group,
> > > > > we are back to serialization issue.
> > > >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > > > That's why we discuss throttling of processes at transaction start after
> > > > all. But I agree starvation is an issue - I originally thought blk-throttle
> > > > throttles synchronously which wouldn't have starvation issues.
> > 
> > Current bio throttling is asynchrounous. Process can submit the bio
> > and go back and wait for bio to finish. That bio will be queued at device
> > queue in a per cgroup queue and will be dispatched to device according
> > to configured IO rate for cgroup.
> > 
> > The additional feature for buffered throttle (which never went upstream),
> > was synchronous in nature. That is we were actively putting writer to
> > sleep on a per cgroup wait queue in the request queue and wake it up when
> > it can do further IO based on cgroup limits.
>   Hmm, but then there would be similar starvation issues as with my simple
> scheme because async IO could always use the whole available bandwidth.

It depends on how the throttling logic decides to divide bandwidth between
sync and async. I had chosen a round robin policy of dispatching some
bios and then allowing some async IO etc. So async IO was not consuming
the whole available bandwidth. We could easibly tilt it in favor of sync IO
with a tunable knob.

> Mixing of sync & async throttling is really problematic... I'm wondering
> how useful the async throttling is.

If sync throttling is useful, then async throttling has to be useful too?
Especially given the fact that often async IO consumes all bandwidth
impacting sync latencies.

> Because we will block on request
> allocation once there are more than nr_requests pending requests so at that
> point throttling becomes sync anyway.

First of all flushers will block on nr_requests and not actual writers.
And secondly we thought of having per group request descriptors so that
writes of one group don't impact others. So once the writes of a group
are backlogged, then flusher can query the congestion status of group
and not submit any more writes to that group. As some writes are already
queued in that group, writes will not be starved. Well, in case of
deadline, even direct writes go in write queue so theoritically we can
hit starvation issue (flush not being able to submit writes without
risking blocking) there too.

To avoid this starvation, ideally we need per bdi per cgroup flusher. so
that flusher can simply block if there are not enough request descriptors
in the cgroup.

So trying to throttle buffered writes synchronously in balance_dirty_pages(),
atleast simlifies the implementation.  I like my implementation better
over Fengguang's approach of throttling for simple reason that buffered
writes and direct writes can be subjected to same throttling limits
instead of separate limits for buffered writes.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                   ` <20120411154005.GD16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2012-04-11 15:45                     ` Vivek Goyal
@ 2012-04-11 19:22                     ` Jan Kara
  2012-04-14 12:25                     ` [Lsf] " Peter Zijlstra
  2 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-11 19:22 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Wed 11-04-12 11:40:05, Vivek Goyal wrote:
> On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> 
> [..]
> > > I have implemented and posted patches for per bdi per cgroup congestion
> > > flag. The only problem I see with that is that a group might be congested
> > > for a long time because of lots of other IO happening (say direct IO) and
> > > if you keep on backing off and never submit the metadata IO (transaction),
> > > you get starved. And if you go ahead and submit IO in a congested group,
> > > we are back to serialization issue.
> >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > That's why we discuss throttling of processes at transaction start after
> > all. But I agree starvation is an issue - I originally thought blk-throttle
> > throttles synchronously which wouldn't have starvation issues. But when
> > that's not the case things are a bit more tricky. We could treat
> > transaction start as an IO of some size (since we already have some
> > estimation how large a transaction will be when we are starting it) and let
> > the transaction start only when our "virtual" IO would be submitted but
> > I feel that gets maybe too complicated... Maybe we could just delay the
> > transaction start by the amount reported from blk-throttle layer? Something
> > along your callback for throttling you implemented?
> 
> I think now I have lost you. It probably stems from the fact that I don't
> know much about transactions and filesystem.
>  
> So all the metadata IO will happen thorough journaling thread and that
> will be in root group which should remain unthrottled. So any journal
> IO going to disk should remain unthrottled.
  Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't
have to have the journal thread (as is the case of reiserfs where random
writer may end up doing commit) but let's not complicate things
unnecessarily.

> Now, IIRC, fsync problem with throttling was that we had opened a
> transaction but could not write it back to disk because we had to
> wait for all the cached data to go to disk (which is throttled). So
> my question is, can't we first wait for all the data to be flushed
> to disk and then open a transaction for metadata. metadata will be
> unthrottled so filesystem will not have to do any tricks like bdi is
> congested or not.
  Actually that's what's happening. We first do filemap_write_and_wait()
which syncs all the data and then we go and force transaction commit to
make sure all metadata got to stable storage. The problem is that writeout
of data may need to allocate new blocks and that starts a transaction and
while the transaction is started we may need to do some reads (e.g. of
bitmaps etc.) which may be throttled and at that moment the whole
filesystem is blocked. I don't remember the stack traces you showed me so
I'm not sure it this is what your observed but it's certainly one possible
scenario. The reason why fsync triggers problems is simply that it's the
only place where process normally does significant amount of writing. In
most cases flusher thread / journal thread do it so this effect is not
visible. And to precede your question, it would be rather hard to avoid IO
while the transaction is started due to locking.

> [..]
> > > I guess throttling at bdi layer will take care of network filesystem
> > > case too?
> >   Yes. At least for client side. On sever side Steve wants server to have
> > insight into how much IO we could push in future so that it can limit
> > number of outstanding requests if I understand him right. I'm not sure we
> > really want / are able to provide this amount of knowledge to filesystems
> > even less userspace...
> 
> I am not sure what does it mean but server could simply query the bdi
> and read configured rate and then it knows at what rate IO will go to
> disk and make predictions about future?
  Yeah, that would work if we had the current bandwidth for current cgroup
exposed in bdi.
 
> > > Also per bdi limit mechanism will not solve the issue of global throttling
> > > where in case of btrfs an IO might go to multiple bdi's. So throttling limits
> > > are not total but per bdi.
> >   Well, btrfs plays tricks with bdi's but there is a special bdi called
> > "btrfs" which backs the whole filesystem and that is what's put in
> > sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a
> > global bdi to work with.
> 
> Ok, that's good to know. How would we configure this special bdi? I am
> assuming there is no backing device visible in /sys/block/<device>/queue/?
> Same is true for network file systems.
  Where should be the backing device visible? Now it's me who is lost :)

								Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-11 15:40                   ` Vivek Goyal
@ 2012-04-11 19:22                     ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-11 19:22 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

On Wed 11-04-12 11:40:05, Vivek Goyal wrote:
> On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> 
> [..]
> > > I have implemented and posted patches for per bdi per cgroup congestion
> > > flag. The only problem I see with that is that a group might be congested
> > > for a long time because of lots of other IO happening (say direct IO) and
> > > if you keep on backing off and never submit the metadata IO (transaction),
> > > you get starved. And if you go ahead and submit IO in a congested group,
> > > we are back to serialization issue.
> >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > That's why we discuss throttling of processes at transaction start after
> > all. But I agree starvation is an issue - I originally thought blk-throttle
> > throttles synchronously which wouldn't have starvation issues. But when
> > that's not the case things are a bit more tricky. We could treat
> > transaction start as an IO of some size (since we already have some
> > estimation how large a transaction will be when we are starting it) and let
> > the transaction start only when our "virtual" IO would be submitted but
> > I feel that gets maybe too complicated... Maybe we could just delay the
> > transaction start by the amount reported from blk-throttle layer? Something
> > along your callback for throttling you implemented?
> 
> I think now I have lost you. It probably stems from the fact that I don't
> know much about transactions and filesystem.
>  
> So all the metadata IO will happen thorough journaling thread and that
> will be in root group which should remain unthrottled. So any journal
> IO going to disk should remain unthrottled.
  Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't
have to have the journal thread (as is the case of reiserfs where random
writer may end up doing commit) but let's not complicate things
unnecessarily.

> Now, IIRC, fsync problem with throttling was that we had opened a
> transaction but could not write it back to disk because we had to
> wait for all the cached data to go to disk (which is throttled). So
> my question is, can't we first wait for all the data to be flushed
> to disk and then open a transaction for metadata. metadata will be
> unthrottled so filesystem will not have to do any tricks like bdi is
> congested or not.
  Actually that's what's happening. We first do filemap_write_and_wait()
which syncs all the data and then we go and force transaction commit to
make sure all metadata got to stable storage. The problem is that writeout
of data may need to allocate new blocks and that starts a transaction and
while the transaction is started we may need to do some reads (e.g. of
bitmaps etc.) which may be throttled and at that moment the whole
filesystem is blocked. I don't remember the stack traces you showed me so
I'm not sure it this is what your observed but it's certainly one possible
scenario. The reason why fsync triggers problems is simply that it's the
only place where process normally does significant amount of writing. In
most cases flusher thread / journal thread do it so this effect is not
visible. And to precede your question, it would be rather hard to avoid IO
while the transaction is started due to locking.

> [..]
> > > I guess throttling at bdi layer will take care of network filesystem
> > > case too?
> >   Yes. At least for client side. On sever side Steve wants server to have
> > insight into how much IO we could push in future so that it can limit
> > number of outstanding requests if I understand him right. I'm not sure we
> > really want / are able to provide this amount of knowledge to filesystems
> > even less userspace...
> 
> I am not sure what does it mean but server could simply query the bdi
> and read configured rate and then it knows at what rate IO will go to
> disk and make predictions about future?
  Yeah, that would work if we had the current bandwidth for current cgroup
exposed in bdi.
 
> > > Also per bdi limit mechanism will not solve the issue of global throttling
> > > where in case of btrfs an IO might go to multiple bdi's. So throttling limits
> > > are not total but per bdi.
> >   Well, btrfs plays tricks with bdi's but there is a special bdi called
> > "btrfs" which backs the whole filesystem and that is what's put in
> > sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a
> > global bdi to work with.
> 
> Ok, that's good to know. How would we configure this special bdi? I am
> assuming there is no backing device visible in /sys/block/<device>/queue/?
> Same is true for network file systems.
  Where should be the backing device visible? Now it's me who is lost :)

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-11 19:22                     ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-11 19:22 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

On Wed 11-04-12 11:40:05, Vivek Goyal wrote:
> On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> 
> [..]
> > > I have implemented and posted patches for per bdi per cgroup congestion
> > > flag. The only problem I see with that is that a group might be congested
> > > for a long time because of lots of other IO happening (say direct IO) and
> > > if you keep on backing off and never submit the metadata IO (transaction),
> > > you get starved. And if you go ahead and submit IO in a congested group,
> > > we are back to serialization issue.
> >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > That's why we discuss throttling of processes at transaction start after
> > all. But I agree starvation is an issue - I originally thought blk-throttle
> > throttles synchronously which wouldn't have starvation issues. But when
> > that's not the case things are a bit more tricky. We could treat
> > transaction start as an IO of some size (since we already have some
> > estimation how large a transaction will be when we are starting it) and let
> > the transaction start only when our "virtual" IO would be submitted but
> > I feel that gets maybe too complicated... Maybe we could just delay the
> > transaction start by the amount reported from blk-throttle layer? Something
> > along your callback for throttling you implemented?
> 
> I think now I have lost you. It probably stems from the fact that I don't
> know much about transactions and filesystem.
>  
> So all the metadata IO will happen thorough journaling thread and that
> will be in root group which should remain unthrottled. So any journal
> IO going to disk should remain unthrottled.
  Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't
have to have the journal thread (as is the case of reiserfs where random
writer may end up doing commit) but let's not complicate things
unnecessarily.

> Now, IIRC, fsync problem with throttling was that we had opened a
> transaction but could not write it back to disk because we had to
> wait for all the cached data to go to disk (which is throttled). So
> my question is, can't we first wait for all the data to be flushed
> to disk and then open a transaction for metadata. metadata will be
> unthrottled so filesystem will not have to do any tricks like bdi is
> congested or not.
  Actually that's what's happening. We first do filemap_write_and_wait()
which syncs all the data and then we go and force transaction commit to
make sure all metadata got to stable storage. The problem is that writeout
of data may need to allocate new blocks and that starts a transaction and
while the transaction is started we may need to do some reads (e.g. of
bitmaps etc.) which may be throttled and at that moment the whole
filesystem is blocked. I don't remember the stack traces you showed me so
I'm not sure it this is what your observed but it's certainly one possible
scenario. The reason why fsync triggers problems is simply that it's the
only place where process normally does significant amount of writing. In
most cases flusher thread / journal thread do it so this effect is not
visible. And to precede your question, it would be rather hard to avoid IO
while the transaction is started due to locking.

> [..]
> > > I guess throttling at bdi layer will take care of network filesystem
> > > case too?
> >   Yes. At least for client side. On sever side Steve wants server to have
> > insight into how much IO we could push in future so that it can limit
> > number of outstanding requests if I understand him right. I'm not sure we
> > really want / are able to provide this amount of knowledge to filesystems
> > even less userspace...
> 
> I am not sure what does it mean but server could simply query the bdi
> and read configured rate and then it knows at what rate IO will go to
> disk and make predictions about future?
  Yeah, that would work if we had the current bandwidth for current cgroup
exposed in bdi.
 
> > > Also per bdi limit mechanism will not solve the issue of global throttling
> > > where in case of btrfs an IO might go to multiple bdi's. So throttling limits
> > > are not total but per bdi.
> >   Well, btrfs plays tricks with bdi's but there is a special bdi called
> > "btrfs" which backs the whole filesystem and that is what's put in
> > sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a
> > global bdi to work with.
> 
> Ok, that's good to know. How would we configure this special bdi? I am
> assuming there is no backing device visible in /sys/block/<device>/queue/?
> Same is true for network file systems.
  Where should be the backing device visible? Now it's me who is lost :)

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-11 17:23                         ` Vivek Goyal
  (?)
@ 2012-04-11 19:44                             ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-11 19:44 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Wed 11-04-12 13:23:11, Vivek Goyal wrote:
> On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote:
> > On Wed 11-04-12 11:45:31, Vivek Goyal wrote:
> > > On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote:
> > > > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> > > > 
> > > > [..]
> > > > > > I have implemented and posted patches for per bdi per cgroup congestion
> > > > > > flag. The only problem I see with that is that a group might be congested
> > > > > > for a long time because of lots of other IO happening (say direct IO) and
> > > > > > if you keep on backing off and never submit the metadata IO (transaction),
> > > > > > you get starved. And if you go ahead and submit IO in a congested group,
> > > > > > we are back to serialization issue.
> > > > >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > > > > That's why we discuss throttling of processes at transaction start after
> > > > > all. But I agree starvation is an issue - I originally thought blk-throttle
> > > > > throttles synchronously which wouldn't have starvation issues.
> > > 
> > > Current bio throttling is asynchrounous. Process can submit the bio
> > > and go back and wait for bio to finish. That bio will be queued at device
> > > queue in a per cgroup queue and will be dispatched to device according
> > > to configured IO rate for cgroup.
> > > 
> > > The additional feature for buffered throttle (which never went upstream),
> > > was synchronous in nature. That is we were actively putting writer to
> > > sleep on a per cgroup wait queue in the request queue and wake it up when
> > > it can do further IO based on cgroup limits.
> >   Hmm, but then there would be similar starvation issues as with my simple
> > scheme because async IO could always use the whole available bandwidth.
> 
> It depends on how the throttling logic decides to divide bandwidth between
> sync and async. I had chosen a round robin policy of dispatching some
> bios and then allowing some async IO etc. So async IO was not consuming
> the whole available bandwidth. We could easibly tilt it in favor of sync IO
> with a tunable knob.
  Ah, OK.

> > Mixing of sync & async throttling is really problematic... I'm wondering
> > how useful the async throttling is.
> 
> If sync throttling is useful, then async throttling has to be useful too?
> Especially given the fact that often async IO consumes all bandwidth
> impacting sync latencies.
  I wasn't clear enough I guess. I meant to ask if async throttling brings
some serious advantage over the sync one. And I think your answer is that
we want to have at least some IO prepared to be submitted to maintain
reasonable device utilization.

> > Because we will block on request
> > allocation once there are more than nr_requests pending requests so at that
> > point throttling becomes sync anyway.
> 
> First of all flushers will block on nr_requests and not actual writers.
  Well, but as soon as you are going to do real IO (not just use the
cache), you can block - i.e. direct IO writers, or fsync, or readers can
block.

> And secondly we thought of having per group request descriptors so that
> writes of one group don't impact others. So once the writes of a group
> are backlogged, then flusher can query the congestion status of group
> and not submit any more writes to that group. As some writes are already
> queued in that group, writes will not be starved. Well, in case of
> deadline, even direct writes go in write queue so theoritically we can
> hit starvation issue (flush not being able to submit writes without
> risking blocking) there too.
> 
> To avoid this starvation, ideally we need per bdi per cgroup flusher. so
> that flusher can simply block if there are not enough request descriptors
> in the cgroup.
  Yeah, on one hand this would simplify some things, but on the other hand
you would possibly create performance issue with interleaving IO from
different flusher threads (although that shouldn't be a big problem because
they would work on disjoint sets of inodes and should submit large enough
chunks) and also fs-wide operations such as sync(2) would need some
thinking.

Actually handling of sync(2) is interesting on it's own because if it
should obey throttling limits for each cgroup whose inode is written, it
may take *really* long time to complete it...
 
> So trying to throttle buffered writes synchronously in balance_dirty_pages(),
> atleast simlifies the implementation.  I like my implementation better
> over Fengguang's approach of throttling for simple reason that buffered
> writes and direct writes can be subjected to same throttling limits
> instead of separate limits for buffered writes.
  I guess we all agree (including Fengguang) that this is desirable.
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-11 19:44                             ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-11 19:44 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

On Wed 11-04-12 13:23:11, Vivek Goyal wrote:
> On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote:
> > On Wed 11-04-12 11:45:31, Vivek Goyal wrote:
> > > On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote:
> > > > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> > > > 
> > > > [..]
> > > > > > I have implemented and posted patches for per bdi per cgroup congestion
> > > > > > flag. The only problem I see with that is that a group might be congested
> > > > > > for a long time because of lots of other IO happening (say direct IO) and
> > > > > > if you keep on backing off and never submit the metadata IO (transaction),
> > > > > > you get starved. And if you go ahead and submit IO in a congested group,
> > > > > > we are back to serialization issue.
> > > > >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > > > > That's why we discuss throttling of processes at transaction start after
> > > > > all. But I agree starvation is an issue - I originally thought blk-throttle
> > > > > throttles synchronously which wouldn't have starvation issues.
> > > 
> > > Current bio throttling is asynchrounous. Process can submit the bio
> > > and go back and wait for bio to finish. That bio will be queued at device
> > > queue in a per cgroup queue and will be dispatched to device according
> > > to configured IO rate for cgroup.
> > > 
> > > The additional feature for buffered throttle (which never went upstream),
> > > was synchronous in nature. That is we were actively putting writer to
> > > sleep on a per cgroup wait queue in the request queue and wake it up when
> > > it can do further IO based on cgroup limits.
> >   Hmm, but then there would be similar starvation issues as with my simple
> > scheme because async IO could always use the whole available bandwidth.
> 
> It depends on how the throttling logic decides to divide bandwidth between
> sync and async. I had chosen a round robin policy of dispatching some
> bios and then allowing some async IO etc. So async IO was not consuming
> the whole available bandwidth. We could easibly tilt it in favor of sync IO
> with a tunable knob.
  Ah, OK.

> > Mixing of sync & async throttling is really problematic... I'm wondering
> > how useful the async throttling is.
> 
> If sync throttling is useful, then async throttling has to be useful too?
> Especially given the fact that often async IO consumes all bandwidth
> impacting sync latencies.
  I wasn't clear enough I guess. I meant to ask if async throttling brings
some serious advantage over the sync one. And I think your answer is that
we want to have at least some IO prepared to be submitted to maintain
reasonable device utilization.

> > Because we will block on request
> > allocation once there are more than nr_requests pending requests so at that
> > point throttling becomes sync anyway.
> 
> First of all flushers will block on nr_requests and not actual writers.
  Well, but as soon as you are going to do real IO (not just use the
cache), you can block - i.e. direct IO writers, or fsync, or readers can
block.

> And secondly we thought of having per group request descriptors so that
> writes of one group don't impact others. So once the writes of a group
> are backlogged, then flusher can query the congestion status of group
> and not submit any more writes to that group. As some writes are already
> queued in that group, writes will not be starved. Well, in case of
> deadline, even direct writes go in write queue so theoritically we can
> hit starvation issue (flush not being able to submit writes without
> risking blocking) there too.
> 
> To avoid this starvation, ideally we need per bdi per cgroup flusher. so
> that flusher can simply block if there are not enough request descriptors
> in the cgroup.
  Yeah, on one hand this would simplify some things, but on the other hand
you would possibly create performance issue with interleaving IO from
different flusher threads (although that shouldn't be a big problem because
they would work on disjoint sets of inodes and should submit large enough
chunks) and also fs-wide operations such as sync(2) would need some
thinking.

Actually handling of sync(2) is interesting on it's own because if it
should obey throttling limits for each cgroup whose inode is written, it
may take *really* long time to complete it...
 
> So trying to throttle buffered writes synchronously in balance_dirty_pages(),
> atleast simlifies the implementation.  I like my implementation better
> over Fengguang's approach of throttling for simple reason that buffered
> writes and direct writes can be subjected to same throttling limits
> instead of separate limits for buffered writes.
  I guess we all agree (including Fengguang) that this is desirable.
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-11 19:44                             ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-11 19:44 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

On Wed 11-04-12 13:23:11, Vivek Goyal wrote:
> On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote:
> > On Wed 11-04-12 11:45:31, Vivek Goyal wrote:
> > > On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote:
> > > > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> > > > 
> > > > [..]
> > > > > > I have implemented and posted patches for per bdi per cgroup congestion
> > > > > > flag. The only problem I see with that is that a group might be congested
> > > > > > for a long time because of lots of other IO happening (say direct IO) and
> > > > > > if you keep on backing off and never submit the metadata IO (transaction),
> > > > > > you get starved. And if you go ahead and submit IO in a congested group,
> > > > > > we are back to serialization issue.
> > > > >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > > > > That's why we discuss throttling of processes at transaction start after
> > > > > all. But I agree starvation is an issue - I originally thought blk-throttle
> > > > > throttles synchronously which wouldn't have starvation issues.
> > > 
> > > Current bio throttling is asynchrounous. Process can submit the bio
> > > and go back and wait for bio to finish. That bio will be queued at device
> > > queue in a per cgroup queue and will be dispatched to device according
> > > to configured IO rate for cgroup.
> > > 
> > > The additional feature for buffered throttle (which never went upstream),
> > > was synchronous in nature. That is we were actively putting writer to
> > > sleep on a per cgroup wait queue in the request queue and wake it up when
> > > it can do further IO based on cgroup limits.
> >   Hmm, but then there would be similar starvation issues as with my simple
> > scheme because async IO could always use the whole available bandwidth.
> 
> It depends on how the throttling logic decides to divide bandwidth between
> sync and async. I had chosen a round robin policy of dispatching some
> bios and then allowing some async IO etc. So async IO was not consuming
> the whole available bandwidth. We could easibly tilt it in favor of sync IO
> with a tunable knob.
  Ah, OK.

> > Mixing of sync & async throttling is really problematic... I'm wondering
> > how useful the async throttling is.
> 
> If sync throttling is useful, then async throttling has to be useful too?
> Especially given the fact that often async IO consumes all bandwidth
> impacting sync latencies.
  I wasn't clear enough I guess. I meant to ask if async throttling brings
some serious advantage over the sync one. And I think your answer is that
we want to have at least some IO prepared to be submitted to maintain
reasonable device utilization.

> > Because we will block on request
> > allocation once there are more than nr_requests pending requests so at that
> > point throttling becomes sync anyway.
> 
> First of all flushers will block on nr_requests and not actual writers.
  Well, but as soon as you are going to do real IO (not just use the
cache), you can block - i.e. direct IO writers, or fsync, or readers can
block.

> And secondly we thought of having per group request descriptors so that
> writes of one group don't impact others. So once the writes of a group
> are backlogged, then flusher can query the congestion status of group
> and not submit any more writes to that group. As some writes are already
> queued in that group, writes will not be starved. Well, in case of
> deadline, even direct writes go in write queue so theoritically we can
> hit starvation issue (flush not being able to submit writes without
> risking blocking) there too.
> 
> To avoid this starvation, ideally we need per bdi per cgroup flusher. so
> that flusher can simply block if there are not enough request descriptors
> in the cgroup.
  Yeah, on one hand this would simplify some things, but on the other hand
you would possibly create performance issue with interleaving IO from
different flusher threads (although that shouldn't be a big problem because
they would work on disjoint sets of inodes and should submit large enough
chunks) and also fs-wide operations such as sync(2) would need some
thinking.

Actually handling of sync(2) is interesting on it's own because if it
should obey throttling limits for each cgroup whose inode is written, it
may take *really* long time to complete it...
 
> So trying to throttle buffered writes synchronously in balance_dirty_pages(),
> atleast simlifies the implementation.  I like my implementation better
> over Fengguang's approach of throttling for simple reason that buffered
> writes and direct writes can be subjected to same throttling limits
> instead of separate limits for buffered writes.
  I guess we all agree (including Fengguang) that this is desirable.
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-11 19:22                     ` Jan Kara
  (?)
@ 2012-04-12 20:37                         ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-12 20:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote:

[..]
> > >   Well, btrfs plays tricks with bdi's but there is a special bdi called
> > > "btrfs" which backs the whole filesystem and that is what's put in
> > > sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a
> > > global bdi to work with.
> > 
> > Ok, that's good to know. How would we configure this special bdi? I am
> > assuming there is no backing device visible in /sys/block/<device>/queue/?
> > Same is true for network file systems.
>   Where should be the backing device visible? Now it's me who is lost :)

I mean how are we supposed to put cgroup throttling rules using cgroup
interface for network filesystems and for btrfs global bdi. Using "dev_t"
associated with bdi? I see that all the bdi's are showing up in
/sys/class/bdi, but how do I know which one I am intereste in or which
one belongs to filesystem I am interestd in putting throttling rule on.

For block devices, we simply use "major:min limit" format to write to
a cgroup file and this configuration will sit in one of the per queue
per cgroup data structure.

I am assuming that when you say throttling should happen at bdi, you
are thinking of maintaining per cgroup per bdi data structures and user
is somehow supposed to pass "bdi_maj:bdi_min  limit" through cgroup files?
If yes, how does one map a filesystem's bdi we want to put rules on?

Also, at request queue level we have bios and we throttle bios. At bdi
level, I think there are no bios yet. So somehow we got to deal with
pages. Not sure how exactly will throttling happen.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-12 20:37                         ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-12 20:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote:

[..]
> > >   Well, btrfs plays tricks with bdi's but there is a special bdi called
> > > "btrfs" which backs the whole filesystem and that is what's put in
> > > sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a
> > > global bdi to work with.
> > 
> > Ok, that's good to know. How would we configure this special bdi? I am
> > assuming there is no backing device visible in /sys/block/<device>/queue/?
> > Same is true for network file systems.
>   Where should be the backing device visible? Now it's me who is lost :)

I mean how are we supposed to put cgroup throttling rules using cgroup
interface for network filesystems and for btrfs global bdi. Using "dev_t"
associated with bdi? I see that all the bdi's are showing up in
/sys/class/bdi, but how do I know which one I am intereste in or which
one belongs to filesystem I am interestd in putting throttling rule on.

For block devices, we simply use "major:min limit" format to write to
a cgroup file and this configuration will sit in one of the per queue
per cgroup data structure.

I am assuming that when you say throttling should happen at bdi, you
are thinking of maintaining per cgroup per bdi data structures and user
is somehow supposed to pass "bdi_maj:bdi_min  limit" through cgroup files?
If yes, how does one map a filesystem's bdi we want to put rules on?

Also, at request queue level we have bios and we throttle bios. At bdi
level, I think there are no bios yet. So somehow we got to deal with
pages. Not sure how exactly will throttling happen.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-12 20:37                         ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-12 20:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote:

[..]
> > >   Well, btrfs plays tricks with bdi's but there is a special bdi called
> > > "btrfs" which backs the whole filesystem and that is what's put in
> > > sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a
> > > global bdi to work with.
> > 
> > Ok, that's good to know. How would we configure this special bdi? I am
> > assuming there is no backing device visible in /sys/block/<device>/queue/?
> > Same is true for network file systems.
>   Where should be the backing device visible? Now it's me who is lost :)

I mean how are we supposed to put cgroup throttling rules using cgroup
interface for network filesystems and for btrfs global bdi. Using "dev_t"
associated with bdi? I see that all the bdi's are showing up in
/sys/class/bdi, but how do I know which one I am intereste in or which
one belongs to filesystem I am interestd in putting throttling rule on.

For block devices, we simply use "major:min limit" format to write to
a cgroup file and this configuration will sit in one of the per queue
per cgroup data structure.

I am assuming that when you say throttling should happen at bdi, you
are thinking of maintaining per cgroup per bdi data structures and user
is somehow supposed to pass "bdi_maj:bdi_min  limit" through cgroup files?
If yes, how does one map a filesystem's bdi we want to put rules on?

Also, at request queue level we have bios and we throttle bios. At bdi
level, I think there are no bios yet. So somehow we got to deal with
pages. Not sure how exactly will throttling happen.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                         ` <20120412203719.GL2207-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-12 20:51                           ` Tejun Heo
  2012-04-15 11:37                           ` [Lsf] " Peter Zijlstra
  1 sibling, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-12 20:51 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

Hello, Vivek.

On Thu, Apr 12, 2012 at 04:37:19PM -0400, Vivek Goyal wrote:
> I mean how are we supposed to put cgroup throttling rules using cgroup
> interface for network filesystems and for btrfs global bdi. Using "dev_t"
> associated with bdi? I see that all the bdi's are showing up in
> /sys/class/bdi, but how do I know which one I am intereste in or which
> one belongs to filesystem I am interestd in putting throttling rule on.
> 
> For block devices, we simply use "major:min limit" format to write to
> a cgroup file and this configuration will sit in one of the per queue
> per cgroup data structure.
> 
> I am assuming that when you say throttling should happen at bdi, you
> are thinking of maintaining per cgroup per bdi data structures and user
> is somehow supposed to pass "bdi_maj:bdi_min  limit" through cgroup files?
> If yes, how does one map a filesystem's bdi we want to put rules on?

I think you're worrying way too much.  One of the biggest reasons we
have layers and abstractions is to avoid worrying about everything
from everywhere.  Let block device implement per-device limits.  Let
writeback work from the backpressure it gets from the relevant IO
channel, bdi-cgroup combination in this case.

For stacked or combined devices, let the combining layer deal with
piping the congestion information.  If it's per-file split, the
combined bdi can simply forward information from the matching
underlying device.  If the file is striped / duplicated somehow, the
*only* layer which knows what to do is and should be the layer
performing the striping and duplication.  There's no need to worry
about it from blkcg and if you get the layering correct it isn't
difficult to slice such logic inbetween.  In fact, most of it
(backpressure propagation) would just happen as part of the usual
buffering between layers.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-12 20:37                         ` Vivek Goyal
@ 2012-04-12 20:51                           ` Tejun Heo
  -1 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-12 20:51 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hello, Vivek.

On Thu, Apr 12, 2012 at 04:37:19PM -0400, Vivek Goyal wrote:
> I mean how are we supposed to put cgroup throttling rules using cgroup
> interface for network filesystems and for btrfs global bdi. Using "dev_t"
> associated with bdi? I see that all the bdi's are showing up in
> /sys/class/bdi, but how do I know which one I am intereste in or which
> one belongs to filesystem I am interestd in putting throttling rule on.
> 
> For block devices, we simply use "major:min limit" format to write to
> a cgroup file and this configuration will sit in one of the per queue
> per cgroup data structure.
> 
> I am assuming that when you say throttling should happen at bdi, you
> are thinking of maintaining per cgroup per bdi data structures and user
> is somehow supposed to pass "bdi_maj:bdi_min  limit" through cgroup files?
> If yes, how does one map a filesystem's bdi we want to put rules on?

I think you're worrying way too much.  One of the biggest reasons we
have layers and abstractions is to avoid worrying about everything
from everywhere.  Let block device implement per-device limits.  Let
writeback work from the backpressure it gets from the relevant IO
channel, bdi-cgroup combination in this case.

For stacked or combined devices, let the combining layer deal with
piping the congestion information.  If it's per-file split, the
combined bdi can simply forward information from the matching
underlying device.  If the file is striped / duplicated somehow, the
*only* layer which knows what to do is and should be the layer
performing the striping and duplication.  There's no need to worry
about it from blkcg and if you get the layering correct it isn't
difficult to slice such logic inbetween.  In fact, most of it
(backpressure propagation) would just happen as part of the usual
buffering between layers.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-12 20:51                           ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-12 20:51 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hello, Vivek.

On Thu, Apr 12, 2012 at 04:37:19PM -0400, Vivek Goyal wrote:
> I mean how are we supposed to put cgroup throttling rules using cgroup
> interface for network filesystems and for btrfs global bdi. Using "dev_t"
> associated with bdi? I see that all the bdi's are showing up in
> /sys/class/bdi, but how do I know which one I am intereste in or which
> one belongs to filesystem I am interestd in putting throttling rule on.
> 
> For block devices, we simply use "major:min limit" format to write to
> a cgroup file and this configuration will sit in one of the per queue
> per cgroup data structure.
> 
> I am assuming that when you say throttling should happen at bdi, you
> are thinking of maintaining per cgroup per bdi data structures and user
> is somehow supposed to pass "bdi_maj:bdi_min  limit" through cgroup files?
> If yes, how does one map a filesystem's bdi we want to put rules on?

I think you're worrying way too much.  One of the biggest reasons we
have layers and abstractions is to avoid worrying about everything
from everywhere.  Let block device implement per-device limits.  Let
writeback work from the backpressure it gets from the relevant IO
channel, bdi-cgroup combination in this case.

For stacked or combined devices, let the combining layer deal with
piping the congestion information.  If it's per-file split, the
combined bdi can simply forward information from the matching
underlying device.  If the file is striped / duplicated somehow, the
*only* layer which knows what to do is and should be the layer
performing the striping and duplication.  There's no need to worry
about it from blkcg and if you get the layering correct it isn't
difficult to slice such logic inbetween.  In fact, most of it
(backpressure propagation) would just happen as part of the usual
buffering between layers.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]     ` <20120404184909.GB29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
                         ` (2 preceding siblings ...)
  2012-04-05 16:38       ` Tejun Heo
@ 2012-04-14 11:53       ` Peter Zijlstra
  3 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-14 11:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal

On Wed, 2012-04-04 at 11:49 -0700, Tejun Heo wrote:
> > - How to handle NFS.
> 
> As said above, maybe through network based bdi pressure propagation,
> Maybe some other special case mechanism.  Unsure but I don't think
> this concern should dictate the whole design. 

NFS has a custom bdi implementation and implements congestion control
based on the number of outstanding writeback pages.

See fs/nfs/write.c:nfs_{set,end}_page_writeback

All !block based filesystems have their own BDI implementation, I'm not
sure on the congestion implementation of anything other than NFS though.

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]     ` <20120404184909.GB29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
  2012-04-04 19:23       ` Steve French
@ 2012-04-14 11:53       ` Peter Zijlstra
  2012-04-05 16:38       ` Tejun Heo
  2012-04-14 11:53       ` [Lsf] " Peter Zijlstra
  3 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-14 11:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Wed, 2012-04-04 at 11:49 -0700, Tejun Heo wrote:
> > - How to handle NFS.
> 
> As said above, maybe through network based bdi pressure propagation,
> Maybe some other special case mechanism.  Unsure but I don't think
> this concern should dictate the whole design. 

NFS has a custom bdi implementation and implements congestion control
based on the number of outstanding writeback pages.

See fs/nfs/write.c:nfs_{set,end}_page_writeback

All !block based filesystems have their own BDI implementation, I'm not
sure on the congestion implementation of anything other than NFS though.

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-14 11:53       ` Peter Zijlstra
  0 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-14 11:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, 2012-04-04 at 11:49 -0700, Tejun Heo wrote:
> > - How to handle NFS.
> 
> As said above, maybe through network based bdi pressure propagation,
> Maybe some other special case mechanism.  Unsure but I don't think
> this concern should dictate the whole design. 

NFS has a custom bdi implementation and implements congestion control
based on the number of outstanding writeback pages.

See fs/nfs/write.c:nfs_{set,end}_page_writeback

All !block based filesystems have their own BDI implementation, I'm not
sure on the congestion implementation of anything other than NFS though.

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-14 11:53       ` Peter Zijlstra
  0 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-14 11:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Wed, 2012-04-04 at 11:49 -0700, Tejun Heo wrote:
> > - How to handle NFS.
> 
> As said above, maybe through network based bdi pressure propagation,
> Maybe some other special case mechanism.  Unsure but I don't think
> this concern should dictate the whole design. 

NFS has a custom bdi implementation and implements congestion control
based on the number of outstanding writeback pages.

See fs/nfs/write.c:nfs_{set,end}_page_writeback

All !block based filesystems have their own BDI implementation, I'm not
sure on the congestion implementation of anything other than NFS though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]       ` <CAH2r5mvP56D0y4mk5wKrJcj+=OZ0e0Q5No_L+9a8a=GMcEhRew-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2012-04-14 12:15         ` Peter Zijlstra
  0 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-14 12:15 UTC (permalink / raw)
  To: Steve French
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Wed, 2012-04-04 at 14:23 -0500, Steve French wrote:
> Current use of bdi is a little hard to understand since
> there are 25+ fields in the structure.  

Filesystems only need a small fraction of those.

In particular,

  backing_dev_info::name	-- string
  backing_dev_info::ra_pages	-- number of read-ahead-pages
  backing_dev_info::capability	-- see BDI_CAP_*
  
One should properly initialize/destroy the thing using:

  bdi_init()/bdi_destroy()


Furthermore, it has hooks into the regular page-writeback stuff:

  test_{set,clear}_page_writeback()/bdi_writeout_inc()
  set_page_dirty()/account_page_dirtied()
  
but also allows filesystems to do custom stuff, see FUSE for example.

The only other bit is the pressure valve, aka.
{set,clear}_bdi_congested(). Which really is rather broken and of
dubious value.

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-04 19:23       ` Steve French
  (?)
@ 2012-04-14 12:15         ` Peter Zijlstra
  -1 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-14 12:15 UTC (permalink / raw)
  To: Steve French
  Cc: Tejun Heo, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Wed, 2012-04-04 at 14:23 -0500, Steve French wrote:
> Current use of bdi is a little hard to understand since
> there are 25+ fields in the structure.  

Filesystems only need a small fraction of those.

In particular,

  backing_dev_info::name	-- string
  backing_dev_info::ra_pages	-- number of read-ahead-pages
  backing_dev_info::capability	-- see BDI_CAP_*
  
One should properly initialize/destroy the thing using:

  bdi_init()/bdi_destroy()


Furthermore, it has hooks into the regular page-writeback stuff:

  test_{set,clear}_page_writeback()/bdi_writeout_inc()
  set_page_dirty()/account_page_dirtied()
  
but also allows filesystems to do custom stuff, see FUSE for example.

The only other bit is the pressure valve, aka.
{set,clear}_bdi_congested(). Which really is rather broken and of
dubious value.



^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-14 12:15         ` Peter Zijlstra
  0 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-14 12:15 UTC (permalink / raw)
  To: Steve French
  Cc: Tejun Heo, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Wed, 2012-04-04 at 14:23 -0500, Steve French wrote:
> Current use of bdi is a little hard to understand since
> there are 25+ fields in the structure.  

Filesystems only need a small fraction of those.

In particular,

  backing_dev_info::name	-- string
  backing_dev_info::ra_pages	-- number of read-ahead-pages
  backing_dev_info::capability	-- see BDI_CAP_*
  
One should properly initialize/destroy the thing using:

  bdi_init()/bdi_destroy()


Furthermore, it has hooks into the regular page-writeback stuff:

  test_{set,clear}_page_writeback()/bdi_writeout_inc()
  set_page_dirty()/account_page_dirtied()
  
but also allows filesystems to do custom stuff, see FUSE for example.

The only other bit is the pressure valve, aka.
{set,clear}_bdi_congested(). Which really is rather broken and of
dubious value.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-14 12:15         ` Peter Zijlstra
  0 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-14 12:15 UTC (permalink / raw)
  To: Steve French
  Cc: Tejun Heo, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Wed, 2012-04-04 at 14:23 -0500, Steve French wrote:
> Current use of bdi is a little hard to understand since
> there are 25+ fields in the structure.  

Filesystems only need a small fraction of those.

In particular,

  backing_dev_info::name	-- string
  backing_dev_info::ra_pages	-- number of read-ahead-pages
  backing_dev_info::capability	-- see BDI_CAP_*
  
One should properly initialize/destroy the thing using:

  bdi_init()/bdi_destroy()


Furthermore, it has hooks into the regular page-writeback stuff:

  test_{set,clear}_page_writeback()/bdi_writeout_inc()
  set_page_dirty()/account_page_dirtied()
  
but also allows filesystems to do custom stuff, see FUSE for example.

The only other bit is the pressure valve, aka.
{set,clear}_bdi_congested(). Which really is rather broken and of
dubious value.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]                   ` <20120411154005.GD16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2012-04-11 15:45                     ` Vivek Goyal
  2012-04-11 19:22                     ` Jan Kara
@ 2012-04-14 12:25                     ` Peter Zijlstra
  2 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-14 12:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote:
> 
> Ok, that's good to know. How would we configure this special bdi? I am
> assuming there is no backing device visible in /sys/block/<device>/queue/?
> Same is true for network file systems. 

root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done
ls: cannot access /sys/class/bdi/0:20/: No such file or directory
total 0
drwxr-xr-x  3 root root    0 2012-03-27 23:18 .
drwxr-xr-x 35 root root    0 2012-03-27 23:02 ..
-rw-r--r--  1 root root 4096 2012-04-14 14:22 max_ratio
-rw-r--r--  1 root root 4096 2012-04-14 14:22 min_ratio
drwxr-xr-x  2 root root    0 2012-04-14 14:22 power
-rw-r--r--  1 root root 4096 2012-04-14 14:22 read_ahead_kb
lrwxrwxrwx  1 root root    0 2012-03-27 23:18 subsystem -> ../../../../class/bdi
-rw-r--r--  1 root root 4096 2012-03-27 23:18 uevent

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-11 15:40                   ` Vivek Goyal
  (?)
@ 2012-04-14 12:25                     ` Peter Zijlstra
  -1 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-14 12:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel

On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote:
> 
> Ok, that's good to know. How would we configure this special bdi? I am
> assuming there is no backing device visible in /sys/block/<device>/queue/?
> Same is true for network file systems. 

root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done
ls: cannot access /sys/class/bdi/0:20/: No such file or directory
total 0
drwxr-xr-x  3 root root    0 2012-03-27 23:18 .
drwxr-xr-x 35 root root    0 2012-03-27 23:02 ..
-rw-r--r--  1 root root 4096 2012-04-14 14:22 max_ratio
-rw-r--r--  1 root root 4096 2012-04-14 14:22 min_ratio
drwxr-xr-x  2 root root    0 2012-04-14 14:22 power
-rw-r--r--  1 root root 4096 2012-04-14 14:22 read_ahead_kb
lrwxrwxrwx  1 root root    0 2012-03-27 23:18 subsystem -> ../../../../class/bdi
-rw-r--r--  1 root root 4096 2012-03-27 23:18 uevent


^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-14 12:25                     ` Peter Zijlstra
  0 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-14 12:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel

On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote:
> 
> Ok, that's good to know. How would we configure this special bdi? I am
> assuming there is no backing device visible in /sys/block/<device>/queue/?
> Same is true for network file systems. 

root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done
ls: cannot access /sys/class/bdi/0:20/: No such file or directory
total 0
drwxr-xr-x  3 root root    0 2012-03-27 23:18 .
drwxr-xr-x 35 root root    0 2012-03-27 23:02 ..
-rw-r--r--  1 root root 4096 2012-04-14 14:22 max_ratio
-rw-r--r--  1 root root 4096 2012-04-14 14:22 min_ratio
drwxr-xr-x  2 root root    0 2012-04-14 14:22 power
-rw-r--r--  1 root root 4096 2012-04-14 14:22 read_ahead_kb
lrwxrwxrwx  1 root root    0 2012-03-27 23:18 subsystem -> ../../../../class/bdi
-rw-r--r--  1 root root 4096 2012-03-27 23:18 uevent

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-14 12:25                     ` Peter Zijlstra
  0 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-14 12:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel

On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote:
> 
> Ok, that's good to know. How would we configure this special bdi? I am
> assuming there is no backing device visible in /sys/block/<device>/queue/?
> Same is true for network file systems. 

root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done
ls: cannot access /sys/class/bdi/0:20/: No such file or directory
total 0
drwxr-xr-x  3 root root    0 2012-03-27 23:18 .
drwxr-xr-x 35 root root    0 2012-03-27 23:02 ..
-rw-r--r--  1 root root 4096 2012-04-14 14:22 max_ratio
-rw-r--r--  1 root root 4096 2012-04-14 14:22 min_ratio
drwxr-xr-x  2 root root    0 2012-04-14 14:22 power
-rw-r--r--  1 root root 4096 2012-04-14 14:22 read_ahead_kb
lrwxrwxrwx  1 root root    0 2012-03-27 23:18 subsystem -> ../../../../class/bdi
-rw-r--r--  1 root root 4096 2012-03-27 23:18 uevent

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-12 20:51                           ` Tejun Heo
@ 2012-04-14 14:36                               ` Fengguang Wu
  -1 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-14 14:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal

[-- Attachment #1: Type: text/plain, Size: 4887 bytes --]

On Thu, Apr 12, 2012 at 01:51:48PM -0700, Tejun Heo wrote:
> Hello, Vivek.
> 
> On Thu, Apr 12, 2012 at 04:37:19PM -0400, Vivek Goyal wrote:
> > I mean how are we supposed to put cgroup throttling rules using cgroup
> > interface for network filesystems and for btrfs global bdi. Using "dev_t"
> > associated with bdi? I see that all the bdi's are showing up in
> > /sys/class/bdi, but how do I know which one I am intereste in or which
> > one belongs to filesystem I am interestd in putting throttling rule on.
> > 
> > For block devices, we simply use "major:min limit" format to write to
> > a cgroup file and this configuration will sit in one of the per queue
> > per cgroup data structure.
> > 
> > I am assuming that when you say throttling should happen at bdi, you
> > are thinking of maintaining per cgroup per bdi data structures and user
> > is somehow supposed to pass "bdi_maj:bdi_min  limit" through cgroup files?
> > If yes, how does one map a filesystem's bdi we want to put rules on?
> 
> I think you're worrying way too much.  One of the biggest reasons we
> have layers and abstractions is to avoid worrying about everything
> from everywhere.  Let block device implement per-device limits.  Let
> writeback work from the backpressure it gets from the relevant IO
> channel, bdi-cgroup combination in this case.
> 
> For stacked or combined devices, let the combining layer deal with
> piping the congestion information.  If it's per-file split, the
> combined bdi can simply forward information from the matching
> underlying device.  If the file is striped / duplicated somehow, the
> *only* layer which knows what to do is and should be the layer
> performing the striping and duplication.  There's no need to worry
> about it from blkcg and if you get the layering correct it isn't
> difficult to slice such logic inbetween.  In fact, most of it
> (backpressure propagation) would just happen as part of the usual
> buffering between layers.

Yeah the backpressure idea would work nicely with all possible
intermediate stacking between the bdi and leaf devices. In my attempt
to do combined IO bandwidth control for

- buffered writes, in balance_dirty_pages()
- direct IO, in the cfq IO scheduler

I have to look into the cfq code in the past days to get an idea how
the two throttling layers can cooperate (and suffer from the pains
arise from the violations of layers). It's also rather tricky to get
two previously independent throttling mechanisms to work seamlessly
with each other for providing the desired _unified_ user interface. It
took a lot of reasoning and experiments to work the basic scheme out...

But here is the first result. The attached graph shows progress of 4
tasks:
- cgroup A: 1 direct dd + 1 buffered dd
- cgroup B: 1 direct dd + 1 buffered dd

The 4 tasks are mostly progressing at the same pace. The top 2
smoother lines are for the buffered dirtiers. The bottom 2 lines are
for the direct writers. As you may notice, the two direct writers are
somehow stalled for 1-2 times, which increases the gaps between the
lines. Otherwise, the algorithm is working as expected to distribute
the bandwidth to each task.

The current code's target is to satisfy the more realistic user demand
of distributing bandwidth equally to each cgroup, and inside each
cgroup, distribute bandwidth equally to buffered/direct writes. On top
of which, weights can be specified to change the default distribution.

The implementation involves adding "weight for direct IO" to the cfq
groups and "weight for buffered writes" to the root cgroup. Note that
current cfq proportional IO conroller does not offer explicit control
over the direct:buffered ratio.

When there are both direct/buffered writers in the cgroup,
balance_dirty_pages() will kick in and adjust the weights for cfq to
execute. Note that cfq will continue to send all flusher IOs to the
root cgroup.  balance_dirty_pages() will compute the overall async
weight for it so that in the above test case, the computed weights
will be

- 1000 async weight for the root cgroup (2 buffered dds)
- 500 dio weight for cgroup A (1 direct dd)
- 500 dio weight for cgroup B (1 direct dd)

The second graph shows result for another test case:
- cgroup A, weight 300: 1 buffered cp
- cgroup B, weight 600: 1 buffered dd + 1 direct dd
- cgroup C, weight 300: 1 direct dd
which is also working as expected.

Once the cfq properly grants total async IO share to the flusher,
balance_dirty_pages() will then do its original job of distributing
the buffered write bandwidth among the buffered dd tasks.

It will have to assume that the devices under the same bdi are
"symmetry". It also needs further stats feedback on IOPS or disk time
in order to do IOPS/time based IO distribution. Anyway it would be
interesting to see how far this scheme can go. I'll cleanup the code
and post it soon.

Thanks,
Fengguang

[-- Attachment #2: balance_dirty_pages-task-bw.png --]
[-- Type: image/png, Size: 72619 bytes --]

[-- Attachment #3: balance_dirty_pages-task-bw.png --]
[-- Type: image/png, Size: 69646 bytes --]

[-- Attachment #4: Type: text/plain, Size: 205 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-14 14:36                               ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-14 14:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

[-- Attachment #1: Type: text/plain, Size: 4887 bytes --]

On Thu, Apr 12, 2012 at 01:51:48PM -0700, Tejun Heo wrote:
> Hello, Vivek.
> 
> On Thu, Apr 12, 2012 at 04:37:19PM -0400, Vivek Goyal wrote:
> > I mean how are we supposed to put cgroup throttling rules using cgroup
> > interface for network filesystems and for btrfs global bdi. Using "dev_t"
> > associated with bdi? I see that all the bdi's are showing up in
> > /sys/class/bdi, but how do I know which one I am intereste in or which
> > one belongs to filesystem I am interestd in putting throttling rule on.
> > 
> > For block devices, we simply use "major:min limit" format to write to
> > a cgroup file and this configuration will sit in one of the per queue
> > per cgroup data structure.
> > 
> > I am assuming that when you say throttling should happen at bdi, you
> > are thinking of maintaining per cgroup per bdi data structures and user
> > is somehow supposed to pass "bdi_maj:bdi_min  limit" through cgroup files?
> > If yes, how does one map a filesystem's bdi we want to put rules on?
> 
> I think you're worrying way too much.  One of the biggest reasons we
> have layers and abstractions is to avoid worrying about everything
> from everywhere.  Let block device implement per-device limits.  Let
> writeback work from the backpressure it gets from the relevant IO
> channel, bdi-cgroup combination in this case.
> 
> For stacked or combined devices, let the combining layer deal with
> piping the congestion information.  If it's per-file split, the
> combined bdi can simply forward information from the matching
> underlying device.  If the file is striped / duplicated somehow, the
> *only* layer which knows what to do is and should be the layer
> performing the striping and duplication.  There's no need to worry
> about it from blkcg and if you get the layering correct it isn't
> difficult to slice such logic inbetween.  In fact, most of it
> (backpressure propagation) would just happen as part of the usual
> buffering between layers.

Yeah the backpressure idea would work nicely with all possible
intermediate stacking between the bdi and leaf devices. In my attempt
to do combined IO bandwidth control for

- buffered writes, in balance_dirty_pages()
- direct IO, in the cfq IO scheduler

I have to look into the cfq code in the past days to get an idea how
the two throttling layers can cooperate (and suffer from the pains
arise from the violations of layers). It's also rather tricky to get
two previously independent throttling mechanisms to work seamlessly
with each other for providing the desired _unified_ user interface. It
took a lot of reasoning and experiments to work the basic scheme out...

But here is the first result. The attached graph shows progress of 4
tasks:
- cgroup A: 1 direct dd + 1 buffered dd
- cgroup B: 1 direct dd + 1 buffered dd

The 4 tasks are mostly progressing at the same pace. The top 2
smoother lines are for the buffered dirtiers. The bottom 2 lines are
for the direct writers. As you may notice, the two direct writers are
somehow stalled for 1-2 times, which increases the gaps between the
lines. Otherwise, the algorithm is working as expected to distribute
the bandwidth to each task.

The current code's target is to satisfy the more realistic user demand
of distributing bandwidth equally to each cgroup, and inside each
cgroup, distribute bandwidth equally to buffered/direct writes. On top
of which, weights can be specified to change the default distribution.

The implementation involves adding "weight for direct IO" to the cfq
groups and "weight for buffered writes" to the root cgroup. Note that
current cfq proportional IO conroller does not offer explicit control
over the direct:buffered ratio.

When there are both direct/buffered writers in the cgroup,
balance_dirty_pages() will kick in and adjust the weights for cfq to
execute. Note that cfq will continue to send all flusher IOs to the
root cgroup.  balance_dirty_pages() will compute the overall async
weight for it so that in the above test case, the computed weights
will be

- 1000 async weight for the root cgroup (2 buffered dds)
- 500 dio weight for cgroup A (1 direct dd)
- 500 dio weight for cgroup B (1 direct dd)

The second graph shows result for another test case:
- cgroup A, weight 300: 1 buffered cp
- cgroup B, weight 600: 1 buffered dd + 1 direct dd
- cgroup C, weight 300: 1 direct dd
which is also working as expected.

Once the cfq properly grants total async IO share to the flusher,
balance_dirty_pages() will then do its original job of distributing
the buffered write bandwidth among the buffered dd tasks.

It will have to assume that the devices under the same bdi are
"symmetry". It also needs further stats feedback on IOPS or disk time
in order to do IOPS/time based IO distribution. Anyway it would be
interesting to see how far this scheme can go. I'll cleanup the code
and post it soon.

Thanks,
Fengguang

[-- Attachment #2: balance_dirty_pages-task-bw.png --]
[-- Type: image/png, Size: 72619 bytes --]

[-- Attachment #3: balance_dirty_pages-task-bw.png --]
[-- Type: image/png, Size: 69646 bytes --]

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]                         ` <20120412203719.GL2207-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2012-04-12 20:51                           ` Tejun Heo
@ 2012-04-15 11:37                           ` Peter Zijlstra
  1 sibling, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-15 11:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu, 2012-04-12 at 16:37 -0400, Vivek Goyal wrote:
> If yes, how does one map a filesystem's bdi we want to put rules on?
> 
/proc/self/mountinfo has the required bits

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-12 20:37                         ` Vivek Goyal
  (?)
@ 2012-04-15 11:37                           ` Peter Zijlstra
  -1 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-15 11:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel

On Thu, 2012-04-12 at 16:37 -0400, Vivek Goyal wrote:
> If yes, how does one map a filesystem's bdi we want to put rules on?
> 
/proc/self/mountinfo has the required bits

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-15 11:37                           ` Peter Zijlstra
  0 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-15 11:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel

On Thu, 2012-04-12 at 16:37 -0400, Vivek Goyal wrote:
> If yes, how does one map a filesystem's bdi we want to put rules on?
> 
/proc/self/mountinfo has the required bits

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-15 11:37                           ` Peter Zijlstra
  0 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-15 11:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel

On Thu, 2012-04-12 at 16:37 -0400, Vivek Goyal wrote:
> If yes, how does one map a filesystem's bdi we want to put rules on?
> 
/proc/self/mountinfo has the required bits

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-14 11:53       ` Peter Zijlstra
  (?)
  (?)
@ 2012-04-16  1:25       ` Steve French
  -1 siblings, 0 replies; 262+ messages in thread
From: Steve French @ 2012-04-16  1:25 UTC (permalink / raw)
  To: linux-cifs-u79uwXL29TY76Z2rM5mHXA

This long thread on linux-mm and linux-fsdevel has been discussing
writeback, throttling, cgroups etc.  This post reminded me that we
should look more carefully at the cifs bdi implementation, compare to
nfs, and also check what needs to be improved in the bdi
implementation to handle smb2 credits.  It will be interesting to see
if that will help writeback.

On Sat, Apr 14, 2012 at 6:53 AM, Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
>
> On Wed, 2012-04-04 at 11:49 -0700, Tejun Heo wrote:
> > > - How to handle NFS.
> >
> > As said above, maybe through network based bdi pressure propagation,
> > Maybe some other special case mechanism.  Unsure but I don't think
> > this concern should dictate the whole design.
>
> NFS has a custom bdi implementation and implements congestion control
> based on the number of outstanding writeback pages.
>
> See fs/nfs/write.c:nfs_{set,end}_page_writeback
>
> All !block based filesystems have their own BDI implementation, I'm not
> sure on the congestion implementation of anything other than NFS though.
> _______________________________________________
> Lsf mailing list
> Lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/lsf




--
Thanks,

Steve

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-14 12:25                     ` Peter Zijlstra
  (?)
@ 2012-04-16 12:54                       ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-16 12:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote:
> On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote:
> > 
> > Ok, that's good to know. How would we configure this special bdi? I am
> > assuming there is no backing device visible in /sys/block/<device>/queue/?
> > Same is true for network file systems. 
> 
> root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done
> ls: cannot access /sys/class/bdi/0:20/: No such file or directory
> total 0
> drwxr-xr-x  3 root root    0 2012-03-27 23:18 .
> drwxr-xr-x 35 root root    0 2012-03-27 23:02 ..
> -rw-r--r--  1 root root 4096 2012-04-14 14:22 max_ratio
> -rw-r--r--  1 root root 4096 2012-04-14 14:22 min_ratio
> drwxr-xr-x  2 root root    0 2012-04-14 14:22 power
> -rw-r--r--  1 root root 4096 2012-04-14 14:22 read_ahead_kb
> lrwxrwxrwx  1 root root    0 2012-03-27 23:18 subsystem -> ../../../../class/bdi
> -rw-r--r--  1 root root 4096 2012-03-27 23:18 uevent

Ok, got it. So /proc/self/mountinfo has the information about st_dev and
one can use that to reach to associated bdi. Thanks Peter.

Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-16 12:54                       ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-16 12:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel

On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote:
> On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote:
> > 
> > Ok, that's good to know. How would we configure this special bdi? I am
> > assuming there is no backing device visible in /sys/block/<device>/queue/?
> > Same is true for network file systems. 
> 
> root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done
> ls: cannot access /sys/class/bdi/0:20/: No such file or directory
> total 0
> drwxr-xr-x  3 root root    0 2012-03-27 23:18 .
> drwxr-xr-x 35 root root    0 2012-03-27 23:02 ..
> -rw-r--r--  1 root root 4096 2012-04-14 14:22 max_ratio
> -rw-r--r--  1 root root 4096 2012-04-14 14:22 min_ratio
> drwxr-xr-x  2 root root    0 2012-04-14 14:22 power
> -rw-r--r--  1 root root 4096 2012-04-14 14:22 read_ahead_kb
> lrwxrwxrwx  1 root root    0 2012-03-27 23:18 subsystem -> ../../../../class/bdi
> -rw-r--r--  1 root root 4096 2012-03-27 23:18 uevent

Ok, got it. So /proc/self/mountinfo has the information about st_dev and
one can use that to reach to associated bdi. Thanks Peter.

Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-16 12:54                       ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-16 12:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel

On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote:
> On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote:
> > 
> > Ok, that's good to know. How would we configure this special bdi? I am
> > assuming there is no backing device visible in /sys/block/<device>/queue/?
> > Same is true for network file systems. 
> 
> root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done
> ls: cannot access /sys/class/bdi/0:20/: No such file or directory
> total 0
> drwxr-xr-x  3 root root    0 2012-03-27 23:18 .
> drwxr-xr-x 35 root root    0 2012-03-27 23:02 ..
> -rw-r--r--  1 root root 4096 2012-04-14 14:22 max_ratio
> -rw-r--r--  1 root root 4096 2012-04-14 14:22 min_ratio
> drwxr-xr-x  2 root root    0 2012-04-14 14:22 power
> -rw-r--r--  1 root root 4096 2012-04-14 14:22 read_ahead_kb
> lrwxrwxrwx  1 root root    0 2012-03-27 23:18 subsystem -> ../../../../class/bdi
> -rw-r--r--  1 root root 4096 2012-03-27 23:18 uevent

Ok, got it. So /proc/self/mountinfo has the information about st_dev and
one can use that to reach to associated bdi. Thanks Peter.

Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]                       ` <20120416125432.GB12776-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-16 13:07                         ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-16 13:07 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon, Apr 16, 2012 at 08:54:32AM -0400, Vivek Goyal wrote:
> On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote:
> > On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote:
> > > 
> > > Ok, that's good to know. How would we configure this special bdi? I am
> > > assuming there is no backing device visible in /sys/block/<device>/queue/?
> > > Same is true for network file systems. 
> > 
> > root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done
> > ls: cannot access /sys/class/bdi/0:20/: No such file or directory
> > total 0
> > drwxr-xr-x  3 root root    0 2012-03-27 23:18 .
> > drwxr-xr-x 35 root root    0 2012-03-27 23:02 ..
> > -rw-r--r--  1 root root 4096 2012-04-14 14:22 max_ratio
> > -rw-r--r--  1 root root 4096 2012-04-14 14:22 min_ratio
> > drwxr-xr-x  2 root root    0 2012-04-14 14:22 power
> > -rw-r--r--  1 root root 4096 2012-04-14 14:22 read_ahead_kb
> > lrwxrwxrwx  1 root root    0 2012-03-27 23:18 subsystem -> ../../../../class/bdi
> > -rw-r--r--  1 root root 4096 2012-03-27 23:18 uevent
> 
> Ok, got it. So /proc/self/mountinfo has the information about st_dev and
> one can use that to reach to associated bdi. Thanks Peter.

Vivek, I noticed these lines in cfq code

                sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);

Why not use bdi->dev->devt?  The problem is that dev_name() will
return "btrfs-X" for btrfs rather than "major:minor".

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-16 12:54                       ` Vivek Goyal
@ 2012-04-16 13:07                         ` Fengguang Wu
  -1 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-16 13:07 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Mon, Apr 16, 2012 at 08:54:32AM -0400, Vivek Goyal wrote:
> On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote:
> > On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote:
> > > 
> > > Ok, that's good to know. How would we configure this special bdi? I am
> > > assuming there is no backing device visible in /sys/block/<device>/queue/?
> > > Same is true for network file systems. 
> > 
> > root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done
> > ls: cannot access /sys/class/bdi/0:20/: No such file or directory
> > total 0
> > drwxr-xr-x  3 root root    0 2012-03-27 23:18 .
> > drwxr-xr-x 35 root root    0 2012-03-27 23:02 ..
> > -rw-r--r--  1 root root 4096 2012-04-14 14:22 max_ratio
> > -rw-r--r--  1 root root 4096 2012-04-14 14:22 min_ratio
> > drwxr-xr-x  2 root root    0 2012-04-14 14:22 power
> > -rw-r--r--  1 root root 4096 2012-04-14 14:22 read_ahead_kb
> > lrwxrwxrwx  1 root root    0 2012-03-27 23:18 subsystem -> ../../../../class/bdi
> > -rw-r--r--  1 root root 4096 2012-03-27 23:18 uevent
> 
> Ok, got it. So /proc/self/mountinfo has the information about st_dev and
> one can use that to reach to associated bdi. Thanks Peter.

Vivek, I noticed these lines in cfq code

                sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);

Why not use bdi->dev->devt?  The problem is that dev_name() will
return "btrfs-X" for btrfs rather than "major:minor".

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-16 13:07                         ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-16 13:07 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Mon, Apr 16, 2012 at 08:54:32AM -0400, Vivek Goyal wrote:
> On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote:
> > On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote:
> > > 
> > > Ok, that's good to know. How would we configure this special bdi? I am
> > > assuming there is no backing device visible in /sys/block/<device>/queue/?
> > > Same is true for network file systems. 
> > 
> > root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done
> > ls: cannot access /sys/class/bdi/0:20/: No such file or directory
> > total 0
> > drwxr-xr-x  3 root root    0 2012-03-27 23:18 .
> > drwxr-xr-x 35 root root    0 2012-03-27 23:02 ..
> > -rw-r--r--  1 root root 4096 2012-04-14 14:22 max_ratio
> > -rw-r--r--  1 root root 4096 2012-04-14 14:22 min_ratio
> > drwxr-xr-x  2 root root    0 2012-04-14 14:22 power
> > -rw-r--r--  1 root root 4096 2012-04-14 14:22 read_ahead_kb
> > lrwxrwxrwx  1 root root    0 2012-03-27 23:18 subsystem -> ../../../../class/bdi
> > -rw-r--r--  1 root root 4096 2012-03-27 23:18 uevent
> 
> Ok, got it. So /proc/self/mountinfo has the information about st_dev and
> one can use that to reach to associated bdi. Thanks Peter.

Vivek, I noticed these lines in cfq code

                sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);

Why not use bdi->dev->devt?  The problem is that dev_name() will
return "btrfs-X" for btrfs rather than "major:minor".

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-16 13:07                         ` Fengguang Wu
  (?)
@ 2012-04-16 14:19                         ` Fengguang Wu
  -1 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-16 14:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote:
> On Mon, Apr 16, 2012 at 08:54:32AM -0400, Vivek Goyal wrote:
> > On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote:
> > > On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote:
> > > > 
> > > > Ok, that's good to know. How would we configure this special bdi? I am
> > > > assuming there is no backing device visible in /sys/block/<device>/queue/?
> > > > Same is true for network file systems. 
> > > 
> > > root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done
> > > ls: cannot access /sys/class/bdi/0:20/: No such file or directory
> > > total 0
> > > drwxr-xr-x  3 root root    0 2012-03-27 23:18 .
> > > drwxr-xr-x 35 root root    0 2012-03-27 23:02 ..
> > > -rw-r--r--  1 root root 4096 2012-04-14 14:22 max_ratio
> > > -rw-r--r--  1 root root 4096 2012-04-14 14:22 min_ratio
> > > drwxr-xr-x  2 root root    0 2012-04-14 14:22 power
> > > -rw-r--r--  1 root root 4096 2012-04-14 14:22 read_ahead_kb
> > > lrwxrwxrwx  1 root root    0 2012-03-27 23:18 subsystem -> ../../../../class/bdi
> > > -rw-r--r--  1 root root 4096 2012-03-27 23:18 uevent
> > 
> > Ok, got it. So /proc/self/mountinfo has the information about st_dev and
> > one can use that to reach to associated bdi. Thanks Peter.
> 
> Vivek, I noticed these lines in cfq code
> 
>                 sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
> 
> Why not use bdi->dev->devt?  The problem is that dev_name() will
> return "btrfs-X" for btrfs rather than "major:minor".

Sorry it's not that simple. btrfs reports its faked btrfs_fs_info.bdi
to upper layer which is different from the bdi's for
btrfs_fs_info.fs_devices.devices saw by cfq.

It's the faked btrfs bdi that is named "btrfs-X" by this function:

setup_bdi():
        bdi_setup_and_register(bdi, "btrfs", BDI_CAP_MAP_COPY);

It does impose difficulties to interpret btrfs mountinfo, where you
cannot directly get the block device major/minor numbers:

35 16 0:26 / /fs/sda3 rw,relatime - btrfs /dev/sda3 rw,noacl

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-16 13:07                         ` Fengguang Wu
@ 2012-04-16 14:19                           ` Fengguang Wu
  -1 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-16 14:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote:
> On Mon, Apr 16, 2012 at 08:54:32AM -0400, Vivek Goyal wrote:
> > On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote:
> > > On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote:
> > > > 
> > > > Ok, that's good to know. How would we configure this special bdi? I am
> > > > assuming there is no backing device visible in /sys/block/<device>/queue/?
> > > > Same is true for network file systems. 
> > > 
> > > root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done
> > > ls: cannot access /sys/class/bdi/0:20/: No such file or directory
> > > total 0
> > > drwxr-xr-x  3 root root    0 2012-03-27 23:18 .
> > > drwxr-xr-x 35 root root    0 2012-03-27 23:02 ..
> > > -rw-r--r--  1 root root 4096 2012-04-14 14:22 max_ratio
> > > -rw-r--r--  1 root root 4096 2012-04-14 14:22 min_ratio
> > > drwxr-xr-x  2 root root    0 2012-04-14 14:22 power
> > > -rw-r--r--  1 root root 4096 2012-04-14 14:22 read_ahead_kb
> > > lrwxrwxrwx  1 root root    0 2012-03-27 23:18 subsystem -> ../../../../class/bdi
> > > -rw-r--r--  1 root root 4096 2012-03-27 23:18 uevent
> > 
> > Ok, got it. So /proc/self/mountinfo has the information about st_dev and
> > one can use that to reach to associated bdi. Thanks Peter.
> 
> Vivek, I noticed these lines in cfq code
> 
>                 sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
> 
> Why not use bdi->dev->devt?  The problem is that dev_name() will
> return "btrfs-X" for btrfs rather than "major:minor".

Sorry it's not that simple. btrfs reports its faked btrfs_fs_info.bdi
to upper layer which is different from the bdi's for
btrfs_fs_info.fs_devices.devices saw by cfq.

It's the faked btrfs bdi that is named "btrfs-X" by this function:

setup_bdi():
        bdi_setup_and_register(bdi, "btrfs", BDI_CAP_MAP_COPY);

It does impose difficulties to interpret btrfs mountinfo, where you
cannot directly get the block device major/minor numbers:

35 16 0:26 / /fs/sda3 rw,relatime - btrfs /dev/sda3 rw,noacl

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-16 14:19                           ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-16 14:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote:
> On Mon, Apr 16, 2012 at 08:54:32AM -0400, Vivek Goyal wrote:
> > On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote:
> > > On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote:
> > > > 
> > > > Ok, that's good to know. How would we configure this special bdi? I am
> > > > assuming there is no backing device visible in /sys/block/<device>/queue/?
> > > > Same is true for network file systems. 
> > > 
> > > root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done
> > > ls: cannot access /sys/class/bdi/0:20/: No such file or directory
> > > total 0
> > > drwxr-xr-x  3 root root    0 2012-03-27 23:18 .
> > > drwxr-xr-x 35 root root    0 2012-03-27 23:02 ..
> > > -rw-r--r--  1 root root 4096 2012-04-14 14:22 max_ratio
> > > -rw-r--r--  1 root root 4096 2012-04-14 14:22 min_ratio
> > > drwxr-xr-x  2 root root    0 2012-04-14 14:22 power
> > > -rw-r--r--  1 root root 4096 2012-04-14 14:22 read_ahead_kb
> > > lrwxrwxrwx  1 root root    0 2012-03-27 23:18 subsystem -> ../../../../class/bdi
> > > -rw-r--r--  1 root root 4096 2012-03-27 23:18 uevent
> > 
> > Ok, got it. So /proc/self/mountinfo has the information about st_dev and
> > one can use that to reach to associated bdi. Thanks Peter.
> 
> Vivek, I noticed these lines in cfq code
> 
>                 sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
> 
> Why not use bdi->dev->devt?  The problem is that dev_name() will
> return "btrfs-X" for btrfs rather than "major:minor".

Sorry it's not that simple. btrfs reports its faked btrfs_fs_info.bdi
to upper layer which is different from the bdi's for
btrfs_fs_info.fs_devices.devices saw by cfq.

It's the faked btrfs bdi that is named "btrfs-X" by this function:

setup_bdi():
        bdi_setup_and_register(bdi, "btrfs", BDI_CAP_MAP_COPY);

It does impose difficulties to interpret btrfs mountinfo, where you
cannot directly get the block device major/minor numbers:

35 16 0:26 / /fs/sda3 rw,relatime - btrfs /dev/sda3 rw,noacl

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-14 14:36                               ` Fengguang Wu
  (?)
  (?)
@ 2012-04-16 14:57                               ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-16 14:57 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote:

[..]
> Yeah the backpressure idea would work nicely with all possible
> intermediate stacking between the bdi and leaf devices. In my attempt
> to do combined IO bandwidth control for
> 
> - buffered writes, in balance_dirty_pages()
> - direct IO, in the cfq IO scheduler
> 
> I have to look into the cfq code in the past days to get an idea how
> the two throttling layers can cooperate (and suffer from the pains
> arise from the violations of layers). It's also rather tricky to get
> two previously independent throttling mechanisms to work seamlessly
> with each other for providing the desired _unified_ user interface. It
> took a lot of reasoning and experiments to work the basic scheme out...
> 
> But here is the first result. The attached graph shows progress of 4
> tasks:
> - cgroup A: 1 direct dd + 1 buffered dd
> - cgroup B: 1 direct dd + 1 buffered dd
> 
> The 4 tasks are mostly progressing at the same pace. The top 2
> smoother lines are for the buffered dirtiers. The bottom 2 lines are
> for the direct writers. As you may notice, the two direct writers are
> somehow stalled for 1-2 times, which increases the gaps between the
> lines. Otherwise, the algorithm is working as expected to distribute
> the bandwidth to each task.
> 
> The current code's target is to satisfy the more realistic user demand
> of distributing bandwidth equally to each cgroup, and inside each
> cgroup, distribute bandwidth equally to buffered/direct writes. On top
> of which, weights can be specified to change the default distribution.
> 
> The implementation involves adding "weight for direct IO" to the cfq
> groups and "weight for buffered writes" to the root cgroup. Note that
> current cfq proportional IO conroller does not offer explicit control
> over the direct:buffered ratio.
> 
> When there are both direct/buffered writers in the cgroup,
> balance_dirty_pages() will kick in and adjust the weights for cfq to
> execute. Note that cfq will continue to send all flusher IOs to the
> root cgroup.  balance_dirty_pages() will compute the overall async
> weight for it so that in the above test case, the computed weights
> will be

I think having separate weigths for sync IO groups and async IO is not
very appealing. There should be one notion of group weight and bandwidth
distrubuted among groups according to their weight.

Now one can argue that with-in a group, there might be one knob in CFQ
which allows to change the share or sync/async IO.

Also Tejun and Jan have expressed the desire that once we have figured
out a way to communicate the submitter's context for async IO, we would
like to account that IO in associated cgroup instead of root cgroup (as
we do today).

> 
> - 1000 async weight for the root cgroup (2 buffered dds)
> - 500 dio weight for cgroup A (1 direct dd)
> - 500 dio weight for cgroup B (1 direct dd)
> 
> The second graph shows result for another test case:
> - cgroup A, weight 300: 1 buffered cp
> - cgroup B, weight 600: 1 buffered dd + 1 direct dd
> - cgroup C, weight 300: 1 direct dd
> which is also working as expected.
> 
> Once the cfq properly grants total async IO share to the flusher,
> balance_dirty_pages() will then do its original job of distributing
> the buffered write bandwidth among the buffered dd tasks.
> 
> It will have to assume that the devices under the same bdi are
> "symmetry". It also needs further stats feedback on IOPS or disk time
> in order to do IOPS/time based IO distribution. Anyway it would be
> interesting to see how far this scheme can go. I'll cleanup the code
> and post it soon.

Your proposal relies on few things.

- Bandwidth needs to be divided eually among sync and async IO.
- Flusher thread async IO will always to go to root cgroup.
- I am not sure how this scheme is going to work when we introduce
  hierarchical blkio cgroups.
- cgroup weights for sync IO seems to be being controlled by user and
  somehow root cgroup weight seems to be controlled by this async IO
  logic silently.

Overall sounds very odd design to me. I am not sure what are we achieving
by this. In current scheme one should be able to just adjust the weight
of root cgroup using cgroup interface and achieve same results which you
are showing. So where is the need of dynamically changing it inside
kernel.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-14 14:36                               ` Fengguang Wu
@ 2012-04-16 14:57                                 ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-16 14:57 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote:

[..]
> Yeah the backpressure idea would work nicely with all possible
> intermediate stacking between the bdi and leaf devices. In my attempt
> to do combined IO bandwidth control for
> 
> - buffered writes, in balance_dirty_pages()
> - direct IO, in the cfq IO scheduler
> 
> I have to look into the cfq code in the past days to get an idea how
> the two throttling layers can cooperate (and suffer from the pains
> arise from the violations of layers). It's also rather tricky to get
> two previously independent throttling mechanisms to work seamlessly
> with each other for providing the desired _unified_ user interface. It
> took a lot of reasoning and experiments to work the basic scheme out...
> 
> But here is the first result. The attached graph shows progress of 4
> tasks:
> - cgroup A: 1 direct dd + 1 buffered dd
> - cgroup B: 1 direct dd + 1 buffered dd
> 
> The 4 tasks are mostly progressing at the same pace. The top 2
> smoother lines are for the buffered dirtiers. The bottom 2 lines are
> for the direct writers. As you may notice, the two direct writers are
> somehow stalled for 1-2 times, which increases the gaps between the
> lines. Otherwise, the algorithm is working as expected to distribute
> the bandwidth to each task.
> 
> The current code's target is to satisfy the more realistic user demand
> of distributing bandwidth equally to each cgroup, and inside each
> cgroup, distribute bandwidth equally to buffered/direct writes. On top
> of which, weights can be specified to change the default distribution.
> 
> The implementation involves adding "weight for direct IO" to the cfq
> groups and "weight for buffered writes" to the root cgroup. Note that
> current cfq proportional IO conroller does not offer explicit control
> over the direct:buffered ratio.
> 
> When there are both direct/buffered writers in the cgroup,
> balance_dirty_pages() will kick in and adjust the weights for cfq to
> execute. Note that cfq will continue to send all flusher IOs to the
> root cgroup.  balance_dirty_pages() will compute the overall async
> weight for it so that in the above test case, the computed weights
> will be

I think having separate weigths for sync IO groups and async IO is not
very appealing. There should be one notion of group weight and bandwidth
distrubuted among groups according to their weight.

Now one can argue that with-in a group, there might be one knob in CFQ
which allows to change the share or sync/async IO.

Also Tejun and Jan have expressed the desire that once we have figured
out a way to communicate the submitter's context for async IO, we would
like to account that IO in associated cgroup instead of root cgroup (as
we do today).

> 
> - 1000 async weight for the root cgroup (2 buffered dds)
> - 500 dio weight for cgroup A (1 direct dd)
> - 500 dio weight for cgroup B (1 direct dd)
> 
> The second graph shows result for another test case:
> - cgroup A, weight 300: 1 buffered cp
> - cgroup B, weight 600: 1 buffered dd + 1 direct dd
> - cgroup C, weight 300: 1 direct dd
> which is also working as expected.
> 
> Once the cfq properly grants total async IO share to the flusher,
> balance_dirty_pages() will then do its original job of distributing
> the buffered write bandwidth among the buffered dd tasks.
> 
> It will have to assume that the devices under the same bdi are
> "symmetry". It also needs further stats feedback on IOPS or disk time
> in order to do IOPS/time based IO distribution. Anyway it would be
> interesting to see how far this scheme can go. I'll cleanup the code
> and post it soon.

Your proposal relies on few things.

- Bandwidth needs to be divided eually among sync and async IO.
- Flusher thread async IO will always to go to root cgroup.
- I am not sure how this scheme is going to work when we introduce
  hierarchical blkio cgroups.
- cgroup weights for sync IO seems to be being controlled by user and
  somehow root cgroup weight seems to be controlled by this async IO
  logic silently.

Overall sounds very odd design to me. I am not sure what are we achieving
by this. In current scheme one should be able to just adjust the weight
of root cgroup using cgroup interface and achieve same results which you
are showing. So where is the need of dynamically changing it inside
kernel.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-16 14:57                                 ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-16 14:57 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote:

[..]
> Yeah the backpressure idea would work nicely with all possible
> intermediate stacking between the bdi and leaf devices. In my attempt
> to do combined IO bandwidth control for
> 
> - buffered writes, in balance_dirty_pages()
> - direct IO, in the cfq IO scheduler
> 
> I have to look into the cfq code in the past days to get an idea how
> the two throttling layers can cooperate (and suffer from the pains
> arise from the violations of layers). It's also rather tricky to get
> two previously independent throttling mechanisms to work seamlessly
> with each other for providing the desired _unified_ user interface. It
> took a lot of reasoning and experiments to work the basic scheme out...
> 
> But here is the first result. The attached graph shows progress of 4
> tasks:
> - cgroup A: 1 direct dd + 1 buffered dd
> - cgroup B: 1 direct dd + 1 buffered dd
> 
> The 4 tasks are mostly progressing at the same pace. The top 2
> smoother lines are for the buffered dirtiers. The bottom 2 lines are
> for the direct writers. As you may notice, the two direct writers are
> somehow stalled for 1-2 times, which increases the gaps between the
> lines. Otherwise, the algorithm is working as expected to distribute
> the bandwidth to each task.
> 
> The current code's target is to satisfy the more realistic user demand
> of distributing bandwidth equally to each cgroup, and inside each
> cgroup, distribute bandwidth equally to buffered/direct writes. On top
> of which, weights can be specified to change the default distribution.
> 
> The implementation involves adding "weight for direct IO" to the cfq
> groups and "weight for buffered writes" to the root cgroup. Note that
> current cfq proportional IO conroller does not offer explicit control
> over the direct:buffered ratio.
> 
> When there are both direct/buffered writers in the cgroup,
> balance_dirty_pages() will kick in and adjust the weights for cfq to
> execute. Note that cfq will continue to send all flusher IOs to the
> root cgroup.  balance_dirty_pages() will compute the overall async
> weight for it so that in the above test case, the computed weights
> will be

I think having separate weigths for sync IO groups and async IO is not
very appealing. There should be one notion of group weight and bandwidth
distrubuted among groups according to their weight.

Now one can argue that with-in a group, there might be one knob in CFQ
which allows to change the share or sync/async IO.

Also Tejun and Jan have expressed the desire that once we have figured
out a way to communicate the submitter's context for async IO, we would
like to account that IO in associated cgroup instead of root cgroup (as
we do today).

> 
> - 1000 async weight for the root cgroup (2 buffered dds)
> - 500 dio weight for cgroup A (1 direct dd)
> - 500 dio weight for cgroup B (1 direct dd)
> 
> The second graph shows result for another test case:
> - cgroup A, weight 300: 1 buffered cp
> - cgroup B, weight 600: 1 buffered dd + 1 direct dd
> - cgroup C, weight 300: 1 direct dd
> which is also working as expected.
> 
> Once the cfq properly grants total async IO share to the flusher,
> balance_dirty_pages() will then do its original job of distributing
> the buffered write bandwidth among the buffered dd tasks.
> 
> It will have to assume that the devices under the same bdi are
> "symmetry". It also needs further stats feedback on IOPS or disk time
> in order to do IOPS/time based IO distribution. Anyway it would be
> interesting to see how far this scheme can go. I'll cleanup the code
> and post it soon.

Your proposal relies on few things.

- Bandwidth needs to be divided eually among sync and async IO.
- Flusher thread async IO will always to go to root cgroup.
- I am not sure how this scheme is going to work when we introduce
  hierarchical blkio cgroups.
- cgroup weights for sync IO seems to be being controlled by user and
  somehow root cgroup weight seems to be controlled by this async IO
  logic silently.

Overall sounds very odd design to me. I am not sure what are we achieving
by this. In current scheme one should be able to just adjust the weight
of root cgroup using cgroup interface and achieve same results which you
are showing. So where is the need of dynamically changing it inside
kernel.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-16 13:07                         ` Fengguang Wu
                                           ` (2 preceding siblings ...)
  (?)
@ 2012-04-16 15:52                         ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-16 15:52 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote:

[..]
> Vivek, I noticed these lines in cfq code
> 
>                 sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
> 
> Why not use bdi->dev->devt?  The problem is that dev_name() will
> return "btrfs-X" for btrfs rather than "major:minor".

Isn't bdi->dev->devt 0?  I see following code.

add_disk()
   bdi_register_dev()
      bdi_register()
         device_create_vargs(MKDEV(0,0))
	      dev->devt = devt = MKDEV(0,0);

So for normal block devices, I think bdi->dev->devt will be zero, that's
why probably we don't use it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-16 13:07                         ` Fengguang Wu
@ 2012-04-16 15:52                           ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-16 15:52 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Peter Zijlstra, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote:

[..]
> Vivek, I noticed these lines in cfq code
> 
>                 sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
> 
> Why not use bdi->dev->devt?  The problem is that dev_name() will
> return "btrfs-X" for btrfs rather than "major:minor".

Isn't bdi->dev->devt 0?  I see following code.

add_disk()
   bdi_register_dev()
      bdi_register()
         device_create_vargs(MKDEV(0,0))
	      dev->devt = devt = MKDEV(0,0);

So for normal block devices, I think bdi->dev->devt will be zero, that's
why probably we don't use it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-16 15:52                           ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-16 15:52 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Peter Zijlstra, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote:

[..]
> Vivek, I noticed these lines in cfq code
> 
>                 sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
> 
> Why not use bdi->dev->devt?  The problem is that dev_name() will
> return "btrfs-X" for btrfs rather than "major:minor".

Isn't bdi->dev->devt 0?  I see following code.

add_disk()
   bdi_register_dev()
      bdi_register()
         device_create_vargs(MKDEV(0,0))
	      dev->devt = devt = MKDEV(0,0);

So for normal block devices, I think bdi->dev->devt will be zero, that's
why probably we don't use it.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-16 15:52                           ` Vivek Goyal
  (?)
@ 2012-04-17  2:14                               ` Fengguang Wu
  -1 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-17  2:14 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon, Apr 16, 2012 at 11:52:07AM -0400, Vivek Goyal wrote:
> On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote:
> 
> [..]
> > Vivek, I noticed these lines in cfq code
> > 
> >                 sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
> > 
> > Why not use bdi->dev->devt?  The problem is that dev_name() will
> > return "btrfs-X" for btrfs rather than "major:minor".
> 
> Isn't bdi->dev->devt 0?  I see following code.
> 
> add_disk()
>    bdi_register_dev()
>       bdi_register()
>          device_create_vargs(MKDEV(0,0))
> 	      dev->devt = devt = MKDEV(0,0);
> 
> So for normal block devices, I think bdi->dev->devt will be zero, that's
> why probably we don't use it.

Yes indeed. I can confirm this with tracing. There are two main cases

- some filesystems do not have a real device for the bdi.

- add_disk() calls bdi_register_dev() with the devt, however this
  information is not passed down for some reason.
  device_create_vargs() will try to create a sysfs dev file if the
  devt is not MKDEV(0,0).

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-17  2:14                               ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-17  2:14 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Mon, Apr 16, 2012 at 11:52:07AM -0400, Vivek Goyal wrote:
> On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote:
> 
> [..]
> > Vivek, I noticed these lines in cfq code
> > 
> >                 sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
> > 
> > Why not use bdi->dev->devt?  The problem is that dev_name() will
> > return "btrfs-X" for btrfs rather than "major:minor".
> 
> Isn't bdi->dev->devt 0?  I see following code.
> 
> add_disk()
>    bdi_register_dev()
>       bdi_register()
>          device_create_vargs(MKDEV(0,0))
> 	      dev->devt = devt = MKDEV(0,0);
> 
> So for normal block devices, I think bdi->dev->devt will be zero, that's
> why probably we don't use it.

Yes indeed. I can confirm this with tracing. There are two main cases

- some filesystems do not have a real device for the bdi.

- add_disk() calls bdi_register_dev() with the devt, however this
  information is not passed down for some reason.
  device_create_vargs() will try to create a sysfs dev file if the
  devt is not MKDEV(0,0).

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-17  2:14                               ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-17  2:14 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Mon, Apr 16, 2012 at 11:52:07AM -0400, Vivek Goyal wrote:
> On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote:
> 
> [..]
> > Vivek, I noticed these lines in cfq code
> > 
> >                 sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
> > 
> > Why not use bdi->dev->devt?  The problem is that dev_name() will
> > return "btrfs-X" for btrfs rather than "major:minor".
> 
> Isn't bdi->dev->devt 0?  I see following code.
> 
> add_disk()
>    bdi_register_dev()
>       bdi_register()
>          device_create_vargs(MKDEV(0,0))
> 	      dev->devt = devt = MKDEV(0,0);
> 
> So for normal block devices, I think bdi->dev->devt will be zero, that's
> why probably we don't use it.

Yes indeed. I can confirm this with tracing. There are two main cases

- some filesystems do not have a real device for the bdi.

- add_disk() calls bdi_register_dev() with the devt, however this
  information is not passed down for some reason.
  device_create_vargs() will try to create a sysfs dev file if the
  devt is not MKDEV(0,0).

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                       ` <20120411170542.GB16008-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
  2012-04-11 17:23                         ` Vivek Goyal
@ 2012-04-17 21:48                         ` Tejun Heo
  1 sibling, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-17 21:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu, Vivek Goyal

Hello,

On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote:
> > The additional feature for buffered throttle (which never went upstream),
> > was synchronous in nature. That is we were actively putting writer to
> > sleep on a per cgroup wait queue in the request queue and wake it up when
> > it can do further IO based on cgroup limits.
>
>   Hmm, but then there would be similar starvation issues as with my simple
> scheme because async IO could always use the whole available bandwidth.
> Mixing of sync & async throttling is really problematic... I'm wondering
> how useful the async throttling is. Because we will block on request
> allocation once there are more than nr_requests pending requests so at that
> point throttling becomes sync anyway.

I haven't thought about the interface too much yet but, with the
synchronous wait at transaction start, we have information both ways -
ie. lower layer also knows that there are synchrnous waiters.  At the
simplest, not allowing any more async IOs when sync writers exist
should solve the starvation issue.

As for priority inversion through shared request pool, it is a problem
which needs to be solved regardless of how async IOs are throttled.
I'm not determined to which extent yet tho.  Different cgroups
definitely need to be on separate pools but do we also want
distinguish sync and async and what about ioprio?  Maybe we need a
bybrid approach with larger common pool and reserved ones for each
class?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                       ` <20120411170542.GB16008-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
  2012-04-11 17:23                         ` Vivek Goyal
@ 2012-04-17 21:48                         ` Tejun Heo
  1 sibling, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-17 21:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

Hello,

On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote:
> > The additional feature for buffered throttle (which never went upstream),
> > was synchronous in nature. That is we were actively putting writer to
> > sleep on a per cgroup wait queue in the request queue and wake it up when
> > it can do further IO based on cgroup limits.
>
>   Hmm, but then there would be similar starvation issues as with my simple
> scheme because async IO could always use the whole available bandwidth.
> Mixing of sync & async throttling is really problematic... I'm wondering
> how useful the async throttling is. Because we will block on request
> allocation once there are more than nr_requests pending requests so at that
> point throttling becomes sync anyway.

I haven't thought about the interface too much yet but, with the
synchronous wait at transaction start, we have information both ways -
ie. lower layer also knows that there are synchrnous waiters.  At the
simplest, not allowing any more async IOs when sync writers exist
should solve the starvation issue.

As for priority inversion through shared request pool, it is a problem
which needs to be solved regardless of how async IOs are throttled.
I'm not determined to which extent yet tho.  Different cgroups
definitely need to be on separate pools but do we also want
distinguish sync and async and what about ioprio?  Maybe we need a
bybrid approach with larger common pool and reserved ones for each
class?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-17 21:48                         ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-17 21:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, Fengguang Wu, Jens Axboe,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k,
	andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hello,

On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote:
> > The additional feature for buffered throttle (which never went upstream),
> > was synchronous in nature. That is we were actively putting writer to
> > sleep on a per cgroup wait queue in the request queue and wake it up when
> > it can do further IO based on cgroup limits.
>
>   Hmm, but then there would be similar starvation issues as with my simple
> scheme because async IO could always use the whole available bandwidth.
> Mixing of sync & async throttling is really problematic... I'm wondering
> how useful the async throttling is. Because we will block on request
> allocation once there are more than nr_requests pending requests so at that
> point throttling becomes sync anyway.

I haven't thought about the interface too much yet but, with the
synchronous wait at transaction start, we have information both ways -
ie. lower layer also knows that there are synchrnous waiters.  At the
simplest, not allowing any more async IOs when sync writers exist
should solve the starvation issue.

As for priority inversion through shared request pool, it is a problem
which needs to be solved regardless of how async IOs are throttled.
I'm not determined to which extent yet tho.  Different cgroups
definitely need to be on separate pools but do we also want
distinguish sync and async and what about ioprio?  Maybe we need a
bybrid approach with larger common pool and reserved ones for each
class?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-17 21:48                         ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-17 21:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

Hello,

On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote:
> > The additional feature for buffered throttle (which never went upstream),
> > was synchronous in nature. That is we were actively putting writer to
> > sleep on a per cgroup wait queue in the request queue and wake it up when
> > it can do further IO based on cgroup limits.
>
>   Hmm, but then there would be similar starvation issues as with my simple
> scheme because async IO could always use the whole available bandwidth.
> Mixing of sync & async throttling is really problematic... I'm wondering
> how useful the async throttling is. Because we will block on request
> allocation once there are more than nr_requests pending requests so at that
> point throttling becomes sync anyway.

I haven't thought about the interface too much yet but, with the
synchronous wait at transaction start, we have information both ways -
ie. lower layer also knows that there are synchrnous waiters.  At the
simplest, not allowing any more async IOs when sync writers exist
should solve the starvation issue.

As for priority inversion through shared request pool, it is a problem
which needs to be solved regardless of how async IOs are throttled.
I'm not determined to which extent yet tho.  Different cgroups
definitely need to be on separate pools but do we also want
distinguish sync and async and what about ioprio?  Maybe we need a
bybrid approach with larger common pool and reserved ones for each
class?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                     ` <20120411192231.GF16008-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
  2012-04-12 20:37                         ` Vivek Goyal
@ 2012-04-17 22:01                       ` Tejun Heo
  1 sibling, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-17 22:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu, Vivek Goyal

Hello,

On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote:
> > So all the metadata IO will happen thorough journaling thread and that
> > will be in root group which should remain unthrottled. So any journal
> > IO going to disk should remain unthrottled.
>
>   Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't
> have to have the journal thread (as is the case of reiserfs where random
> writer may end up doing commit) but let's not complicate things
> unnecessarily.

Why can't journal entries keep track of the originator so that bios
can be attributed to the originator while committing?  That shouldn't
be too difficult to implement, no?

> > Now, IIRC, fsync problem with throttling was that we had opened a
> > transaction but could not write it back to disk because we had to
> > wait for all the cached data to go to disk (which is throttled). So
> > my question is, can't we first wait for all the data to be flushed
> > to disk and then open a transaction for metadata. metadata will be
> > unthrottled so filesystem will not have to do any tricks like bdi is
> > congested or not.
>
>   Actually that's what's happening. We first do filemap_write_and_wait()
> which syncs all the data and then we go and force transaction commit to
> make sure all metadata got to stable storage. The problem is that writeout
> of data may need to allocate new blocks and that starts a transaction and
> while the transaction is started we may need to do some reads (e.g. of
> bitmaps etc.) which may be throttled and at that moment the whole
> filesystem is blocked. I don't remember the stack traces you showed me so
> I'm not sure it this is what your observed but it's certainly one possible
> scenario. The reason why fsync triggers problems is simply that it's the
> only place where process normally does significant amount of writing. In
> most cases flusher thread / journal thread do it so this effect is not
> visible. And to precede your question, it would be rather hard to avoid IO
> while the transaction is started due to locking.

Probably we should mark all IOs issued inside transaction as META (or
whatever which tells blkcg to avoid throttling it).  We're gonna need
overcharging for metadata writes anyway, so I don't think this will
make too much of a difference.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                     ` <20120411192231.GF16008-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
  2012-04-12 20:37                         ` Vivek Goyal
@ 2012-04-17 22:01                       ` Tejun Heo
  1 sibling, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-17 22:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

Hello,

On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote:
> > So all the metadata IO will happen thorough journaling thread and that
> > will be in root group which should remain unthrottled. So any journal
> > IO going to disk should remain unthrottled.
>
>   Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't
> have to have the journal thread (as is the case of reiserfs where random
> writer may end up doing commit) but let's not complicate things
> unnecessarily.

Why can't journal entries keep track of the originator so that bios
can be attributed to the originator while committing?  That shouldn't
be too difficult to implement, no?

> > Now, IIRC, fsync problem with throttling was that we had opened a
> > transaction but could not write it back to disk because we had to
> > wait for all the cached data to go to disk (which is throttled). So
> > my question is, can't we first wait for all the data to be flushed
> > to disk and then open a transaction for metadata. metadata will be
> > unthrottled so filesystem will not have to do any tricks like bdi is
> > congested or not.
>
>   Actually that's what's happening. We first do filemap_write_and_wait()
> which syncs all the data and then we go and force transaction commit to
> make sure all metadata got to stable storage. The problem is that writeout
> of data may need to allocate new blocks and that starts a transaction and
> while the transaction is started we may need to do some reads (e.g. of
> bitmaps etc.) which may be throttled and at that moment the whole
> filesystem is blocked. I don't remember the stack traces you showed me so
> I'm not sure it this is what your observed but it's certainly one possible
> scenario. The reason why fsync triggers problems is simply that it's the
> only place where process normally does significant amount of writing. In
> most cases flusher thread / journal thread do it so this effect is not
> visible. And to precede your question, it would be rather hard to avoid IO
> while the transaction is started due to locking.

Probably we should mark all IOs issued inside transaction as META (or
whatever which tells blkcg to avoid throttling it).  We're gonna need
overcharging for metadata writes anyway, so I don't think this will
make too much of a difference.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-17 22:01                       ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-17 22:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, Fengguang Wu, Jens Axboe,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k,
	andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hello,

On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote:
> > So all the metadata IO will happen thorough journaling thread and that
> > will be in root group which should remain unthrottled. So any journal
> > IO going to disk should remain unthrottled.
>
>   Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't
> have to have the journal thread (as is the case of reiserfs where random
> writer may end up doing commit) but let's not complicate things
> unnecessarily.

Why can't journal entries keep track of the originator so that bios
can be attributed to the originator while committing?  That shouldn't
be too difficult to implement, no?

> > Now, IIRC, fsync problem with throttling was that we had opened a
> > transaction but could not write it back to disk because we had to
> > wait for all the cached data to go to disk (which is throttled). So
> > my question is, can't we first wait for all the data to be flushed
> > to disk and then open a transaction for metadata. metadata will be
> > unthrottled so filesystem will not have to do any tricks like bdi is
> > congested or not.
>
>   Actually that's what's happening. We first do filemap_write_and_wait()
> which syncs all the data and then we go and force transaction commit to
> make sure all metadata got to stable storage. The problem is that writeout
> of data may need to allocate new blocks and that starts a transaction and
> while the transaction is started we may need to do some reads (e.g. of
> bitmaps etc.) which may be throttled and at that moment the whole
> filesystem is blocked. I don't remember the stack traces you showed me so
> I'm not sure it this is what your observed but it's certainly one possible
> scenario. The reason why fsync triggers problems is simply that it's the
> only place where process normally does significant amount of writing. In
> most cases flusher thread / journal thread do it so this effect is not
> visible. And to precede your question, it would be rather hard to avoid IO
> while the transaction is started due to locking.

Probably we should mark all IOs issued inside transaction as META (or
whatever which tells blkcg to avoid throttling it).  We're gonna need
overcharging for metadata writes anyway, so I don't think this will
make too much of a difference.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-17 22:01                       ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-17 22:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

Hello,

On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote:
> > So all the metadata IO will happen thorough journaling thread and that
> > will be in root group which should remain unthrottled. So any journal
> > IO going to disk should remain unthrottled.
>
>   Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't
> have to have the journal thread (as is the case of reiserfs where random
> writer may end up doing commit) but let's not complicate things
> unnecessarily.

Why can't journal entries keep track of the originator so that bios
can be attributed to the originator while committing?  That shouldn't
be too difficult to implement, no?

> > Now, IIRC, fsync problem with throttling was that we had opened a
> > transaction but could not write it back to disk because we had to
> > wait for all the cached data to go to disk (which is throttled). So
> > my question is, can't we first wait for all the data to be flushed
> > to disk and then open a transaction for metadata. metadata will be
> > unthrottled so filesystem will not have to do any tricks like bdi is
> > congested or not.
>
>   Actually that's what's happening. We first do filemap_write_and_wait()
> which syncs all the data and then we go and force transaction commit to
> make sure all metadata got to stable storage. The problem is that writeout
> of data may need to allocate new blocks and that starts a transaction and
> while the transaction is started we may need to do some reads (e.g. of
> bitmaps etc.) which may be throttled and at that moment the whole
> filesystem is blocked. I don't remember the stack traces you showed me so
> I'm not sure it this is what your observed but it's certainly one possible
> scenario. The reason why fsync triggers problems is simply that it's the
> only place where process normally does significant amount of writing. In
> most cases flusher thread / journal thread do it so this effect is not
> visible. And to precede your question, it would be rather hard to avoid IO
> while the transaction is started due to locking.

Probably we should mark all IOs issued inside transaction as META (or
whatever which tells blkcg to avoid throttling it).  We're gonna need
overcharging for metadata writes anyway, so I don't think this will
make too much of a difference.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-06  9:59         ` Fengguang Wu
  (?)
  (?)
@ 2012-04-17 22:38         ` Tejun Heo
  -1 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-17 22:38 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA

Hello, Fengguang.

On Fri, Apr 06, 2012 at 02:59:34AM -0700, Fengguang Wu wrote:
> Fortunately, the above gap can be easily filled judging from the
> block/cfq IO controller code. By adding some direct IO accounting
> and changing several lines of my patches to make use of the collected
> stats, the semantics of the blkio.throttle.write_bps interfaces can be
> changed from "limit for direct IO" to "limit for direct+buffered IOs".
> Ditto for blkio.weight and blkio.write_iops, as long as some
> iops/device time stats are made available to balance_dirty_pages().
> 
> It would be a fairly *easy* change. :-) It's merely adding some
> accounting code and there is no need to change the block IO
> controlling algorithm at all. I'll do the work of accounting (which
> is basically independent of the IO controlling) and use the new stats
> in balance_dirty_pages().

I don't really understand how this can work.  For hard limits, maybe,
but for proportional IO, you have to know which cgroups have IOs
before assigning the proportions, so blkcg assigning IO bandwidth
without knowing async writes simply can't work.

For example, let's say cgroups A and B have 2:8 split.  If A has IOs
on queue and B doesn't, blkcg will assign all IO bandwidth to A.  I
can't wrap my head around how writeback is gonna make use of the
resulting stats but let's say it decides it needs to put out some IOs
out for both cgroups.  What happens then?  Do all the async writes go
through the root cgroup controlled by and affecting the ratio between
rootcg and cgroup A and B?  Or do they have to be accounted as part of
cgroups A and B?  If so, what if the added bandwidth goes over the
limit?  Let's say if we implement overcharge; then, I suppose we'll
have to communicate that upwards too, right?

This is still easy.  What about hierarchical propio?  What happens
then?  You can't do hierarchical proportional allocation without
knowing how much IOs are pending for which group.  How is that
information gonna be communicated between blkcg and writeback?  Are we
gonna have two separate hierarchical proportional IO allocators?  How
is that gonna work at all?  If we're gonna have single allocator in
block layer, writeback would have to feed the amount of IOs it may
generate into the allocator, get the resulting allocation and then
issue IO and then block layer again will have to account these to the
originating cgroups.  It's just crazy.

> The only problem I can see now, is that balance_dirty_pages() works
> per-bdi and blkcg works per-device. So the two ends may not match
> nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where
> sdb is shared by lv0 and lv1. However it should be rare situations and
> be much more acceptable than the problems arise from the "push back"
> approach which impacts everyone.

I don't know.  What problems?  AFAICS, the biggest issue is writeback
of different inodes getting mixed resulting in poor performance, but
if you think about it, that's about the frequency of switching cgroups
and a problem which can and should be dealt with from block layer
(e.g. use larger time slice if all the pending IOs are async).

Writeback's duty is generating stream of async writes which can be
served efficiently for the *cgroup* and keeping the buffer filled as
necessary and chaining the backpressure from there to the actual
dirtier.  That's what writeback does without cgroup.  Nothing
fundamental changes with cgroup.  It's just finer grained.

> > No, no, it's not about standing in my way.  As Vivek said in the other
> > reply, it's that the "gap" that you filled was created *because*
> > writeback wasn't cgroup aware and now you're in turn filling that gap
> > by making writeback work around that "gap".  I mean, my mind boggles.
> > Doesn't yours?  I strongly believe everyone's should.
> 
> Heh. It's a hard problem indeed. I felt great pains in the IO-less
> dirty throttling work. I did a lot reasoning about it, and have in
> fact kept cgroup IO controller in mind since its early days. Now I'd
> say it's hands down for it to adapt to the gap between the total IO
> limit and what's carried out by the block IO controller.

You're not providing any valid counter arguments about the issues
being raised about the messed up design.  How is anything "hands down"
here?

> > There's where I'm confused.  How is the said split supposed to work?
> > They aren't independent.  I mean, who gets to decide what and where
> > are those decisions enforced?
> 
> Yeah it's not independent. It's about
> 
> - keep block IO cgroup untouched (in its current algorithm, for
>   throttling direct IO)
> 
> - let balance_dirty_pages() adapt to the throttling target
>   
>         buffered_write_limit = total_limit - direct_IOs

Think about proportional allocation.  You don't have a number until
you know who have pending IOs and how much.

> To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> It's always there doing 1:1 proportional throttling. Then you try to
> kick in to add *double* throttling in block/cfq layer. Now the low
> layer may enforce 10:1 throttling and push balance_dirty_pages() away
> from its balanced state, leading to large fluctuations and program
> stalls.

Just do the same 1:1 inside each cgroup.

>  This can be avoided by telling balance_dirty_pages(): "your
> balance goal is no longer 1:1, but 10:1". With this information
> balance_dirty_pages() will behave right. Then there is the question:
> if balance_dirty_pages() will work just well provided the information,
> why bother doing the throttling at low layer and "push back" the
> pressure all the way up?

Because splitting a resource into two pieces arbitrarily with
different amount of consumptions on each side and then applying the
same proportion on both doesn't mean anything?

> The balance_dirty_pages() is already deeply involved in dirty throttling.
> As you can see from this patchset, the same algorithms can be extended
> trivially to work with cgroup IO limits.
> 
> buffered write IO controller in balance_dirty_pages()
> https://lkml.org/lkml/2012/3/28/275

It is half broken thing with fundamental design flaws which can't be
corrected without complete reimplementation.  I don't know what to
say.

> In the "back pressure" scheme, memcg is a must because only it has all
> the infrastructure to track dirty pages upon which you can apply some
> dirty_limits. Don't tell me you want to account dirty pages in blkcg...

For now, per-inode tracking seems good enough.

> What I can see is, it looks pretty simple and nature to let
> balance_dirty_pages() fill the gap towards a total solution :-)
> 
> - add direct IO accounting in some convenient point of the IO path
>   IO submission or completion point, either is fine.
> 
> - change several lines of the buffered write IO controller to
>   integrate the direct IO rate into the formula to fit the "total
>   IO" limit
> 
> - in future, add more accounting as well as feedback control to make
>   balance_dirty_pages() work with IOPS and disk time

To me, you seem to be not addressing the issues I've been raising at
all and just repeating the same points again and again.  If I'm
misunderstanding something, please point out.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-06  9:59         ` Fengguang Wu
  (?)
@ 2012-04-17 22:38           ` Tejun Heo
  -1 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-17 22:38 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hello, Fengguang.

On Fri, Apr 06, 2012 at 02:59:34AM -0700, Fengguang Wu wrote:
> Fortunately, the above gap can be easily filled judging from the
> block/cfq IO controller code. By adding some direct IO accounting
> and changing several lines of my patches to make use of the collected
> stats, the semantics of the blkio.throttle.write_bps interfaces can be
> changed from "limit for direct IO" to "limit for direct+buffered IOs".
> Ditto for blkio.weight and blkio.write_iops, as long as some
> iops/device time stats are made available to balance_dirty_pages().
> 
> It would be a fairly *easy* change. :-) It's merely adding some
> accounting code and there is no need to change the block IO
> controlling algorithm at all. I'll do the work of accounting (which
> is basically independent of the IO controlling) and use the new stats
> in balance_dirty_pages().

I don't really understand how this can work.  For hard limits, maybe,
but for proportional IO, you have to know which cgroups have IOs
before assigning the proportions, so blkcg assigning IO bandwidth
without knowing async writes simply can't work.

For example, let's say cgroups A and B have 2:8 split.  If A has IOs
on queue and B doesn't, blkcg will assign all IO bandwidth to A.  I
can't wrap my head around how writeback is gonna make use of the
resulting stats but let's say it decides it needs to put out some IOs
out for both cgroups.  What happens then?  Do all the async writes go
through the root cgroup controlled by and affecting the ratio between
rootcg and cgroup A and B?  Or do they have to be accounted as part of
cgroups A and B?  If so, what if the added bandwidth goes over the
limit?  Let's say if we implement overcharge; then, I suppose we'll
have to communicate that upwards too, right?

This is still easy.  What about hierarchical propio?  What happens
then?  You can't do hierarchical proportional allocation without
knowing how much IOs are pending for which group.  How is that
information gonna be communicated between blkcg and writeback?  Are we
gonna have two separate hierarchical proportional IO allocators?  How
is that gonna work at all?  If we're gonna have single allocator in
block layer, writeback would have to feed the amount of IOs it may
generate into the allocator, get the resulting allocation and then
issue IO and then block layer again will have to account these to the
originating cgroups.  It's just crazy.

> The only problem I can see now, is that balance_dirty_pages() works
> per-bdi and blkcg works per-device. So the two ends may not match
> nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where
> sdb is shared by lv0 and lv1. However it should be rare situations and
> be much more acceptable than the problems arise from the "push back"
> approach which impacts everyone.

I don't know.  What problems?  AFAICS, the biggest issue is writeback
of different inodes getting mixed resulting in poor performance, but
if you think about it, that's about the frequency of switching cgroups
and a problem which can and should be dealt with from block layer
(e.g. use larger time slice if all the pending IOs are async).

Writeback's duty is generating stream of async writes which can be
served efficiently for the *cgroup* and keeping the buffer filled as
necessary and chaining the backpressure from there to the actual
dirtier.  That's what writeback does without cgroup.  Nothing
fundamental changes with cgroup.  It's just finer grained.

> > No, no, it's not about standing in my way.  As Vivek said in the other
> > reply, it's that the "gap" that you filled was created *because*
> > writeback wasn't cgroup aware and now you're in turn filling that gap
> > by making writeback work around that "gap".  I mean, my mind boggles.
> > Doesn't yours?  I strongly believe everyone's should.
> 
> Heh. It's a hard problem indeed. I felt great pains in the IO-less
> dirty throttling work. I did a lot reasoning about it, and have in
> fact kept cgroup IO controller in mind since its early days. Now I'd
> say it's hands down for it to adapt to the gap between the total IO
> limit and what's carried out by the block IO controller.

You're not providing any valid counter arguments about the issues
being raised about the messed up design.  How is anything "hands down"
here?

> > There's where I'm confused.  How is the said split supposed to work?
> > They aren't independent.  I mean, who gets to decide what and where
> > are those decisions enforced?
> 
> Yeah it's not independent. It's about
> 
> - keep block IO cgroup untouched (in its current algorithm, for
>   throttling direct IO)
> 
> - let balance_dirty_pages() adapt to the throttling target
>   
>         buffered_write_limit = total_limit - direct_IOs

Think about proportional allocation.  You don't have a number until
you know who have pending IOs and how much.

> To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> It's always there doing 1:1 proportional throttling. Then you try to
> kick in to add *double* throttling in block/cfq layer. Now the low
> layer may enforce 10:1 throttling and push balance_dirty_pages() away
> from its balanced state, leading to large fluctuations and program
> stalls.

Just do the same 1:1 inside each cgroup.

>  This can be avoided by telling balance_dirty_pages(): "your
> balance goal is no longer 1:1, but 10:1". With this information
> balance_dirty_pages() will behave right. Then there is the question:
> if balance_dirty_pages() will work just well provided the information,
> why bother doing the throttling at low layer and "push back" the
> pressure all the way up?

Because splitting a resource into two pieces arbitrarily with
different amount of consumptions on each side and then applying the
same proportion on both doesn't mean anything?

> The balance_dirty_pages() is already deeply involved in dirty throttling.
> As you can see from this patchset, the same algorithms can be extended
> trivially to work with cgroup IO limits.
> 
> buffered write IO controller in balance_dirty_pages()
> https://lkml.org/lkml/2012/3/28/275

It is half broken thing with fundamental design flaws which can't be
corrected without complete reimplementation.  I don't know what to
say.

> In the "back pressure" scheme, memcg is a must because only it has all
> the infrastructure to track dirty pages upon which you can apply some
> dirty_limits. Don't tell me you want to account dirty pages in blkcg...

For now, per-inode tracking seems good enough.

> What I can see is, it looks pretty simple and nature to let
> balance_dirty_pages() fill the gap towards a total solution :-)
> 
> - add direct IO accounting in some convenient point of the IO path
>   IO submission or completion point, either is fine.
> 
> - change several lines of the buffered write IO controller to
>   integrate the direct IO rate into the formula to fit the "total
>   IO" limit
> 
> - in future, add more accounting as well as feedback control to make
>   balance_dirty_pages() work with IOPS and disk time

To me, you seem to be not addressing the issues I've been raising at
all and just repeating the same points again and again.  If I'm
misunderstanding something, please point out.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-17 22:38           ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-17 22:38 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k,
	andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hello, Fengguang.

On Fri, Apr 06, 2012 at 02:59:34AM -0700, Fengguang Wu wrote:
> Fortunately, the above gap can be easily filled judging from the
> block/cfq IO controller code. By adding some direct IO accounting
> and changing several lines of my patches to make use of the collected
> stats, the semantics of the blkio.throttle.write_bps interfaces can be
> changed from "limit for direct IO" to "limit for direct+buffered IOs".
> Ditto for blkio.weight and blkio.write_iops, as long as some
> iops/device time stats are made available to balance_dirty_pages().
> 
> It would be a fairly *easy* change. :-) It's merely adding some
> accounting code and there is no need to change the block IO
> controlling algorithm at all. I'll do the work of accounting (which
> is basically independent of the IO controlling) and use the new stats
> in balance_dirty_pages().

I don't really understand how this can work.  For hard limits, maybe,
but for proportional IO, you have to know which cgroups have IOs
before assigning the proportions, so blkcg assigning IO bandwidth
without knowing async writes simply can't work.

For example, let's say cgroups A and B have 2:8 split.  If A has IOs
on queue and B doesn't, blkcg will assign all IO bandwidth to A.  I
can't wrap my head around how writeback is gonna make use of the
resulting stats but let's say it decides it needs to put out some IOs
out for both cgroups.  What happens then?  Do all the async writes go
through the root cgroup controlled by and affecting the ratio between
rootcg and cgroup A and B?  Or do they have to be accounted as part of
cgroups A and B?  If so, what if the added bandwidth goes over the
limit?  Let's say if we implement overcharge; then, I suppose we'll
have to communicate that upwards too, right?

This is still easy.  What about hierarchical propio?  What happens
then?  You can't do hierarchical proportional allocation without
knowing how much IOs are pending for which group.  How is that
information gonna be communicated between blkcg and writeback?  Are we
gonna have two separate hierarchical proportional IO allocators?  How
is that gonna work at all?  If we're gonna have single allocator in
block layer, writeback would have to feed the amount of IOs it may
generate into the allocator, get the resulting allocation and then
issue IO and then block layer again will have to account these to the
originating cgroups.  It's just crazy.

> The only problem I can see now, is that balance_dirty_pages() works
> per-bdi and blkcg works per-device. So the two ends may not match
> nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where
> sdb is shared by lv0 and lv1. However it should be rare situations and
> be much more acceptable than the problems arise from the "push back"
> approach which impacts everyone.

I don't know.  What problems?  AFAICS, the biggest issue is writeback
of different inodes getting mixed resulting in poor performance, but
if you think about it, that's about the frequency of switching cgroups
and a problem which can and should be dealt with from block layer
(e.g. use larger time slice if all the pending IOs are async).

Writeback's duty is generating stream of async writes which can be
served efficiently for the *cgroup* and keeping the buffer filled as
necessary and chaining the backpressure from there to the actual
dirtier.  That's what writeback does without cgroup.  Nothing
fundamental changes with cgroup.  It's just finer grained.

> > No, no, it's not about standing in my way.  As Vivek said in the other
> > reply, it's that the "gap" that you filled was created *because*
> > writeback wasn't cgroup aware and now you're in turn filling that gap
> > by making writeback work around that "gap".  I mean, my mind boggles.
> > Doesn't yours?  I strongly believe everyone's should.
> 
> Heh. It's a hard problem indeed. I felt great pains in the IO-less
> dirty throttling work. I did a lot reasoning about it, and have in
> fact kept cgroup IO controller in mind since its early days. Now I'd
> say it's hands down for it to adapt to the gap between the total IO
> limit and what's carried out by the block IO controller.

You're not providing any valid counter arguments about the issues
being raised about the messed up design.  How is anything "hands down"
here?

> > There's where I'm confused.  How is the said split supposed to work?
> > They aren't independent.  I mean, who gets to decide what and where
> > are those decisions enforced?
> 
> Yeah it's not independent. It's about
> 
> - keep block IO cgroup untouched (in its current algorithm, for
>   throttling direct IO)
> 
> - let balance_dirty_pages() adapt to the throttling target
>   
>         buffered_write_limit = total_limit - direct_IOs

Think about proportional allocation.  You don't have a number until
you know who have pending IOs and how much.

> To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> It's always there doing 1:1 proportional throttling. Then you try to
> kick in to add *double* throttling in block/cfq layer. Now the low
> layer may enforce 10:1 throttling and push balance_dirty_pages() away
> from its balanced state, leading to large fluctuations and program
> stalls.

Just do the same 1:1 inside each cgroup.

>  This can be avoided by telling balance_dirty_pages(): "your
> balance goal is no longer 1:1, but 10:1". With this information
> balance_dirty_pages() will behave right. Then there is the question:
> if balance_dirty_pages() will work just well provided the information,
> why bother doing the throttling at low layer and "push back" the
> pressure all the way up?

Because splitting a resource into two pieces arbitrarily with
different amount of consumptions on each side and then applying the
same proportion on both doesn't mean anything?

> The balance_dirty_pages() is already deeply involved in dirty throttling.
> As you can see from this patchset, the same algorithms can be extended
> trivially to work with cgroup IO limits.
> 
> buffered write IO controller in balance_dirty_pages()
> https://lkml.org/lkml/2012/3/28/275

It is half broken thing with fundamental design flaws which can't be
corrected without complete reimplementation.  I don't know what to
say.

> In the "back pressure" scheme, memcg is a must because only it has all
> the infrastructure to track dirty pages upon which you can apply some
> dirty_limits. Don't tell me you want to account dirty pages in blkcg...

For now, per-inode tracking seems good enough.

> What I can see is, it looks pretty simple and nature to let
> balance_dirty_pages() fill the gap towards a total solution :-)
> 
> - add direct IO accounting in some convenient point of the IO path
>   IO submission or completion point, either is fine.
> 
> - change several lines of the buffered write IO controller to
>   integrate the direct IO rate into the formula to fit the "total
>   IO" limit
> 
> - in future, add more accounting as well as feedback control to make
>   balance_dirty_pages() work with IOPS and disk time

To me, you seem to be not addressing the issues I've been raising at
all and just repeating the same points again and again.  If I'm
misunderstanding something, please point out.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-17 22:38           ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-17 22:38 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hello, Fengguang.

On Fri, Apr 06, 2012 at 02:59:34AM -0700, Fengguang Wu wrote:
> Fortunately, the above gap can be easily filled judging from the
> block/cfq IO controller code. By adding some direct IO accounting
> and changing several lines of my patches to make use of the collected
> stats, the semantics of the blkio.throttle.write_bps interfaces can be
> changed from "limit for direct IO" to "limit for direct+buffered IOs".
> Ditto for blkio.weight and blkio.write_iops, as long as some
> iops/device time stats are made available to balance_dirty_pages().
> 
> It would be a fairly *easy* change. :-) It's merely adding some
> accounting code and there is no need to change the block IO
> controlling algorithm at all. I'll do the work of accounting (which
> is basically independent of the IO controlling) and use the new stats
> in balance_dirty_pages().

I don't really understand how this can work.  For hard limits, maybe,
but for proportional IO, you have to know which cgroups have IOs
before assigning the proportions, so blkcg assigning IO bandwidth
without knowing async writes simply can't work.

For example, let's say cgroups A and B have 2:8 split.  If A has IOs
on queue and B doesn't, blkcg will assign all IO bandwidth to A.  I
can't wrap my head around how writeback is gonna make use of the
resulting stats but let's say it decides it needs to put out some IOs
out for both cgroups.  What happens then?  Do all the async writes go
through the root cgroup controlled by and affecting the ratio between
rootcg and cgroup A and B?  Or do they have to be accounted as part of
cgroups A and B?  If so, what if the added bandwidth goes over the
limit?  Let's say if we implement overcharge; then, I suppose we'll
have to communicate that upwards too, right?

This is still easy.  What about hierarchical propio?  What happens
then?  You can't do hierarchical proportional allocation without
knowing how much IOs are pending for which group.  How is that
information gonna be communicated between blkcg and writeback?  Are we
gonna have two separate hierarchical proportional IO allocators?  How
is that gonna work at all?  If we're gonna have single allocator in
block layer, writeback would have to feed the amount of IOs it may
generate into the allocator, get the resulting allocation and then
issue IO and then block layer again will have to account these to the
originating cgroups.  It's just crazy.

> The only problem I can see now, is that balance_dirty_pages() works
> per-bdi and blkcg works per-device. So the two ends may not match
> nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where
> sdb is shared by lv0 and lv1. However it should be rare situations and
> be much more acceptable than the problems arise from the "push back"
> approach which impacts everyone.

I don't know.  What problems?  AFAICS, the biggest issue is writeback
of different inodes getting mixed resulting in poor performance, but
if you think about it, that's about the frequency of switching cgroups
and a problem which can and should be dealt with from block layer
(e.g. use larger time slice if all the pending IOs are async).

Writeback's duty is generating stream of async writes which can be
served efficiently for the *cgroup* and keeping the buffer filled as
necessary and chaining the backpressure from there to the actual
dirtier.  That's what writeback does without cgroup.  Nothing
fundamental changes with cgroup.  It's just finer grained.

> > No, no, it's not about standing in my way.  As Vivek said in the other
> > reply, it's that the "gap" that you filled was created *because*
> > writeback wasn't cgroup aware and now you're in turn filling that gap
> > by making writeback work around that "gap".  I mean, my mind boggles.
> > Doesn't yours?  I strongly believe everyone's should.
> 
> Heh. It's a hard problem indeed. I felt great pains in the IO-less
> dirty throttling work. I did a lot reasoning about it, and have in
> fact kept cgroup IO controller in mind since its early days. Now I'd
> say it's hands down for it to adapt to the gap between the total IO
> limit and what's carried out by the block IO controller.

You're not providing any valid counter arguments about the issues
being raised about the messed up design.  How is anything "hands down"
here?

> > There's where I'm confused.  How is the said split supposed to work?
> > They aren't independent.  I mean, who gets to decide what and where
> > are those decisions enforced?
> 
> Yeah it's not independent. It's about
> 
> - keep block IO cgroup untouched (in its current algorithm, for
>   throttling direct IO)
> 
> - let balance_dirty_pages() adapt to the throttling target
>   
>         buffered_write_limit = total_limit - direct_IOs

Think about proportional allocation.  You don't have a number until
you know who have pending IOs and how much.

> To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> It's always there doing 1:1 proportional throttling. Then you try to
> kick in to add *double* throttling in block/cfq layer. Now the low
> layer may enforce 10:1 throttling and push balance_dirty_pages() away
> from its balanced state, leading to large fluctuations and program
> stalls.

Just do the same 1:1 inside each cgroup.

>  This can be avoided by telling balance_dirty_pages(): "your
> balance goal is no longer 1:1, but 10:1". With this information
> balance_dirty_pages() will behave right. Then there is the question:
> if balance_dirty_pages() will work just well provided the information,
> why bother doing the throttling at low layer and "push back" the
> pressure all the way up?

Because splitting a resource into two pieces arbitrarily with
different amount of consumptions on each side and then applying the
same proportion on both doesn't mean anything?

> The balance_dirty_pages() is already deeply involved in dirty throttling.
> As you can see from this patchset, the same algorithms can be extended
> trivially to work with cgroup IO limits.
> 
> buffered write IO controller in balance_dirty_pages()
> https://lkml.org/lkml/2012/3/28/275

It is half broken thing with fundamental design flaws which can't be
corrected without complete reimplementation.  I don't know what to
say.

> In the "back pressure" scheme, memcg is a must because only it has all
> the infrastructure to track dirty pages upon which you can apply some
> dirty_limits. Don't tell me you want to account dirty pages in blkcg...

For now, per-inode tracking seems good enough.

> What I can see is, it looks pretty simple and nature to let
> balance_dirty_pages() fill the gap towards a total solution :-)
> 
> - add direct IO accounting in some convenient point of the IO path
>   IO submission or completion point, either is fine.
> 
> - change several lines of the buffered write IO controller to
>   integrate the direct IO rate into the formula to fit the "total
>   IO" limit
> 
> - in future, add more accounting as well as feedback control to make
>   balance_dirty_pages() work with IOPS and disk time

To me, you seem to be not addressing the issues I've been raising at
all and just repeating the same points again and again.  If I'm
misunderstanding something, please point out.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                       ` <20120417220106.GF19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-04-18  6:30                         ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-18  6:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu, Vivek Goyal

  Hello,

On Tue 17-04-12 15:01:06, Tejun Heo wrote:
> On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote:
> > > So all the metadata IO will happen thorough journaling thread and that
> > > will be in root group which should remain unthrottled. So any journal
> > > IO going to disk should remain unthrottled.
> >
> >   Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't
> > have to have the journal thread (as is the case of reiserfs where random
> > writer may end up doing commit) but let's not complicate things
> > unnecessarily.
> 
> Why can't journal entries keep track of the originator so that bios
> can be attributed to the originator while committing?  That shouldn't
> be too difficult to implement, no?
  I think I was just describing the current state but yes, in future we
can track which cgroup first attached a buffer to a transaction.

> > > Now, IIRC, fsync problem with throttling was that we had opened a
> > > transaction but could not write it back to disk because we had to
> > > wait for all the cached data to go to disk (which is throttled). So
> > > my question is, can't we first wait for all the data to be flushed
> > > to disk and then open a transaction for metadata. metadata will be
> > > unthrottled so filesystem will not have to do any tricks like bdi is
> > > congested or not.
> >
> >   Actually that's what's happening. We first do filemap_write_and_wait()
> > which syncs all the data and then we go and force transaction commit to
> > make sure all metadata got to stable storage. The problem is that writeout
> > of data may need to allocate new blocks and that starts a transaction and
> > while the transaction is started we may need to do some reads (e.g. of
> > bitmaps etc.) which may be throttled and at that moment the whole
> > filesystem is blocked. I don't remember the stack traces you showed me so
> > I'm not sure it this is what your observed but it's certainly one possible
> > scenario. The reason why fsync triggers problems is simply that it's the
> > only place where process normally does significant amount of writing. In
> > most cases flusher thread / journal thread do it so this effect is not
> > visible. And to precede your question, it would be rather hard to avoid IO
> > while the transaction is started due to locking.
> 
> Probably we should mark all IOs issued inside transaction as META (or
> whatever which tells blkcg to avoid throttling it).  We're gonna need
> overcharging for metadata writes anyway, so I don't think this will
> make too much of a difference.
  Agreed.

								Honza

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-17 22:01                       ` Tejun Heo
@ 2012-04-18  6:30                         ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-18  6:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Vivek Goyal, Fengguang Wu, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

  Hello,

On Tue 17-04-12 15:01:06, Tejun Heo wrote:
> On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote:
> > > So all the metadata IO will happen thorough journaling thread and that
> > > will be in root group which should remain unthrottled. So any journal
> > > IO going to disk should remain unthrottled.
> >
> >   Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't
> > have to have the journal thread (as is the case of reiserfs where random
> > writer may end up doing commit) but let's not complicate things
> > unnecessarily.
> 
> Why can't journal entries keep track of the originator so that bios
> can be attributed to the originator while committing?  That shouldn't
> be too difficult to implement, no?
  I think I was just describing the current state but yes, in future we
can track which cgroup first attached a buffer to a transaction.

> > > Now, IIRC, fsync problem with throttling was that we had opened a
> > > transaction but could not write it back to disk because we had to
> > > wait for all the cached data to go to disk (which is throttled). So
> > > my question is, can't we first wait for all the data to be flushed
> > > to disk and then open a transaction for metadata. metadata will be
> > > unthrottled so filesystem will not have to do any tricks like bdi is
> > > congested or not.
> >
> >   Actually that's what's happening. We first do filemap_write_and_wait()
> > which syncs all the data and then we go and force transaction commit to
> > make sure all metadata got to stable storage. The problem is that writeout
> > of data may need to allocate new blocks and that starts a transaction and
> > while the transaction is started we may need to do some reads (e.g. of
> > bitmaps etc.) which may be throttled and at that moment the whole
> > filesystem is blocked. I don't remember the stack traces you showed me so
> > I'm not sure it this is what your observed but it's certainly one possible
> > scenario. The reason why fsync triggers problems is simply that it's the
> > only place where process normally does significant amount of writing. In
> > most cases flusher thread / journal thread do it so this effect is not
> > visible. And to precede your question, it would be rather hard to avoid IO
> > while the transaction is started due to locking.
> 
> Probably we should mark all IOs issued inside transaction as META (or
> whatever which tells blkcg to avoid throttling it).  We're gonna need
> overcharging for metadata writes anyway, so I don't think this will
> make too much of a difference.
  Agreed.

								Honza

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-18  6:30                         ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-18  6:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Vivek Goyal, Fengguang Wu, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

  Hello,

On Tue 17-04-12 15:01:06, Tejun Heo wrote:
> On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote:
> > > So all the metadata IO will happen thorough journaling thread and that
> > > will be in root group which should remain unthrottled. So any journal
> > > IO going to disk should remain unthrottled.
> >
> >   Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't
> > have to have the journal thread (as is the case of reiserfs where random
> > writer may end up doing commit) but let's not complicate things
> > unnecessarily.
> 
> Why can't journal entries keep track of the originator so that bios
> can be attributed to the originator while committing?  That shouldn't
> be too difficult to implement, no?
  I think I was just describing the current state but yes, in future we
can track which cgroup first attached a buffer to a transaction.

> > > Now, IIRC, fsync problem with throttling was that we had opened a
> > > transaction but could not write it back to disk because we had to
> > > wait for all the cached data to go to disk (which is throttled). So
> > > my question is, can't we first wait for all the data to be flushed
> > > to disk and then open a transaction for metadata. metadata will be
> > > unthrottled so filesystem will not have to do any tricks like bdi is
> > > congested or not.
> >
> >   Actually that's what's happening. We first do filemap_write_and_wait()
> > which syncs all the data and then we go and force transaction commit to
> > make sure all metadata got to stable storage. The problem is that writeout
> > of data may need to allocate new blocks and that starts a transaction and
> > while the transaction is started we may need to do some reads (e.g. of
> > bitmaps etc.) which may be throttled and at that moment the whole
> > filesystem is blocked. I don't remember the stack traces you showed me so
> > I'm not sure it this is what your observed but it's certainly one possible
> > scenario. The reason why fsync triggers problems is simply that it's the
> > only place where process normally does significant amount of writing. In
> > most cases flusher thread / journal thread do it so this effect is not
> > visible. And to precede your question, it would be rather hard to avoid IO
> > while the transaction is started due to locking.
> 
> Probably we should mark all IOs issued inside transaction as META (or
> whatever which tells blkcg to avoid throttling it).  We're gonna need
> overcharging for metadata writes anyway, so I don't think this will
> make too much of a difference.
  Agreed.

								Honza

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-06  9:59         ` Fengguang Wu
                           ` (4 preceding siblings ...)
  (?)
@ 2012-04-18  6:57         ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-18  6:57 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA

On Fri 06-04-12 02:59:34, Wu Fengguang wrote:
...
> > > > Let's please keep the layering clear.  IO limitations will be applied
> > > > at the block layer and pressure will be formed there and then
> > > > propagated upwards eventually to the originator.  Sure, exposing the
> > > > whole information might result in better behavior for certain
> > > > workloads, but down the road, say, in three or five years, devices
> > > > which can be shared without worrying too much about seeks might be
> > > > commonplace and we could be swearing at a disgusting structural mess,
> > > > and sadly various cgroup support seems to be a prominent source of
> > > > such design failures.
> > > 
> > > Super fast storages are coming which will make us regret to make the
> > > IO path over complex.  Spinning disks are not going away anytime soon.
> > > I doubt Google is willing to afford the disk seek costs on its
> > > millions of disks and has the patience to wait until switching all of
> > > the spin disks to SSD years later (if it will ever happen).
> > 
> > This is new.  Let's keep the damn employer out of the discussion.
> > While the area I work on is affected by my employment (writeback isn't
> > even my area BTW), I'm not gonna do something adverse to upstream even
> > if it's beneficial to google and I'm much more likely to do something
> > which may hurt google a bit if it's gonna benefit upstream.
> > 
> > As for the faster / newer storage argument, that is *exactly* why we
> > want to keep the layering proper.  Writeback works from the pressure
> > from the IO stack.  If IO technology changes, we update the IO stack
> > and writeback still works from the pressure.  It may need to be
> > adjusted but the principles don't change.
> 
> To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> It's always there doing 1:1 proportional throttling. Then you try to
> kick in to add *double* throttling in block/cfq layer. Now the low
> layer may enforce 10:1 throttling and push balance_dirty_pages() away
> from its balanced state, leading to large fluctuations and program
> stalls.  This can be avoided by telling balance_dirty_pages(): "your
> balance goal is no longer 1:1, but 10:1". With this information
> balance_dirty_pages() will behave right. Then there is the question:
> if balance_dirty_pages() will work just well provided the information,
> why bother doing the throttling at low layer and "push back" the
> pressure all the way up?
  Fengguang, maybe we should first agree on some basics:
  The two main goals of balance_dirty_pages() are (and always have been
AFAIK) to limit amount of dirty pages in memory and keep enough dirty pages
in memory to allow for efficient writeback. Secondary goals are to also
keep amount of dirty pages somewhat fair among bdis and processes. Agreed?

Thus shift to trying to control *IO throughput* (or even just buffered
write throughput) from balance_dirty_pages() is a fundamental shift in the
goals of balance_dirty_pages(), not just some tweak (although technically,
it might be relatively easy to do for buffered writes given the current
implementation).

...
> > Well, I tried and I hope some of it got through.  I also wrote a lot
> > of questions, mainly regarding how what you have in mind is supposed
> > to work through what path.  Maybe I'm just not seeing what you're
> > seeing but I just can't see where all the IOs would go through and
> > come together.  Can you please elaborate more on that?
> 
> What I can see is, it looks pretty simple and nature to let
> balance_dirty_pages() fill the gap towards a total solution :-)
> 
> - add direct IO accounting in some convenient point of the IO path
>   IO submission or completion point, either is fine.
> 
> - change several lines of the buffered write IO controller to
>   integrate the direct IO rate into the formula to fit the "total
>   IO" limit
> 
> - in future, add more accounting as well as feedback control to make
>   balance_dirty_pages() work with IOPS and disk time
  Sorry Fengguang but I also think this is a wrong way to go.
balance_dirty_pages() must primarily control the amount of dirty pages.
Trying to bend it to control IO throughput by including direct IO and
reads in the accounting will just make the logic even more complex than it
already is.

								Honza

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-06  9:59         ` Fengguang Wu
@ 2012-04-18  6:57           ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-18  6:57 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Tejun Heo, Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Fri 06-04-12 02:59:34, Wu Fengguang wrote:
...
> > > > Let's please keep the layering clear.  IO limitations will be applied
> > > > at the block layer and pressure will be formed there and then
> > > > propagated upwards eventually to the originator.  Sure, exposing the
> > > > whole information might result in better behavior for certain
> > > > workloads, but down the road, say, in three or five years, devices
> > > > which can be shared without worrying too much about seeks might be
> > > > commonplace and we could be swearing at a disgusting structural mess,
> > > > and sadly various cgroup support seems to be a prominent source of
> > > > such design failures.
> > > 
> > > Super fast storages are coming which will make us regret to make the
> > > IO path over complex.  Spinning disks are not going away anytime soon.
> > > I doubt Google is willing to afford the disk seek costs on its
> > > millions of disks and has the patience to wait until switching all of
> > > the spin disks to SSD years later (if it will ever happen).
> > 
> > This is new.  Let's keep the damn employer out of the discussion.
> > While the area I work on is affected by my employment (writeback isn't
> > even my area BTW), I'm not gonna do something adverse to upstream even
> > if it's beneficial to google and I'm much more likely to do something
> > which may hurt google a bit if it's gonna benefit upstream.
> > 
> > As for the faster / newer storage argument, that is *exactly* why we
> > want to keep the layering proper.  Writeback works from the pressure
> > from the IO stack.  If IO technology changes, we update the IO stack
> > and writeback still works from the pressure.  It may need to be
> > adjusted but the principles don't change.
> 
> To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> It's always there doing 1:1 proportional throttling. Then you try to
> kick in to add *double* throttling in block/cfq layer. Now the low
> layer may enforce 10:1 throttling and push balance_dirty_pages() away
> from its balanced state, leading to large fluctuations and program
> stalls.  This can be avoided by telling balance_dirty_pages(): "your
> balance goal is no longer 1:1, but 10:1". With this information
> balance_dirty_pages() will behave right. Then there is the question:
> if balance_dirty_pages() will work just well provided the information,
> why bother doing the throttling at low layer and "push back" the
> pressure all the way up?
  Fengguang, maybe we should first agree on some basics:
  The two main goals of balance_dirty_pages() are (and always have been
AFAIK) to limit amount of dirty pages in memory and keep enough dirty pages
in memory to allow for efficient writeback. Secondary goals are to also
keep amount of dirty pages somewhat fair among bdis and processes. Agreed?

Thus shift to trying to control *IO throughput* (or even just buffered
write throughput) from balance_dirty_pages() is a fundamental shift in the
goals of balance_dirty_pages(), not just some tweak (although technically,
it might be relatively easy to do for buffered writes given the current
implementation).

...
> > Well, I tried and I hope some of it got through.  I also wrote a lot
> > of questions, mainly regarding how what you have in mind is supposed
> > to work through what path.  Maybe I'm just not seeing what you're
> > seeing but I just can't see where all the IOs would go through and
> > come together.  Can you please elaborate more on that?
> 
> What I can see is, it looks pretty simple and nature to let
> balance_dirty_pages() fill the gap towards a total solution :-)
> 
> - add direct IO accounting in some convenient point of the IO path
>   IO submission or completion point, either is fine.
> 
> - change several lines of the buffered write IO controller to
>   integrate the direct IO rate into the formula to fit the "total
>   IO" limit
> 
> - in future, add more accounting as well as feedback control to make
>   balance_dirty_pages() work with IOPS and disk time
  Sorry Fengguang but I also think this is a wrong way to go.
balance_dirty_pages() must primarily control the amount of dirty pages.
Trying to bend it to control IO throughput by including direct IO and
reads in the accounting will just make the logic even more complex than it
already is.

								Honza

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-18  6:57           ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-18  6:57 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Tejun Heo, Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Fri 06-04-12 02:59:34, Wu Fengguang wrote:
...
> > > > Let's please keep the layering clear.  IO limitations will be applied
> > > > at the block layer and pressure will be formed there and then
> > > > propagated upwards eventually to the originator.  Sure, exposing the
> > > > whole information might result in better behavior for certain
> > > > workloads, but down the road, say, in three or five years, devices
> > > > which can be shared without worrying too much about seeks might be
> > > > commonplace and we could be swearing at a disgusting structural mess,
> > > > and sadly various cgroup support seems to be a prominent source of
> > > > such design failures.
> > > 
> > > Super fast storages are coming which will make us regret to make the
> > > IO path over complex.  Spinning disks are not going away anytime soon.
> > > I doubt Google is willing to afford the disk seek costs on its
> > > millions of disks and has the patience to wait until switching all of
> > > the spin disks to SSD years later (if it will ever happen).
> > 
> > This is new.  Let's keep the damn employer out of the discussion.
> > While the area I work on is affected by my employment (writeback isn't
> > even my area BTW), I'm not gonna do something adverse to upstream even
> > if it's beneficial to google and I'm much more likely to do something
> > which may hurt google a bit if it's gonna benefit upstream.
> > 
> > As for the faster / newer storage argument, that is *exactly* why we
> > want to keep the layering proper.  Writeback works from the pressure
> > from the IO stack.  If IO technology changes, we update the IO stack
> > and writeback still works from the pressure.  It may need to be
> > adjusted but the principles don't change.
> 
> To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> It's always there doing 1:1 proportional throttling. Then you try to
> kick in to add *double* throttling in block/cfq layer. Now the low
> layer may enforce 10:1 throttling and push balance_dirty_pages() away
> from its balanced state, leading to large fluctuations and program
> stalls.  This can be avoided by telling balance_dirty_pages(): "your
> balance goal is no longer 1:1, but 10:1". With this information
> balance_dirty_pages() will behave right. Then there is the question:
> if balance_dirty_pages() will work just well provided the information,
> why bother doing the throttling at low layer and "push back" the
> pressure all the way up?
  Fengguang, maybe we should first agree on some basics:
  The two main goals of balance_dirty_pages() are (and always have been
AFAIK) to limit amount of dirty pages in memory and keep enough dirty pages
in memory to allow for efficient writeback. Secondary goals are to also
keep amount of dirty pages somewhat fair among bdis and processes. Agreed?

Thus shift to trying to control *IO throughput* (or even just buffered
write throughput) from balance_dirty_pages() is a fundamental shift in the
goals of balance_dirty_pages(), not just some tweak (although technically,
it might be relatively easy to do for buffered writes given the current
implementation).

...
> > Well, I tried and I hope some of it got through.  I also wrote a lot
> > of questions, mainly regarding how what you have in mind is supposed
> > to work through what path.  Maybe I'm just not seeing what you're
> > seeing but I just can't see where all the IOs would go through and
> > come together.  Can you please elaborate more on that?
> 
> What I can see is, it looks pretty simple and nature to let
> balance_dirty_pages() fill the gap towards a total solution :-)
> 
> - add direct IO accounting in some convenient point of the IO path
>   IO submission or completion point, either is fine.
> 
> - change several lines of the buffered write IO controller to
>   integrate the direct IO rate into the formula to fit the "total
>   IO" limit
> 
> - in future, add more accounting as well as feedback control to make
>   balance_dirty_pages() work with IOPS and disk time
  Sorry Fengguang but I also think this is a wrong way to go.
balance_dirty_pages() must primarily control the amount of dirty pages.
Trying to bend it to control IO throughput by including direct IO and
reads in the accounting will just make the logic even more complex than it
already is.

								Honza

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]           ` <20120418065720.GA21485-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
@ 2012-04-18  7:58             ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-18  7:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA

On Wed, Apr 18, 2012 at 08:57:20AM +0200, Jan Kara wrote:
> On Fri 06-04-12 02:59:34, Wu Fengguang wrote:
> ...
> > > > > Let's please keep the layering clear.  IO limitations will be applied
> > > > > at the block layer and pressure will be formed there and then
> > > > > propagated upwards eventually to the originator.  Sure, exposing the
> > > > > whole information might result in better behavior for certain
> > > > > workloads, but down the road, say, in three or five years, devices
> > > > > which can be shared without worrying too much about seeks might be
> > > > > commonplace and we could be swearing at a disgusting structural mess,
> > > > > and sadly various cgroup support seems to be a prominent source of
> > > > > such design failures.
> > > > 
> > > > Super fast storages are coming which will make us regret to make the
> > > > IO path over complex.  Spinning disks are not going away anytime soon.
> > > > I doubt Google is willing to afford the disk seek costs on its
> > > > millions of disks and has the patience to wait until switching all of
> > > > the spin disks to SSD years later (if it will ever happen).
> > > 
> > > This is new.  Let's keep the damn employer out of the discussion.
> > > While the area I work on is affected by my employment (writeback isn't
> > > even my area BTW), I'm not gonna do something adverse to upstream even
> > > if it's beneficial to google and I'm much more likely to do something
> > > which may hurt google a bit if it's gonna benefit upstream.
> > > 
> > > As for the faster / newer storage argument, that is *exactly* why we
> > > want to keep the layering proper.  Writeback works from the pressure
> > > from the IO stack.  If IO technology changes, we update the IO stack
> > > and writeback still works from the pressure.  It may need to be
> > > adjusted but the principles don't change.
> > 
> > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > It's always there doing 1:1 proportional throttling. Then you try to
> > kick in to add *double* throttling in block/cfq layer. Now the low
> > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > from its balanced state, leading to large fluctuations and program
> > stalls.  This can be avoided by telling balance_dirty_pages(): "your
> > balance goal is no longer 1:1, but 10:1". With this information
> > balance_dirty_pages() will behave right. Then there is the question:
> > if balance_dirty_pages() will work just well provided the information,
> > why bother doing the throttling at low layer and "push back" the
> > pressure all the way up?
>   Fengguang, maybe we should first agree on some basics:
>   The two main goals of balance_dirty_pages() are (and always have been
> AFAIK) to limit amount of dirty pages in memory and keep enough dirty pages
> in memory to allow for efficient writeback. Secondary goals are to also
> keep amount of dirty pages somewhat fair among bdis and processes. Agreed?

Agreed. In fact, before the IO-less change, balance_dirty_pages() had
no much explicit control over the dirty rate and fairness.

> Thus shift to trying to control *IO throughput* (or even just buffered
> write throughput) from balance_dirty_pages() is a fundamental shift in the
> goals of balance_dirty_pages(), not just some tweak (although technically,
> it might be relatively easy to do for buffered writes given the current
> implementation).

Yes, it has been a bit shift to the rate based dirty control.

> ...
> > > Well, I tried and I hope some of it got through.  I also wrote a lot
> > > of questions, mainly regarding how what you have in mind is supposed
> > > to work through what path.  Maybe I'm just not seeing what you're
> > > seeing but I just can't see where all the IOs would go through and
> > > come together.  Can you please elaborate more on that?
> > 
> > What I can see is, it looks pretty simple and nature to let
> > balance_dirty_pages() fill the gap towards a total solution :-)
> > 
> > - add direct IO accounting in some convenient point of the IO path
> >   IO submission or completion point, either is fine.
> > 
> > - change several lines of the buffered write IO controller to
> >   integrate the direct IO rate into the formula to fit the "total
> >   IO" limit
> > 
> > - in future, add more accounting as well as feedback control to make
> >   balance_dirty_pages() work with IOPS and disk time
>   Sorry Fengguang but I also think this is a wrong way to go.
> balance_dirty_pages() must primarily control the amount of dirty pages.
> Trying to bend it to control IO throughput by including direct IO and
> reads in the accounting will just make the logic even more complex than it
> already is.

Right, I have been adding too much complexity to balance_dirty_pages().
The control algorithms are pretty hard to understand and get right for
all cases.

OK, I'll post results of my experiments up to now, answer some
questions and take a comfortable break. Phooo..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]           ` <20120418065720.GA21485-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
  2012-04-18  7:58             ` Fengguang Wu
@ 2012-04-18  7:58             ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-18  7:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 18, 2012 at 08:57:20AM +0200, Jan Kara wrote:
> On Fri 06-04-12 02:59:34, Wu Fengguang wrote:
> ...
> > > > > Let's please keep the layering clear.  IO limitations will be applied
> > > > > at the block layer and pressure will be formed there and then
> > > > > propagated upwards eventually to the originator.  Sure, exposing the
> > > > > whole information might result in better behavior for certain
> > > > > workloads, but down the road, say, in three or five years, devices
> > > > > which can be shared without worrying too much about seeks might be
> > > > > commonplace and we could be swearing at a disgusting structural mess,
> > > > > and sadly various cgroup support seems to be a prominent source of
> > > > > such design failures.
> > > > 
> > > > Super fast storages are coming which will make us regret to make the
> > > > IO path over complex.  Spinning disks are not going away anytime soon.
> > > > I doubt Google is willing to afford the disk seek costs on its
> > > > millions of disks and has the patience to wait until switching all of
> > > > the spin disks to SSD years later (if it will ever happen).
> > > 
> > > This is new.  Let's keep the damn employer out of the discussion.
> > > While the area I work on is affected by my employment (writeback isn't
> > > even my area BTW), I'm not gonna do something adverse to upstream even
> > > if it's beneficial to google and I'm much more likely to do something
> > > which may hurt google a bit if it's gonna benefit upstream.
> > > 
> > > As for the faster / newer storage argument, that is *exactly* why we
> > > want to keep the layering proper.  Writeback works from the pressure
> > > from the IO stack.  If IO technology changes, we update the IO stack
> > > and writeback still works from the pressure.  It may need to be
> > > adjusted but the principles don't change.
> > 
> > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > It's always there doing 1:1 proportional throttling. Then you try to
> > kick in to add *double* throttling in block/cfq layer. Now the low
> > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > from its balanced state, leading to large fluctuations and program
> > stalls.  This can be avoided by telling balance_dirty_pages(): "your
> > balance goal is no longer 1:1, but 10:1". With this information
> > balance_dirty_pages() will behave right. Then there is the question:
> > if balance_dirty_pages() will work just well provided the information,
> > why bother doing the throttling at low layer and "push back" the
> > pressure all the way up?
>   Fengguang, maybe we should first agree on some basics:
>   The two main goals of balance_dirty_pages() are (and always have been
> AFAIK) to limit amount of dirty pages in memory and keep enough dirty pages
> in memory to allow for efficient writeback. Secondary goals are to also
> keep amount of dirty pages somewhat fair among bdis and processes. Agreed?

Agreed. In fact, before the IO-less change, balance_dirty_pages() had
no much explicit control over the dirty rate and fairness.

> Thus shift to trying to control *IO throughput* (or even just buffered
> write throughput) from balance_dirty_pages() is a fundamental shift in the
> goals of balance_dirty_pages(), not just some tweak (although technically,
> it might be relatively easy to do for buffered writes given the current
> implementation).

Yes, it has been a bit shift to the rate based dirty control.

> ...
> > > Well, I tried and I hope some of it got through.  I also wrote a lot
> > > of questions, mainly regarding how what you have in mind is supposed
> > > to work through what path.  Maybe I'm just not seeing what you're
> > > seeing but I just can't see where all the IOs would go through and
> > > come together.  Can you please elaborate more on that?
> > 
> > What I can see is, it looks pretty simple and nature to let
> > balance_dirty_pages() fill the gap towards a total solution :-)
> > 
> > - add direct IO accounting in some convenient point of the IO path
> >   IO submission or completion point, either is fine.
> > 
> > - change several lines of the buffered write IO controller to
> >   integrate the direct IO rate into the formula to fit the "total
> >   IO" limit
> > 
> > - in future, add more accounting as well as feedback control to make
> >   balance_dirty_pages() work with IOPS and disk time
>   Sorry Fengguang but I also think this is a wrong way to go.
> balance_dirty_pages() must primarily control the amount of dirty pages.
> Trying to bend it to control IO throughput by including direct IO and
> reads in the accounting will just make the logic even more complex than it
> already is.

Right, I have been adding too much complexity to balance_dirty_pages().
The control algorithms are pretty hard to understand and get right for
all cases.

OK, I'll post results of my experiments up to now, answer some
questions and take a comfortable break. Phooo..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-18  7:58             ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-18  7:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k,
	andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Wed, Apr 18, 2012 at 08:57:20AM +0200, Jan Kara wrote:
> On Fri 06-04-12 02:59:34, Wu Fengguang wrote:
> ...
> > > > > Let's please keep the layering clear.  IO limitations will be applied
> > > > > at the block layer and pressure will be formed there and then
> > > > > propagated upwards eventually to the originator.  Sure, exposing the
> > > > > whole information might result in better behavior for certain
> > > > > workloads, but down the road, say, in three or five years, devices
> > > > > which can be shared without worrying too much about seeks might be
> > > > > commonplace and we could be swearing at a disgusting structural mess,
> > > > > and sadly various cgroup support seems to be a prominent source of
> > > > > such design failures.
> > > > 
> > > > Super fast storages are coming which will make us regret to make the
> > > > IO path over complex.  Spinning disks are not going away anytime soon.
> > > > I doubt Google is willing to afford the disk seek costs on its
> > > > millions of disks and has the patience to wait until switching all of
> > > > the spin disks to SSD years later (if it will ever happen).
> > > 
> > > This is new.  Let's keep the damn employer out of the discussion.
> > > While the area I work on is affected by my employment (writeback isn't
> > > even my area BTW), I'm not gonna do something adverse to upstream even
> > > if it's beneficial to google and I'm much more likely to do something
> > > which may hurt google a bit if it's gonna benefit upstream.
> > > 
> > > As for the faster / newer storage argument, that is *exactly* why we
> > > want to keep the layering proper.  Writeback works from the pressure
> > > from the IO stack.  If IO technology changes, we update the IO stack
> > > and writeback still works from the pressure.  It may need to be
> > > adjusted but the principles don't change.
> > 
> > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > It's always there doing 1:1 proportional throttling. Then you try to
> > kick in to add *double* throttling in block/cfq layer. Now the low
> > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > from its balanced state, leading to large fluctuations and program
> > stalls.  This can be avoided by telling balance_dirty_pages(): "your
> > balance goal is no longer 1:1, but 10:1". With this information
> > balance_dirty_pages() will behave right. Then there is the question:
> > if balance_dirty_pages() will work just well provided the information,
> > why bother doing the throttling at low layer and "push back" the
> > pressure all the way up?
>   Fengguang, maybe we should first agree on some basics:
>   The two main goals of balance_dirty_pages() are (and always have been
> AFAIK) to limit amount of dirty pages in memory and keep enough dirty pages
> in memory to allow for efficient writeback. Secondary goals are to also
> keep amount of dirty pages somewhat fair among bdis and processes. Agreed?

Agreed. In fact, before the IO-less change, balance_dirty_pages() had
no much explicit control over the dirty rate and fairness.

> Thus shift to trying to control *IO throughput* (or even just buffered
> write throughput) from balance_dirty_pages() is a fundamental shift in the
> goals of balance_dirty_pages(), not just some tweak (although technically,
> it might be relatively easy to do for buffered writes given the current
> implementation).

Yes, it has been a bit shift to the rate based dirty control.

> ...
> > > Well, I tried and I hope some of it got through.  I also wrote a lot
> > > of questions, mainly regarding how what you have in mind is supposed
> > > to work through what path.  Maybe I'm just not seeing what you're
> > > seeing but I just can't see where all the IOs would go through and
> > > come together.  Can you please elaborate more on that?
> > 
> > What I can see is, it looks pretty simple and nature to let
> > balance_dirty_pages() fill the gap towards a total solution :-)
> > 
> > - add direct IO accounting in some convenient point of the IO path
> >   IO submission or completion point, either is fine.
> > 
> > - change several lines of the buffered write IO controller to
> >   integrate the direct IO rate into the formula to fit the "total
> >   IO" limit
> > 
> > - in future, add more accounting as well as feedback control to make
> >   balance_dirty_pages() work with IOPS and disk time
>   Sorry Fengguang but I also think this is a wrong way to go.
> balance_dirty_pages() must primarily control the amount of dirty pages.
> Trying to bend it to control IO throughput by including direct IO and
> reads in the accounting will just make the logic even more complex than it
> already is.

Right, I have been adding too much complexity to balance_dirty_pages().
The control algorithms are pretty hard to understand and get right for
all cases.

OK, I'll post results of my experiments up to now, answer some
questions and take a comfortable break. Phooo..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-18  7:58             ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-18  7:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 18, 2012 at 08:57:20AM +0200, Jan Kara wrote:
> On Fri 06-04-12 02:59:34, Wu Fengguang wrote:
> ...
> > > > > Let's please keep the layering clear.  IO limitations will be applied
> > > > > at the block layer and pressure will be formed there and then
> > > > > propagated upwards eventually to the originator.  Sure, exposing the
> > > > > whole information might result in better behavior for certain
> > > > > workloads, but down the road, say, in three or five years, devices
> > > > > which can be shared without worrying too much about seeks might be
> > > > > commonplace and we could be swearing at a disgusting structural mess,
> > > > > and sadly various cgroup support seems to be a prominent source of
> > > > > such design failures.
> > > > 
> > > > Super fast storages are coming which will make us regret to make the
> > > > IO path over complex.  Spinning disks are not going away anytime soon.
> > > > I doubt Google is willing to afford the disk seek costs on its
> > > > millions of disks and has the patience to wait until switching all of
> > > > the spin disks to SSD years later (if it will ever happen).
> > > 
> > > This is new.  Let's keep the damn employer out of the discussion.
> > > While the area I work on is affected by my employment (writeback isn't
> > > even my area BTW), I'm not gonna do something adverse to upstream even
> > > if it's beneficial to google and I'm much more likely to do something
> > > which may hurt google a bit if it's gonna benefit upstream.
> > > 
> > > As for the faster / newer storage argument, that is *exactly* why we
> > > want to keep the layering proper.  Writeback works from the pressure
> > > from the IO stack.  If IO technology changes, we update the IO stack
> > > and writeback still works from the pressure.  It may need to be
> > > adjusted but the principles don't change.
> > 
> > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > It's always there doing 1:1 proportional throttling. Then you try to
> > kick in to add *double* throttling in block/cfq layer. Now the low
> > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > from its balanced state, leading to large fluctuations and program
> > stalls.  This can be avoided by telling balance_dirty_pages(): "your
> > balance goal is no longer 1:1, but 10:1". With this information
> > balance_dirty_pages() will behave right. Then there is the question:
> > if balance_dirty_pages() will work just well provided the information,
> > why bother doing the throttling at low layer and "push back" the
> > pressure all the way up?
>   Fengguang, maybe we should first agree on some basics:
>   The two main goals of balance_dirty_pages() are (and always have been
> AFAIK) to limit amount of dirty pages in memory and keep enough dirty pages
> in memory to allow for efficient writeback. Secondary goals are to also
> keep amount of dirty pages somewhat fair among bdis and processes. Agreed?

Agreed. In fact, before the IO-less change, balance_dirty_pages() had
no much explicit control over the dirty rate and fairness.

> Thus shift to trying to control *IO throughput* (or even just buffered
> write throughput) from balance_dirty_pages() is a fundamental shift in the
> goals of balance_dirty_pages(), not just some tweak (although technically,
> it might be relatively easy to do for buffered writes given the current
> implementation).

Yes, it has been a bit shift to the rate based dirty control.

> ...
> > > Well, I tried and I hope some of it got through.  I also wrote a lot
> > > of questions, mainly regarding how what you have in mind is supposed
> > > to work through what path.  Maybe I'm just not seeing what you're
> > > seeing but I just can't see where all the IOs would go through and
> > > come together.  Can you please elaborate more on that?
> > 
> > What I can see is, it looks pretty simple and nature to let
> > balance_dirty_pages() fill the gap towards a total solution :-)
> > 
> > - add direct IO accounting in some convenient point of the IO path
> >   IO submission or completion point, either is fine.
> > 
> > - change several lines of the buffered write IO controller to
> >   integrate the direct IO rate into the formula to fit the "total
> >   IO" limit
> > 
> > - in future, add more accounting as well as feedback control to make
> >   balance_dirty_pages() work with IOPS and disk time
>   Sorry Fengguang but I also think this is a wrong way to go.
> balance_dirty_pages() must primarily control the amount of dirty pages.
> Trying to bend it to control IO throughput by including direct IO and
> reads in the accounting will just make the logic even more complex than it
> already is.

Right, I have been adding too much complexity to balance_dirty_pages().
The control algorithms are pretty hard to understand and get right for
all cases.

OK, I'll post results of my experiments up to now, answer some
questions and take a comfortable break. Phooo..

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                         ` <20120417214831.GE19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-04-18 18:18                           ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-18 18:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Tue, Apr 17, 2012 at 02:48:31PM -0700, Tejun Heo wrote:
[..]

> As for priority inversion through shared request pool, it is a problem
> which needs to be solved regardless of how async IOs are throttled.
> I'm not determined to which extent yet tho.  Different cgroups
> definitely need to be on separate pools but do we also want
> distinguish sync and async and what about ioprio?  Maybe we need a
> bybrid approach with larger common pool and reserved ones for each
> class?

currently we have global pool with separate limits for sync and async
and there is no consideration of ioprio. I think to keep it simple we
can just extend the same notion to keep per cgroup pool with internal
limits on sync/async requests to make sure sync IO does not get
serialized behind async IO. Personally I am not too worried about
async IO prio. It has never worked.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-17 21:48                         ` Tejun Heo
@ 2012-04-18 18:18                           ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-18 18:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Tue, Apr 17, 2012 at 02:48:31PM -0700, Tejun Heo wrote:
[..]

> As for priority inversion through shared request pool, it is a problem
> which needs to be solved regardless of how async IOs are throttled.
> I'm not determined to which extent yet tho.  Different cgroups
> definitely need to be on separate pools but do we also want
> distinguish sync and async and what about ioprio?  Maybe we need a
> bybrid approach with larger common pool and reserved ones for each
> class?

currently we have global pool with separate limits for sync and async
and there is no consideration of ioprio. I think to keep it simple we
can just extend the same notion to keep per cgroup pool with internal
limits on sync/async requests to make sure sync IO does not get
serialized behind async IO. Personally I am not too worried about
async IO prio. It has never worked.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-18 18:18                           ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-18 18:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Tue, Apr 17, 2012 at 02:48:31PM -0700, Tejun Heo wrote:
[..]

> As for priority inversion through shared request pool, it is a problem
> which needs to be solved regardless of how async IOs are throttled.
> I'm not determined to which extent yet tho.  Different cgroups
> definitely need to be on separate pools but do we also want
> distinguish sync and async and what about ioprio?  Maybe we need a
> bybrid approach with larger common pool and reserved ones for each
> class?

currently we have global pool with separate limits for sync and async
and there is no consideration of ioprio. I think to keep it simple we
can just extend the same notion to keep per cgroup pool with internal
limits on sync/async requests to make sure sync IO does not get
serialized behind async IO. Personally I am not too worried about
async IO prio. It has never worked.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]           ` <20120417223854.GG19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-04-19 14:23             ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-19 14:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA

Hi Tejun,

On Tue, Apr 17, 2012 at 03:38:54PM -0700, Tejun Heo wrote:
> Hello, Fengguang.
> 
> On Fri, Apr 06, 2012 at 02:59:34AM -0700, Fengguang Wu wrote:
> > Fortunately, the above gap can be easily filled judging from the
> > block/cfq IO controller code. By adding some direct IO accounting
> > and changing several lines of my patches to make use of the collected
> > stats, the semantics of the blkio.throttle.write_bps interfaces can be
> > changed from "limit for direct IO" to "limit for direct+buffered IOs".
> > Ditto for blkio.weight and blkio.write_iops, as long as some
> > iops/device time stats are made available to balance_dirty_pages().
> > 
> > It would be a fairly *easy* change. :-) It's merely adding some
> > accounting code and there is no need to change the block IO
> > controlling algorithm at all. I'll do the work of accounting (which
> > is basically independent of the IO controlling) and use the new stats
> > in balance_dirty_pages().
> 
> I don't really understand how this can work.  For hard limits, maybe,

Yeah, hard limits are the easiest.

> but for proportional IO, you have to know which cgroups have IOs
> before assigning the proportions, so blkcg assigning IO bandwidth
> without knowing async writes simply can't work.
> 
> For example, let's say cgroups A and B have 2:8 split.  If A has IOs
> on queue and B doesn't, blkcg will assign all IO bandwidth to A.  I
> can't wrap my head around how writeback is gonna make use of the
> resulting stats but let's say it decides it needs to put out some IOs
> out for both cgroups.  What happens then?  Do all the async writes go
> through the root cgroup controlled by and affecting the ratio between
> rootcg and cgroup A and B?  Or do they have to be accounted as part of
> cgroups A and B?  If so, what if the added bandwidth goes over the
> limit?  Let's say if we implement overcharge; then, I suppose we'll
> have to communicate that upwards too, right?

The trick is to do the throttling for buffered writes at page dirty
time, when balance_dirty_pages() knows exactly what cgroup the dirtier
task belongs to, the dirty rate and whether or not it's an aggressive
dirtier. The cgroup's direct IO rate can also be measured. The only
missing information is whether it's a non-aggressive direct writer
(only cfq may know about that). Now I'm simply assuming direct writers
are all aggressive.

So if A and B have 2:8 split and A only submits async IO and B only
submits direct IO, there will be no cfqg exist for A at all. cfq will
be serving B and root cgroup interleavely. In the patch I just posted,
blkcg_update_dirty_ratelimit() will transfer A's weight 2 to the root
cgroup for use by the flusher. In the end the flusher gets weight 2
and B gets weight 8. Here we need to distinguish the weight assigned
by user and the weight after the async/sync adjustment.

The other missing information is the real cost when the dirtied pages
eventually hit the disk after perhaps dozens of seconds.  For that
part I'm assuming simple dd at this time and balance_dirty_pages()
is now splitting out the flusher's overall writeout progress to the
dirtier tasks' dirty ratelimit based on bandwidth fairness.

> This is still easy.  What about hierarchical propio?  What happens
> then?  You can't do hierarchical proportional allocation without
> knowing how much IOs are pending for which group.  How is that
> information gonna be communicated between blkcg and writeback?  Are we
> gonna have two separate hierarchical proportional IO allocators?  How
> is that gonna work at all?  If we're gonna have single allocator in
> block layer, writeback would have to feed the amount of IOs it may
> generate into the allocator, get the resulting allocation and then
> issue IO and then block layer again will have to account these to the
> originating cgroups.  It's just crazy.

No I have not got the idea on how to do the hierarchical proportional
IO controller without physically splitting up the async IO streams.
It's pretty hard and I'd better break out before it drives me crazy.

So in the following discussion, let's assume cfq will move async
requests from the current root cgroup to individual IO issuer's cfqgs
and schedule service for the async streams there. And thus the need to
create "backpressure" for balance_dirty_pages() to eventually throttle
the individual dirtier tasks.

That said, I still don't think we've come up with any satisfactory
solutions. It's hard problem after all.

> > The only problem I can see now, is that balance_dirty_pages() works
> > per-bdi and blkcg works per-device. So the two ends may not match
> > nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where
> > sdb is shared by lv0 and lv1. However it should be rare situations and
> > be much more acceptable than the problems arise from the "push back"
> > approach which impacts everyone.
> 
> I don't know.  What problems?  AFAICS, the biggest issue is writeback
> of different inodes getting mixed resulting in poor performance, but
> if you think about it, that's about the frequency of switching cgroups
> and a problem which can and should be dealt with from block layer
> (e.g. use larger time slice if all the pending IOs are async).

Yeah increasing time slice would help that case. In general it's not
merely the frequency of switching cgroup if take hard disk' writeback
cache into account.  Think about some inodes with async IO: A1, A2,
A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different
cgroups. So when the root cgroup holds all async inodes, the cfq may
schedule IO interleavely like this

        A1,    A1,    A1,    A2,    A1,    A2,    ...
           D1,    D2,    D3,    D4,    D5,    D6, ...

Now it becomes

        A1,    A2,    A3,    A4,    A5,    A6,    ...
           D1,    D2,    D3,    D4,    D5,    D6, ...

The difference is that it's now switching the async inodes each time.
At cfq level, the seek costs look the same, however the disk's
writeback cache may help merge the data chunks from the same inode A1.
Well, it may cost some latency for spin disks. But how about SSD? It
can run deeper queue and benefit from large writes.

> Writeback's duty is generating stream of async writes which can be
> served efficiently for the *cgroup* and keeping the buffer filled as
> necessary and chaining the backpressure from there to the actual
> dirtier.  That's what writeback does without cgroup.  Nothing
> fundamental changes with cgroup.  It's just finer grained.

Believe me, physically partitioning the dirty pages and async IO
streams comes at big costs. It won't scale well in many ways.

For one instance, splitting the request queues will give rise to
PG_writeback pages.  Those pages have been the biggest source of
latency issues in the various parts of the system.

It's not uncommon for me to see filesystems sleep on PG_writeback
pages during heavy writeback, within some lock or transaction, which in
turn stall many tasks that try to do IO or merely dirty some page in
memory. Random writes are especially susceptible to such stalls. The
stable page feature also vastly increase the chances of stalls by
locking the writeback pages. 

Page reclaim may also block on PG_writeback and/or PG_dirty pages. In
the case of direct reclaim, it means blocking random tasks that are
allocating memory in the system.

PG_writeback pages are much worse than PG_dirty pages in that they are
not movable. This makes a big difference for high-order page allocations.
To make room for a 2MB huge page, vmscan has the option to migrate
PG_dirty pages, but for PG_writeback it has no better choices than to
wait for IO completion.

The difficulty of THP allocation goes up *exponentially* with the
number of PG_writeback pages. Assume PG_writeback pages are randomly
distributed in the physical memory space. Then we have formula

        P(reclaimable for THP) = 1 - P(hit PG_writeback)^256

That's the possibly for a contiguous range of 256 pages to be free of
PG_writeback, so that it's immediately reclaimable for use by
transparent huge page. This ruby script shows us the concrete numbers.

irb> 1.upto(10) { |i| j=i/1000.0; printf "%.3f\t\t\t%.3f\n", j, (1-j)**512 }

        P(hit PG_writeback)     P(reclaimable for THP)
        0.001                   0.599
        0.002                   0.359
        0.003                   0.215
        0.004                   0.128
        0.005                   0.077
        0.006                   0.046
        0.007                   0.027
        0.008                   0.016
        0.009                   0.010
        0.010                   0.006

The numbers show that when the PG_writeback pages go up from 0.1% to
1% of system memory, the THP reclaim success ratio drops quickly from
60% to 0.6%. It indicates that in order to use THP without constantly
running into stalls, the reasonable PG_writeback ratio is <= 0.1%.
Going beyond that threshold, it quickly becomes intolerable.

That makes a limit of 256MB writeback pages for a mem=256GB system.
Looking at the real vmstat:nr_writeback numbers in dd write tests:

JBOD-12SSD-thresh=8G/ext4-1dd-1-3.3.0/vmstat-end:nr_writeback 217009
JBOD-12SSD-thresh=8G/ext4-10dd-1-3.3.0/vmstat-end:nr_writeback 198335
JBOD-12SSD-thresh=8G/xfs-1dd-1-3.3.0/vmstat-end:nr_writeback 306026
JBOD-12SSD-thresh=8G/xfs-10dd-1-3.3.0/vmstat-end:nr_writeback 315099
JBOD-12SSD-thresh=8G/btrfs-1dd-1-3.3.0/vmstat-end:nr_writeback 1216058
JBOD-12SSD-thresh=8G/btrfs-10dd-1-3.3.0/vmstat-end:nr_writeback 895335

Oops btrfs has 4GB writeback pages -- which asks for some bug fixing.
Even ext4's 800MB still looks way too high, but that's ~1s worth of
data per queue (or 130ms worth of data for the high performance Intel
SSD, which is perhaps in danger of queue underruns?). So this system
would require 512GB memory to comfortably run KVM instances with THP
support.

Judging from the above numbers, we can hardly afford to split up the
IO queues and proliferate writeback pages.

It's worth to note that running multiple flusher threads per bdi means
not only disk seeks for spin disks, smaller IO size for SSD, but also
lock contentions and cache bouncing for metadata heavy workloads and
fast storage.

To give some concrete examples on how much CPU overheads can be saved
by reducing multiple IO submitters, here are some summaries for the
IO-less dirty throttling gains. Tests show that it yields huge
benefits for reducing IO seeks as well as CPU overheads.

For example, the fs_mark benchmark on a 12-drive software RAID0 goes
from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock
contention". (by Dave Chinner)

- "CPU usage has dropped by ~55%", "it certainly appears that most of
  the CPU time saving comes from the removal of contention on the
  inode_wb_list_lock"
  (IMHO at least 10% comes from the reduction of cacheline bouncing,
  because the new code is able to call much less frequently into
  balance_dirty_pages() and hence access the _global_ page states)

- the user space "App overhead" is reduced by 20%, by avoiding the
  cacheline pollution by the complex writeback code path

- "for a ~5% throughput reduction", "the number of write IOs have
  dropped by ~25%", and the elapsed time reduced from 41:42.17 to
  40:53.23.

And for simple dd tests

- "throughput for a _single_ large dd (100GB) increase from ~650MB/s
  to 700MB/s"

- "On a simple test of 100 dd, it reduces the CPU %system time from
  30% to 3%, and improves IO throughput from 38MB/s to 42MB/s."

> > > No, no, it's not about standing in my way.  As Vivek said in the other
> > > reply, it's that the "gap" that you filled was created *because*
> > > writeback wasn't cgroup aware and now you're in turn filling that gap
> > > by making writeback work around that "gap".  I mean, my mind boggles.
> > > Doesn't yours?  I strongly believe everyone's should.
> > 
> > Heh. It's a hard problem indeed. I felt great pains in the IO-less
> > dirty throttling work. I did a lot reasoning about it, and have in
> > fact kept cgroup IO controller in mind since its early days. Now I'd
> > say it's hands down for it to adapt to the gap between the total IO
> > limit and what's carried out by the block IO controller.
> 
> You're not providing any valid counter arguments about the issues
> being raised about the messed up design.  How is anything "hands down"
> here?

Yeah sadly, it turns out to be not "hands down" when it comes to the
proportional async/sync splits, and it's even prohibiting when comes
to the hierarchical support..

> > > There's where I'm confused.  How is the said split supposed to work?
> > > They aren't independent.  I mean, who gets to decide what and where
> > > are those decisions enforced?
> > 
> > Yeah it's not independent. It's about
> > 
> > - keep block IO cgroup untouched (in its current algorithm, for
> >   throttling direct IO)
> > 
> > - let balance_dirty_pages() adapt to the throttling target
> >   
> >         buffered_write_limit = total_limit - direct_IOs
> 
> Think about proportional allocation.  You don't have a number until
> you know who have pending IOs and how much.

We have the IO rate. The above formula is actually working on "rates".
That's good enough for calculating the ratelimit for buffered writes.
We don't have to know every transient states of the pending IOs.
Because the direct IOs are handled by cfq based on cfqg weight and 
for async IOs, there are plenty of dirty pages for
buffering/tolerating small errors in the dirty rate control.

> > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > It's always there doing 1:1 proportional throttling. Then you try to
> > kick in to add *double* throttling in block/cfq layer. Now the low
> > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > from its balanced state, leading to large fluctuations and program
> > stalls.
> 
> Just do the same 1:1 inside each cgroup.

Sure. But the ratio mismatch I'm talking about is inter-cgroup.
For example there are only 2 dd tasks doing buffered writes in the
system. Now consider the mismatch that cfq is dispatching their IO
requests at 10:1 weights, while balance_dirty_pages() is throttling
the dd tasks at 1:1 equal split because it's not aware of the cgroup
weights.

What will happen in the end? The 1:1 ratio imposed by
balance_dirty_pages() will take effect and the dd tasks will progress
at the same pace. The cfq weights will be defeated because the async
queue for the second dd (and cgroup) constantly runs empty.

> >  This can be avoided by telling balance_dirty_pages(): "your
> > balance goal is no longer 1:1, but 10:1". With this information
> > balance_dirty_pages() will behave right. Then there is the question:
> > if balance_dirty_pages() will work just well provided the information,
> > why bother doing the throttling at low layer and "push back" the
> > pressure all the way up?
> 
> Because splitting a resource into two pieces arbitrarily with
> different amount of consumptions on each side and then applying the
> same proportion on both doesn't mean anything?

Sorry, I don't quite catch your words here.

> > The balance_dirty_pages() is already deeply involved in dirty throttling.
> > As you can see from this patchset, the same algorithms can be extended
> > trivially to work with cgroup IO limits.
> > 
> > buffered write IO controller in balance_dirty_pages()
> > https://lkml.org/lkml/2012/3/28/275
> 
> It is half broken thing with fundamental design flaws which can't be
> corrected without complete reimplementation.  I don't know what to
> say.

I'm fully aware of that, and so have been exploring new ways out :)

> > In the "back pressure" scheme, memcg is a must because only it has all
> > the infrastructure to track dirty pages upon which you can apply some
> > dirty_limits. Don't tell me you want to account dirty pages in blkcg...
> 
> For now, per-inode tracking seems good enough.

There are actually two directions of information passing.

1) pass the dirtier ownership down to bio. For this part, it's mostly
   enough to do the light weight per-inode tracking.

2) pass the backpressure up, from cfq (IO dispatch) to flusher (IO
submit) as well as to balance_dirty_pages() (to actually throttle the
dirty tasks). The flusher naturally works on inode granularities.
However balance_dirty_pages() is about limiting dirty pages. For this
part, it needs to know the total number of dirty pages and writeout
bandwidth for each cgroup in order to do proper dirty throttling. And
to maintain proper number of dirty pages to avoid the queue underrun
issue explained in the above 2-dd example.

> > What I can see is, it looks pretty simple and nature to let
> > balance_dirty_pages() fill the gap towards a total solution :-)
> > 
> > - add direct IO accounting in some convenient point of the IO path
> >   IO submission or completion point, either is fine.
> > 
> > - change several lines of the buffered write IO controller to
> >   integrate the direct IO rate into the formula to fit the "total
> >   IO" limit
> > 
> > - in future, add more accounting as well as feedback control to make
> >   balance_dirty_pages() work with IOPS and disk time
> 
> To me, you seem to be not addressing the issues I've been raising at
> all and just repeating the same points again and again.  If I'm
> misunderstanding something, please point out.

Hopefully the renewed patch can dismiss some of your questions. It's a
pity that I didn't thought about the hierarchical requirement at the
time. Otherwise the complexity of calculations still looks manageable.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]           ` <20120417223854.GG19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2012-04-19 14:23             ` Fengguang Wu
@ 2012-04-19 14:23             ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-19 14:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hi Tejun,

On Tue, Apr 17, 2012 at 03:38:54PM -0700, Tejun Heo wrote:
> Hello, Fengguang.
> 
> On Fri, Apr 06, 2012 at 02:59:34AM -0700, Fengguang Wu wrote:
> > Fortunately, the above gap can be easily filled judging from the
> > block/cfq IO controller code. By adding some direct IO accounting
> > and changing several lines of my patches to make use of the collected
> > stats, the semantics of the blkio.throttle.write_bps interfaces can be
> > changed from "limit for direct IO" to "limit for direct+buffered IOs".
> > Ditto for blkio.weight and blkio.write_iops, as long as some
> > iops/device time stats are made available to balance_dirty_pages().
> > 
> > It would be a fairly *easy* change. :-) It's merely adding some
> > accounting code and there is no need to change the block IO
> > controlling algorithm at all. I'll do the work of accounting (which
> > is basically independent of the IO controlling) and use the new stats
> > in balance_dirty_pages().
> 
> I don't really understand how this can work.  For hard limits, maybe,

Yeah, hard limits are the easiest.

> but for proportional IO, you have to know which cgroups have IOs
> before assigning the proportions, so blkcg assigning IO bandwidth
> without knowing async writes simply can't work.
> 
> For example, let's say cgroups A and B have 2:8 split.  If A has IOs
> on queue and B doesn't, blkcg will assign all IO bandwidth to A.  I
> can't wrap my head around how writeback is gonna make use of the
> resulting stats but let's say it decides it needs to put out some IOs
> out for both cgroups.  What happens then?  Do all the async writes go
> through the root cgroup controlled by and affecting the ratio between
> rootcg and cgroup A and B?  Or do they have to be accounted as part of
> cgroups A and B?  If so, what if the added bandwidth goes over the
> limit?  Let's say if we implement overcharge; then, I suppose we'll
> have to communicate that upwards too, right?

The trick is to do the throttling for buffered writes at page dirty
time, when balance_dirty_pages() knows exactly what cgroup the dirtier
task belongs to, the dirty rate and whether or not it's an aggressive
dirtier. The cgroup's direct IO rate can also be measured. The only
missing information is whether it's a non-aggressive direct writer
(only cfq may know about that). Now I'm simply assuming direct writers
are all aggressive.

So if A and B have 2:8 split and A only submits async IO and B only
submits direct IO, there will be no cfqg exist for A at all. cfq will
be serving B and root cgroup interleavely. In the patch I just posted,
blkcg_update_dirty_ratelimit() will transfer A's weight 2 to the root
cgroup for use by the flusher. In the end the flusher gets weight 2
and B gets weight 8. Here we need to distinguish the weight assigned
by user and the weight after the async/sync adjustment.

The other missing information is the real cost when the dirtied pages
eventually hit the disk after perhaps dozens of seconds.  For that
part I'm assuming simple dd at this time and balance_dirty_pages()
is now splitting out the flusher's overall writeout progress to the
dirtier tasks' dirty ratelimit based on bandwidth fairness.

> This is still easy.  What about hierarchical propio?  What happens
> then?  You can't do hierarchical proportional allocation without
> knowing how much IOs are pending for which group.  How is that
> information gonna be communicated between blkcg and writeback?  Are we
> gonna have two separate hierarchical proportional IO allocators?  How
> is that gonna work at all?  If we're gonna have single allocator in
> block layer, writeback would have to feed the amount of IOs it may
> generate into the allocator, get the resulting allocation and then
> issue IO and then block layer again will have to account these to the
> originating cgroups.  It's just crazy.

No I have not got the idea on how to do the hierarchical proportional
IO controller without physically splitting up the async IO streams.
It's pretty hard and I'd better break out before it drives me crazy.

So in the following discussion, let's assume cfq will move async
requests from the current root cgroup to individual IO issuer's cfqgs
and schedule service for the async streams there. And thus the need to
create "backpressure" for balance_dirty_pages() to eventually throttle
the individual dirtier tasks.

That said, I still don't think we've come up with any satisfactory
solutions. It's hard problem after all.

> > The only problem I can see now, is that balance_dirty_pages() works
> > per-bdi and blkcg works per-device. So the two ends may not match
> > nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where
> > sdb is shared by lv0 and lv1. However it should be rare situations and
> > be much more acceptable than the problems arise from the "push back"
> > approach which impacts everyone.
> 
> I don't know.  What problems?  AFAICS, the biggest issue is writeback
> of different inodes getting mixed resulting in poor performance, but
> if you think about it, that's about the frequency of switching cgroups
> and a problem which can and should be dealt with from block layer
> (e.g. use larger time slice if all the pending IOs are async).

Yeah increasing time slice would help that case. In general it's not
merely the frequency of switching cgroup if take hard disk' writeback
cache into account.  Think about some inodes with async IO: A1, A2,
A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different
cgroups. So when the root cgroup holds all async inodes, the cfq may
schedule IO interleavely like this

        A1,    A1,    A1,    A2,    A1,    A2,    ...
           D1,    D2,    D3,    D4,    D5,    D6, ...

Now it becomes

        A1,    A2,    A3,    A4,    A5,    A6,    ...
           D1,    D2,    D3,    D4,    D5,    D6, ...

The difference is that it's now switching the async inodes each time.
At cfq level, the seek costs look the same, however the disk's
writeback cache may help merge the data chunks from the same inode A1.
Well, it may cost some latency for spin disks. But how about SSD? It
can run deeper queue and benefit from large writes.

> Writeback's duty is generating stream of async writes which can be
> served efficiently for the *cgroup* and keeping the buffer filled as
> necessary and chaining the backpressure from there to the actual
> dirtier.  That's what writeback does without cgroup.  Nothing
> fundamental changes with cgroup.  It's just finer grained.

Believe me, physically partitioning the dirty pages and async IO
streams comes at big costs. It won't scale well in many ways.

For one instance, splitting the request queues will give rise to
PG_writeback pages.  Those pages have been the biggest source of
latency issues in the various parts of the system.

It's not uncommon for me to see filesystems sleep on PG_writeback
pages during heavy writeback, within some lock or transaction, which in
turn stall many tasks that try to do IO or merely dirty some page in
memory. Random writes are especially susceptible to such stalls. The
stable page feature also vastly increase the chances of stalls by
locking the writeback pages. 

Page reclaim may also block on PG_writeback and/or PG_dirty pages. In
the case of direct reclaim, it means blocking random tasks that are
allocating memory in the system.

PG_writeback pages are much worse than PG_dirty pages in that they are
not movable. This makes a big difference for high-order page allocations.
To make room for a 2MB huge page, vmscan has the option to migrate
PG_dirty pages, but for PG_writeback it has no better choices than to
wait for IO completion.

The difficulty of THP allocation goes up *exponentially* with the
number of PG_writeback pages. Assume PG_writeback pages are randomly
distributed in the physical memory space. Then we have formula

        P(reclaimable for THP) = 1 - P(hit PG_writeback)^256

That's the possibly for a contiguous range of 256 pages to be free of
PG_writeback, so that it's immediately reclaimable for use by
transparent huge page. This ruby script shows us the concrete numbers.

irb> 1.upto(10) { |i| j=i/1000.0; printf "%.3f\t\t\t%.3f\n", j, (1-j)**512 }

        P(hit PG_writeback)     P(reclaimable for THP)
        0.001                   0.599
        0.002                   0.359
        0.003                   0.215
        0.004                   0.128
        0.005                   0.077
        0.006                   0.046
        0.007                   0.027
        0.008                   0.016
        0.009                   0.010
        0.010                   0.006

The numbers show that when the PG_writeback pages go up from 0.1% to
1% of system memory, the THP reclaim success ratio drops quickly from
60% to 0.6%. It indicates that in order to use THP without constantly
running into stalls, the reasonable PG_writeback ratio is <= 0.1%.
Going beyond that threshold, it quickly becomes intolerable.

That makes a limit of 256MB writeback pages for a mem=256GB system.
Looking at the real vmstat:nr_writeback numbers in dd write tests:

JBOD-12SSD-thresh=8G/ext4-1dd-1-3.3.0/vmstat-end:nr_writeback 217009
JBOD-12SSD-thresh=8G/ext4-10dd-1-3.3.0/vmstat-end:nr_writeback 198335
JBOD-12SSD-thresh=8G/xfs-1dd-1-3.3.0/vmstat-end:nr_writeback 306026
JBOD-12SSD-thresh=8G/xfs-10dd-1-3.3.0/vmstat-end:nr_writeback 315099
JBOD-12SSD-thresh=8G/btrfs-1dd-1-3.3.0/vmstat-end:nr_writeback 1216058
JBOD-12SSD-thresh=8G/btrfs-10dd-1-3.3.0/vmstat-end:nr_writeback 895335

Oops btrfs has 4GB writeback pages -- which asks for some bug fixing.
Even ext4's 800MB still looks way too high, but that's ~1s worth of
data per queue (or 130ms worth of data for the high performance Intel
SSD, which is perhaps in danger of queue underruns?). So this system
would require 512GB memory to comfortably run KVM instances with THP
support.

Judging from the above numbers, we can hardly afford to split up the
IO queues and proliferate writeback pages.

It's worth to note that running multiple flusher threads per bdi means
not only disk seeks for spin disks, smaller IO size for SSD, but also
lock contentions and cache bouncing for metadata heavy workloads and
fast storage.

To give some concrete examples on how much CPU overheads can be saved
by reducing multiple IO submitters, here are some summaries for the
IO-less dirty throttling gains. Tests show that it yields huge
benefits for reducing IO seeks as well as CPU overheads.

For example, the fs_mark benchmark on a 12-drive software RAID0 goes
from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock
contention". (by Dave Chinner)

- "CPU usage has dropped by ~55%", "it certainly appears that most of
  the CPU time saving comes from the removal of contention on the
  inode_wb_list_lock"
  (IMHO at least 10% comes from the reduction of cacheline bouncing,
  because the new code is able to call much less frequently into
  balance_dirty_pages() and hence access the _global_ page states)

- the user space "App overhead" is reduced by 20%, by avoiding the
  cacheline pollution by the complex writeback code path

- "for a ~5% throughput reduction", "the number of write IOs have
  dropped by ~25%", and the elapsed time reduced from 41:42.17 to
  40:53.23.

And for simple dd tests

- "throughput for a _single_ large dd (100GB) increase from ~650MB/s
  to 700MB/s"

- "On a simple test of 100 dd, it reduces the CPU %system time from
  30% to 3%, and improves IO throughput from 38MB/s to 42MB/s."

> > > No, no, it's not about standing in my way.  As Vivek said in the other
> > > reply, it's that the "gap" that you filled was created *because*
> > > writeback wasn't cgroup aware and now you're in turn filling that gap
> > > by making writeback work around that "gap".  I mean, my mind boggles.
> > > Doesn't yours?  I strongly believe everyone's should.
> > 
> > Heh. It's a hard problem indeed. I felt great pains in the IO-less
> > dirty throttling work. I did a lot reasoning about it, and have in
> > fact kept cgroup IO controller in mind since its early days. Now I'd
> > say it's hands down for it to adapt to the gap between the total IO
> > limit and what's carried out by the block IO controller.
> 
> You're not providing any valid counter arguments about the issues
> being raised about the messed up design.  How is anything "hands down"
> here?

Yeah sadly, it turns out to be not "hands down" when it comes to the
proportional async/sync splits, and it's even prohibiting when comes
to the hierarchical support..

> > > There's where I'm confused.  How is the said split supposed to work?
> > > They aren't independent.  I mean, who gets to decide what and where
> > > are those decisions enforced?
> > 
> > Yeah it's not independent. It's about
> > 
> > - keep block IO cgroup untouched (in its current algorithm, for
> >   throttling direct IO)
> > 
> > - let balance_dirty_pages() adapt to the throttling target
> >   
> >         buffered_write_limit = total_limit - direct_IOs
> 
> Think about proportional allocation.  You don't have a number until
> you know who have pending IOs and how much.

We have the IO rate. The above formula is actually working on "rates".
That's good enough for calculating the ratelimit for buffered writes.
We don't have to know every transient states of the pending IOs.
Because the direct IOs are handled by cfq based on cfqg weight and 
for async IOs, there are plenty of dirty pages for
buffering/tolerating small errors in the dirty rate control.

> > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > It's always there doing 1:1 proportional throttling. Then you try to
> > kick in to add *double* throttling in block/cfq layer. Now the low
> > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > from its balanced state, leading to large fluctuations and program
> > stalls.
> 
> Just do the same 1:1 inside each cgroup.

Sure. But the ratio mismatch I'm talking about is inter-cgroup.
For example there are only 2 dd tasks doing buffered writes in the
system. Now consider the mismatch that cfq is dispatching their IO
requests at 10:1 weights, while balance_dirty_pages() is throttling
the dd tasks at 1:1 equal split because it's not aware of the cgroup
weights.

What will happen in the end? The 1:1 ratio imposed by
balance_dirty_pages() will take effect and the dd tasks will progress
at the same pace. The cfq weights will be defeated because the async
queue for the second dd (and cgroup) constantly runs empty.

> >  This can be avoided by telling balance_dirty_pages(): "your
> > balance goal is no longer 1:1, but 10:1". With this information
> > balance_dirty_pages() will behave right. Then there is the question:
> > if balance_dirty_pages() will work just well provided the information,
> > why bother doing the throttling at low layer and "push back" the
> > pressure all the way up?
> 
> Because splitting a resource into two pieces arbitrarily with
> different amount of consumptions on each side and then applying the
> same proportion on both doesn't mean anything?

Sorry, I don't quite catch your words here.

> > The balance_dirty_pages() is already deeply involved in dirty throttling.
> > As you can see from this patchset, the same algorithms can be extended
> > trivially to work with cgroup IO limits.
> > 
> > buffered write IO controller in balance_dirty_pages()
> > https://lkml.org/lkml/2012/3/28/275
> 
> It is half broken thing with fundamental design flaws which can't be
> corrected without complete reimplementation.  I don't know what to
> say.

I'm fully aware of that, and so have been exploring new ways out :)

> > In the "back pressure" scheme, memcg is a must because only it has all
> > the infrastructure to track dirty pages upon which you can apply some
> > dirty_limits. Don't tell me you want to account dirty pages in blkcg...
> 
> For now, per-inode tracking seems good enough.

There are actually two directions of information passing.

1) pass the dirtier ownership down to bio. For this part, it's mostly
   enough to do the light weight per-inode tracking.

2) pass the backpressure up, from cfq (IO dispatch) to flusher (IO
submit) as well as to balance_dirty_pages() (to actually throttle the
dirty tasks). The flusher naturally works on inode granularities.
However balance_dirty_pages() is about limiting dirty pages. For this
part, it needs to know the total number of dirty pages and writeout
bandwidth for each cgroup in order to do proper dirty throttling. And
to maintain proper number of dirty pages to avoid the queue underrun
issue explained in the above 2-dd example.

> > What I can see is, it looks pretty simple and nature to let
> > balance_dirty_pages() fill the gap towards a total solution :-)
> > 
> > - add direct IO accounting in some convenient point of the IO path
> >   IO submission or completion point, either is fine.
> > 
> > - change several lines of the buffered write IO controller to
> >   integrate the direct IO rate into the formula to fit the "total
> >   IO" limit
> > 
> > - in future, add more accounting as well as feedback control to make
> >   balance_dirty_pages() work with IOPS and disk time
> 
> To me, you seem to be not addressing the issues I've been raising at
> all and just repeating the same points again and again.  If I'm
> misunderstanding something, please point out.

Hopefully the renewed patch can dismiss some of your questions. It's a
pity that I didn't thought about the hierarchical requirement at the
time. Otherwise the complexity of calculations still looks manageable.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-19 14:23             ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-19 14:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k,
	andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hi Tejun,

On Tue, Apr 17, 2012 at 03:38:54PM -0700, Tejun Heo wrote:
> Hello, Fengguang.
> 
> On Fri, Apr 06, 2012 at 02:59:34AM -0700, Fengguang Wu wrote:
> > Fortunately, the above gap can be easily filled judging from the
> > block/cfq IO controller code. By adding some direct IO accounting
> > and changing several lines of my patches to make use of the collected
> > stats, the semantics of the blkio.throttle.write_bps interfaces can be
> > changed from "limit for direct IO" to "limit for direct+buffered IOs".
> > Ditto for blkio.weight and blkio.write_iops, as long as some
> > iops/device time stats are made available to balance_dirty_pages().
> > 
> > It would be a fairly *easy* change. :-) It's merely adding some
> > accounting code and there is no need to change the block IO
> > controlling algorithm at all. I'll do the work of accounting (which
> > is basically independent of the IO controlling) and use the new stats
> > in balance_dirty_pages().
> 
> I don't really understand how this can work.  For hard limits, maybe,

Yeah, hard limits are the easiest.

> but for proportional IO, you have to know which cgroups have IOs
> before assigning the proportions, so blkcg assigning IO bandwidth
> without knowing async writes simply can't work.
> 
> For example, let's say cgroups A and B have 2:8 split.  If A has IOs
> on queue and B doesn't, blkcg will assign all IO bandwidth to A.  I
> can't wrap my head around how writeback is gonna make use of the
> resulting stats but let's say it decides it needs to put out some IOs
> out for both cgroups.  What happens then?  Do all the async writes go
> through the root cgroup controlled by and affecting the ratio between
> rootcg and cgroup A and B?  Or do they have to be accounted as part of
> cgroups A and B?  If so, what if the added bandwidth goes over the
> limit?  Let's say if we implement overcharge; then, I suppose we'll
> have to communicate that upwards too, right?

The trick is to do the throttling for buffered writes at page dirty
time, when balance_dirty_pages() knows exactly what cgroup the dirtier
task belongs to, the dirty rate and whether or not it's an aggressive
dirtier. The cgroup's direct IO rate can also be measured. The only
missing information is whether it's a non-aggressive direct writer
(only cfq may know about that). Now I'm simply assuming direct writers
are all aggressive.

So if A and B have 2:8 split and A only submits async IO and B only
submits direct IO, there will be no cfqg exist for A at all. cfq will
be serving B and root cgroup interleavely. In the patch I just posted,
blkcg_update_dirty_ratelimit() will transfer A's weight 2 to the root
cgroup for use by the flusher. In the end the flusher gets weight 2
and B gets weight 8. Here we need to distinguish the weight assigned
by user and the weight after the async/sync adjustment.

The other missing information is the real cost when the dirtied pages
eventually hit the disk after perhaps dozens of seconds.  For that
part I'm assuming simple dd at this time and balance_dirty_pages()
is now splitting out the flusher's overall writeout progress to the
dirtier tasks' dirty ratelimit based on bandwidth fairness.

> This is still easy.  What about hierarchical propio?  What happens
> then?  You can't do hierarchical proportional allocation without
> knowing how much IOs are pending for which group.  How is that
> information gonna be communicated between blkcg and writeback?  Are we
> gonna have two separate hierarchical proportional IO allocators?  How
> is that gonna work at all?  If we're gonna have single allocator in
> block layer, writeback would have to feed the amount of IOs it may
> generate into the allocator, get the resulting allocation and then
> issue IO and then block layer again will have to account these to the
> originating cgroups.  It's just crazy.

No I have not got the idea on how to do the hierarchical proportional
IO controller without physically splitting up the async IO streams.
It's pretty hard and I'd better break out before it drives me crazy.

So in the following discussion, let's assume cfq will move async
requests from the current root cgroup to individual IO issuer's cfqgs
and schedule service for the async streams there. And thus the need to
create "backpressure" for balance_dirty_pages() to eventually throttle
the individual dirtier tasks.

That said, I still don't think we've come up with any satisfactory
solutions. It's hard problem after all.

> > The only problem I can see now, is that balance_dirty_pages() works
> > per-bdi and blkcg works per-device. So the two ends may not match
> > nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where
> > sdb is shared by lv0 and lv1. However it should be rare situations and
> > be much more acceptable than the problems arise from the "push back"
> > approach which impacts everyone.
> 
> I don't know.  What problems?  AFAICS, the biggest issue is writeback
> of different inodes getting mixed resulting in poor performance, but
> if you think about it, that's about the frequency of switching cgroups
> and a problem which can and should be dealt with from block layer
> (e.g. use larger time slice if all the pending IOs are async).

Yeah increasing time slice would help that case. In general it's not
merely the frequency of switching cgroup if take hard disk' writeback
cache into account.  Think about some inodes with async IO: A1, A2,
A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different
cgroups. So when the root cgroup holds all async inodes, the cfq may
schedule IO interleavely like this

        A1,    A1,    A1,    A2,    A1,    A2,    ...
           D1,    D2,    D3,    D4,    D5,    D6, ...

Now it becomes

        A1,    A2,    A3,    A4,    A5,    A6,    ...
           D1,    D2,    D3,    D4,    D5,    D6, ...

The difference is that it's now switching the async inodes each time.
At cfq level, the seek costs look the same, however the disk's
writeback cache may help merge the data chunks from the same inode A1.
Well, it may cost some latency for spin disks. But how about SSD? It
can run deeper queue and benefit from large writes.

> Writeback's duty is generating stream of async writes which can be
> served efficiently for the *cgroup* and keeping the buffer filled as
> necessary and chaining the backpressure from there to the actual
> dirtier.  That's what writeback does without cgroup.  Nothing
> fundamental changes with cgroup.  It's just finer grained.

Believe me, physically partitioning the dirty pages and async IO
streams comes at big costs. It won't scale well in many ways.

For one instance, splitting the request queues will give rise to
PG_writeback pages.  Those pages have been the biggest source of
latency issues in the various parts of the system.

It's not uncommon for me to see filesystems sleep on PG_writeback
pages during heavy writeback, within some lock or transaction, which in
turn stall many tasks that try to do IO or merely dirty some page in
memory. Random writes are especially susceptible to such stalls. The
stable page feature also vastly increase the chances of stalls by
locking the writeback pages. 

Page reclaim may also block on PG_writeback and/or PG_dirty pages. In
the case of direct reclaim, it means blocking random tasks that are
allocating memory in the system.

PG_writeback pages are much worse than PG_dirty pages in that they are
not movable. This makes a big difference for high-order page allocations.
To make room for a 2MB huge page, vmscan has the option to migrate
PG_dirty pages, but for PG_writeback it has no better choices than to
wait for IO completion.

The difficulty of THP allocation goes up *exponentially* with the
number of PG_writeback pages. Assume PG_writeback pages are randomly
distributed in the physical memory space. Then we have formula

        P(reclaimable for THP) = 1 - P(hit PG_writeback)^256

That's the possibly for a contiguous range of 256 pages to be free of
PG_writeback, so that it's immediately reclaimable for use by
transparent huge page. This ruby script shows us the concrete numbers.

irb> 1.upto(10) { |i| j=i/1000.0; printf "%.3f\t\t\t%.3f\n", j, (1-j)**512 }

        P(hit PG_writeback)     P(reclaimable for THP)
        0.001                   0.599
        0.002                   0.359
        0.003                   0.215
        0.004                   0.128
        0.005                   0.077
        0.006                   0.046
        0.007                   0.027
        0.008                   0.016
        0.009                   0.010
        0.010                   0.006

The numbers show that when the PG_writeback pages go up from 0.1% to
1% of system memory, the THP reclaim success ratio drops quickly from
60% to 0.6%. It indicates that in order to use THP without constantly
running into stalls, the reasonable PG_writeback ratio is <= 0.1%.
Going beyond that threshold, it quickly becomes intolerable.

That makes a limit of 256MB writeback pages for a mem=256GB system.
Looking at the real vmstat:nr_writeback numbers in dd write tests:

JBOD-12SSD-thresh=8G/ext4-1dd-1-3.3.0/vmstat-end:nr_writeback 217009
JBOD-12SSD-thresh=8G/ext4-10dd-1-3.3.0/vmstat-end:nr_writeback 198335
JBOD-12SSD-thresh=8G/xfs-1dd-1-3.3.0/vmstat-end:nr_writeback 306026
JBOD-12SSD-thresh=8G/xfs-10dd-1-3.3.0/vmstat-end:nr_writeback 315099
JBOD-12SSD-thresh=8G/btrfs-1dd-1-3.3.0/vmstat-end:nr_writeback 1216058
JBOD-12SSD-thresh=8G/btrfs-10dd-1-3.3.0/vmstat-end:nr_writeback 895335

Oops btrfs has 4GB writeback pages -- which asks for some bug fixing.
Even ext4's 800MB still looks way too high, but that's ~1s worth of
data per queue (or 130ms worth of data for the high performance Intel
SSD, which is perhaps in danger of queue underruns?). So this system
would require 512GB memory to comfortably run KVM instances with THP
support.

Judging from the above numbers, we can hardly afford to split up the
IO queues and proliferate writeback pages.

It's worth to note that running multiple flusher threads per bdi means
not only disk seeks for spin disks, smaller IO size for SSD, but also
lock contentions and cache bouncing for metadata heavy workloads and
fast storage.

To give some concrete examples on how much CPU overheads can be saved
by reducing multiple IO submitters, here are some summaries for the
IO-less dirty throttling gains. Tests show that it yields huge
benefits for reducing IO seeks as well as CPU overheads.

For example, the fs_mark benchmark on a 12-drive software RAID0 goes
from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock
contention". (by Dave Chinner)

- "CPU usage has dropped by ~55%", "it certainly appears that most of
  the CPU time saving comes from the removal of contention on the
  inode_wb_list_lock"
  (IMHO at least 10% comes from the reduction of cacheline bouncing,
  because the new code is able to call much less frequently into
  balance_dirty_pages() and hence access the _global_ page states)

- the user space "App overhead" is reduced by 20%, by avoiding the
  cacheline pollution by the complex writeback code path

- "for a ~5% throughput reduction", "the number of write IOs have
  dropped by ~25%", and the elapsed time reduced from 41:42.17 to
  40:53.23.

And for simple dd tests

- "throughput for a _single_ large dd (100GB) increase from ~650MB/s
  to 700MB/s"

- "On a simple test of 100 dd, it reduces the CPU %system time from
  30% to 3%, and improves IO throughput from 38MB/s to 42MB/s."

> > > No, no, it's not about standing in my way.  As Vivek said in the other
> > > reply, it's that the "gap" that you filled was created *because*
> > > writeback wasn't cgroup aware and now you're in turn filling that gap
> > > by making writeback work around that "gap".  I mean, my mind boggles.
> > > Doesn't yours?  I strongly believe everyone's should.
> > 
> > Heh. It's a hard problem indeed. I felt great pains in the IO-less
> > dirty throttling work. I did a lot reasoning about it, and have in
> > fact kept cgroup IO controller in mind since its early days. Now I'd
> > say it's hands down for it to adapt to the gap between the total IO
> > limit and what's carried out by the block IO controller.
> 
> You're not providing any valid counter arguments about the issues
> being raised about the messed up design.  How is anything "hands down"
> here?

Yeah sadly, it turns out to be not "hands down" when it comes to the
proportional async/sync splits, and it's even prohibiting when comes
to the hierarchical support..

> > > There's where I'm confused.  How is the said split supposed to work?
> > > They aren't independent.  I mean, who gets to decide what and where
> > > are those decisions enforced?
> > 
> > Yeah it's not independent. It's about
> > 
> > - keep block IO cgroup untouched (in its current algorithm, for
> >   throttling direct IO)
> > 
> > - let balance_dirty_pages() adapt to the throttling target
> >   
> >         buffered_write_limit = total_limit - direct_IOs
> 
> Think about proportional allocation.  You don't have a number until
> you know who have pending IOs and how much.

We have the IO rate. The above formula is actually working on "rates".
That's good enough for calculating the ratelimit for buffered writes.
We don't have to know every transient states of the pending IOs.
Because the direct IOs are handled by cfq based on cfqg weight and 
for async IOs, there are plenty of dirty pages for
buffering/tolerating small errors in the dirty rate control.

> > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > It's always there doing 1:1 proportional throttling. Then you try to
> > kick in to add *double* throttling in block/cfq layer. Now the low
> > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > from its balanced state, leading to large fluctuations and program
> > stalls.
> 
> Just do the same 1:1 inside each cgroup.

Sure. But the ratio mismatch I'm talking about is inter-cgroup.
For example there are only 2 dd tasks doing buffered writes in the
system. Now consider the mismatch that cfq is dispatching their IO
requests at 10:1 weights, while balance_dirty_pages() is throttling
the dd tasks at 1:1 equal split because it's not aware of the cgroup
weights.

What will happen in the end? The 1:1 ratio imposed by
balance_dirty_pages() will take effect and the dd tasks will progress
at the same pace. The cfq weights will be defeated because the async
queue for the second dd (and cgroup) constantly runs empty.

> >  This can be avoided by telling balance_dirty_pages(): "your
> > balance goal is no longer 1:1, but 10:1". With this information
> > balance_dirty_pages() will behave right. Then there is the question:
> > if balance_dirty_pages() will work just well provided the information,
> > why bother doing the throttling at low layer and "push back" the
> > pressure all the way up?
> 
> Because splitting a resource into two pieces arbitrarily with
> different amount of consumptions on each side and then applying the
> same proportion on both doesn't mean anything?

Sorry, I don't quite catch your words here.

> > The balance_dirty_pages() is already deeply involved in dirty throttling.
> > As you can see from this patchset, the same algorithms can be extended
> > trivially to work with cgroup IO limits.
> > 
> > buffered write IO controller in balance_dirty_pages()
> > https://lkml.org/lkml/2012/3/28/275
> 
> It is half broken thing with fundamental design flaws which can't be
> corrected without complete reimplementation.  I don't know what to
> say.

I'm fully aware of that, and so have been exploring new ways out :)

> > In the "back pressure" scheme, memcg is a must because only it has all
> > the infrastructure to track dirty pages upon which you can apply some
> > dirty_limits. Don't tell me you want to account dirty pages in blkcg...
> 
> For now, per-inode tracking seems good enough.

There are actually two directions of information passing.

1) pass the dirtier ownership down to bio. For this part, it's mostly
   enough to do the light weight per-inode tracking.

2) pass the backpressure up, from cfq (IO dispatch) to flusher (IO
submit) as well as to balance_dirty_pages() (to actually throttle the
dirty tasks). The flusher naturally works on inode granularities.
However balance_dirty_pages() is about limiting dirty pages. For this
part, it needs to know the total number of dirty pages and writeout
bandwidth for each cgroup in order to do proper dirty throttling. And
to maintain proper number of dirty pages to avoid the queue underrun
issue explained in the above 2-dd example.

> > What I can see is, it looks pretty simple and nature to let
> > balance_dirty_pages() fill the gap towards a total solution :-)
> > 
> > - add direct IO accounting in some convenient point of the IO path
> >   IO submission or completion point, either is fine.
> > 
> > - change several lines of the buffered write IO controller to
> >   integrate the direct IO rate into the formula to fit the "total
> >   IO" limit
> > 
> > - in future, add more accounting as well as feedback control to make
> >   balance_dirty_pages() work with IOPS and disk time
> 
> To me, you seem to be not addressing the issues I've been raising at
> all and just repeating the same points again and again.  If I'm
> misunderstanding something, please point out.

Hopefully the renewed patch can dismiss some of your questions. It's a
pity that I didn't thought about the hierarchical requirement at the
time. Otherwise the complexity of calculations still looks manageable.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-19 14:23             ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-19 14:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hi Tejun,

On Tue, Apr 17, 2012 at 03:38:54PM -0700, Tejun Heo wrote:
> Hello, Fengguang.
> 
> On Fri, Apr 06, 2012 at 02:59:34AM -0700, Fengguang Wu wrote:
> > Fortunately, the above gap can be easily filled judging from the
> > block/cfq IO controller code. By adding some direct IO accounting
> > and changing several lines of my patches to make use of the collected
> > stats, the semantics of the blkio.throttle.write_bps interfaces can be
> > changed from "limit for direct IO" to "limit for direct+buffered IOs".
> > Ditto for blkio.weight and blkio.write_iops, as long as some
> > iops/device time stats are made available to balance_dirty_pages().
> > 
> > It would be a fairly *easy* change. :-) It's merely adding some
> > accounting code and there is no need to change the block IO
> > controlling algorithm at all. I'll do the work of accounting (which
> > is basically independent of the IO controlling) and use the new stats
> > in balance_dirty_pages().
> 
> I don't really understand how this can work.  For hard limits, maybe,

Yeah, hard limits are the easiest.

> but for proportional IO, you have to know which cgroups have IOs
> before assigning the proportions, so blkcg assigning IO bandwidth
> without knowing async writes simply can't work.
> 
> For example, let's say cgroups A and B have 2:8 split.  If A has IOs
> on queue and B doesn't, blkcg will assign all IO bandwidth to A.  I
> can't wrap my head around how writeback is gonna make use of the
> resulting stats but let's say it decides it needs to put out some IOs
> out for both cgroups.  What happens then?  Do all the async writes go
> through the root cgroup controlled by and affecting the ratio between
> rootcg and cgroup A and B?  Or do they have to be accounted as part of
> cgroups A and B?  If so, what if the added bandwidth goes over the
> limit?  Let's say if we implement overcharge; then, I suppose we'll
> have to communicate that upwards too, right?

The trick is to do the throttling for buffered writes at page dirty
time, when balance_dirty_pages() knows exactly what cgroup the dirtier
task belongs to, the dirty rate and whether or not it's an aggressive
dirtier. The cgroup's direct IO rate can also be measured. The only
missing information is whether it's a non-aggressive direct writer
(only cfq may know about that). Now I'm simply assuming direct writers
are all aggressive.

So if A and B have 2:8 split and A only submits async IO and B only
submits direct IO, there will be no cfqg exist for A at all. cfq will
be serving B and root cgroup interleavely. In the patch I just posted,
blkcg_update_dirty_ratelimit() will transfer A's weight 2 to the root
cgroup for use by the flusher. In the end the flusher gets weight 2
and B gets weight 8. Here we need to distinguish the weight assigned
by user and the weight after the async/sync adjustment.

The other missing information is the real cost when the dirtied pages
eventually hit the disk after perhaps dozens of seconds.  For that
part I'm assuming simple dd at this time and balance_dirty_pages()
is now splitting out the flusher's overall writeout progress to the
dirtier tasks' dirty ratelimit based on bandwidth fairness.

> This is still easy.  What about hierarchical propio?  What happens
> then?  You can't do hierarchical proportional allocation without
> knowing how much IOs are pending for which group.  How is that
> information gonna be communicated between blkcg and writeback?  Are we
> gonna have two separate hierarchical proportional IO allocators?  How
> is that gonna work at all?  If we're gonna have single allocator in
> block layer, writeback would have to feed the amount of IOs it may
> generate into the allocator, get the resulting allocation and then
> issue IO and then block layer again will have to account these to the
> originating cgroups.  It's just crazy.

No I have not got the idea on how to do the hierarchical proportional
IO controller without physically splitting up the async IO streams.
It's pretty hard and I'd better break out before it drives me crazy.

So in the following discussion, let's assume cfq will move async
requests from the current root cgroup to individual IO issuer's cfqgs
and schedule service for the async streams there. And thus the need to
create "backpressure" for balance_dirty_pages() to eventually throttle
the individual dirtier tasks.

That said, I still don't think we've come up with any satisfactory
solutions. It's hard problem after all.

> > The only problem I can see now, is that balance_dirty_pages() works
> > per-bdi and blkcg works per-device. So the two ends may not match
> > nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where
> > sdb is shared by lv0 and lv1. However it should be rare situations and
> > be much more acceptable than the problems arise from the "push back"
> > approach which impacts everyone.
> 
> I don't know.  What problems?  AFAICS, the biggest issue is writeback
> of different inodes getting mixed resulting in poor performance, but
> if you think about it, that's about the frequency of switching cgroups
> and a problem which can and should be dealt with from block layer
> (e.g. use larger time slice if all the pending IOs are async).

Yeah increasing time slice would help that case. In general it's not
merely the frequency of switching cgroup if take hard disk' writeback
cache into account.  Think about some inodes with async IO: A1, A2,
A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different
cgroups. So when the root cgroup holds all async inodes, the cfq may
schedule IO interleavely like this

        A1,    A1,    A1,    A2,    A1,    A2,    ...
           D1,    D2,    D3,    D4,    D5,    D6, ...

Now it becomes

        A1,    A2,    A3,    A4,    A5,    A6,    ...
           D1,    D2,    D3,    D4,    D5,    D6, ...

The difference is that it's now switching the async inodes each time.
At cfq level, the seek costs look the same, however the disk's
writeback cache may help merge the data chunks from the same inode A1.
Well, it may cost some latency for spin disks. But how about SSD? It
can run deeper queue and benefit from large writes.

> Writeback's duty is generating stream of async writes which can be
> served efficiently for the *cgroup* and keeping the buffer filled as
> necessary and chaining the backpressure from there to the actual
> dirtier.  That's what writeback does without cgroup.  Nothing
> fundamental changes with cgroup.  It's just finer grained.

Believe me, physically partitioning the dirty pages and async IO
streams comes at big costs. It won't scale well in many ways.

For one instance, splitting the request queues will give rise to
PG_writeback pages.  Those pages have been the biggest source of
latency issues in the various parts of the system.

It's not uncommon for me to see filesystems sleep on PG_writeback
pages during heavy writeback, within some lock or transaction, which in
turn stall many tasks that try to do IO or merely dirty some page in
memory. Random writes are especially susceptible to such stalls. The
stable page feature also vastly increase the chances of stalls by
locking the writeback pages. 

Page reclaim may also block on PG_writeback and/or PG_dirty pages. In
the case of direct reclaim, it means blocking random tasks that are
allocating memory in the system.

PG_writeback pages are much worse than PG_dirty pages in that they are
not movable. This makes a big difference for high-order page allocations.
To make room for a 2MB huge page, vmscan has the option to migrate
PG_dirty pages, but for PG_writeback it has no better choices than to
wait for IO completion.

The difficulty of THP allocation goes up *exponentially* with the
number of PG_writeback pages. Assume PG_writeback pages are randomly
distributed in the physical memory space. Then we have formula

        P(reclaimable for THP) = 1 - P(hit PG_writeback)^256

That's the possibly for a contiguous range of 256 pages to be free of
PG_writeback, so that it's immediately reclaimable for use by
transparent huge page. This ruby script shows us the concrete numbers.

irb> 1.upto(10) { |i| j=i/1000.0; printf "%.3f\t\t\t%.3f\n", j, (1-j)**512 }

        P(hit PG_writeback)     P(reclaimable for THP)
        0.001                   0.599
        0.002                   0.359
        0.003                   0.215
        0.004                   0.128
        0.005                   0.077
        0.006                   0.046
        0.007                   0.027
        0.008                   0.016
        0.009                   0.010
        0.010                   0.006

The numbers show that when the PG_writeback pages go up from 0.1% to
1% of system memory, the THP reclaim success ratio drops quickly from
60% to 0.6%. It indicates that in order to use THP without constantly
running into stalls, the reasonable PG_writeback ratio is <= 0.1%.
Going beyond that threshold, it quickly becomes intolerable.

That makes a limit of 256MB writeback pages for a mem=256GB system.
Looking at the real vmstat:nr_writeback numbers in dd write tests:

JBOD-12SSD-thresh=8G/ext4-1dd-1-3.3.0/vmstat-end:nr_writeback 217009
JBOD-12SSD-thresh=8G/ext4-10dd-1-3.3.0/vmstat-end:nr_writeback 198335
JBOD-12SSD-thresh=8G/xfs-1dd-1-3.3.0/vmstat-end:nr_writeback 306026
JBOD-12SSD-thresh=8G/xfs-10dd-1-3.3.0/vmstat-end:nr_writeback 315099
JBOD-12SSD-thresh=8G/btrfs-1dd-1-3.3.0/vmstat-end:nr_writeback 1216058
JBOD-12SSD-thresh=8G/btrfs-10dd-1-3.3.0/vmstat-end:nr_writeback 895335

Oops btrfs has 4GB writeback pages -- which asks for some bug fixing.
Even ext4's 800MB still looks way too high, but that's ~1s worth of
data per queue (or 130ms worth of data for the high performance Intel
SSD, which is perhaps in danger of queue underruns?). So this system
would require 512GB memory to comfortably run KVM instances with THP
support.

Judging from the above numbers, we can hardly afford to split up the
IO queues and proliferate writeback pages.

It's worth to note that running multiple flusher threads per bdi means
not only disk seeks for spin disks, smaller IO size for SSD, but also
lock contentions and cache bouncing for metadata heavy workloads and
fast storage.

To give some concrete examples on how much CPU overheads can be saved
by reducing multiple IO submitters, here are some summaries for the
IO-less dirty throttling gains. Tests show that it yields huge
benefits for reducing IO seeks as well as CPU overheads.

For example, the fs_mark benchmark on a 12-drive software RAID0 goes
from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock
contention". (by Dave Chinner)

- "CPU usage has dropped by ~55%", "it certainly appears that most of
  the CPU time saving comes from the removal of contention on the
  inode_wb_list_lock"
  (IMHO at least 10% comes from the reduction of cacheline bouncing,
  because the new code is able to call much less frequently into
  balance_dirty_pages() and hence access the _global_ page states)

- the user space "App overhead" is reduced by 20%, by avoiding the
  cacheline pollution by the complex writeback code path

- "for a ~5% throughput reduction", "the number of write IOs have
  dropped by ~25%", and the elapsed time reduced from 41:42.17 to
  40:53.23.

And for simple dd tests

- "throughput for a _single_ large dd (100GB) increase from ~650MB/s
  to 700MB/s"

- "On a simple test of 100 dd, it reduces the CPU %system time from
  30% to 3%, and improves IO throughput from 38MB/s to 42MB/s."

> > > No, no, it's not about standing in my way.  As Vivek said in the other
> > > reply, it's that the "gap" that you filled was created *because*
> > > writeback wasn't cgroup aware and now you're in turn filling that gap
> > > by making writeback work around that "gap".  I mean, my mind boggles.
> > > Doesn't yours?  I strongly believe everyone's should.
> > 
> > Heh. It's a hard problem indeed. I felt great pains in the IO-less
> > dirty throttling work. I did a lot reasoning about it, and have in
> > fact kept cgroup IO controller in mind since its early days. Now I'd
> > say it's hands down for it to adapt to the gap between the total IO
> > limit and what's carried out by the block IO controller.
> 
> You're not providing any valid counter arguments about the issues
> being raised about the messed up design.  How is anything "hands down"
> here?

Yeah sadly, it turns out to be not "hands down" when it comes to the
proportional async/sync splits, and it's even prohibiting when comes
to the hierarchical support..

> > > There's where I'm confused.  How is the said split supposed to work?
> > > They aren't independent.  I mean, who gets to decide what and where
> > > are those decisions enforced?
> > 
> > Yeah it's not independent. It's about
> > 
> > - keep block IO cgroup untouched (in its current algorithm, for
> >   throttling direct IO)
> > 
> > - let balance_dirty_pages() adapt to the throttling target
> >   
> >         buffered_write_limit = total_limit - direct_IOs
> 
> Think about proportional allocation.  You don't have a number until
> you know who have pending IOs and how much.

We have the IO rate. The above formula is actually working on "rates".
That's good enough for calculating the ratelimit for buffered writes.
We don't have to know every transient states of the pending IOs.
Because the direct IOs are handled by cfq based on cfqg weight and 
for async IOs, there are plenty of dirty pages for
buffering/tolerating small errors in the dirty rate control.

> > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > It's always there doing 1:1 proportional throttling. Then you try to
> > kick in to add *double* throttling in block/cfq layer. Now the low
> > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > from its balanced state, leading to large fluctuations and program
> > stalls.
> 
> Just do the same 1:1 inside each cgroup.

Sure. But the ratio mismatch I'm talking about is inter-cgroup.
For example there are only 2 dd tasks doing buffered writes in the
system. Now consider the mismatch that cfq is dispatching their IO
requests at 10:1 weights, while balance_dirty_pages() is throttling
the dd tasks at 1:1 equal split because it's not aware of the cgroup
weights.

What will happen in the end? The 1:1 ratio imposed by
balance_dirty_pages() will take effect and the dd tasks will progress
at the same pace. The cfq weights will be defeated because the async
queue for the second dd (and cgroup) constantly runs empty.

> >  This can be avoided by telling balance_dirty_pages(): "your
> > balance goal is no longer 1:1, but 10:1". With this information
> > balance_dirty_pages() will behave right. Then there is the question:
> > if balance_dirty_pages() will work just well provided the information,
> > why bother doing the throttling at low layer and "push back" the
> > pressure all the way up?
> 
> Because splitting a resource into two pieces arbitrarily with
> different amount of consumptions on each side and then applying the
> same proportion on both doesn't mean anything?

Sorry, I don't quite catch your words here.

> > The balance_dirty_pages() is already deeply involved in dirty throttling.
> > As you can see from this patchset, the same algorithms can be extended
> > trivially to work with cgroup IO limits.
> > 
> > buffered write IO controller in balance_dirty_pages()
> > https://lkml.org/lkml/2012/3/28/275
> 
> It is half broken thing with fundamental design flaws which can't be
> corrected without complete reimplementation.  I don't know what to
> say.

I'm fully aware of that, and so have been exploring new ways out :)

> > In the "back pressure" scheme, memcg is a must because only it has all
> > the infrastructure to track dirty pages upon which you can apply some
> > dirty_limits. Don't tell me you want to account dirty pages in blkcg...
> 
> For now, per-inode tracking seems good enough.

There are actually two directions of information passing.

1) pass the dirtier ownership down to bio. For this part, it's mostly
   enough to do the light weight per-inode tracking.

2) pass the backpressure up, from cfq (IO dispatch) to flusher (IO
submit) as well as to balance_dirty_pages() (to actually throttle the
dirty tasks). The flusher naturally works on inode granularities.
However balance_dirty_pages() is about limiting dirty pages. For this
part, it needs to know the total number of dirty pages and writeout
bandwidth for each cgroup in order to do proper dirty throttling. And
to maintain proper number of dirty pages to avoid the queue underrun
issue explained in the above 2-dd example.

> > What I can see is, it looks pretty simple and nature to let
> > balance_dirty_pages() fill the gap towards a total solution :-)
> > 
> > - add direct IO accounting in some convenient point of the IO path
> >   IO submission or completion point, either is fine.
> > 
> > - change several lines of the buffered write IO controller to
> >   integrate the direct IO rate into the formula to fit the "total
> >   IO" limit
> > 
> > - in future, add more accounting as well as feedback control to make
> >   balance_dirty_pages() work with IOPS and disk time
> 
> To me, you seem to be not addressing the issues I've been raising at
> all and just repeating the same points again and again.  If I'm
> misunderstanding something, please point out.

Hopefully the renewed patch can dismiss some of your questions. It's a
pity that I didn't thought about the hierarchical requirement at the
time. Otherwise the complexity of calculations still looks manageable.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-19 14:23             ` Fengguang Wu
                               ` (2 preceding siblings ...)
  (?)
@ 2012-04-19 18:31             ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-19 18:31 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Thu, Apr 19, 2012 at 10:23:43PM +0800, Fengguang Wu wrote:

Hi Fengguang,

[..]
> > I don't know.  What problems?  AFAICS, the biggest issue is writeback
> > of different inodes getting mixed resulting in poor performance, but
> > if you think about it, that's about the frequency of switching cgroups
> > and a problem which can and should be dealt with from block layer
> > (e.g. use larger time slice if all the pending IOs are async).
> 
> Yeah increasing time slice would help that case. In general it's not
> merely the frequency of switching cgroup if take hard disk' writeback
> cache into account.  Think about some inodes with async IO: A1, A2,
> A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different
> cgroups. So when the root cgroup holds all async inodes, the cfq may
> schedule IO interleavely like this
> 
>         A1,    A1,    A1,    A2,    A1,    A2,    ...
>            D1,    D2,    D3,    D4,    D5,    D6, ...
> 
> Now it becomes
> 
>         A1,    A2,    A3,    A4,    A5,    A6,    ...
>            D1,    D2,    D3,    D4,    D5,    D6, ...
> 
> The difference is that it's now switching the async inodes each time.
> At cfq level, the seek costs look the same, however the disk's
> writeback cache may help merge the data chunks from the same inode A1.
> Well, it may cost some latency for spin disks. But how about SSD? It
> can run deeper queue and benefit from large writes.

Not sure what's the point here. Many things seem to be mixed up.

If we start putting async queues in separate groups (in an attempt to
provide fairness/service differentiation), then how much IO we dispatch
from one async inode will directly depend on slice time of that
cgroup/queue. So if you want longer dispatch from same async inode
increasing slice time will help.

Also elevator merge logic anyway increses the size of async IO requests
and big requests are submitted to device.

If you are looking that in every dispatch cycle we continue to dispatch
request from same inode, yes that's not possible. Too huge a slice length
in presence of sync IO is also not good. So if you are looking for
high throughput and sacrificing fairness then you can switch to mode
where all async queues are put in single root group. (Note: you will have
to do reasonably fast switch between cgroups so that all the cgroups are
able to do some writeout in a time window).

Writeback logic also submits a certain amount of writes from one inode
and then switches to next inode in an attempt to provide fairness. Same
thing should be directly controllable by CFQ's notion of time slice. That
is continue to dispatch async IO from a cgroup/inode for extended durtaion
before switching. So what's the difference. One can achieve equivalent
behavior at any layer (writeback/CFQ).

> 
> > Writeback's duty is generating stream of async writes which can be
> > served efficiently for the *cgroup* and keeping the buffer filled as
> > necessary and chaining the backpressure from there to the actual
> > dirtier.  That's what writeback does without cgroup.  Nothing
> > fundamental changes with cgroup.  It's just finer grained.
> 
> Believe me, physically partitioning the dirty pages and async IO
> streams comes at big costs. It won't scale well in many ways.
> 
> For one instance, splitting the request queues will give rise to
> PG_writeback pages.  Those pages have been the biggest source of
> latency issues in the various parts of the system.

So PG_writeback pages are one which have been submitted for IO? So even
now we generate PG_writeback pages across multiple inodes as we submit
those pages for IO. By keeping the number of request descriptor per
group low, we can build back pressure early and hence per inode/group
we will not have too many PG_Writeback pages. IOW, number of PG_Writeback
pages will be controllable by number of request descriptros. So how
does situation becomes worse in case of CFQ putting them in separate
cgroups?

> It's worth to note that running multiple flusher threads per bdi means
> not only disk seeks for spin disks, smaller IO size for SSD, but also
> lock contentions and cache bouncing for metadata heavy workloads and
> fast storage.

But we could still have single flusher per bdi and just check the
write congestion state of each group and back off if it is congested.

So single thread will still be doing IO submission. Just that it will
submit IO from multiple inodes/cgroup which can cause additional seeks.
And that's the tradeoff of fairness. What I am not able to understand
is that how are you avoiding this tradeoff by implementing things in
writeback layer. To achieve more fairness among groups, even a flusher
thread will have to switch faster among cgroups/inodes.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-19 14:23             ` Fengguang Wu
@ 2012-04-19 18:31               ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-19 18:31 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Thu, Apr 19, 2012 at 10:23:43PM +0800, Fengguang Wu wrote:

Hi Fengguang,

[..]
> > I don't know.  What problems?  AFAICS, the biggest issue is writeback
> > of different inodes getting mixed resulting in poor performance, but
> > if you think about it, that's about the frequency of switching cgroups
> > and a problem which can and should be dealt with from block layer
> > (e.g. use larger time slice if all the pending IOs are async).
> 
> Yeah increasing time slice would help that case. In general it's not
> merely the frequency of switching cgroup if take hard disk' writeback
> cache into account.  Think about some inodes with async IO: A1, A2,
> A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different
> cgroups. So when the root cgroup holds all async inodes, the cfq may
> schedule IO interleavely like this
> 
>         A1,    A1,    A1,    A2,    A1,    A2,    ...
>            D1,    D2,    D3,    D4,    D5,    D6, ...
> 
> Now it becomes
> 
>         A1,    A2,    A3,    A4,    A5,    A6,    ...
>            D1,    D2,    D3,    D4,    D5,    D6, ...
> 
> The difference is that it's now switching the async inodes each time.
> At cfq level, the seek costs look the same, however the disk's
> writeback cache may help merge the data chunks from the same inode A1.
> Well, it may cost some latency for spin disks. But how about SSD? It
> can run deeper queue and benefit from large writes.

Not sure what's the point here. Many things seem to be mixed up.

If we start putting async queues in separate groups (in an attempt to
provide fairness/service differentiation), then how much IO we dispatch
from one async inode will directly depend on slice time of that
cgroup/queue. So if you want longer dispatch from same async inode
increasing slice time will help.

Also elevator merge logic anyway increses the size of async IO requests
and big requests are submitted to device.

If you are looking that in every dispatch cycle we continue to dispatch
request from same inode, yes that's not possible. Too huge a slice length
in presence of sync IO is also not good. So if you are looking for
high throughput and sacrificing fairness then you can switch to mode
where all async queues are put in single root group. (Note: you will have
to do reasonably fast switch between cgroups so that all the cgroups are
able to do some writeout in a time window).

Writeback logic also submits a certain amount of writes from one inode
and then switches to next inode in an attempt to provide fairness. Same
thing should be directly controllable by CFQ's notion of time slice. That
is continue to dispatch async IO from a cgroup/inode for extended durtaion
before switching. So what's the difference. One can achieve equivalent
behavior at any layer (writeback/CFQ).

> 
> > Writeback's duty is generating stream of async writes which can be
> > served efficiently for the *cgroup* and keeping the buffer filled as
> > necessary and chaining the backpressure from there to the actual
> > dirtier.  That's what writeback does without cgroup.  Nothing
> > fundamental changes with cgroup.  It's just finer grained.
> 
> Believe me, physically partitioning the dirty pages and async IO
> streams comes at big costs. It won't scale well in many ways.
> 
> For one instance, splitting the request queues will give rise to
> PG_writeback pages.  Those pages have been the biggest source of
> latency issues in the various parts of the system.

So PG_writeback pages are one which have been submitted for IO? So even
now we generate PG_writeback pages across multiple inodes as we submit
those pages for IO. By keeping the number of request descriptor per
group low, we can build back pressure early and hence per inode/group
we will not have too many PG_Writeback pages. IOW, number of PG_Writeback
pages will be controllable by number of request descriptros. So how
does situation becomes worse in case of CFQ putting them in separate
cgroups?

> It's worth to note that running multiple flusher threads per bdi means
> not only disk seeks for spin disks, smaller IO size for SSD, but also
> lock contentions and cache bouncing for metadata heavy workloads and
> fast storage.

But we could still have single flusher per bdi and just check the
write congestion state of each group and back off if it is congested.

So single thread will still be doing IO submission. Just that it will
submit IO from multiple inodes/cgroup which can cause additional seeks.
And that's the tradeoff of fairness. What I am not able to understand
is that how are you avoiding this tradeoff by implementing things in
writeback layer. To achieve more fairness among groups, even a flusher
thread will have to switch faster among cgroups/inodes.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-19 18:31               ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-19 18:31 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Thu, Apr 19, 2012 at 10:23:43PM +0800, Fengguang Wu wrote:

Hi Fengguang,

[..]
> > I don't know.  What problems?  AFAICS, the biggest issue is writeback
> > of different inodes getting mixed resulting in poor performance, but
> > if you think about it, that's about the frequency of switching cgroups
> > and a problem which can and should be dealt with from block layer
> > (e.g. use larger time slice if all the pending IOs are async).
> 
> Yeah increasing time slice would help that case. In general it's not
> merely the frequency of switching cgroup if take hard disk' writeback
> cache into account.  Think about some inodes with async IO: A1, A2,
> A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different
> cgroups. So when the root cgroup holds all async inodes, the cfq may
> schedule IO interleavely like this
> 
>         A1,    A1,    A1,    A2,    A1,    A2,    ...
>            D1,    D2,    D3,    D4,    D5,    D6, ...
> 
> Now it becomes
> 
>         A1,    A2,    A3,    A4,    A5,    A6,    ...
>            D1,    D2,    D3,    D4,    D5,    D6, ...
> 
> The difference is that it's now switching the async inodes each time.
> At cfq level, the seek costs look the same, however the disk's
> writeback cache may help merge the data chunks from the same inode A1.
> Well, it may cost some latency for spin disks. But how about SSD? It
> can run deeper queue and benefit from large writes.

Not sure what's the point here. Many things seem to be mixed up.

If we start putting async queues in separate groups (in an attempt to
provide fairness/service differentiation), then how much IO we dispatch
from one async inode will directly depend on slice time of that
cgroup/queue. So if you want longer dispatch from same async inode
increasing slice time will help.

Also elevator merge logic anyway increses the size of async IO requests
and big requests are submitted to device.

If you are looking that in every dispatch cycle we continue to dispatch
request from same inode, yes that's not possible. Too huge a slice length
in presence of sync IO is also not good. So if you are looking for
high throughput and sacrificing fairness then you can switch to mode
where all async queues are put in single root group. (Note: you will have
to do reasonably fast switch between cgroups so that all the cgroups are
able to do some writeout in a time window).

Writeback logic also submits a certain amount of writes from one inode
and then switches to next inode in an attempt to provide fairness. Same
thing should be directly controllable by CFQ's notion of time slice. That
is continue to dispatch async IO from a cgroup/inode for extended durtaion
before switching. So what's the difference. One can achieve equivalent
behavior at any layer (writeback/CFQ).

> 
> > Writeback's duty is generating stream of async writes which can be
> > served efficiently for the *cgroup* and keeping the buffer filled as
> > necessary and chaining the backpressure from there to the actual
> > dirtier.  That's what writeback does without cgroup.  Nothing
> > fundamental changes with cgroup.  It's just finer grained.
> 
> Believe me, physically partitioning the dirty pages and async IO
> streams comes at big costs. It won't scale well in many ways.
> 
> For one instance, splitting the request queues will give rise to
> PG_writeback pages.  Those pages have been the biggest source of
> latency issues in the various parts of the system.

So PG_writeback pages are one which have been submitted for IO? So even
now we generate PG_writeback pages across multiple inodes as we submit
those pages for IO. By keeping the number of request descriptor per
group low, we can build back pressure early and hence per inode/group
we will not have too many PG_Writeback pages. IOW, number of PG_Writeback
pages will be controllable by number of request descriptros. So how
does situation becomes worse in case of CFQ putting them in separate
cgroups?

> It's worth to note that running multiple flusher threads per bdi means
> not only disk seeks for spin disks, smaller IO size for SSD, but also
> lock contentions and cache bouncing for metadata heavy workloads and
> fast storage.

But we could still have single flusher per bdi and just check the
write congestion state of each group and back off if it is congested.

So single thread will still be doing IO submission. Just that it will
submit IO from multiple inodes/cgroup which can cause additional seeks.
And that's the tradeoff of fairness. What I am not able to understand
is that how are you avoiding this tradeoff by implementing things in
writeback layer. To achieve more fairness among groups, even a flusher
thread will have to switch faster among cgroups/inodes.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-19 14:23             ` Fengguang Wu
  (?)
@ 2012-04-19 20:26               ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-19 20:26 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA

On Thu 19-04-12 22:23:43, Wu Fengguang wrote:
> For one instance, splitting the request queues will give rise to
> PG_writeback pages.  Those pages have been the biggest source of
> latency issues in the various parts of the system.
  Well, if we allow more requests to be in flight in total then yes, number
of PG_Writeback pages can be higher as well.

> It's not uncommon for me to see filesystems sleep on PG_writeback
> pages during heavy writeback, within some lock or transaction, which in
> turn stall many tasks that try to do IO or merely dirty some page in
> memory. Random writes are especially susceptible to such stalls. The
> stable page feature also vastly increase the chances of stalls by
> locking the writeback pages. 
> 
> Page reclaim may also block on PG_writeback and/or PG_dirty pages. In
> the case of direct reclaim, it means blocking random tasks that are
> allocating memory in the system.
> 
> PG_writeback pages are much worse than PG_dirty pages in that they are
> not movable. This makes a big difference for high-order page allocations.
> To make room for a 2MB huge page, vmscan has the option to migrate
> PG_dirty pages, but for PG_writeback it has no better choices than to
> wait for IO completion.
> 
> The difficulty of THP allocation goes up *exponentially* with the
> number of PG_writeback pages. Assume PG_writeback pages are randomly
> distributed in the physical memory space. Then we have formula
> 
>         P(reclaimable for THP) = 1 - P(hit PG_writeback)^256
  Well, this implicitely assumes that PG_Writeback pages are scattered
across memory uniformly at random. I'm not sure to which extent this is
true... Also as a nitpick, this isn't really an exponential growth since
the exponent is fixed (256 - actually it should be 512, right?). It's just
a polynomial with a big exponent. But sure, growth in number of PG_Writeback
pages will cause relatively steep drop in the number of available huge
pages.

...
> It's worth to note that running multiple flusher threads per bdi means
> not only disk seeks for spin disks, smaller IO size for SSD, but also
> lock contentions and cache bouncing for metadata heavy workloads and
> fast storage.
  Well, this heavily depends on particular implementation (and chosen
data structures). But yes, we should have that in mind.

...
> > > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > > It's always there doing 1:1 proportional throttling. Then you try to
> > > kick in to add *double* throttling in block/cfq layer. Now the low
> > > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > > from its balanced state, leading to large fluctuations and program
> > > stalls.
> > 
> > Just do the same 1:1 inside each cgroup.
> 
> Sure. But the ratio mismatch I'm talking about is inter-cgroup.
> For example there are only 2 dd tasks doing buffered writes in the
> system. Now consider the mismatch that cfq is dispatching their IO
> requests at 10:1 weights, while balance_dirty_pages() is throttling
> the dd tasks at 1:1 equal split because it's not aware of the cgroup
> weights.
> 
> What will happen in the end? The 1:1 ratio imposed by
> balance_dirty_pages() will take effect and the dd tasks will progress
> at the same pace. The cfq weights will be defeated because the async
> queue for the second dd (and cgroup) constantly runs empty.
  Yup. This just shows that you have to have per-cgroup dirty limits. Once
you have those, things start working again.

								Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-19 20:26               ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-19 20:26 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Tejun Heo, Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Thu 19-04-12 22:23:43, Wu Fengguang wrote:
> For one instance, splitting the request queues will give rise to
> PG_writeback pages.  Those pages have been the biggest source of
> latency issues in the various parts of the system.
  Well, if we allow more requests to be in flight in total then yes, number
of PG_Writeback pages can be higher as well.

> It's not uncommon for me to see filesystems sleep on PG_writeback
> pages during heavy writeback, within some lock or transaction, which in
> turn stall many tasks that try to do IO or merely dirty some page in
> memory.