All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] writeback and cgroup
@ 2012-04-03 18:36 ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-03 18:36 UTC (permalink / raw)
  To: Fengguang Wu, Jan Kara, vgoyal, Jens Axboe
  Cc: linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel,
	linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups,
	ctalbott, rni, lsf

Hello, guys.

So, during LSF, I, Fengguang and Jan had a chance to sit down and talk
about how to cgroup support to writeback.  Here's what I got from it.

Fengguang's opinion is that the throttling algorithm implemented in
writeback is good enough and blkcg parameters can be exposed to
writeback such that those limits can be applied from writeback.  As
for reads and direct IOs, Fengguang opined that the algorithm can
easily be extended to cover those cases and IIUC all IOs, whether
buffered writes, reads or direct IOs can eventually all go through
writeback layer which will be the one layer controlling all IOs.

Unfortunately, I don't agree with that at all.  I think it's a gross
layering violation and lacks any longterm design.  We have a well
working model of applying and propagating resource pressure - we apply
the pressure where the resource exists and propagates the back
pressure through buffers to upper layers upto the originator.  Think
about network, the pressure exists or is applied at the in/egress
points which gets propagated through socket buffers and eventually
throttles the originator.

Writeback, without cgroup, isn't different.  It consists a part of the
pressure propagation chain anchored at the IO device.  IO devices
these days generate very high pressure, which gets propgated through
the IO sched and buffered requests, which in turn creates pressure at
writeback.  Here, the buffering happens in page cache and pressure at
writeback increases the amount of dirty page cache.  Propagating this
IO pressure to the dirtying task is one of the biggest
responsibililties of the writeback code, and this is the underlying
design of the whole thing.

IIUC, without cgroup, the current writeback code works more or less
like this.  Throwing in cgroup doesn't really change the fundamental
design.  Instead of a single pipe going down, we just have multiple
pipes to the same device, each of which should be treated separately.
Of course, a spinning disk can't be divided that easily and their
performance characteristics will be inter-dependent, but the place to
solve that problem is where the problem is, the block layer.

We may have to look for optimizations and expose some details to
improve the overall behavior and such optimizations may require some
deviation from the fundamental design, but such optimizations should
be justified and such deviations kept at minimum, so, no, I don't
think we're gonna be expose blkcg / block / elevator parameters
directly to writeback.  Unless someone can *really* convince me
otherwise, I'll be vetoing any change toward that direction.

Let's please keep the layering clear.  IO limitations will be applied
at the block layer and pressure will be formed there and then
propagated upwards eventually to the originator.  Sure, exposing the
whole information might result in better behavior for certain
workloads, but down the road, say, in three or five years, devices
which can be shared without worrying too much about seeks might be
commonplace and we could be swearing at a disgusting structural mess,
and sadly various cgroup support seems to be a prominent source of
such design failures.

IMHO, treating cgroup - device/bdi pair as a separate device should
suffice as the underlying design.  After all, blkio cgroup support's
ultimate goal is dividing the IO resource into separate bins.
Implementation details might change as underlying technology changes
and we learn more about how to do it better but that is the goal which
we'll always try to keep close to.  Writeback should (be able to)
treat them as separate devices.  We surely will need adjustments and
optimizations to make things work at least somewhat reasonably but
that is the baseline.

In the discussion, for such implementation, the following obstacles
were identified.

* There are a lot of cases where IOs are issued by a task which isn't
  the originiator.  ie. Writeback issues IOs for pages which are
  dirtied by some other tasks.  So, by the time an IO reaches the
  block layer, we don't know which cgroup the IO belongs to.

  Recently, block layer has grown support to attach a task to a bio
  which causes the bio to be handled as if it were issued by the
  associated task regardless of the actual issuing task.  It currently
  only allows attaching %current to a bio - bio_associate_current() -
  but changing it to support other tasks is trivial.

  We'll need to update the async issuers to tag the IOs they issue but
  the mechanism is already there.

* There's a single request pool shared by all issuers per a request
  queue.  This can lead to priority inversion among cgroups.  Note
  that problem also exists without cgroups.  Lower ioprio issuer may
  be holding a request holding back highprio issuer.

  We'll need to make request allocation cgroup (and hopefully ioprio)
  aware.  Probably in the form of separate request pools.  This will
  take some work but I don't think this will be too challenging.  I'll
  work on it.

* cfq cgroup policy throws all async IOs, which all buffered writes
  are, into the shared cgroup regardless of the actual cgroup.  This
  behavior is, I believe, mostly historical and changing it isn't
  difficult.  Prolly only few tens of lines of changes.  This may
  cause significant changes to actual IO behavior with cgroups tho.  I
  personally think the previous behavior was too wrong to keep (the
  weight was completely ignored for buffered writes) but we may want
  to introduce a switch to toggle between the two behaviors.

  Note that blk-throttle doesn't have this problem.

* Unlike dirty data pages, metadata tends to have strict ordering
  requirements and thus is susceptible to priority inversion.  Two
  solutions were suggested - 1. allow overdrawl for metadata writes so
  that low prio metadata writes don't block the whole FS, 2. provide
  an interface to query and wait for bdi-cgroup congestion which can
  be called from FS metadata paths to throttle metadata operations
  before they enter the stream of ordered operations.

  I think combination of the above two should be enough for solving
  the problem.  I *think* the second can be implemented as part of
  cgroup aware request allocation update.  The first one needs a bit
  more thinking but there can be easier interim solutions (e.g. throw
  META writes to the head of the cgroup queue or just plain ignore
  cgroup limits for META writes) for now.

* I'm sure there are a lot of design choices to be made in the
  writeback implementation but IIUC Jan seems to agree that the
  simplest would be simply deal different cgroup-bdi pairs as
  completely separate which shouldn't add too much complexity to the
  already intricate writeback code.

So, I think we have something which sounds like a plan, which at least
I can agree with and seems doable without adding a lot of complexity.

Jan, Fengguang, I'm pretty sure I missed some stuff from writeback's
side and IIUC Fengguang doesn't agree with this approach too much, so
please voice your opinions & comments.

Thank you.

--
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* [RFC] writeback and cgroup
@ 2012-04-03 18:36 ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-03 18:36 UTC (permalink / raw)
  To: Fengguang Wu, Jan Kara, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k,
	andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hello, guys.

So, during LSF, I, Fengguang and Jan had a chance to sit down and talk
about how to cgroup support to writeback.  Here's what I got from it.

Fengguang's opinion is that the throttling algorithm implemented in
writeback is good enough and blkcg parameters can be exposed to
writeback such that those limits can be applied from writeback.  As
for reads and direct IOs, Fengguang opined that the algorithm can
easily be extended to cover those cases and IIUC all IOs, whether
buffered writes, reads or direct IOs can eventually all go through
writeback layer which will be the one layer controlling all IOs.

Unfortunately, I don't agree with that at all.  I think it's a gross
layering violation and lacks any longterm design.  We have a well
working model of applying and propagating resource pressure - we apply
the pressure where the resource exists and propagates the back
pressure through buffers to upper layers upto the originator.  Think
about network, the pressure exists or is applied at the in/egress
points which gets propagated through socket buffers and eventually
throttles the originator.

Writeback, without cgroup, isn't different.  It consists a part of the
pressure propagation chain anchored at the IO device.  IO devices
these days generate very high pressure, which gets propgated through
the IO sched and buffered requests, which in turn creates pressure at
writeback.  Here, the buffering happens in page cache and pressure at
writeback increases the amount of dirty page cache.  Propagating this
IO pressure to the dirtying task is one of the biggest
responsibililties of the writeback code, and this is the underlying
design of the whole thing.

IIUC, without cgroup, the current writeback code works more or less
like this.  Throwing in cgroup doesn't really change the fundamental
design.  Instead of a single pipe going down, we just have multiple
pipes to the same device, each of which should be treated separately.
Of course, a spinning disk can't be divided that easily and their
performance characteristics will be inter-dependent, but the place to
solve that problem is where the problem is, the block layer.

We may have to look for optimizations and expose some details to
improve the overall behavior and such optimizations may require some
deviation from the fundamental design, but such optimizations should
be justified and such deviations kept at minimum, so, no, I don't
think we're gonna be expose blkcg / block / elevator parameters
directly to writeback.  Unless someone can *really* convince me
otherwise, I'll be vetoing any change toward that direction.

Let's please keep the layering clear.  IO limitations will be applied
at the block layer and pressure will be formed there and then
propagated upwards eventually to the originator.  Sure, exposing the
whole information might result in better behavior for certain
workloads, but down the road, say, in three or five years, devices
which can be shared without worrying too much about seeks might be
commonplace and we could be swearing at a disgusting structural mess,
and sadly various cgroup support seems to be a prominent source of
such design failures.

IMHO, treating cgroup - device/bdi pair as a separate device should
suffice as the underlying design.  After all, blkio cgroup support's
ultimate goal is dividing the IO resource into separate bins.
Implementation details might change as underlying technology changes
and we learn more about how to do it better but that is the goal which
we'll always try to keep close to.  Writeback should (be able to)
treat them as separate devices.  We surely will need adjustments and
optimizations to make things work at least somewhat reasonably but
that is the baseline.

In the discussion, for such implementation, the following obstacles
were identified.

* There are a lot of cases where IOs are issued by a task which isn't
  the originiator.  ie. Writeback issues IOs for pages which are
  dirtied by some other tasks.  So, by the time an IO reaches the
  block layer, we don't know which cgroup the IO belongs to.

  Recently, block layer has grown support to attach a task to a bio
  which causes the bio to be handled as if it were issued by the
  associated task regardless of the actual issuing task.  It currently
  only allows attaching %current to a bio - bio_associate_current() -
  but changing it to support other tasks is trivial.

  We'll need to update the async issuers to tag the IOs they issue but
  the mechanism is already there.

* There's a single request pool shared by all issuers per a request
  queue.  This can lead to priority inversion among cgroups.  Note
  that problem also exists without cgroups.  Lower ioprio issuer may
  be holding a request holding back highprio issuer.

  We'll need to make request allocation cgroup (and hopefully ioprio)
  aware.  Probably in the form of separate request pools.  This will
  take some work but I don't think this will be too challenging.  I'll
  work on it.

* cfq cgroup policy throws all async IOs, which all buffered writes
  are, into the shared cgroup regardless of the actual cgroup.  This
  behavior is, I believe, mostly historical and changing it isn't
  difficult.  Prolly only few tens of lines of changes.  This may
  cause significant changes to actual IO behavior with cgroups tho.  I
  personally think the previous behavior was too wrong to keep (the
  weight was completely ignored for buffered writes) but we may want
  to introduce a switch to toggle between the two behaviors.

  Note that blk-throttle doesn't have this problem.

* Unlike dirty data pages, metadata tends to have strict ordering
  requirements and thus is susceptible to priority inversion.  Two
  solutions were suggested - 1. allow overdrawl for metadata writes so
  that low prio metadata writes don't block the whole FS, 2. provide
  an interface to query and wait for bdi-cgroup congestion which can
  be called from FS metadata paths to throttle metadata operations
  before they enter the stream of ordered operations.

  I think combination of the above two should be enough for solving
  the problem.  I *think* the second can be implemented as part of
  cgroup aware request allocation update.  The first one needs a bit
  more thinking but there can be easier interim solutions (e.g. throw
  META writes to the head of the cgroup queue or just plain ignore
  cgroup limits for META writes) for now.

* I'm sure there are a lot of design choices to be made in the
  writeback implementation but IIUC Jan seems to agree that the
  simplest would be simply deal different cgroup-bdi pairs as
  completely separate which shouldn't add too much complexity to the
  already intricate writeback code.

So, I think we have something which sounds like a plan, which at least
I can agree with and seems doable without adding a lot of complexity.

Jan, Fengguang, I'm pretty sure I missed some stuff from writeback's
side and IIUC Fengguang doesn't agree with this approach too much, so
please voice your opinions & comments.

Thank you.

--
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* [RFC] writeback and cgroup
@ 2012-04-03 18:36 ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-03 18:36 UTC (permalink / raw)
  To: Fengguang Wu, Jan Kara, vgoyal, Jens Axboe
  Cc: linux-mm, sjayaraman, andrea, jmoyer, linux-fsdevel,
	linux-kernel, kamezawa.hiroyu, lizefan, containers, cgroups,
	ctalbott, rni, lsf

Hello, guys.

So, during LSF, I, Fengguang and Jan had a chance to sit down and talk
about how to cgroup support to writeback.  Here's what I got from it.

Fengguang's opinion is that the throttling algorithm implemented in
writeback is good enough and blkcg parameters can be exposed to
writeback such that those limits can be applied from writeback.  As
for reads and direct IOs, Fengguang opined that the algorithm can
easily be extended to cover those cases and IIUC all IOs, whether
buffered writes, reads or direct IOs can eventually all go through
writeback layer which will be the one layer controlling all IOs.

Unfortunately, I don't agree with that at all.  I think it's a gross
layering violation and lacks any longterm design.  We have a well
working model of applying and propagating resource pressure - we apply
the pressure where the resource exists and propagates the back
pressure through buffers to upper layers upto the originator.  Think
about network, the pressure exists or is applied at the in/egress
points which gets propagated through socket buffers and eventually
throttles the originator.

Writeback, without cgroup, isn't different.  It consists a part of the
pressure propagation chain anchored at the IO device.  IO devices
these days generate very high pressure, which gets propgated through
the IO sched and buffered requests, which in turn creates pressure at
writeback.  Here, the buffering happens in page cache and pressure at
writeback increases the amount of dirty page cache.  Propagating this
IO pressure to the dirtying task is one of the biggest
responsibililties of the writeback code, and this is the underlying
design of the whole thing.

IIUC, without cgroup, the current writeback code works more or less
like this.  Throwing in cgroup doesn't really change the fundamental
design.  Instead of a single pipe going down, we just have multiple
pipes to the same device, each of which should be treated separately.
Of course, a spinning disk can't be divided that easily and their
performance characteristics will be inter-dependent, but the place to
solve that problem is where the problem is, the block layer.

We may have to look for optimizations and expose some details to
improve the overall behavior and such optimizations may require some
deviation from the fundamental design, but such optimizations should
be justified and such deviations kept at minimum, so, no, I don't
think we're gonna be expose blkcg / block / elevator parameters
directly to writeback.  Unless someone can *really* convince me
otherwise, I'll be vetoing any change toward that direction.

Let's please keep the layering clear.  IO limitations will be applied
at the block layer and pressure will be formed there and then
propagated upwards eventually to the originator.  Sure, exposing the
whole information might result in better behavior for certain
workloads, but down the road, say, in three or five years, devices
which can be shared without worrying too much about seeks might be
commonplace and we could be swearing at a disgusting structural mess,
and sadly various cgroup support seems to be a prominent source of
such design failures.

IMHO, treating cgroup - device/bdi pair as a separate device should
suffice as the underlying design.  After all, blkio cgroup support's
ultimate goal is dividing the IO resource into separate bins.
Implementation details might change as underlying technology changes
and we learn more about how to do it better but that is the goal which
we'll always try to keep close to.  Writeback should (be able to)
treat them as separate devices.  We surely will need adjustments and
optimizations to make things work at least somewhat reasonably but
that is the baseline.

In the discussion, for such implementation, the following obstacles
were identified.

* There are a lot of cases where IOs are issued by a task which isn't
  the originiator.  ie. Writeback issues IOs for pages which are
  dirtied by some other tasks.  So, by the time an IO reaches the
  block layer, we don't know which cgroup the IO belongs to.

  Recently, block layer has grown support to attach a task to a bio
  which causes the bio to be handled as if it were issued by the
  associated task regardless of the actual issuing task.  It currently
  only allows attaching %current to a bio - bio_associate_current() -
  but changing it to support other tasks is trivial.

  We'll need to update the async issuers to tag the IOs they issue but
  the mechanism is already there.

* There's a single request pool shared by all issuers per a request
  queue.  This can lead to priority inversion among cgroups.  Note
  that problem also exists without cgroups.  Lower ioprio issuer may
  be holding a request holding back highprio issuer.

  We'll need to make request allocation cgroup (and hopefully ioprio)
  aware.  Probably in the form of separate request pools.  This will
  take some work but I don't think this will be too challenging.  I'll
  work on it.

* cfq cgroup policy throws all async IOs, which all buffered writes
  are, into the shared cgroup regardless of the actual cgroup.  This
  behavior is, I believe, mostly historical and changing it isn't
  difficult.  Prolly only few tens of lines of changes.  This may
  cause significant changes to actual IO behavior with cgroups tho.  I
  personally think the previous behavior was too wrong to keep (the
  weight was completely ignored for buffered writes) but we may want
  to introduce a switch to toggle between the two behaviors.

  Note that blk-throttle doesn't have this problem.

* Unlike dirty data pages, metadata tends to have strict ordering
  requirements and thus is susceptible to priority inversion.  Two
  solutions were suggested - 1. allow overdrawl for metadata writes so
  that low prio metadata writes don't block the whole FS, 2. provide
  an interface to query and wait for bdi-cgroup congestion which can
  be called from FS metadata paths to throttle metadata operations
  before they enter the stream of ordered operations.

  I think combination of the above two should be enough for solving
  the problem.  I *think* the second can be implemented as part of
  cgroup aware request allocation update.  The first one needs a bit
  more thinking but there can be easier interim solutions (e.g. throw
  META writes to the head of the cgroup queue or just plain ignore
  cgroup limits for META writes) for now.

* I'm sure there are a lot of design choices to be made in the
  writeback implementation but IIUC Jan seems to agree that the
  simplest would be simply deal different cgroup-bdi pairs as
  completely separate which shouldn't add too much complexity to the
  already intricate writeback code.

So, I think we have something which sounds like a plan, which at least
I can agree with and seems doable without adding a lot of complexity.

Jan, Fengguang, I'm pretty sure I missed some stuff from writeback's
side and IIUC Fengguang doesn't agree with this approach too much, so
please voice your opinions & comments.

Thank you.

--
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found] ` <20120403183655.GA23106-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
@ 2012-04-04 14:51   ` Vivek Goyal
  2012-04-04 17:51     ` Fengguang Wu
  1 sibling, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 14:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:

Hi Tejun,

Thanks for the RFC and looking into this issue. Few thoughts inline.

[..]
> IIUC, without cgroup, the current writeback code works more or less
> like this.  Throwing in cgroup doesn't really change the fundamental
> design.  Instead of a single pipe going down, we just have multiple
> pipes to the same device, each of which should be treated separately.
> Of course, a spinning disk can't be divided that easily and their
> performance characteristics will be inter-dependent, but the place to
> solve that problem is where the problem is, the block layer.

How do you take care of thorottling IO to NFS case in this model? Current
throttling logic is tied to block device and in case of NFS, there is no
block device.

[..]
> In the discussion, for such implementation, the following obstacles
> were identified.
> 
> * There are a lot of cases where IOs are issued by a task which isn't
>   the originiator.  ie. Writeback issues IOs for pages which are
>   dirtied by some other tasks.  So, by the time an IO reaches the
>   block layer, we don't know which cgroup the IO belongs to.
> 
>   Recently, block layer has grown support to attach a task to a bio
>   which causes the bio to be handled as if it were issued by the
>   associated task regardless of the actual issuing task.  It currently
>   only allows attaching %current to a bio - bio_associate_current() -
>   but changing it to support other tasks is trivial.
> 
>   We'll need to update the async issuers to tag the IOs they issue but
>   the mechanism is already there.

Most likely this tagging will take place in "struct page" and I am not
sure if we will be allowed to grow size of "struct page" for this reason.

> 
> * There's a single request pool shared by all issuers per a request
>   queue.  This can lead to priority inversion among cgroups.  Note
>   that problem also exists without cgroups.  Lower ioprio issuer may
>   be holding a request holding back highprio issuer.
> 
>   We'll need to make request allocation cgroup (and hopefully ioprio)
>   aware.  Probably in the form of separate request pools.  This will
>   take some work but I don't think this will be too challenging.  I'll
>   work on it.

This should be doable. I had implemented it long back with single request
pool but internal limits for each group. That is block the task in the
group if group has enough pending requests allocated from the pool. But
separate request pool should work equally well. 

Just that it conflits a bit with current definition of q->nr_requests.
Which specifies number of total outstanding requests on the queue. Once
you make the pool per queue, I guess this limit will have to be
transformed into per group upper limit.

> 
> * cfq cgroup policy throws all async IOs, which all buffered writes
>   are, into the shared cgroup regardless of the actual cgroup.  This
>   behavior is, I believe, mostly historical and changing it isn't
>   difficult.  Prolly only few tens of lines of changes.  This may
>   cause significant changes to actual IO behavior with cgroups tho.  I
>   personally think the previous behavior was too wrong to keep (the
>   weight was completely ignored for buffered writes) but we may want
>   to introduce a switch to toggle between the two behaviors.

I had kept all buffered writes in in same cgroup (root cgroup) for few
reasons.

- Because of single request descriptor pool for writes, anyway one writer
  gets backlogged behind other. So creating separate async queues per
  group is not going to help.

- Writeback logic was not cgroup aware. So it might not send enough IO
  from each writer to maintain parallelism. So creating separate async
  queues did not make sense till that was fixed.

- As you said, it is historical also. We prioritize READS at the expense
  of writes. Now by putting buffered/async writes in a separate group, we
  will might end up prioritizing a group's async write over other group's
  synchronous read. How many people really want that behavior? To me
  keeping service differentiation among the sync IO matters most. Even
  if all async IO is treated same, I guess not many people might care.

> 
>   Note that blk-throttle doesn't have this problem.

I am not sure what are you trying to say here. But primarily blk-throttle
will throttle read and direct IO. Buffered writes will go to root cgroup
which is typically unthrottled.

> 
> * Unlike dirty data pages, metadata tends to have strict ordering
>   requirements and thus is susceptible to priority inversion.  Two
>   solutions were suggested - 1. allow overdrawl for metadata writes so
>   that low prio metadata writes don't block the whole FS, 2. provide
>   an interface to query and wait for bdi-cgroup congestion which can
>   be called from FS metadata paths to throttle metadata operations
>   before they enter the stream of ordered operations.

So that probably will mean changing the order of operations also. IIUC, 
in case of fsync (ordered mode), we opened a meta data transaction first,
then tried to flush all the cached data and then flush metadata. So if
fsync is throttled, all the metadata operations behind it will get 
serialized for ext3/ext4.

So you seem to be suggesting that we change the design so that metadata
operation does not thrown into ordered stream till we have finished
writing all the data back to disk? I am not a filesystem developer, so
I don't know how feasible this change is.

This is just one of the points. In the past while talking to Dave Chinner,
he mentioned that in XFS, if two cgroups fall into same allocation group
then there were cases where IO of one cgroup can get serialized behind
other.

In general, the core of the issue is that filesystems are not cgroup aware
and if you do throttling below filesystems, then invariably one or other
serialization issue will come up and I am concerned that we will be constantly
fixing those serialization issues. Or the desgin point could be so central
to filesystem design that it can't be changed.

In general, if you do throttling deeper in the stakc and build back
pressure, then all the layers sitting above should be cgroup aware
to avoid problems. Two layers identified so far are writeback and
filesystems. Is it really worth the complexity. How about doing 
throttling in higher layers when IO is entering the kernel and
keep proportional IO logic at the lowest level and current mechanism
of building pressure continues to work?

Why to split. Proportional IO logic is work conserving so even if
some serialization happens, that situation should clear up pretty
soon as IO from other cgroup will dry up and IO from the group causing
serialization will make progress and at max we will lose fairness for
certain duration.

With throttling limits come from the user and one can put really low
artificial limits. So even if the underlying resources are free the
IO from throttled cgroup might not make any progress in turn choking
every other cgroup which is serialized behind it. 

So in general throttling at block layer and building back pressure is
fine. I am concerned about two cases.

- How to handle NFS.
- Do filesystem developers agree with this approach and are they willing
  to address any serialization issues arising due to this design.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-03 18:36 ` Tejun Heo
@ 2012-04-04 14:51   ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 14:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:

Hi Tejun,

Thanks for the RFC and looking into this issue. Few thoughts inline.

[..]
> IIUC, without cgroup, the current writeback code works more or less
> like this.  Throwing in cgroup doesn't really change the fundamental
> design.  Instead of a single pipe going down, we just have multiple
> pipes to the same device, each of which should be treated separately.
> Of course, a spinning disk can't be divided that easily and their
> performance characteristics will be inter-dependent, but the place to
> solve that problem is where the problem is, the block layer.

How do you take care of thorottling IO to NFS case in this model? Current
throttling logic is tied to block device and in case of NFS, there is no
block device.

[..]
> In the discussion, for such implementation, the following obstacles
> were identified.
> 
> * There are a lot of cases where IOs are issued by a task which isn't
>   the originiator.  ie. Writeback issues IOs for pages which are
>   dirtied by some other tasks.  So, by the time an IO reaches the
>   block layer, we don't know which cgroup the IO belongs to.
> 
>   Recently, block layer has grown support to attach a task to a bio
>   which causes the bio to be handled as if it were issued by the
>   associated task regardless of the actual issuing task.  It currently
>   only allows attaching %current to a bio - bio_associate_current() -
>   but changing it to support other tasks is trivial.
> 
>   We'll need to update the async issuers to tag the IOs they issue but
>   the mechanism is already there.

Most likely this tagging will take place in "struct page" and I am not
sure if we will be allowed to grow size of "struct page" for this reason.

> 
> * There's a single request pool shared by all issuers per a request
>   queue.  This can lead to priority inversion among cgroups.  Note
>   that problem also exists without cgroups.  Lower ioprio issuer may
>   be holding a request holding back highprio issuer.
> 
>   We'll need to make request allocation cgroup (and hopefully ioprio)
>   aware.  Probably in the form of separate request pools.  This will
>   take some work but I don't think this will be too challenging.  I'll
>   work on it.

This should be doable. I had implemented it long back with single request
pool but internal limits for each group. That is block the task in the
group if group has enough pending requests allocated from the pool. But
separate request pool should work equally well. 

Just that it conflits a bit with current definition of q->nr_requests.
Which specifies number of total outstanding requests on the queue. Once
you make the pool per queue, I guess this limit will have to be
transformed into per group upper limit.

> 
> * cfq cgroup policy throws all async IOs, which all buffered writes
>   are, into the shared cgroup regardless of the actual cgroup.  This
>   behavior is, I believe, mostly historical and changing it isn't
>   difficult.  Prolly only few tens of lines of changes.  This may
>   cause significant changes to actual IO behavior with cgroups tho.  I
>   personally think the previous behavior was too wrong to keep (the
>   weight was completely ignored for buffered writes) but we may want
>   to introduce a switch to toggle between the two behaviors.

I had kept all buffered writes in in same cgroup (root cgroup) for few
reasons.

- Because of single request descriptor pool for writes, anyway one writer
  gets backlogged behind other. So creating separate async queues per
  group is not going to help.

- Writeback logic was not cgroup aware. So it might not send enough IO
  from each writer to maintain parallelism. So creating separate async
  queues did not make sense till that was fixed.

- As you said, it is historical also. We prioritize READS at the expense
  of writes. Now by putting buffered/async writes in a separate group, we
  will might end up prioritizing a group's async write over other group's
  synchronous read. How many people really want that behavior? To me
  keeping service differentiation among the sync IO matters most. Even
  if all async IO is treated same, I guess not many people might care.

> 
>   Note that blk-throttle doesn't have this problem.

I am not sure what are you trying to say here. But primarily blk-throttle
will throttle read and direct IO. Buffered writes will go to root cgroup
which is typically unthrottled.

> 
> * Unlike dirty data pages, metadata tends to have strict ordering
>   requirements and thus is susceptible to priority inversion.  Two
>   solutions were suggested - 1. allow overdrawl for metadata writes so
>   that low prio metadata writes don't block the whole FS, 2. provide
>   an interface to query and wait for bdi-cgroup congestion which can
>   be called from FS metadata paths to throttle metadata operations
>   before they enter the stream of ordered operations.

So that probably will mean changing the order of operations also. IIUC, 
in case of fsync (ordered mode), we opened a meta data transaction first,
then tried to flush all the cached data and then flush metadata. So if
fsync is throttled, all the metadata operations behind it will get 
serialized for ext3/ext4.

So you seem to be suggesting that we change the design so that metadata
operation does not thrown into ordered stream till we have finished
writing all the data back to disk? I am not a filesystem developer, so
I don't know how feasible this change is.

This is just one of the points. In the past while talking to Dave Chinner,
he mentioned that in XFS, if two cgroups fall into same allocation group
then there were cases where IO of one cgroup can get serialized behind
other.

In general, the core of the issue is that filesystems are not cgroup aware
and if you do throttling below filesystems, then invariably one or other
serialization issue will come up and I am concerned that we will be constantly
fixing those serialization issues. Or the desgin point could be so central
to filesystem design that it can't be changed.

In general, if you do throttling deeper in the stakc and build back
pressure, then all the layers sitting above should be cgroup aware
to avoid problems. Two layers identified so far are writeback and
filesystems. Is it really worth the complexity. How about doing 
throttling in higher layers when IO is entering the kernel and
keep proportional IO logic at the lowest level and current mechanism
of building pressure continues to work?

Why to split. Proportional IO logic is work conserving so even if
some serialization happens, that situation should clear up pretty
soon as IO from other cgroup will dry up and IO from the group causing
serialization will make progress and at max we will lose fairness for
certain duration.

With throttling limits come from the user and one can put really low
artificial limits. So even if the underlying resources are free the
IO from throttled cgroup might not make any progress in turn choking
every other cgroup which is serialized behind it. 

So in general throttling at block layer and building back pressure is
fine. I am concerned about two cases.

- How to handle NFS.
- Do filesystem developers agree with this approach and are they willing
  to address any serialization issues arising due to this design.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 14:51   ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 14:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:

Hi Tejun,

Thanks for the RFC and looking into this issue. Few thoughts inline.

[..]
> IIUC, without cgroup, the current writeback code works more or less
> like this.  Throwing in cgroup doesn't really change the fundamental
> design.  Instead of a single pipe going down, we just have multiple
> pipes to the same device, each of which should be treated separately.
> Of course, a spinning disk can't be divided that easily and their
> performance characteristics will be inter-dependent, but the place to
> solve that problem is where the problem is, the block layer.

How do you take care of thorottling IO to NFS case in this model? Current
throttling logic is tied to block device and in case of NFS, there is no
block device.

[..]
> In the discussion, for such implementation, the following obstacles
> were identified.
> 
> * There are a lot of cases where IOs are issued by a task which isn't
>   the originiator.  ie. Writeback issues IOs for pages which are
>   dirtied by some other tasks.  So, by the time an IO reaches the
>   block layer, we don't know which cgroup the IO belongs to.
> 
>   Recently, block layer has grown support to attach a task to a bio
>   which causes the bio to be handled as if it were issued by the
>   associated task regardless of the actual issuing task.  It currently
>   only allows attaching %current to a bio - bio_associate_current() -
>   but changing it to support other tasks is trivial.
> 
>   We'll need to update the async issuers to tag the IOs they issue but
>   the mechanism is already there.

Most likely this tagging will take place in "struct page" and I am not
sure if we will be allowed to grow size of "struct page" for this reason.

> 
> * There's a single request pool shared by all issuers per a request
>   queue.  This can lead to priority inversion among cgroups.  Note
>   that problem also exists without cgroups.  Lower ioprio issuer may
>   be holding a request holding back highprio issuer.
> 
>   We'll need to make request allocation cgroup (and hopefully ioprio)
>   aware.  Probably in the form of separate request pools.  This will
>   take some work but I don't think this will be too challenging.  I'll
>   work on it.

This should be doable. I had implemented it long back with single request
pool but internal limits for each group. That is block the task in the
group if group has enough pending requests allocated from the pool. But
separate request pool should work equally well. 

Just that it conflits a bit with current definition of q->nr_requests.
Which specifies number of total outstanding requests on the queue. Once
you make the pool per queue, I guess this limit will have to be
transformed into per group upper limit.

> 
> * cfq cgroup policy throws all async IOs, which all buffered writes
>   are, into the shared cgroup regardless of the actual cgroup.  This
>   behavior is, I believe, mostly historical and changing it isn't
>   difficult.  Prolly only few tens of lines of changes.  This may
>   cause significant changes to actual IO behavior with cgroups tho.  I
>   personally think the previous behavior was too wrong to keep (the
>   weight was completely ignored for buffered writes) but we may want
>   to introduce a switch to toggle between the two behaviors.

I had kept all buffered writes in in same cgroup (root cgroup) for few
reasons.

- Because of single request descriptor pool for writes, anyway one writer
  gets backlogged behind other. So creating separate async queues per
  group is not going to help.

- Writeback logic was not cgroup aware. So it might not send enough IO
  from each writer to maintain parallelism. So creating separate async
  queues did not make sense till that was fixed.

- As you said, it is historical also. We prioritize READS at the expense
  of writes. Now by putting buffered/async writes in a separate group, we
  will might end up prioritizing a group's async write over other group's
  synchronous read. How many people really want that behavior? To me
  keeping service differentiation among the sync IO matters most. Even
  if all async IO is treated same, I guess not many people might care.

> 
>   Note that blk-throttle doesn't have this problem.

I am not sure what are you trying to say here. But primarily blk-throttle
will throttle read and direct IO. Buffered writes will go to root cgroup
which is typically unthrottled.

> 
> * Unlike dirty data pages, metadata tends to have strict ordering
>   requirements and thus is susceptible to priority inversion.  Two
>   solutions were suggested - 1. allow overdrawl for metadata writes so
>   that low prio metadata writes don't block the whole FS, 2. provide
>   an interface to query and wait for bdi-cgroup congestion which can
>   be called from FS metadata paths to throttle metadata operations
>   before they enter the stream of ordered operations.

So that probably will mean changing the order of operations also. IIUC, 
in case of fsync (ordered mode), we opened a meta data transaction first,
then tried to flush all the cached data and then flush metadata. So if
fsync is throttled, all the metadata operations behind it will get 
serialized for ext3/ext4.

So you seem to be suggesting that we change the design so that metadata
operation does not thrown into ordered stream till we have finished
writing all the data back to disk? I am not a filesystem developer, so
I don't know how feasible this change is.

This is just one of the points. In the past while talking to Dave Chinner,
he mentioned that in XFS, if two cgroups fall into same allocation group
then there were cases where IO of one cgroup can get serialized behind
other.

In general, the core of the issue is that filesystems are not cgroup aware
and if you do throttling below filesystems, then invariably one or other
serialization issue will come up and I am concerned that we will be constantly
fixing those serialization issues. Or the desgin point could be so central
to filesystem design that it can't be changed.

In general, if you do throttling deeper in the stakc and build back
pressure, then all the layers sitting above should be cgroup aware
to avoid problems. Two layers identified so far are writeback and
filesystems. Is it really worth the complexity. How about doing 
throttling in higher layers when IO is entering the kernel and
keep proportional IO logic at the lowest level and current mechanism
of building pressure continues to work?

Why to split. Proportional IO logic is work conserving so even if
some serialization happens, that situation should clear up pretty
soon as IO from other cgroup will dry up and IO from the group causing
serialization will make progress and at max we will lose fairness for
certain duration.

With throttling limits come from the user and one can put really low
artificial limits. So even if the underlying resources are free the
IO from throttled cgroup might not make any progress in turn choking
every other cgroup which is serialized behind it. 

So in general throttling at block layer and building back pressure is
fine. I am concerned about two cases.

- How to handle NFS.
- Do filesystem developers agree with this approach and are they willing
  to address any serialization issues arising due to this design.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]   ` <20120404145134.GC12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-04 15:36     ` Steve French
  2012-04-04 18:49     ` Tejun Heo
  2012-04-07  8:00     ` Jan Kara
  2 siblings, 0 replies; 262+ messages in thread
From: Steve French @ 2012-04-04 15:36 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Wed, Apr 4, 2012 at 9:51 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
>
> Hi Tejun,
>
> Thanks for the RFC and looking into this issue. Few thoughts inline.
>
> [..]
>> IIUC, without cgroup, the current writeback code works more or less
>> like this.  Throwing in cgroup doesn't really change the fundamental
>> design.  Instead of a single pipe going down, we just have multiple
>> pipes to the same device, each of which should be treated separately.
>> Of course, a spinning disk can't be divided that easily and their
>> performance characteristics will be inter-dependent, but the place to
>> solve that problem is where the problem is, the block layer.
>
> How do you take care of thorottling IO to NFS case in this model? Current
> throttling logic is tied to block device and in case of NFS, there is no
> block device.

Similarly smb2 gets congestion info (number of "credits") returned from
the server on every response - but not sure why congestion
control is tied to the block device when this would create
problems for network file systems

-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]   ` <20120404145134.GC12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2012-04-04 15:36     ` Steve French
@ 2012-04-04 15:36     ` Steve French
  2012-04-07  8:00     ` Jan Kara
  2 siblings, 0 replies; 262+ messages in thread
From: Steve French @ 2012-04-04 15:36 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Wed, Apr 4, 2012 at 9:51 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
>
> Hi Tejun,
>
> Thanks for the RFC and looking into this issue. Few thoughts inline.
>
> [..]
>> IIUC, without cgroup, the current writeback code works more or less
>> like this.  Throwing in cgroup doesn't really change the fundamental
>> design.  Instead of a single pipe going down, we just have multiple
>> pipes to the same device, each of which should be treated separately.
>> Of course, a spinning disk can't be divided that easily and their
>> performance characteristics will be inter-dependent, but the place to
>> solve that problem is where the problem is, the block layer.
>
> How do you take care of thorottling IO to NFS case in this model? Current
> throttling logic is tied to block device and in case of NFS, there is no
> block device.

Similarly smb2 gets congestion info (number of "credits") returned from
the server on every response - but not sure why congestion
control is tied to the block device when this would create
problems for network file systems

-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-04 15:36     ` Steve French
  0 siblings, 0 replies; 262+ messages in thread
From: Steve French @ 2012-04-04 15:36 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, Apr 4, 2012 at 9:51 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
>
> Hi Tejun,
>
> Thanks for the RFC and looking into this issue. Few thoughts inline.
>
> [..]
>> IIUC, without cgroup, the current writeback code works more or less
>> like this.  Throwing in cgroup doesn't really change the fundamental
>> design.  Instead of a single pipe going down, we just have multiple
>> pipes to the same device, each of which should be treated separately.
>> Of course, a spinning disk can't be divided that easily and their
>> performance characteristics will be inter-dependent, but the place to
>> solve that problem is where the problem is, the block layer.
>
> How do you take care of thorottling IO to NFS case in this model? Current
> throttling logic is tied to block device and in case of NFS, there is no
> block device.

Similarly smb2 gets congestion info (number of "credits") returned from
the server on every response - but not sure why congestion
control is tied to the block device when this would create
problems for network file systems

-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-04 15:36     ` Steve French
  0 siblings, 0 replies; 262+ messages in thread
From: Steve French @ 2012-04-04 15:36 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Wed, Apr 4, 2012 at 9:51 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
>
> Hi Tejun,
>
> Thanks for the RFC and looking into this issue. Few thoughts inline.
>
> [..]
>> IIUC, without cgroup, the current writeback code works more or less
>> like this.  Throwing in cgroup doesn't really change the fundamental
>> design.  Instead of a single pipe going down, we just have multiple
>> pipes to the same device, each of which should be treated separately.
>> Of course, a spinning disk can't be divided that easily and their
>> performance characteristics will be inter-dependent, but the place to
>> solve that problem is where the problem is, the block layer.
>
> How do you take care of thorottling IO to NFS case in this model? Current
> throttling logic is tied to block device and in case of NFS, there is no
> block device.

Similarly smb2 gets congestion info (number of "credits") returned from
the server on every response - but not sure why congestion
control is tied to the block device when this would create
problems for network file systems

-- 
Thanks,

Steve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-03 18:36 ` Tejun Heo
  (?)
@ 2012-04-04 17:51     ` Fengguang Wu
  -1 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-04 17:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA

Hi Tejun,

On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
> Hello, guys.
> 
> So, during LSF, I, Fengguang and Jan had a chance to sit down and talk
> about how to cgroup support to writeback.  Here's what I got from it.
> 
> Fengguang's opinion is that the throttling algorithm implemented in
> writeback is good enough and blkcg parameters can be exposed to
> writeback such that those limits can be applied from writeback.  As
> for reads and direct IOs, Fengguang opined that the algorithm can
> easily be extended to cover those cases and IIUC all IOs, whether
> buffered writes, reads or direct IOs can eventually all go through
> writeback layer which will be the one layer controlling all IOs.
 
Yeah it should be trivial to apply the balance_dirty_pages()
throttling algorithm to the read/direct IOs. However up to now I don't
see much added value to *duplicate* the current block IO controller
functionalities, assuming the current users and developers are happy
with it.

I did the buffered write IO controller mainly to fill the gap.  If I
happen to stand in your way, sorry that's not my initial intention.
It's a pity and surprise that Google as a big user does not buy in
this simple solution. You may prefer more comprehensive controls which
may not be easily achievable with the simple scheme. However the
complexities and overheads involved in throttling the flusher IOs
really upsets me. 

The sweet split point would be for balance_dirty_pages() to do cgroup
aware buffered write throttling and leave other IOs to the current
blkcg. For this to work well as a total solution for end users, I hope
we can cooperate and figure out ways for the two throttling entities
to work well with each other.

What I'm interested is, what's Google and other users' use schemes in
practice. What's their desired interfaces. Whether and how the
combined bdp+blkcg throttling can fulfill the goals.

> Unfortunately, I don't agree with that at all.  I think it's a gross
> layering violation and lacks any longterm design.  We have a well
> working model of applying and propagating resource pressure - we apply
> the pressure where the resource exists and propagates the back
> pressure through buffers to upper layers upto the originator.  Think
> about network, the pressure exists or is applied at the in/egress
> points which gets propagated through socket buffers and eventually
> throttles the originator.
> 
> Writeback, without cgroup, isn't different.  It consists a part of the
> pressure propagation chain anchored at the IO device.  IO devices
> these days generate very high pressure, which gets propgated through
> the IO sched and buffered requests, which in turn creates pressure at
> writeback.  Here, the buffering happens in page cache and pressure at
> writeback increases the amount of dirty page cache.  Propagating this
> IO pressure to the dirtying task is one of the biggest
> responsibililties of the writeback code, and this is the underlying
> design of the whole thing.
> 
> IIUC, without cgroup, the current writeback code works more or less
> like this.  Throwing in cgroup doesn't really change the fundamental
> design.  Instead of a single pipe going down, we just have multiple
> pipes to the same device, each of which should be treated separately.
> Of course, a spinning disk can't be divided that easily and their
> performance characteristics will be inter-dependent, but the place to
> solve that problem is where the problem is, the block layer.
> 
> We may have to look for optimizations and expose some details to
> improve the overall behavior and such optimizations may require some
> deviation from the fundamental design, but such optimizations should
> be justified and such deviations kept at minimum, so, no, I don't
> think we're gonna be expose blkcg / block / elevator parameters
> directly to writeback.  Unless someone can *really* convince me
> otherwise, I'll be vetoing any change toward that direction.
> 
> Let's please keep the layering clear.  IO limitations will be applied
> at the block layer and pressure will be formed there and then
> propagated upwards eventually to the originator.  Sure, exposing the
> whole information might result in better behavior for certain
> workloads, but down the road, say, in three or five years, devices
> which can be shared without worrying too much about seeks might be
> commonplace and we could be swearing at a disgusting structural mess,
> and sadly various cgroup support seems to be a prominent source of
> such design failures.

Super fast storages are coming which will make us regret to make the
IO path over complex.  Spinning disks are not going away anytime soon.
I doubt Google is willing to afford the disk seek costs on its
millions of disks and has the patience to wait until switching all of
the spin disks to SSD years later (if it will ever happen).

Sorry, I won't buy in the layering arguments and analog to networking.
Yeah network is a good way to show your "push back" idea, however
writeback has its own metadata, seeking, etc. problems.

I'd prefer we base our discussions on real things like complexities,
overheads, performance as well as user demands.

It's obvious that your below proposal involves a lot of complexities,
overheads, and will hurt performance. It basically involves

- running concurrent flusher threads for cgroups, which adds back the
  disk seeks and lock contentions. And still has problems with sync
  and shared inodes.

- splitting device queue for cgroups, possibly scaling up the pool of
  writeback pages (and locked pages in the case of stable pages) which
  could stall random processes in the system

- the mess of metadata handling

- unnecessarily coupled with memcg, in order to take advantage of the
  per-memcg dirty limits for balance_dirty_pages() to actually convert
  the "pushed back" dirty pages pressure into lowered dirty rate. Why
  the hell the users *have to* setup memcg (suffering from all the
  inconvenience and overheads) in order to do IO throttling?  Please,
  this is really ugly! And the "back pressure" may constantly push the
  memcg dirty pages to the limits. I'm not going to support *miss use*
  of per-memcg dirty limits like this!

I cannot believe you would keep overlooking all the problems without
good reasons. Please do tell us the reasons that matter.

Thanks,
Fengguang

> IMHO, treating cgroup - device/bdi pair as a separate device should
> suffice as the underlying design.  After all, blkio cgroup support's
> ultimate goal is dividing the IO resource into separate bins.
> Implementation details might change as underlying technology changes
> and we learn more about how to do it better but that is the goal which
> we'll always try to keep close to.  Writeback should (be able to)
> treat them as separate devices.  We surely will need adjustments and
> optimizations to make things work at least somewhat reasonably but
> that is the baseline.
> 
> In the discussion, for such implementation, the following obstacles
> were identified.
> 
> * There are a lot of cases where IOs are issued by a task which isn't
>   the originiator.  ie. Writeback issues IOs for pages which are
>   dirtied by some other tasks.  So, by the time an IO reaches the
>   block layer, we don't know which cgroup the IO belongs to.
> 
>   Recently, block layer has grown support to attach a task to a bio
>   which causes the bio to be handled as if it were issued by the
>   associated task regardless of the actual issuing task.  It currently
>   only allows attaching %current to a bio - bio_associate_current() -
>   but changing it to support other tasks is trivial.
> 
>   We'll need to update the async issuers to tag the IOs they issue but
>   the mechanism is already there.
> 
> * There's a single request pool shared by all issuers per a request
>   queue.  This can lead to priority inversion among cgroups.  Note
>   that problem also exists without cgroups.  Lower ioprio issuer may
>   be holding a request holding back highprio issuer.
> 
>   We'll need to make request allocation cgroup (and hopefully ioprio)
>   aware.  Probably in the form of separate request pools.  This will
>   take some work but I don't think this will be too challenging.  I'll
>   work on it.
> 
> * cfq cgroup policy throws all async IOs, which all buffered writes
>   are, into the shared cgroup regardless of the actual cgroup.  This
>   behavior is, I believe, mostly historical and changing it isn't
>   difficult.  Prolly only few tens of lines of changes.  This may
>   cause significant changes to actual IO behavior with cgroups tho.  I
>   personally think the previous behavior was too wrong to keep (the
>   weight was completely ignored for buffered writes) but we may want
>   to introduce a switch to toggle between the two behaviors.
> 
>   Note that blk-throttle doesn't have this problem.
> 
> * Unlike dirty data pages, metadata tends to have strict ordering
>   requirements and thus is susceptible to priority inversion.  Two
>   solutions were suggested - 1. allow overdrawl for metadata writes so
>   that low prio metadata writes don't block the whole FS, 2. provide
>   an interface to query and wait for bdi-cgroup congestion which can
>   be called from FS metadata paths to throttle metadata operations
>   before they enter the stream of ordered operations.
> 
>   I think combination of the above two should be enough for solving
>   the problem.  I *think* the second can be implemented as part of
>   cgroup aware request allocation update.  The first one needs a bit
>   more thinking but there can be easier interim solutions (e.g. throw
>   META writes to the head of the cgroup queue or just plain ignore
>   cgroup limits for META writes) for now.
> 
> * I'm sure there are a lot of design choices to be made in the
>   writeback implementation but IIUC Jan seems to agree that the
>   simplest would be simply deal different cgroup-bdi pairs as
>   completely separate which shouldn't add too much complexity to the
>   already intricate writeback code.
> 
> So, I think we have something which sounds like a plan, which at least
> I can agree with and seems doable without adding a lot of complexity.
> 
> Jan, Fengguang, I'm pretty sure I missed some stuff from writeback's
> side and IIUC Fengguang doesn't agree with this approach too much, so
> please voice your opinions & comments.
> 
> Thank you.
> 
> --
> tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 17:51     ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-04 17:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hi Tejun,

On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
> Hello, guys.
> 
> So, during LSF, I, Fengguang and Jan had a chance to sit down and talk
> about how to cgroup support to writeback.  Here's what I got from it.
> 
> Fengguang's opinion is that the throttling algorithm implemented in
> writeback is good enough and blkcg parameters can be exposed to
> writeback such that those limits can be applied from writeback.  As
> for reads and direct IOs, Fengguang opined that the algorithm can
> easily be extended to cover those cases and IIUC all IOs, whether
> buffered writes, reads or direct IOs can eventually all go through
> writeback layer which will be the one layer controlling all IOs.
 
Yeah it should be trivial to apply the balance_dirty_pages()
throttling algorithm to the read/direct IOs. However up to now I don't
see much added value to *duplicate* the current block IO controller
functionalities, assuming the current users and developers are happy
with it.

I did the buffered write IO controller mainly to fill the gap.  If I
happen to stand in your way, sorry that's not my initial intention.
It's a pity and surprise that Google as a big user does not buy in
this simple solution. You may prefer more comprehensive controls which
may not be easily achievable with the simple scheme. However the
complexities and overheads involved in throttling the flusher IOs
really upsets me. 

The sweet split point would be for balance_dirty_pages() to do cgroup
aware buffered write throttling and leave other IOs to the current
blkcg. For this to work well as a total solution for end users, I hope
we can cooperate and figure out ways for the two throttling entities
to work well with each other.

What I'm interested is, what's Google and other users' use schemes in
practice. What's their desired interfaces. Whether and how the
combined bdp+blkcg throttling can fulfill the goals.

> Unfortunately, I don't agree with that at all.  I think it's a gross
> layering violation and lacks any longterm design.  We have a well
> working model of applying and propagating resource pressure - we apply
> the pressure where the resource exists and propagates the back
> pressure through buffers to upper layers upto the originator.  Think
> about network, the pressure exists or is applied at the in/egress
> points which gets propagated through socket buffers and eventually
> throttles the originator.
> 
> Writeback, without cgroup, isn't different.  It consists a part of the
> pressure propagation chain anchored at the IO device.  IO devices
> these days generate very high pressure, which gets propgated through
> the IO sched and buffered requests, which in turn creates pressure at
> writeback.  Here, the buffering happens in page cache and pressure at
> writeback increases the amount of dirty page cache.  Propagating this
> IO pressure to the dirtying task is one of the biggest
> responsibililties of the writeback code, and this is the underlying
> design of the whole thing.
> 
> IIUC, without cgroup, the current writeback code works more or less
> like this.  Throwing in cgroup doesn't really change the fundamental
> design.  Instead of a single pipe going down, we just have multiple
> pipes to the same device, each of which should be treated separately.
> Of course, a spinning disk can't be divided that easily and their
> performance characteristics will be inter-dependent, but the place to
> solve that problem is where the problem is, the block layer.
> 
> We may have to look for optimizations and expose some details to
> improve the overall behavior and such optimizations may require some
> deviation from the fundamental design, but such optimizations should
> be justified and such deviations kept at minimum, so, no, I don't
> think we're gonna be expose blkcg / block / elevator parameters
> directly to writeback.  Unless someone can *really* convince me
> otherwise, I'll be vetoing any change toward that direction.
> 
> Let's please keep the layering clear.  IO limitations will be applied
> at the block layer and pressure will be formed there and then
> propagated upwards eventually to the originator.  Sure, exposing the
> whole information might result in better behavior for certain
> workloads, but down the road, say, in three or five years, devices
> which can be shared without worrying too much about seeks might be
> commonplace and we could be swearing at a disgusting structural mess,
> and sadly various cgroup support seems to be a prominent source of
> such design failures.

Super fast storages are coming which will make us regret to make the
IO path over complex.  Spinning disks are not going away anytime soon.
I doubt Google is willing to afford the disk seek costs on its
millions of disks and has the patience to wait until switching all of
the spin disks to SSD years later (if it will ever happen).

Sorry, I won't buy in the layering arguments and analog to networking.
Yeah network is a good way to show your "push back" idea, however
writeback has its own metadata, seeking, etc. problems.

I'd prefer we base our discussions on real things like complexities,
overheads, performance as well as user demands.

It's obvious that your below proposal involves a lot of complexities,
overheads, and will hurt performance. It basically involves

- running concurrent flusher threads for cgroups, which adds back the
  disk seeks and lock contentions. And still has problems with sync
  and shared inodes.

- splitting device queue for cgroups, possibly scaling up the pool of
  writeback pages (and locked pages in the case of stable pages) which
  could stall random processes in the system

- the mess of metadata handling

- unnecessarily coupled with memcg, in order to take advantage of the
  per-memcg dirty limits for balance_dirty_pages() to actually convert
  the "pushed back" dirty pages pressure into lowered dirty rate. Why
  the hell the users *have to* setup memcg (suffering from all the
  inconvenience and overheads) in order to do IO throttling?  Please,
  this is really ugly! And the "back pressure" may constantly push the
  memcg dirty pages to the limits. I'm not going to support *miss use*
  of per-memcg dirty limits like this!

I cannot believe you would keep overlooking all the problems without
good reasons. Please do tell us the reasons that matter.

Thanks,
Fengguang

> IMHO, treating cgroup - device/bdi pair as a separate device should
> suffice as the underlying design.  After all, blkio cgroup support's
> ultimate goal is dividing the IO resource into separate bins.
> Implementation details might change as underlying technology changes
> and we learn more about how to do it better but that is the goal which
> we'll always try to keep close to.  Writeback should (be able to)
> treat them as separate devices.  We surely will need adjustments and
> optimizations to make things work at least somewhat reasonably but
> that is the baseline.
> 
> In the discussion, for such implementation, the following obstacles
> were identified.
> 
> * There are a lot of cases where IOs are issued by a task which isn't
>   the originiator.  ie. Writeback issues IOs for pages which are
>   dirtied by some other tasks.  So, by the time an IO reaches the
>   block layer, we don't know which cgroup the IO belongs to.
> 
>   Recently, block layer has grown support to attach a task to a bio
>   which causes the bio to be handled as if it were issued by the
>   associated task regardless of the actual issuing task.  It currently
>   only allows attaching %current to a bio - bio_associate_current() -
>   but changing it to support other tasks is trivial.
> 
>   We'll need to update the async issuers to tag the IOs they issue but
>   the mechanism is already there.
> 
> * There's a single request pool shared by all issuers per a request
>   queue.  This can lead to priority inversion among cgroups.  Note
>   that problem also exists without cgroups.  Lower ioprio issuer may
>   be holding a request holding back highprio issuer.
> 
>   We'll need to make request allocation cgroup (and hopefully ioprio)
>   aware.  Probably in the form of separate request pools.  This will
>   take some work but I don't think this will be too challenging.  I'll
>   work on it.
> 
> * cfq cgroup policy throws all async IOs, which all buffered writes
>   are, into the shared cgroup regardless of the actual cgroup.  This
>   behavior is, I believe, mostly historical and changing it isn't
>   difficult.  Prolly only few tens of lines of changes.  This may
>   cause significant changes to actual IO behavior with cgroups tho.  I
>   personally think the previous behavior was too wrong to keep (the
>   weight was completely ignored for buffered writes) but we may want
>   to introduce a switch to toggle between the two behaviors.
> 
>   Note that blk-throttle doesn't have this problem.
> 
> * Unlike dirty data pages, metadata tends to have strict ordering
>   requirements and thus is susceptible to priority inversion.  Two
>   solutions were suggested - 1. allow overdrawl for metadata writes so
>   that low prio metadata writes don't block the whole FS, 2. provide
>   an interface to query and wait for bdi-cgroup congestion which can
>   be called from FS metadata paths to throttle metadata operations
>   before they enter the stream of ordered operations.
> 
>   I think combination of the above two should be enough for solving
>   the problem.  I *think* the second can be implemented as part of
>   cgroup aware request allocation update.  The first one needs a bit
>   more thinking but there can be easier interim solutions (e.g. throw
>   META writes to the head of the cgroup queue or just plain ignore
>   cgroup limits for META writes) for now.
> 
> * I'm sure there are a lot of design choices to be made in the
>   writeback implementation but IIUC Jan seems to agree that the
>   simplest would be simply deal different cgroup-bdi pairs as
>   completely separate which shouldn't add too much complexity to the
>   already intricate writeback code.
> 
> So, I think we have something which sounds like a plan, which at least
> I can agree with and seems doable without adding a lot of complexity.
> 
> Jan, Fengguang, I'm pretty sure I missed some stuff from writeback's
> side and IIUC Fengguang doesn't agree with this approach too much, so
> please voice your opinions & comments.
> 
> Thank you.
> 
> --
> tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 17:51     ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-04 17:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hi Tejun,

On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
> Hello, guys.
> 
> So, during LSF, I, Fengguang and Jan had a chance to sit down and talk
> about how to cgroup support to writeback.  Here's what I got from it.
> 
> Fengguang's opinion is that the throttling algorithm implemented in
> writeback is good enough and blkcg parameters can be exposed to
> writeback such that those limits can be applied from writeback.  As
> for reads and direct IOs, Fengguang opined that the algorithm can
> easily be extended to cover those cases and IIUC all IOs, whether
> buffered writes, reads or direct IOs can eventually all go through
> writeback layer which will be the one layer controlling all IOs.
 
Yeah it should be trivial to apply the balance_dirty_pages()
throttling algorithm to the read/direct IOs. However up to now I don't
see much added value to *duplicate* the current block IO controller
functionalities, assuming the current users and developers are happy
with it.

I did the buffered write IO controller mainly to fill the gap.  If I
happen to stand in your way, sorry that's not my initial intention.
It's a pity and surprise that Google as a big user does not buy in
this simple solution. You may prefer more comprehensive controls which
may not be easily achievable with the simple scheme. However the
complexities and overheads involved in throttling the flusher IOs
really upsets me. 

The sweet split point would be for balance_dirty_pages() to do cgroup
aware buffered write throttling and leave other IOs to the current
blkcg. For this to work well as a total solution for end users, I hope
we can cooperate and figure out ways for the two throttling entities
to work well with each other.

What I'm interested is, what's Google and other users' use schemes in
practice. What's their desired interfaces. Whether and how the
combined bdp+blkcg throttling can fulfill the goals.

> Unfortunately, I don't agree with that at all.  I think it's a gross
> layering violation and lacks any longterm design.  We have a well
> working model of applying and propagating resource pressure - we apply
> the pressure where the resource exists and propagates the back
> pressure through buffers to upper layers upto the originator.  Think
> about network, the pressure exists or is applied at the in/egress
> points which gets propagated through socket buffers and eventually
> throttles the originator.
> 
> Writeback, without cgroup, isn't different.  It consists a part of the
> pressure propagation chain anchored at the IO device.  IO devices
> these days generate very high pressure, which gets propgated through
> the IO sched and buffered requests, which in turn creates pressure at
> writeback.  Here, the buffering happens in page cache and pressure at
> writeback increases the amount of dirty page cache.  Propagating this
> IO pressure to the dirtying task is one of the biggest
> responsibililties of the writeback code, and this is the underlying
> design of the whole thing.
> 
> IIUC, without cgroup, the current writeback code works more or less
> like this.  Throwing in cgroup doesn't really change the fundamental
> design.  Instead of a single pipe going down, we just have multiple
> pipes to the same device, each of which should be treated separately.
> Of course, a spinning disk can't be divided that easily and their
> performance characteristics will be inter-dependent, but the place to
> solve that problem is where the problem is, the block layer.
> 
> We may have to look for optimizations and expose some details to
> improve the overall behavior and such optimizations may require some
> deviation from the fundamental design, but such optimizations should
> be justified and such deviations kept at minimum, so, no, I don't
> think we're gonna be expose blkcg / block / elevator parameters
> directly to writeback.  Unless someone can *really* convince me
> otherwise, I'll be vetoing any change toward that direction.
> 
> Let's please keep the layering clear.  IO limitations will be applied
> at the block layer and pressure will be formed there and then
> propagated upwards eventually to the originator.  Sure, exposing the
> whole information might result in better behavior for certain
> workloads, but down the road, say, in three or five years, devices
> which can be shared without worrying too much about seeks might be
> commonplace and we could be swearing at a disgusting structural mess,
> and sadly various cgroup support seems to be a prominent source of
> such design failures.

Super fast storages are coming which will make us regret to make the
IO path over complex.  Spinning disks are not going away anytime soon.
I doubt Google is willing to afford the disk seek costs on its
millions of disks and has the patience to wait until switching all of
the spin disks to SSD years later (if it will ever happen).

Sorry, I won't buy in the layering arguments and analog to networking.
Yeah network is a good way to show your "push back" idea, however
writeback has its own metadata, seeking, etc. problems.

I'd prefer we base our discussions on real things like complexities,
overheads, performance as well as user demands.

It's obvious that your below proposal involves a lot of complexities,
overheads, and will hurt performance. It basically involves

- running concurrent flusher threads for cgroups, which adds back the
  disk seeks and lock contentions. And still has problems with sync
  and shared inodes.

- splitting device queue for cgroups, possibly scaling up the pool of
  writeback pages (and locked pages in the case of stable pages) which
  could stall random processes in the system

- the mess of metadata handling

- unnecessarily coupled with memcg, in order to take advantage of the
  per-memcg dirty limits for balance_dirty_pages() to actually convert
  the "pushed back" dirty pages pressure into lowered dirty rate. Why
  the hell the users *have to* setup memcg (suffering from all the
  inconvenience and overheads) in order to do IO throttling?  Please,
  this is really ugly! And the "back pressure" may constantly push the
  memcg dirty pages to the limits. I'm not going to support *miss use*
  of per-memcg dirty limits like this!

I cannot believe you would keep overlooking all the problems without
good reasons. Please do tell us the reasons that matter.

Thanks,
Fengguang

> IMHO, treating cgroup - device/bdi pair as a separate device should
> suffice as the underlying design.  After all, blkio cgroup support's
> ultimate goal is dividing the IO resource into separate bins.
> Implementation details might change as underlying technology changes
> and we learn more about how to do it better but that is the goal which
> we'll always try to keep close to.  Writeback should (be able to)
> treat them as separate devices.  We surely will need adjustments and
> optimizations to make things work at least somewhat reasonably but
> that is the baseline.
> 
> In the discussion, for such implementation, the following obstacles
> were identified.
> 
> * There are a lot of cases where IOs are issued by a task which isn't
>   the originiator.  ie. Writeback issues IOs for pages which are
>   dirtied by some other tasks.  So, by the time an IO reaches the
>   block layer, we don't know which cgroup the IO belongs to.
> 
>   Recently, block layer has grown support to attach a task to a bio
>   which causes the bio to be handled as if it were issued by the
>   associated task regardless of the actual issuing task.  It currently
>   only allows attaching %current to a bio - bio_associate_current() -
>   but changing it to support other tasks is trivial.
> 
>   We'll need to update the async issuers to tag the IOs they issue but
>   the mechanism is already there.
> 
> * There's a single request pool shared by all issuers per a request
>   queue.  This can lead to priority inversion among cgroups.  Note
>   that problem also exists without cgroups.  Lower ioprio issuer may
>   be holding a request holding back highprio issuer.
> 
>   We'll need to make request allocation cgroup (and hopefully ioprio)
>   aware.  Probably in the form of separate request pools.  This will
>   take some work but I don't think this will be too challenging.  I'll
>   work on it.
> 
> * cfq cgroup policy throws all async IOs, which all buffered writes
>   are, into the shared cgroup regardless of the actual cgroup.  This
>   behavior is, I believe, mostly historical and changing it isn't
>   difficult.  Prolly only few tens of lines of changes.  This may
>   cause significant changes to actual IO behavior with cgroups tho.  I
>   personally think the previous behavior was too wrong to keep (the
>   weight was completely ignored for buffered writes) but we may want
>   to introduce a switch to toggle between the two behaviors.
> 
>   Note that blk-throttle doesn't have this problem.
> 
> * Unlike dirty data pages, metadata tends to have strict ordering
>   requirements and thus is susceptible to priority inversion.  Two
>   solutions were suggested - 1. allow overdrawl for metadata writes so
>   that low prio metadata writes don't block the whole FS, 2. provide
>   an interface to query and wait for bdi-cgroup congestion which can
>   be called from FS metadata paths to throttle metadata operations
>   before they enter the stream of ordered operations.
> 
>   I think combination of the above two should be enough for solving
>   the problem.  I *think* the second can be implemented as part of
>   cgroup aware request allocation update.  The first one needs a bit
>   more thinking but there can be easier interim solutions (e.g. throw
>   META writes to the head of the cgroup queue or just plain ignore
>   cgroup limits for META writes) for now.
> 
> * I'm sure there are a lot of design choices to be made in the
>   writeback implementation but IIUC Jan seems to agree that the
>   simplest would be simply deal different cgroup-bdi pairs as
>   completely separate which shouldn't add too much complexity to the
>   already intricate writeback code.
> 
> So, I think we have something which sounds like a plan, which at least
> I can agree with and seems doable without adding a lot of complexity.
> 
> Jan, Fengguang, I'm pretty sure I missed some stuff from writeback's
> side and IIUC Fengguang doesn't agree with this approach too much, so
> please voice your opinions & comments.
> 
> Thank you.
> 
> --
> tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-04 17:51     ` Fengguang Wu
                       ` (2 preceding siblings ...)
  (?)
@ 2012-04-04 18:35     ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 18:35 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:

[..]
> The sweet split point would be for balance_dirty_pages() to do cgroup
> aware buffered write throttling and leave other IOs to the current
> blkcg. For this to work well as a total solution for end users, I hope
> we can cooperate and figure out ways for the two throttling entities
> to work well with each other.

Throttling read + direct IO, higher up has few issues too. Users will
not like that a task got blocked as it tried to submit a read from a
throttled group. Current async behavior works well where we queue up the
bio from the task in throttled group and let task do other things. Same
is true for AIO where we would not like to block in bio submission.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-04 17:51     ` Fengguang Wu
@ 2012-04-04 18:35       ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 18:35 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:

[..]
> The sweet split point would be for balance_dirty_pages() to do cgroup
> aware buffered write throttling and leave other IOs to the current
> blkcg. For this to work well as a total solution for end users, I hope
> we can cooperate and figure out ways for the two throttling entities
> to work well with each other.

Throttling read + direct IO, higher up has few issues too. Users will
not like that a task got blocked as it tried to submit a read from a
throttled group. Current async behavior works well where we queue up the
bio from the task in throttled group and let task do other things. Same
is true for AIO where we would not like to block in bio submission.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 18:35       ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 18:35 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:

[..]
> The sweet split point would be for balance_dirty_pages() to do cgroup
> aware buffered write throttling and leave other IOs to the current
> blkcg. For this to work well as a total solution for end users, I hope
> we can cooperate and figure out ways for the two throttling entities
> to work well with each other.

Throttling read + direct IO, higher up has few issues too. Users will
not like that a task got blocked as it tried to submit a read from a
throttled group. Current async behavior works well where we queue up the
bio from the task in throttled group and let task do other things. Same
is true for AIO where we would not like to block in bio submission.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]   ` <20120404145134.GC12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2012-04-04 15:36     ` Steve French
@ 2012-04-04 18:49     ` Tejun Heo
  2012-04-07  8:00     ` Jan Kara
  2 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 18:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

Hey, Vivek.

On Wed, Apr 04, 2012 at 10:51:34AM -0400, Vivek Goyal wrote:
> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
> > IIUC, without cgroup, the current writeback code works more or less
> > like this.  Throwing in cgroup doesn't really change the fundamental
> > design.  Instead of a single pipe going down, we just have multiple
> > pipes to the same device, each of which should be treated separately.
> > Of course, a spinning disk can't be divided that easily and their
> > performance characteristics will be inter-dependent, but the place to
> > solve that problem is where the problem is, the block layer.
> 
> How do you take care of thorottling IO to NFS case in this model? Current
> throttling logic is tied to block device and in case of NFS, there is no
> block device.

On principle, I don't think it has be any different.  Filesystems's
interface to the underlying device is through bdi.  If a fs is block
backed, block pressure should be propagated through bdi, which should
be mostly trivial.  If a fs is network backed, we can implement a
mechanism for network backed bdis, so that they can relay the pressure
from the server side to the local fs users.

That said, network filesystems often show different behaviors and use
different mechanisms for various reasons and it wouldn't be too
surprising if something different would fit them better here or we
might need something supplemental to the usual mechanism.

> [..]
> > In the discussion, for such implementation, the following obstacles
> > were identified.
> > 
> > * There are a lot of cases where IOs are issued by a task which isn't
> >   the originiator.  ie. Writeback issues IOs for pages which are
> >   dirtied by some other tasks.  So, by the time an IO reaches the
> >   block layer, we don't know which cgroup the IO belongs to.
> > 
> >   Recently, block layer has grown support to attach a task to a bio
> >   which causes the bio to be handled as if it were issued by the
> >   associated task regardless of the actual issuing task.  It currently
> >   only allows attaching %current to a bio - bio_associate_current() -
> >   but changing it to support other tasks is trivial.
> > 
> >   We'll need to update the async issuers to tag the IOs they issue but
> >   the mechanism is already there.
> 
> Most likely this tagging will take place in "struct page" and I am not
> sure if we will be allowed to grow size of "struct page" for this reason.

With memcg enabled, we are already doing that and IIUC Jan and
Fengguang think that using inode granularity should be good enough for
writeback blaming.

> > * There's a single request pool shared by all issuers per a request
> >   queue.  This can lead to priority inversion among cgroups.  Note
> >   that problem also exists without cgroups.  Lower ioprio issuer may
> >   be holding a request holding back highprio issuer.
> > 
> >   We'll need to make request allocation cgroup (and hopefully ioprio)
> >   aware.  Probably in the form of separate request pools.  This will
> >   take some work but I don't think this will be too challenging.  I'll
> >   work on it.
> 
> This should be doable. I had implemented it long back with single request
> pool but internal limits for each group. That is block the task in the
> group if group has enough pending requests allocated from the pool. But
> separate request pool should work equally well. 
> 
> Just that it conflits a bit with current definition of q->nr_requests.
> Which specifies number of total outstanding requests on the queue. Once
> you make the pool per queue, I guess this limit will have to be
> transformed into per group upper limit.

I'm not sure about the details yet.  I *think* the suckiest part is
the actual allocation part.  We're deferring cgroup - request_queue
association until actual usage and depending on atomic allocations to
create those associations on IO path.  Doing the same for requests
might not be too pleasant.  Hmm....  allocation failure handling on
that path is already broken BTW.  Maybe we just need to get the
fallback behavior properly working.  Unsure.

> > * cfq cgroup policy throws all async IOs, which all buffered writes
> >   are, into the shared cgroup regardless of the actual cgroup.  This
> >   behavior is, I believe, mostly historical and changing it isn't
> >   difficult.  Prolly only few tens of lines of changes.  This may
> >   cause significant changes to actual IO behavior with cgroups tho.  I
> >   personally think the previous behavior was too wrong to keep (the
> >   weight was completely ignored for buffered writes) but we may want
> >   to introduce a switch to toggle between the two behaviors.
> 
> I had kept all buffered writes in in same cgroup (root cgroup) for few
> reasons.
> 
> - Because of single request descriptor pool for writes, anyway one writer
>   gets backlogged behind other. So creating separate async queues per
>   group is not going to help.
> 
> - Writeback logic was not cgroup aware. So it might not send enough IO
>   from each writer to maintain parallelism. So creating separate async
>   queues did not make sense till that was fixed.

Yeah, the above are why I find "buffered writes need separate controls
because cfq doesn't distinguish async writes" argument very ironic.
We introduce one quirk to compensate for shortages in the other part
and then later we work that around in that other part for that quirk?
I mean, that's just twisted.

> - As you said, it is historical also. We prioritize READS at the expense
>   of writes. Now by putting buffered/async writes in a separate group, we
>   will might end up prioritizing a group's async write over other group's
>   synchronous read. How many people really want that behavior? To me
>   keeping service differentiation among the sync IO matters most. Even
>   if all async IO is treated same, I guess not many people might care.

While segregation of async IOs may not matter in some cases, it does
matter to many other use cases, so it seems to me that we hard coded
that optimization decision without thinking too much about it.  For a
lot of container type use cases, the current implementation is nearly
useless (I know of cases where people are explicitly patching for
separate async queues).  At the same time, switching the default
behavior *may* disturb some of the current users and that's why I'm
thinking abut having a switch for the new behavior.

> >   Note that blk-throttle doesn't have this problem.
> 
> I am not sure what are you trying to say here. But primarily blk-throttle
> will throttle read and direct IO. Buffered writes will go to root cgroup
> which is typically unthrottled.

Ooh, my bad then.  Anyways, then the same applies to blk-throttle.
Our current implementation essentially collapses at the face of
write-heavy workload.

> > * Unlike dirty data pages, metadata tends to have strict ordering
> >   requirements and thus is susceptible to priority inversion.  Two
> >   solutions were suggested - 1. allow overdrawl for metadata writes so
> >   that low prio metadata writes don't block the whole FS, 2. provide
> >   an interface to query and wait for bdi-cgroup congestion which can
> >   be called from FS metadata paths to throttle metadata operations
> >   before they enter the stream of ordered operations.
> 
> So that probably will mean changing the order of operations also. IIUC, 
> in case of fsync (ordered mode), we opened a meta data transaction first,
> then tried to flush all the cached data and then flush metadata. So if
> fsync is throttled, all the metadata operations behind it will get 
> serialized for ext3/ext4.
> 
> So you seem to be suggesting that we change the design so that metadata
> operation does not thrown into ordered stream till we have finished
> writing all the data back to disk? I am not a filesystem developer, so
> I don't know how feasible this change is.

Jan explained it to me and I don't think it requires extensive changes
to the filesystem.  IIUC, filesystems would just block tasks creating
journal entry while its matching bdi is congested and that's the
extent of the necessary change.

> This is just one of the points. In the past while talking to Dave Chinner,
> he mentioned that in XFS, if two cgroups fall into same allocation group
> then there were cases where IO of one cgroup can get serialized behind
> other.
> 
> In general, the core of the issue is that filesystems are not cgroup aware
> and if you do throttling below filesystems, then invariably one or other
> serialization issue will come up and I am concerned that we will be constantly
> fixing those serialization issues. Or the desgin point could be so central
> to filesystem design that it can't be changed.

So, the idea is to avoid allowing any congested cgroup to enter
serialized journal.  As there's time gap until journal commit, the bdi
might be congested by the commit time.  In that case, META writes get
to overdraw cgroup limits to avoid causing priority inversion.  I
think we should be able to get most working with bdi congestion check
at the front and limit bypass for META for now.  We can worry about
overdrawing later.

> In general, if you do throttling deeper in the stakc and build back
> pressure, then all the layers sitting above should be cgroup aware
> to avoid problems. Two layers identified so far are writeback and
> filesystems. Is it really worth the complexity. How about doing 
> throttling in higher layers when IO is entering the kernel and
> keep proportional IO logic at the lowest level and current mechanism
> of building pressure continues to work?

First, I just don't think it's the right design.  It's a rather
abstract statement but I want to emphasize that having the "right"
design, in the sense that we look at the overall picture and put
configs, controls and other logics where they belong to in the
structure that their roles point to tends to make long-term
development and maintenance much easier in ways which may not be
immediately foreseeable, for both technical and social reasons -
logical structuring and layering keep us sane and make new comer's
lives at least bearable.

Secondly, I don't think it'll be a lot of added complexity.  We *need*
to fix all the said shortcoming in block layer for proper cgroup
support anyway, right?  Propagating that support upwards doesn't take
too much code.  Other than the metadata thing, it mostly just requires
updates to writeback code such that they deal with bdi-cgroup
combination instead of individual cgroups.  They'll surely require
some adjustments but we're not gonna be burdening the main paths with
cgroup awareness.  cgroup support would just make the existing
implementation work on finer grained domains.

Thirdly, I don't see how writeback can control all the IOs.  I mean,
what about reads or direct IOs?  It's not like IO devices have
separate channels for those different types of IOs.  They interact
heavily.  Let's say we have iops/bps limitation applied on top of
proportional IO distribution or a device holds two partitions and one
of them is being used for direct IO w/o filesystems.  How would that
work?  I think the question goes even deeper, what do the separate
limits even mean?  Does the IO sched have to calculate allocation of
IO resource to different types of IOs and then give a "number" to
writeback which in turn enforces that limit?  How does the elevator
know what number to give?  Is the number iops or bps or weight?  If
the iosched doesn't know how much write workload exists, how does it
distribute the surplus buffered writeback resource across different
cgroups?  If so, what makes the limit actualy enforceable (due to
inaccuracies in estimation, fluctuation in workload, delay in
enforcement in different layers and whatnot) except for block layer
applying the limit *again* on the resulting stream of combined IOs?

Fourthly, having clear layering usually means much more flexibility.
The assumptions about IO characteristics that we have are still mostly
based on devices with spindles, probably because they're still causing
the most amount of pain.  The assumptions keep changing and if we get
the layering correct, we can mostly deal with changes at the layers
concerning them - ie. in the block layer.  Maybe we'll have a
different iosched or cfq can be evolved to cover the new cases, but
the required adaptation would be logical and while upper layers might
need some adjustments they wouldn't need any major overhaul.  They'll
be still working from back pressure from IO.

So, the above are the reasons why I don't like the idea of splitting
IO control across multiple layers, well the ones that I can think of
right now anyway.  I'm currently feeling rather strong about them in
the sense of "oh no, this is about to be messed up" but maybe I'm just
not seeing what Fengguang is seeing.  I'll keep discussing there.

> So in general throttling at block layer and building back pressure is
> fine. I am concerned about two cases.
> 
> - How to handle NFS.

As said above, maybe through network based bdi pressure propagation,
Maybe some other special case mechanism.  Unsure but I don't think
this concern should dictate the whole design.

> - Do filesystem developers agree with this approach and are they willing
>   to address any serialization issues arising due to this design.

Jan, can you please fill in?  Did I understand it correctly?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]   ` <20120404145134.GC12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2012-04-04 15:36     ` Steve French
@ 2012-04-04 18:49     ` Tejun Heo
  2012-04-07  8:00     ` Jan Kara
  2 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 18:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hey, Vivek.

On Wed, Apr 04, 2012 at 10:51:34AM -0400, Vivek Goyal wrote:
> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
> > IIUC, without cgroup, the current writeback code works more or less
> > like this.  Throwing in cgroup doesn't really change the fundamental
> > design.  Instead of a single pipe going down, we just have multiple
> > pipes to the same device, each of which should be treated separately.
> > Of course, a spinning disk can't be divided that easily and their
> > performance characteristics will be inter-dependent, but the place to
> > solve that problem is where the problem is, the block layer.
> 
> How do you take care of thorottling IO to NFS case in this model? Current
> throttling logic is tied to block device and in case of NFS, there is no
> block device.

On principle, I don't think it has be any different.  Filesystems's
interface to the underlying device is through bdi.  If a fs is block
backed, block pressure should be propagated through bdi, which should
be mostly trivial.  If a fs is network backed, we can implement a
mechanism for network backed bdis, so that they can relay the pressure
from the server side to the local fs users.

That said, network filesystems often show different behaviors and use
different mechanisms for various reasons and it wouldn't be too
surprising if something different would fit them better here or we
might need something supplemental to the usual mechanism.

> [..]
> > In the discussion, for such implementation, the following obstacles
> > were identified.
> > 
> > * There are a lot of cases where IOs are issued by a task which isn't
> >   the originiator.  ie. Writeback issues IOs for pages which are
> >   dirtied by some other tasks.  So, by the time an IO reaches the
> >   block layer, we don't know which cgroup the IO belongs to.
> > 
> >   Recently, block layer has grown support to attach a task to a bio
> >   which causes the bio to be handled as if it were issued by the
> >   associated task regardless of the actual issuing task.  It currently
> >   only allows attaching %current to a bio - bio_associate_current() -
> >   but changing it to support other tasks is trivial.
> > 
> >   We'll need to update the async issuers to tag the IOs they issue but
> >   the mechanism is already there.
> 
> Most likely this tagging will take place in "struct page" and I am not
> sure if we will be allowed to grow size of "struct page" for this reason.

With memcg enabled, we are already doing that and IIUC Jan and
Fengguang think that using inode granularity should be good enough for
writeback blaming.

> > * There's a single request pool shared by all issuers per a request
> >   queue.  This can lead to priority inversion among cgroups.  Note
> >   that problem also exists without cgroups.  Lower ioprio issuer may
> >   be holding a request holding back highprio issuer.
> > 
> >   We'll need to make request allocation cgroup (and hopefully ioprio)
> >   aware.  Probably in the form of separate request pools.  This will
> >   take some work but I don't think this will be too challenging.  I'll
> >   work on it.
> 
> This should be doable. I had implemented it long back with single request
> pool but internal limits for each group. That is block the task in the
> group if group has enough pending requests allocated from the pool. But
> separate request pool should work equally well. 
> 
> Just that it conflits a bit with current definition of q->nr_requests.
> Which specifies number of total outstanding requests on the queue. Once
> you make the pool per queue, I guess this limit will have to be
> transformed into per group upper limit.

I'm not sure about the details yet.  I *think* the suckiest part is
the actual allocation part.  We're deferring cgroup - request_queue
association until actual usage and depending on atomic allocations to
create those associations on IO path.  Doing the same for requests
might not be too pleasant.  Hmm....  allocation failure handling on
that path is already broken BTW.  Maybe we just need to get the
fallback behavior properly working.  Unsure.

> > * cfq cgroup policy throws all async IOs, which all buffered writes
> >   are, into the shared cgroup regardless of the actual cgroup.  This
> >   behavior is, I believe, mostly historical and changing it isn't
> >   difficult.  Prolly only few tens of lines of changes.  This may
> >   cause significant changes to actual IO behavior with cgroups tho.  I
> >   personally think the previous behavior was too wrong to keep (the
> >   weight was completely ignored for buffered writes) but we may want
> >   to introduce a switch to toggle between the two behaviors.
> 
> I had kept all buffered writes in in same cgroup (root cgroup) for few
> reasons.
> 
> - Because of single request descriptor pool for writes, anyway one writer
>   gets backlogged behind other. So creating separate async queues per
>   group is not going to help.
> 
> - Writeback logic was not cgroup aware. So it might not send enough IO
>   from each writer to maintain parallelism. So creating separate async
>   queues did not make sense till that was fixed.

Yeah, the above are why I find "buffered writes need separate controls
because cfq doesn't distinguish async writes" argument very ironic.
We introduce one quirk to compensate for shortages in the other part
and then later we work that around in that other part for that quirk?
I mean, that's just twisted.

> - As you said, it is historical also. We prioritize READS at the expense
>   of writes. Now by putting buffered/async writes in a separate group, we
>   will might end up prioritizing a group's async write over other group's
>   synchronous read. How many people really want that behavior? To me
>   keeping service differentiation among the sync IO matters most. Even
>   if all async IO is treated same, I guess not many people might care.

While segregation of async IOs may not matter in some cases, it does
matter to many other use cases, so it seems to me that we hard coded
that optimization decision without thinking too much about it.  For a
lot of container type use cases, the current implementation is nearly
useless (I know of cases where people are explicitly patching for
separate async queues).  At the same time, switching the default
behavior *may* disturb some of the current users and that's why I'm
thinking abut having a switch for the new behavior.

> >   Note that blk-throttle doesn't have this problem.
> 
> I am not sure what are you trying to say here. But primarily blk-throttle
> will throttle read and direct IO. Buffered writes will go to root cgroup
> which is typically unthrottled.

Ooh, my bad then.  Anyways, then the same applies to blk-throttle.
Our current implementation essentially collapses at the face of
write-heavy workload.

> > * Unlike dirty data pages, metadata tends to have strict ordering
> >   requirements and thus is susceptible to priority inversion.  Two
> >   solutions were suggested - 1. allow overdrawl for metadata writes so
> >   that low prio metadata writes don't block the whole FS, 2. provide
> >   an interface to query and wait for bdi-cgroup congestion which can
> >   be called from FS metadata paths to throttle metadata operations
> >   before they enter the stream of ordered operations.
> 
> So that probably will mean changing the order of operations also. IIUC, 
> in case of fsync (ordered mode), we opened a meta data transaction first,
> then tried to flush all the cached data and then flush metadata. So if
> fsync is throttled, all the metadata operations behind it will get 
> serialized for ext3/ext4.
> 
> So you seem to be suggesting that we change the design so that metadata
> operation does not thrown into ordered stream till we have finished
> writing all the data back to disk? I am not a filesystem developer, so
> I don't know how feasible this change is.

Jan explained it to me and I don't think it requires extensive changes
to the filesystem.  IIUC, filesystems would just block tasks creating
journal entry while its matching bdi is congested and that's the
extent of the necessary change.

> This is just one of the points. In the past while talking to Dave Chinner,
> he mentioned that in XFS, if two cgroups fall into same allocation group
> then there were cases where IO of one cgroup can get serialized behind
> other.
> 
> In general, the core of the issue is that filesystems are not cgroup aware
> and if you do throttling below filesystems, then invariably one or other
> serialization issue will come up and I am concerned that we will be constantly
> fixing those serialization issues. Or the desgin point could be so central
> to filesystem design that it can't be changed.

So, the idea is to avoid allowing any congested cgroup to enter
serialized journal.  As there's time gap until journal commit, the bdi
might be congested by the commit time.  In that case, META writes get
to overdraw cgroup limits to avoid causing priority inversion.  I
think we should be able to get most working with bdi congestion check
at the front and limit bypass for META for now.  We can worry about
overdrawing later.

> In general, if you do throttling deeper in the stakc and build back
> pressure, then all the layers sitting above should be cgroup aware
> to avoid problems. Two layers identified so far are writeback and
> filesystems. Is it really worth the complexity. How about doing 
> throttling in higher layers when IO is entering the kernel and
> keep proportional IO logic at the lowest level and current mechanism
> of building pressure continues to work?

First, I just don't think it's the right design.  It's a rather
abstract statement but I want to emphasize that having the "right"
design, in the sense that we look at the overall picture and put
configs, controls and other logics where they belong to in the
structure that their roles point to tends to make long-term
development and maintenance much easier in ways which may not be
immediately foreseeable, for both technical and social reasons -
logical structuring and layering keep us sane and make new comer's
lives at least bearable.

Secondly, I don't think it'll be a lot of added complexity.  We *need*
to fix all the said shortcoming in block layer for proper cgroup
support anyway, right?  Propagating that support upwards doesn't take
too much code.  Other than the metadata thing, it mostly just requires
updates to writeback code such that they deal with bdi-cgroup
combination instead of individual cgroups.  They'll surely require
some adjustments but we're not gonna be burdening the main paths with
cgroup awareness.  cgroup support would just make the existing
implementation work on finer grained domains.

Thirdly, I don't see how writeback can control all the IOs.  I mean,
what about reads or direct IOs?  It's not like IO devices have
separate channels for those different types of IOs.  They interact
heavily.  Let's say we have iops/bps limitation applied on top of
proportional IO distribution or a device holds two partitions and one
of them is being used for direct IO w/o filesystems.  How would that
work?  I think the question goes even deeper, what do the separate
limits even mean?  Does the IO sched have to calculate allocation of
IO resource to different types of IOs and then give a "number" to
writeback which in turn enforces that limit?  How does the elevator
know what number to give?  Is the number iops or bps or weight?  If
the iosched doesn't know how much write workload exists, how does it
distribute the surplus buffered writeback resource across different
cgroups?  If so, what makes the limit actualy enforceable (due to
inaccuracies in estimation, fluctuation in workload, delay in
enforcement in different layers and whatnot) except for block layer
applying the limit *again* on the resulting stream of combined IOs?

Fourthly, having clear layering usually means much more flexibility.
The assumptions about IO characteristics that we have are still mostly
based on devices with spindles, probably because they're still causing
the most amount of pain.  The assumptions keep changing and if we get
the layering correct, we can mostly deal with changes at the layers
concerning them - ie. in the block layer.  Maybe we'll have a
different iosched or cfq can be evolved to cover the new cases, but
the required adaptation would be logical and while upper layers might
need some adjustments they wouldn't need any major overhaul.  They'll
be still working from back pressure from IO.

So, the above are the reasons why I don't like the idea of splitting
IO control across multiple layers, well the ones that I can think of
right now anyway.  I'm currently feeling rather strong about them in
the sense of "oh no, this is about to be messed up" but maybe I'm just
not seeing what Fengguang is seeing.  I'll keep discussing there.

> So in general throttling at block layer and building back pressure is
> fine. I am concerned about two cases.
> 
> - How to handle NFS.

As said above, maybe through network based bdi pressure propagation,
Maybe some other special case mechanism.  Unsure but I don't think
this concern should dictate the whole design.

> - Do filesystem developers agree with this approach and are they willing
>   to address any serialization issues arising due to this design.

Jan, can you please fill in?  Did I understand it correctly?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 18:49     ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 18:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Jan Kara, Jens Axboe,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k,
	andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hey, Vivek.

On Wed, Apr 04, 2012 at 10:51:34AM -0400, Vivek Goyal wrote:
> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
> > IIUC, without cgroup, the current writeback code works more or less
> > like this.  Throwing in cgroup doesn't really change the fundamental
> > design.  Instead of a single pipe going down, we just have multiple
> > pipes to the same device, each of which should be treated separately.
> > Of course, a spinning disk can't be divided that easily and their
> > performance characteristics will be inter-dependent, but the place to
> > solve that problem is where the problem is, the block layer.
> 
> How do you take care of thorottling IO to NFS case in this model? Current
> throttling logic is tied to block device and in case of NFS, there is no
> block device.

On principle, I don't think it has be any different.  Filesystems's
interface to the underlying device is through bdi.  If a fs is block
backed, block pressure should be propagated through bdi, which should
be mostly trivial.  If a fs is network backed, we can implement a
mechanism for network backed bdis, so that they can relay the pressure
from the server side to the local fs users.

That said, network filesystems often show different behaviors and use
different mechanisms for various reasons and it wouldn't be too
surprising if something different would fit them better here or we
might need something supplemental to the usual mechanism.

> [..]
> > In the discussion, for such implementation, the following obstacles
> > were identified.
> > 
> > * There are a lot of cases where IOs are issued by a task which isn't
> >   the originiator.  ie. Writeback issues IOs for pages which are
> >   dirtied by some other tasks.  So, by the time an IO reaches the
> >   block layer, we don't know which cgroup the IO belongs to.
> > 
> >   Recently, block layer has grown support to attach a task to a bio
> >   which causes the bio to be handled as if it were issued by the
> >   associated task regardless of the actual issuing task.  It currently
> >   only allows attaching %current to a bio - bio_associate_current() -
> >   but changing it to support other tasks is trivial.
> > 
> >   We'll need to update the async issuers to tag the IOs they issue but
> >   the mechanism is already there.
> 
> Most likely this tagging will take place in "struct page" and I am not
> sure if we will be allowed to grow size of "struct page" for this reason.

With memcg enabled, we are already doing that and IIUC Jan and
Fengguang think that using inode granularity should be good enough for
writeback blaming.

> > * There's a single request pool shared by all issuers per a request
> >   queue.  This can lead to priority inversion among cgroups.  Note
> >   that problem also exists without cgroups.  Lower ioprio issuer may
> >   be holding a request holding back highprio issuer.
> > 
> >   We'll need to make request allocation cgroup (and hopefully ioprio)
> >   aware.  Probably in the form of separate request pools.  This will
> >   take some work but I don't think this will be too challenging.  I'll
> >   work on it.
> 
> This should be doable. I had implemented it long back with single request
> pool but internal limits for each group. That is block the task in the
> group if group has enough pending requests allocated from the pool. But
> separate request pool should work equally well. 
> 
> Just that it conflits a bit with current definition of q->nr_requests.
> Which specifies number of total outstanding requests on the queue. Once
> you make the pool per queue, I guess this limit will have to be
> transformed into per group upper limit.

I'm not sure about the details yet.  I *think* the suckiest part is
the actual allocation part.  We're deferring cgroup - request_queue
association until actual usage and depending on atomic allocations to
create those associations on IO path.  Doing the same for requests
might not be too pleasant.  Hmm....  allocation failure handling on
that path is already broken BTW.  Maybe we just need to get the
fallback behavior properly working.  Unsure.

> > * cfq cgroup policy throws all async IOs, which all buffered writes
> >   are, into the shared cgroup regardless of the actual cgroup.  This
> >   behavior is, I believe, mostly historical and changing it isn't
> >   difficult.  Prolly only few tens of lines of changes.  This may
> >   cause significant changes to actual IO behavior with cgroups tho.  I
> >   personally think the previous behavior was too wrong to keep (the
> >   weight was completely ignored for buffered writes) but we may want
> >   to introduce a switch to toggle between the two behaviors.
> 
> I had kept all buffered writes in in same cgroup (root cgroup) for few
> reasons.
> 
> - Because of single request descriptor pool for writes, anyway one writer
>   gets backlogged behind other. So creating separate async queues per
>   group is not going to help.
> 
> - Writeback logic was not cgroup aware. So it might not send enough IO
>   from each writer to maintain parallelism. So creating separate async
>   queues did not make sense till that was fixed.

Yeah, the above are why I find "buffered writes need separate controls
because cfq doesn't distinguish async writes" argument very ironic.
We introduce one quirk to compensate for shortages in the other part
and then later we work that around in that other part for that quirk?
I mean, that's just twisted.

> - As you said, it is historical also. We prioritize READS at the expense
>   of writes. Now by putting buffered/async writes in a separate group, we
>   will might end up prioritizing a group's async write over other group's
>   synchronous read. How many people really want that behavior? To me
>   keeping service differentiation among the sync IO matters most. Even
>   if all async IO is treated same, I guess not many people might care.

While segregation of async IOs may not matter in some cases, it does
matter to many other use cases, so it seems to me that we hard coded
that optimization decision without thinking too much about it.  For a
lot of container type use cases, the current implementation is nearly
useless (I know of cases where people are explicitly patching for
separate async queues).  At the same time, switching the default
behavior *may* disturb some of the current users and that's why I'm
thinking abut having a switch for the new behavior.

> >   Note that blk-throttle doesn't have this problem.
> 
> I am not sure what are you trying to say here. But primarily blk-throttle
> will throttle read and direct IO. Buffered writes will go to root cgroup
> which is typically unthrottled.

Ooh, my bad then.  Anyways, then the same applies to blk-throttle.
Our current implementation essentially collapses at the face of
write-heavy workload.

> > * Unlike dirty data pages, metadata tends to have strict ordering
> >   requirements and thus is susceptible to priority inversion.  Two
> >   solutions were suggested - 1. allow overdrawl for metadata writes so
> >   that low prio metadata writes don't block the whole FS, 2. provide
> >   an interface to query and wait for bdi-cgroup congestion which can
> >   be called from FS metadata paths to throttle metadata operations
> >   before they enter the stream of ordered operations.
> 
> So that probably will mean changing the order of operations also. IIUC, 
> in case of fsync (ordered mode), we opened a meta data transaction first,
> then tried to flush all the cached data and then flush metadata. So if
> fsync is throttled, all the metadata operations behind it will get 
> serialized for ext3/ext4.
> 
> So you seem to be suggesting that we change the design so that metadata
> operation does not thrown into ordered stream till we have finished
> writing all the data back to disk? I am not a filesystem developer, so
> I don't know how feasible this change is.

Jan explained it to me and I don't think it requires extensive changes
to the filesystem.  IIUC, filesystems would just block tasks creating
journal entry while its matching bdi is congested and that's the
extent of the necessary change.

> This is just one of the points. In the past while talking to Dave Chinner,
> he mentioned that in XFS, if two cgroups fall into same allocation group
> then there were cases where IO of one cgroup can get serialized behind
> other.
> 
> In general, the core of the issue is that filesystems are not cgroup aware
> and if you do throttling below filesystems, then invariably one or other
> serialization issue will come up and I am concerned that we will be constantly
> fixing those serialization issues. Or the desgin point could be so central
> to filesystem design that it can't be changed.

So, the idea is to avoid allowing any congested cgroup to enter
serialized journal.  As there's time gap until journal commit, the bdi
might be congested by the commit time.  In that case, META writes get
to overdraw cgroup limits to avoid causing priority inversion.  I
think we should be able to get most working with bdi congestion check
at the front and limit bypass for META for now.  We can worry about
overdrawing later.

> In general, if you do throttling deeper in the stakc and build back
> pressure, then all the layers sitting above should be cgroup aware
> to avoid problems. Two layers identified so far are writeback and
> filesystems. Is it really worth the complexity. How about doing 
> throttling in higher layers when IO is entering the kernel and
> keep proportional IO logic at the lowest level and current mechanism
> of building pressure continues to work?

First, I just don't think it's the right design.  It's a rather
abstract statement but I want to emphasize that having the "right"
design, in the sense that we look at the overall picture and put
configs, controls and other logics where they belong to in the
structure that their roles point to tends to make long-term
development and maintenance much easier in ways which may not be
immediately foreseeable, for both technical and social reasons -
logical structuring and layering keep us sane and make new comer's
lives at least bearable.

Secondly, I don't think it'll be a lot of added complexity.  We *need*
to fix all the said shortcoming in block layer for proper cgroup
support anyway, right?  Propagating that support upwards doesn't take
too much code.  Other than the metadata thing, it mostly just requires
updates to writeback code such that they deal with bdi-cgroup
combination instead of individual cgroups.  They'll surely require
some adjustments but we're not gonna be burdening the main paths with
cgroup awareness.  cgroup support would just make the existing
implementation work on finer grained domains.

Thirdly, I don't see how writeback can control all the IOs.  I mean,
what about reads or direct IOs?  It's not like IO devices have
separate channels for those different types of IOs.  They interact
heavily.  Let's say we have iops/bps limitation applied on top of
proportional IO distribution or a device holds two partitions and one
of them is being used for direct IO w/o filesystems.  How would that
work?  I think the question goes even deeper, what do the separate
limits even mean?  Does the IO sched have to calculate allocation of
IO resource to different types of IOs and then give a "number" to
writeback which in turn enforces that limit?  How does the elevator
know what number to give?  Is the number iops or bps or weight?  If
the iosched doesn't know how much write workload exists, how does it
distribute the surplus buffered writeback resource across different
cgroups?  If so, what makes the limit actualy enforceable (due to
inaccuracies in estimation, fluctuation in workload, delay in
enforcement in different layers and whatnot) except for block layer
applying the limit *again* on the resulting stream of combined IOs?

Fourthly, having clear layering usually means much more flexibility.
The assumptions about IO characteristics that we have are still mostly
based on devices with spindles, probably because they're still causing
the most amount of pain.  The assumptions keep changing and if we get
the layering correct, we can mostly deal with changes at the layers
concerning them - ie. in the block layer.  Maybe we'll have a
different iosched or cfq can be evolved to cover the new cases, but
the required adaptation would be logical and while upper layers might
need some adjustments they wouldn't need any major overhaul.  They'll
be still working from back pressure from IO.

So, the above are the reasons why I don't like the idea of splitting
IO control across multiple layers, well the ones that I can think of
right now anyway.  I'm currently feeling rather strong about them in
the sense of "oh no, this is about to be messed up" but maybe I'm just
not seeing what Fengguang is seeing.  I'll keep discussing there.

> So in general throttling at block layer and building back pressure is
> fine. I am concerned about two cases.
> 
> - How to handle NFS.

As said above, maybe through network based bdi pressure propagation,
Maybe some other special case mechanism.  Unsure but I don't think
this concern should dictate the whole design.

> - Do filesystem developers agree with this approach and are they willing
>   to address any serialization issues arising due to this design.

Jan, can you please fill in?  Did I understand it correctly?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 18:49     ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 18:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hey, Vivek.

On Wed, Apr 04, 2012 at 10:51:34AM -0400, Vivek Goyal wrote:
> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
> > IIUC, without cgroup, the current writeback code works more or less
> > like this.  Throwing in cgroup doesn't really change the fundamental
> > design.  Instead of a single pipe going down, we just have multiple
> > pipes to the same device, each of which should be treated separately.
> > Of course, a spinning disk can't be divided that easily and their
> > performance characteristics will be inter-dependent, but the place to
> > solve that problem is where the problem is, the block layer.
> 
> How do you take care of thorottling IO to NFS case in this model? Current
> throttling logic is tied to block device and in case of NFS, there is no
> block device.

On principle, I don't think it has be any different.  Filesystems's
interface to the underlying device is through bdi.  If a fs is block
backed, block pressure should be propagated through bdi, which should
be mostly trivial.  If a fs is network backed, we can implement a
mechanism for network backed bdis, so that they can relay the pressure
from the server side to the local fs users.

That said, network filesystems often show different behaviors and use
different mechanisms for various reasons and it wouldn't be too
surprising if something different would fit them better here or we
might need something supplemental to the usual mechanism.

> [..]
> > In the discussion, for such implementation, the following obstacles
> > were identified.
> > 
> > * There are a lot of cases where IOs are issued by a task which isn't
> >   the originiator.  ie. Writeback issues IOs for pages which are
> >   dirtied by some other tasks.  So, by the time an IO reaches the
> >   block layer, we don't know which cgroup the IO belongs to.
> > 
> >   Recently, block layer has grown support to attach a task to a bio
> >   which causes the bio to be handled as if it were issued by the
> >   associated task regardless of the actual issuing task.  It currently
> >   only allows attaching %current to a bio - bio_associate_current() -
> >   but changing it to support other tasks is trivial.
> > 
> >   We'll need to update the async issuers to tag the IOs they issue but
> >   the mechanism is already there.
> 
> Most likely this tagging will take place in "struct page" and I am not
> sure if we will be allowed to grow size of "struct page" for this reason.

With memcg enabled, we are already doing that and IIUC Jan and
Fengguang think that using inode granularity should be good enough for
writeback blaming.

> > * There's a single request pool shared by all issuers per a request
> >   queue.  This can lead to priority inversion among cgroups.  Note
> >   that problem also exists without cgroups.  Lower ioprio issuer may
> >   be holding a request holding back highprio issuer.
> > 
> >   We'll need to make request allocation cgroup (and hopefully ioprio)
> >   aware.  Probably in the form of separate request pools.  This will
> >   take some work but I don't think this will be too challenging.  I'll
> >   work on it.
> 
> This should be doable. I had implemented it long back with single request
> pool but internal limits for each group. That is block the task in the
> group if group has enough pending requests allocated from the pool. But
> separate request pool should work equally well. 
> 
> Just that it conflits a bit with current definition of q->nr_requests.
> Which specifies number of total outstanding requests on the queue. Once
> you make the pool per queue, I guess this limit will have to be
> transformed into per group upper limit.

I'm not sure about the details yet.  I *think* the suckiest part is
the actual allocation part.  We're deferring cgroup - request_queue
association until actual usage and depending on atomic allocations to
create those associations on IO path.  Doing the same for requests
might not be too pleasant.  Hmm....  allocation failure handling on
that path is already broken BTW.  Maybe we just need to get the
fallback behavior properly working.  Unsure.

> > * cfq cgroup policy throws all async IOs, which all buffered writes
> >   are, into the shared cgroup regardless of the actual cgroup.  This
> >   behavior is, I believe, mostly historical and changing it isn't
> >   difficult.  Prolly only few tens of lines of changes.  This may
> >   cause significant changes to actual IO behavior with cgroups tho.  I
> >   personally think the previous behavior was too wrong to keep (the
> >   weight was completely ignored for buffered writes) but we may want
> >   to introduce a switch to toggle between the two behaviors.
> 
> I had kept all buffered writes in in same cgroup (root cgroup) for few
> reasons.
> 
> - Because of single request descriptor pool for writes, anyway one writer
>   gets backlogged behind other. So creating separate async queues per
>   group is not going to help.
> 
> - Writeback logic was not cgroup aware. So it might not send enough IO
>   from each writer to maintain parallelism. So creating separate async
>   queues did not make sense till that was fixed.

Yeah, the above are why I find "buffered writes need separate controls
because cfq doesn't distinguish async writes" argument very ironic.
We introduce one quirk to compensate for shortages in the other part
and then later we work that around in that other part for that quirk?
I mean, that's just twisted.

> - As you said, it is historical also. We prioritize READS at the expense
>   of writes. Now by putting buffered/async writes in a separate group, we
>   will might end up prioritizing a group's async write over other group's
>   synchronous read. How many people really want that behavior? To me
>   keeping service differentiation among the sync IO matters most. Even
>   if all async IO is treated same, I guess not many people might care.

While segregation of async IOs may not matter in some cases, it does
matter to many other use cases, so it seems to me that we hard coded
that optimization decision without thinking too much about it.  For a
lot of container type use cases, the current implementation is nearly
useless (I know of cases where people are explicitly patching for
separate async queues).  At the same time, switching the default
behavior *may* disturb some of the current users and that's why I'm
thinking abut having a switch for the new behavior.

> >   Note that blk-throttle doesn't have this problem.
> 
> I am not sure what are you trying to say here. But primarily blk-throttle
> will throttle read and direct IO. Buffered writes will go to root cgroup
> which is typically unthrottled.

Ooh, my bad then.  Anyways, then the same applies to blk-throttle.
Our current implementation essentially collapses at the face of
write-heavy workload.

> > * Unlike dirty data pages, metadata tends to have strict ordering
> >   requirements and thus is susceptible to priority inversion.  Two
> >   solutions were suggested - 1. allow overdrawl for metadata writes so
> >   that low prio metadata writes don't block the whole FS, 2. provide
> >   an interface to query and wait for bdi-cgroup congestion which can
> >   be called from FS metadata paths to throttle metadata operations
> >   before they enter the stream of ordered operations.
> 
> So that probably will mean changing the order of operations also. IIUC, 
> in case of fsync (ordered mode), we opened a meta data transaction first,
> then tried to flush all the cached data and then flush metadata. So if
> fsync is throttled, all the metadata operations behind it will get 
> serialized for ext3/ext4.
> 
> So you seem to be suggesting that we change the design so that metadata
> operation does not thrown into ordered stream till we have finished
> writing all the data back to disk? I am not a filesystem developer, so
> I don't know how feasible this change is.

Jan explained it to me and I don't think it requires extensive changes
to the filesystem.  IIUC, filesystems would just block tasks creating
journal entry while its matching bdi is congested and that's the
extent of the necessary change.

> This is just one of the points. In the past while talking to Dave Chinner,
> he mentioned that in XFS, if two cgroups fall into same allocation group
> then there were cases where IO of one cgroup can get serialized behind
> other.
> 
> In general, the core of the issue is that filesystems are not cgroup aware
> and if you do throttling below filesystems, then invariably one or other
> serialization issue will come up and I am concerned that we will be constantly
> fixing those serialization issues. Or the desgin point could be so central
> to filesystem design that it can't be changed.

So, the idea is to avoid allowing any congested cgroup to enter
serialized journal.  As there's time gap until journal commit, the bdi
might be congested by the commit time.  In that case, META writes get
to overdraw cgroup limits to avoid causing priority inversion.  I
think we should be able to get most working with bdi congestion check
at the front and limit bypass for META for now.  We can worry about
overdrawing later.

> In general, if you do throttling deeper in the stakc and build back
> pressure, then all the layers sitting above should be cgroup aware
> to avoid problems. Two layers identified so far are writeback and
> filesystems. Is it really worth the complexity. How about doing 
> throttling in higher layers when IO is entering the kernel and
> keep proportional IO logic at the lowest level and current mechanism
> of building pressure continues to work?

First, I just don't think it's the right design.  It's a rather
abstract statement but I want to emphasize that having the "right"
design, in the sense that we look at the overall picture and put
configs, controls and other logics where they belong to in the
structure that their roles point to tends to make long-term
development and maintenance much easier in ways which may not be
immediately foreseeable, for both technical and social reasons -
logical structuring and layering keep us sane and make new comer's
lives at least bearable.

Secondly, I don't think it'll be a lot of added complexity.  We *need*
to fix all the said shortcoming in block layer for proper cgroup
support anyway, right?  Propagating that support upwards doesn't take
too much code.  Other than the metadata thing, it mostly just requires
updates to writeback code such that they deal with bdi-cgroup
combination instead of individual cgroups.  They'll surely require
some adjustments but we're not gonna be burdening the main paths with
cgroup awareness.  cgroup support would just make the existing
implementation work on finer grained domains.

Thirdly, I don't see how writeback can control all the IOs.  I mean,
what about reads or direct IOs?  It's not like IO devices have
separate channels for those different types of IOs.  They interact
heavily.  Let's say we have iops/bps limitation applied on top of
proportional IO distribution or a device holds two partitions and one
of them is being used for direct IO w/o filesystems.  How would that
work?  I think the question goes even deeper, what do the separate
limits even mean?  Does the IO sched have to calculate allocation of
IO resource to different types of IOs and then give a "number" to
writeback which in turn enforces that limit?  How does the elevator
know what number to give?  Is the number iops or bps or weight?  If
the iosched doesn't know how much write workload exists, how does it
distribute the surplus buffered writeback resource across different
cgroups?  If so, what makes the limit actualy enforceable (due to
inaccuracies in estimation, fluctuation in workload, delay in
enforcement in different layers and whatnot) except for block layer
applying the limit *again* on the resulting stream of combined IOs?

Fourthly, having clear layering usually means much more flexibility.
The assumptions about IO characteristics that we have are still mostly
based on devices with spindles, probably because they're still causing
the most amount of pain.  The assumptions keep changing and if we get
the layering correct, we can mostly deal with changes at the layers
concerning them - ie. in the block layer.  Maybe we'll have a
different iosched or cfq can be evolved to cover the new cases, but
the required adaptation would be logical and while upper layers might
need some adjustments they wouldn't need any major overhaul.  They'll
be still working from back pressure from IO.

So, the above are the reasons why I don't like the idea of splitting
IO control across multiple layers, well the ones that I can think of
right now anyway.  I'm currently feeling rather strong about them in
the sense of "oh no, this is about to be messed up" but maybe I'm just
not seeing what Fengguang is seeing.  I'll keep discussing there.

> So in general throttling at block layer and building back pressure is
> fine. I am concerned about two cases.
> 
> - How to handle NFS.

As said above, maybe through network based bdi pressure propagation,
Maybe some other special case mechanism.  Unsure but I don't think
this concern should dictate the whole design.

> - Do filesystem developers agree with this approach and are they willing
>   to address any serialization issues arising due to this design.

Jan, can you please fill in?  Did I understand it correctly?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]     ` <CAH2r5mtwQa0Uu=_Yd2JywVJXA=OMGV43X_OUfziC-yeVy9BGtQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2012-04-04 18:56       ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 18:56 UTC (permalink / raw)
  To: Steve French
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal

On Wed, Apr 04, 2012 at 10:36:04AM -0500, Steve French wrote:
> > How do you take care of thorottling IO to NFS case in this model? Current
> > throttling logic is tied to block device and in case of NFS, there is no
> > block device.
> 
> Similarly smb2 gets congestion info (number of "credits") returned from
> the server on every response - but not sure why congestion
> control is tied to the block device when this would create
> problems for network file systems

I hope the previous replies answered this.  It's about writeback
getting pressure from bdi and isn't restricted to block devices.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-04 15:36     ` Steve French
@ 2012-04-04 18:56       ` Tejun Heo
  -1 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 18:56 UTC (permalink / raw)
  To: Steve French
  Cc: Vivek Goyal, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Wed, Apr 04, 2012 at 10:36:04AM -0500, Steve French wrote:
> > How do you take care of thorottling IO to NFS case in this model? Current
> > throttling logic is tied to block device and in case of NFS, there is no
> > block device.
> 
> Similarly smb2 gets congestion info (number of "credits") returned from
> the server on every response - but not sure why congestion
> control is tied to the block device when this would create
> problems for network file systems

I hope the previous replies answered this.  It's about writeback
getting pressure from bdi and isn't restricted to block devices.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-04 18:56       ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 18:56 UTC (permalink / raw)
  To: Steve French
  Cc: Vivek Goyal, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Wed, Apr 04, 2012 at 10:36:04AM -0500, Steve French wrote:
> > How do you take care of thorottling IO to NFS case in this model? Current
> > throttling logic is tied to block device and in case of NFS, there is no
> > block device.
> 
> Similarly smb2 gets congestion info (number of "credits") returned from
> the server on every response - but not sure why congestion
> control is tied to the block device when this would create
> problems for network file systems

I hope the previous replies answered this.  It's about writeback
getting pressure from bdi and isn't restricted to block devices.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]       ` <20120404185605.GC29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
@ 2012-04-04 19:19         ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 19:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	andrea-oIIqvOZpAevzfdHfmsDf5w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Steve French,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, Apr 04, 2012 at 11:56:05AM -0700, Tejun Heo wrote:
> On Wed, Apr 04, 2012 at 10:36:04AM -0500, Steve French wrote:
> > > How do you take care of thorottling IO to NFS case in this model? Current
> > > throttling logic is tied to block device and in case of NFS, there is no
> > > block device.
> > 
> > Similarly smb2 gets congestion info (number of "credits") returned from
> > the server on every response - but not sure why congestion
> > control is tied to the block device when this would create
> > problems for network file systems
> 
> I hope the previous replies answered this.  It's about writeback
> getting pressure from bdi and isn't restricted to block devices.

So the controlling knobs for network filesystems will be very different
as current throttling knobs are per device (and not per bdi). So
presumably there will be some throttling logic in network layer (network
tc), and that should communicate the back pressure.

I have tried limiting network traffic on NFS using network controller
and tc but that did not help for variety of reasons.

- We again have the problem of losing submitter's context down the layer.

- We have interesting TCP/IP sequencing issues. I don't have the details
  but if you throttle traffic from one group, it kind of led to some 
  kind of multiple re-transmissions from server for ack due to some
  sequence number issues. Sorry, I am short on details as it was long back
  and nfs guys told me that pNFS might help here.

  The basic problem seemed to that that if you multiplex traffic from
  all cgroups on single tcp/ip session and then choke IO suddenly from
  one of them, that was leading to some sequence number issues and led
  to really sucky performance.

So something to keep in mind while coming up ways for how to implement
throttling for network file systems.

Thanks
Vivek 

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-04 18:56       ` Tejun Heo
@ 2012-04-04 19:19         ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 19:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Steve French, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Wed, Apr 04, 2012 at 11:56:05AM -0700, Tejun Heo wrote:
> On Wed, Apr 04, 2012 at 10:36:04AM -0500, Steve French wrote:
> > > How do you take care of thorottling IO to NFS case in this model? Current
> > > throttling logic is tied to block device and in case of NFS, there is no
> > > block device.
> > 
> > Similarly smb2 gets congestion info (number of "credits") returned from
> > the server on every response - but not sure why congestion
> > control is tied to the block device when this would create
> > problems for network file systems
> 
> I hope the previous replies answered this.  It's about writeback
> getting pressure from bdi and isn't restricted to block devices.

So the controlling knobs for network filesystems will be very different
as current throttling knobs are per device (and not per bdi). So
presumably there will be some throttling logic in network layer (network
tc), and that should communicate the back pressure.

I have tried limiting network traffic on NFS using network controller
and tc but that did not help for variety of reasons.

- We again have the problem of losing submitter's context down the layer.

- We have interesting TCP/IP sequencing issues. I don't have the details
  but if you throttle traffic from one group, it kind of led to some 
  kind of multiple re-transmissions from server for ack due to some
  sequence number issues. Sorry, I am short on details as it was long back
  and nfs guys told me that pNFS might help here.

  The basic problem seemed to that that if you multiplex traffic from
  all cgroups on single tcp/ip session and then choke IO suddenly from
  one of them, that was leading to some sequence number issues and led
  to really sucky performance.

So something to keep in mind while coming up ways for how to implement
throttling for network file systems.

Thanks
Vivek 

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-04 19:19         ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 19:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Steve French, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Wed, Apr 04, 2012 at 11:56:05AM -0700, Tejun Heo wrote:
> On Wed, Apr 04, 2012 at 10:36:04AM -0500, Steve French wrote:
> > > How do you take care of thorottling IO to NFS case in this model? Current
> > > throttling logic is tied to block device and in case of NFS, there is no
> > > block device.
> > 
> > Similarly smb2 gets congestion info (number of "credits") returned from
> > the server on every response - but not sure why congestion
> > control is tied to the block device when this would create
> > problems for network file systems
> 
> I hope the previous replies answered this.  It's about writeback
> getting pressure from bdi and isn't restricted to block devices.

So the controlling knobs for network filesystems will be very different
as current throttling knobs are per device (and not per bdi). So
presumably there will be some throttling logic in network layer (network
tc), and that should communicate the back pressure.

I have tried limiting network traffic on NFS using network controller
and tc but that did not help for variety of reasons.

- We again have the problem of losing submitter's context down the layer.

- We have interesting TCP/IP sequencing issues. I don't have the details
  but if you throttle traffic from one group, it kind of led to some 
  kind of multiple re-transmissions from server for ack due to some
  sequence number issues. Sorry, I am short on details as it was long back
  and nfs guys told me that pNFS might help here.

  The basic problem seemed to that that if you multiplex traffic from
  all cgroups on single tcp/ip session and then choke IO suddenly from
  one of them, that was leading to some sequence number issues and led
  to really sucky performance.

So something to keep in mind while coming up ways for how to implement
throttling for network file systems.

Thanks
Vivek 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]     ` <20120404184909.GB29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
@ 2012-04-04 19:23       ` Steve French
  2012-04-04 20:32       ` Vivek Goyal
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 262+ messages in thread
From: Steve French @ 2012-04-04 19:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal

On Wed, Apr 4, 2012 at 1:49 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hey, Vivek.
>
> On Wed, Apr 04, 2012 at 10:51:34AM -0400, Vivek Goyal wrote:
>> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
>> > IIUC, without cgroup, the current writeback code works more or less
>> > like this.  Throwing in cgroup doesn't really change the fundamental
>> > design.  Instead of a single pipe going down, we just have multiple
>> > pipes to the same device, each of which should be treated separately.
>> > Of course, a spinning disk can't be divided that easily and their
>> > performance characteristics will be inter-dependent, but the place to
>> > solve that problem is where the problem is, the block layer.
>>
>> How do you take care of thorottling IO to NFS case in this model? Current
>> throttling logic is tied to block device and in case of NFS, there is no
>> block device.
>
> On principle, I don't think it has be any different.  Filesystems's
> interface to the underlying device is through bdi.  If a fs is block
> backed, block pressure should be propagated through bdi, which should
> be mostly trivial.  If a fs is network backed, we can implement a
> mechanism for network backed bdis, so that they can relay the pressure
> from the server side to the local fs users.
>
> That said, network filesystems often show different behaviors and use
> different mechanisms for various reasons and it wouldn't be too
> surprising if something different would fit them better here or we
> might need something supplemental to the usual mechanism.

For the network file system clients, we may be close already,
but I don't know how to allow servers like Samba or Apache
to query btrfs, xfs etc. for this information.

superblock -> struct backing_dev_info is probably fine as long
as we aren't making that structure more block device specific.
Current use of bdi is a little hard to understand since
there are 25+ fields in the structure.  Is their use/purpose written
up anywhere?  I have a feeling we are under-utilizing what
is already there.  In any case bdi is "backing" info not "block"
specific info.  Since bdi can be assigned to a superblock
and an inode, it seems reasonable for either network or local.

Note that it isn't just traditional network file systems (nfs and cifs and smb2)
but also virtualization (virtfs) and some special purpose file systems
for which block device specific interfaces to higher layers (above the fs)
are an awkward way to think about congestion.   What
about a case of a file system like btrfs that could back a
volume to a pool of devices and distribute hot/cold data
across multiple physical or logical devices?

By the way, there may be less of a problem with current
network file system clients due to small limits on simultaneous i/o.
Until recently NFS client had a low default slot count of 16 IIRC and
it was not much better for cifs.   The typical cifs server defaulted
to allowing a client to only send 50 simultaneous requests to that
server at one time ...
The cifs protocol allows more (up to 64K) and in 3.4 the client now
can send more requests (up to 32K) if the server is so configured.

With SMB2 since "credits" are returned on every response, fast
servers (e.g. Samba running on a good clustered file system,
or a good NAS box) may end up allowing thousands of simultaneous
requests if they have the resources to handle this.   Unfortunately,
the Samba server developers do not know how to request information
on superblock->bdi congestion information from user space.
I vaguely remember bdi debugging info available in sysfs, but how
would an application find out how congested the underlying volume
it is exporting is.

-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-04 18:49     ` Tejun Heo
@ 2012-04-04 19:23       ` Steve French
  -1 siblings, 0 replies; 262+ messages in thread
From: Steve French @ 2012-04-04 19:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Wed, Apr 4, 2012 at 1:49 PM, Tejun Heo <tj@kernel.org> wrote:
> Hey, Vivek.
>
> On Wed, Apr 04, 2012 at 10:51:34AM -0400, Vivek Goyal wrote:
>> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
>> > IIUC, without cgroup, the current writeback code works more or less
>> > like this.  Throwing in cgroup doesn't really change the fundamental
>> > design.  Instead of a single pipe going down, we just have multiple
>> > pipes to the same device, each of which should be treated separately.
>> > Of course, a spinning disk can't be divided that easily and their
>> > performance characteristics will be inter-dependent, but the place to
>> > solve that problem is where the problem is, the block layer.
>>
>> How do you take care of thorottling IO to NFS case in this model? Current
>> throttling logic is tied to block device and in case of NFS, there is no
>> block device.
>
> On principle, I don't think it has be any different.  Filesystems's
> interface to the underlying device is through bdi.  If a fs is block
> backed, block pressure should be propagated through bdi, which should
> be mostly trivial.  If a fs is network backed, we can implement a
> mechanism for network backed bdis, so that they can relay the pressure
> from the server side to the local fs users.
>
> That said, network filesystems often show different behaviors and use
> different mechanisms for various reasons and it wouldn't be too
> surprising if something different would fit them better here or we
> might need something supplemental to the usual mechanism.

For the network file system clients, we may be close already,
but I don't know how to allow servers like Samba or Apache
to query btrfs, xfs etc. for this information.

superblock -> struct backing_dev_info is probably fine as long
as we aren't making that structure more block device specific.
Current use of bdi is a little hard to understand since
there are 25+ fields in the structure.  Is their use/purpose written
up anywhere?  I have a feeling we are under-utilizing what
is already there.  In any case bdi is "backing" info not "block"
specific info.  Since bdi can be assigned to a superblock
and an inode, it seems reasonable for either network or local.

Note that it isn't just traditional network file systems (nfs and cifs and smb2)
but also virtualization (virtfs) and some special purpose file systems
for which block device specific interfaces to higher layers (above the fs)
are an awkward way to think about congestion.   What
about a case of a file system like btrfs that could back a
volume to a pool of devices and distribute hot/cold data
across multiple physical or logical devices?

By the way, there may be less of a problem with current
network file system clients due to small limits on simultaneous i/o.
Until recently NFS client had a low default slot count of 16 IIRC and
it was not much better for cifs.   The typical cifs server defaulted
to allowing a client to only send 50 simultaneous requests to that
server at one time ...
The cifs protocol allows more (up to 64K) and in 3.4 the client now
can send more requests (up to 32K) if the server is so configured.

With SMB2 since "credits" are returned on every response, fast
servers (e.g. Samba running on a good clustered file system,
or a good NAS box) may end up allowing thousands of simultaneous
requests if they have the resources to handle this.   Unfortunately,
the Samba server developers do not know how to request information
on superblock->bdi congestion information from user space.
I vaguely remember bdi debugging info available in sysfs, but how
would an application find out how congested the underlying volume
it is exporting is.

-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-04 19:23       ` Steve French
  0 siblings, 0 replies; 262+ messages in thread
From: Steve French @ 2012-04-04 19:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Wed, Apr 4, 2012 at 1:49 PM, Tejun Heo <tj@kernel.org> wrote:
> Hey, Vivek.
>
> On Wed, Apr 04, 2012 at 10:51:34AM -0400, Vivek Goyal wrote:
>> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
>> > IIUC, without cgroup, the current writeback code works more or less
>> > like this.  Throwing in cgroup doesn't really change the fundamental
>> > design.  Instead of a single pipe going down, we just have multiple
>> > pipes to the same device, each of which should be treated separately.
>> > Of course, a spinning disk can't be divided that easily and their
>> > performance characteristics will be inter-dependent, but the place to
>> > solve that problem is where the problem is, the block layer.
>>
>> How do you take care of thorottling IO to NFS case in this model? Current
>> throttling logic is tied to block device and in case of NFS, there is no
>> block device.
>
> On principle, I don't think it has be any different.  Filesystems's
> interface to the underlying device is through bdi.  If a fs is block
> backed, block pressure should be propagated through bdi, which should
> be mostly trivial.  If a fs is network backed, we can implement a
> mechanism for network backed bdis, so that they can relay the pressure
> from the server side to the local fs users.
>
> That said, network filesystems often show different behaviors and use
> different mechanisms for various reasons and it wouldn't be too
> surprising if something different would fit them better here or we
> might need something supplemental to the usual mechanism.

For the network file system clients, we may be close already,
but I don't know how to allow servers like Samba or Apache
to query btrfs, xfs etc. for this information.

superblock -> struct backing_dev_info is probably fine as long
as we aren't making that structure more block device specific.
Current use of bdi is a little hard to understand since
there are 25+ fields in the structure.  Is their use/purpose written
up anywhere?  I have a feeling we are under-utilizing what
is already there.  In any case bdi is "backing" info not "block"
specific info.  Since bdi can be assigned to a superblock
and an inode, it seems reasonable for either network or local.

Note that it isn't just traditional network file systems (nfs and cifs and smb2)
but also virtualization (virtfs) and some special purpose file systems
for which block device specific interfaces to higher layers (above the fs)
are an awkward way to think about congestion.   What
about a case of a file system like btrfs that could back a
volume to a pool of devices and distribute hot/cold data
across multiple physical or logical devices?

By the way, there may be less of a problem with current
network file system clients due to small limits on simultaneous i/o.
Until recently NFS client had a low default slot count of 16 IIRC and
it was not much better for cifs.   The typical cifs server defaulted
to allowing a client to only send 50 simultaneous requests to that
server at one time ...
The cifs protocol allows more (up to 64K) and in 3.4 the client now
can send more requests (up to 32K) if the server is so configured.

With SMB2 since "credits" are returned on every response, fast
servers (e.g. Samba running on a good clustered file system,
or a good NAS box) may end up allowing thousands of simultaneous
requests if they have the resources to handle this.   Unfortunately,
the Samba server developers do not know how to request information
on superblock->bdi congestion information from user space.
I vaguely remember bdi debugging info available in sysfs, but how
would an application find out how congested the underlying volume
it is exporting is.

-- 
Thanks,

Steve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-04 17:51     ` Fengguang Wu
  (?)
@ 2012-04-04 19:33       ` Tejun Heo
  -1 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 19:33 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA

Hey, Fengguang.

On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> Yeah it should be trivial to apply the balance_dirty_pages()
> throttling algorithm to the read/direct IOs. However up to now I don't
> see much added value to *duplicate* the current block IO controller
> functionalities, assuming the current users and developers are happy
> with it.

Heh, trust me.  It's half broken and people ain't happy.  I get that
your algorithm can be updatd to consider all IOs and I believe that
but what I don't get is how would such information get to writeback
and in turn how writeback would enforce the result on reads and direct
IOs.  Through what path?  Will all reads and direct IOs travel through
balance_dirty_pages() even direct IOs on raw block devices?  Or would
the writeback algorithm take the configuration from cfq, apply the
algorithm and give back the limits to enforce to cfq?  If the latter,
isn't that at least somewhat messed up?

> I did the buffered write IO controller mainly to fill the gap.  If I
> happen to stand in your way, sorry that's not my initial intention.

No, no, it's not about standing in my way.  As Vivek said in the other
reply, it's that the "gap" that you filled was created *because*
writeback wasn't cgroup aware and now you're in turn filling that gap
by making writeback work around that "gap".  I mean, my mind boggles.
Doesn't yours?  I strongly believe everyone's should.

> It's a pity and surprise that Google as a big user does not buy in
> this simple solution. You may prefer more comprehensive controls which
> may not be easily achievable with the simple scheme. However the
> complexities and overheads involved in throttling the flusher IOs
> really upsets me. 

Heh, believe it or not, I'm not really wearing google hat on this
subject and google's writeback people may have completely different
opinions on the subject than mine.  In fact, I'm not even sure how
much "work" time I'll be able to assign to this.  :(

> The sweet split point would be for balance_dirty_pages() to do cgroup
> aware buffered write throttling and leave other IOs to the current
> blkcg. For this to work well as a total solution for end users, I hope
> we can cooperate and figure out ways for the two throttling entities
> to work well with each other.

There's where I'm confused.  How is the said split supposed to work?
They aren't independent.  I mean, who gets to decide what and where
are those decisions enforced?

> What I'm interested is, what's Google and other users' use schemes in
> practice. What's their desired interfaces. Whether and how the
> combined bdp+blkcg throttling can fulfill the goals.

I'm not too privy of mm and writeback in google and even if so I
probably shouldn't talk too much about it.  Confidentiality and all.
That said, I have the general feeling that goog already figured out
how to at least work around the existing implementation and would be
able to continue no matter how upstream development fans out.

That said, wearing the cgroup maintainer and general kernel
contributor hat, I'd really like to avoid another design mess up.

> > Let's please keep the layering clear.  IO limitations will be applied
> > at the block layer and pressure will be formed there and then
> > propagated upwards eventually to the originator.  Sure, exposing the
> > whole information might result in better behavior for certain
> > workloads, but down the road, say, in three or five years, devices
> > which can be shared without worrying too much about seeks might be
> > commonplace and we could be swearing at a disgusting structural mess,
> > and sadly various cgroup support seems to be a prominent source of
> > such design failures.
> 
> Super fast storages are coming which will make us regret to make the
> IO path over complex.  Spinning disks are not going away anytime soon.
> I doubt Google is willing to afford the disk seek costs on its
> millions of disks and has the patience to wait until switching all of
> the spin disks to SSD years later (if it will ever happen).

This is new.  Let's keep the damn employer out of the discussion.
While the area I work on is affected by my employment (writeback isn't
even my area BTW), I'm not gonna do something adverse to upstream even
if it's beneficial to google and I'm much more likely to do something
which may hurt google a bit if it's gonna benefit upstream.

As for the faster / newer storage argument, that is *exactly* why we
want to keep the layering proper.  Writeback works from the pressure
from the IO stack.  If IO technology changes, we update the IO stack
and writeback still works from the pressure.  It may need to be
adjusted but the principles don't change.

> It's obvious that your below proposal involves a lot of complexities,
> overheads, and will hurt performance. It basically involves

Hmmm... that's not the impression I got from the discussion.
According to Jan, applying the current writeback logic to cgroup'fied
bdi shouldn't be too complex, no?

> - running concurrent flusher threads for cgroups, which adds back the
>   disk seeks and lock contentions. And still has problems with sync
>   and shared inodes.

I agree this is an actual concern but if the user wants to split one
spindle to multiple resource domains, there's gonna be considerable
amount of overhead no matter what.  If you want to improve how block
layer handles the split, you're welcome to dive into the block layer,
where the split is made, and improve it.

> - splitting device queue for cgroups, possibly scaling up the pool of
>   writeback pages (and locked pages in the case of stable pages) which
>   could stall random processes in the system

Sure, it'll take up more buffering and memory but that's the overhead
of the cgroup business.  I want it to be less intrusive at the cost of
somewhat more resource consumption.  ie. I don't want writeback logic
itself deeply involved in block IO cgroup enforcement even if that
means somewhat less efficient resource usage.

> - the mess of metadata handling

Does throttling from writeback actually solve this problem?  What
about fsync()?  Does that already go through balance_dirty_pages()?

> - unnecessarily coupled with memcg, in order to take advantage of the
>   per-memcg dirty limits for balance_dirty_pages() to actually convert
>   the "pushed back" dirty pages pressure into lowered dirty rate. Why
>   the hell the users *have to* setup memcg (suffering from all the
>   inconvenience and overheads) in order to do IO throttling?  Please,
>   this is really ugly! And the "back pressure" may constantly push the
>   memcg dirty pages to the limits. I'm not going to support *miss use*
>   of per-memcg dirty limits like this!

Writeback sits between blkcg and memcg and it indeed can be hairy to
consider both sides especially given the current sorry complex state
of cgroup and I can see why it would seem tempting to add a separate
controller or at least knobs to support that.  That said, I *think*
given that memcg controls all other memory parameters it probably
would make most sense giving that parameter to memcg too.  I don't
think this is really relevant to this discussion tho.  Who owns
dirty_limits is a separate issue.

> I cannot believe you would keep overlooking all the problems without
> good reasons. Please do tell us the reasons that matter.

Well, I tried and I hope some of it got through.  I also wrote a lot
of questions, mainly regarding how what you have in mind is supposed
to work through what path.  Maybe I'm just not seeing what you're
seeing but I just can't see where all the IOs would go through and
come together.  Can you please elaborate more on that?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 19:33       ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 19:33 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hey, Fengguang.

On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> Yeah it should be trivial to apply the balance_dirty_pages()
> throttling algorithm to the read/direct IOs. However up to now I don't
> see much added value to *duplicate* the current block IO controller
> functionalities, assuming the current users and developers are happy
> with it.

Heh, trust me.  It's half broken and people ain't happy.  I get that
your algorithm can be updatd to consider all IOs and I believe that
but what I don't get is how would such information get to writeback
and in turn how writeback would enforce the result on reads and direct
IOs.  Through what path?  Will all reads and direct IOs travel through
balance_dirty_pages() even direct IOs on raw block devices?  Or would
the writeback algorithm take the configuration from cfq, apply the
algorithm and give back the limits to enforce to cfq?  If the latter,
isn't that at least somewhat messed up?

> I did the buffered write IO controller mainly to fill the gap.  If I
> happen to stand in your way, sorry that's not my initial intention.

No, no, it's not about standing in my way.  As Vivek said in the other
reply, it's that the "gap" that you filled was created *because*
writeback wasn't cgroup aware and now you're in turn filling that gap
by making writeback work around that "gap".  I mean, my mind boggles.
Doesn't yours?  I strongly believe everyone's should.

> It's a pity and surprise that Google as a big user does not buy in
> this simple solution. You may prefer more comprehensive controls which
> may not be easily achievable with the simple scheme. However the
> complexities and overheads involved in throttling the flusher IOs
> really upsets me. 

Heh, believe it or not, I'm not really wearing google hat on this
subject and google's writeback people may have completely different
opinions on the subject than mine.  In fact, I'm not even sure how
much "work" time I'll be able to assign to this.  :(

> The sweet split point would be for balance_dirty_pages() to do cgroup
> aware buffered write throttling and leave other IOs to the current
> blkcg. For this to work well as a total solution for end users, I hope
> we can cooperate and figure out ways for the two throttling entities
> to work well with each other.

There's where I'm confused.  How is the said split supposed to work?
They aren't independent.  I mean, who gets to decide what and where
are those decisions enforced?

> What I'm interested is, what's Google and other users' use schemes in
> practice. What's their desired interfaces. Whether and how the
> combined bdp+blkcg throttling can fulfill the goals.

I'm not too privy of mm and writeback in google and even if so I
probably shouldn't talk too much about it.  Confidentiality and all.
That said, I have the general feeling that goog already figured out
how to at least work around the existing implementation and would be
able to continue no matter how upstream development fans out.

That said, wearing the cgroup maintainer and general kernel
contributor hat, I'd really like to avoid another design mess up.

> > Let's please keep the layering clear.  IO limitations will be applied
> > at the block layer and pressure will be formed there and then
> > propagated upwards eventually to the originator.  Sure, exposing the
> > whole information might result in better behavior for certain
> > workloads, but down the road, say, in three or five years, devices
> > which can be shared without worrying too much about seeks might be
> > commonplace and we could be swearing at a disgusting structural mess,
> > and sadly various cgroup support seems to be a prominent source of
> > such design failures.
> 
> Super fast storages are coming which will make us regret to make the
> IO path over complex.  Spinning disks are not going away anytime soon.
> I doubt Google is willing to afford the disk seek costs on its
> millions of disks and has the patience to wait until switching all of
> the spin disks to SSD years later (if it will ever happen).

This is new.  Let's keep the damn employer out of the discussion.
While the area I work on is affected by my employment (writeback isn't
even my area BTW), I'm not gonna do something adverse to upstream even
if it's beneficial to google and I'm much more likely to do something
which may hurt google a bit if it's gonna benefit upstream.

As for the faster / newer storage argument, that is *exactly* why we
want to keep the layering proper.  Writeback works from the pressure
from the IO stack.  If IO technology changes, we update the IO stack
and writeback still works from the pressure.  It may need to be
adjusted but the principles don't change.

> It's obvious that your below proposal involves a lot of complexities,
> overheads, and will hurt performance. It basically involves

Hmmm... that's not the impression I got from the discussion.
According to Jan, applying the current writeback logic to cgroup'fied
bdi shouldn't be too complex, no?

> - running concurrent flusher threads for cgroups, which adds back the
>   disk seeks and lock contentions. And still has problems with sync
>   and shared inodes.

I agree this is an actual concern but if the user wants to split one
spindle to multiple resource domains, there's gonna be considerable
amount of overhead no matter what.  If you want to improve how block
layer handles the split, you're welcome to dive into the block layer,
where the split is made, and improve it.

> - splitting device queue for cgroups, possibly scaling up the pool of
>   writeback pages (and locked pages in the case of stable pages) which
>   could stall random processes in the system

Sure, it'll take up more buffering and memory but that's the overhead
of the cgroup business.  I want it to be less intrusive at the cost of
somewhat more resource consumption.  ie. I don't want writeback logic
itself deeply involved in block IO cgroup enforcement even if that
means somewhat less efficient resource usage.

> - the mess of metadata handling

Does throttling from writeback actually solve this problem?  What
about fsync()?  Does that already go through balance_dirty_pages()?

> - unnecessarily coupled with memcg, in order to take advantage of the
>   per-memcg dirty limits for balance_dirty_pages() to actually convert
>   the "pushed back" dirty pages pressure into lowered dirty rate. Why
>   the hell the users *have to* setup memcg (suffering from all the
>   inconvenience and overheads) in order to do IO throttling?  Please,
>   this is really ugly! And the "back pressure" may constantly push the
>   memcg dirty pages to the limits. I'm not going to support *miss use*
>   of per-memcg dirty limits like this!

Writeback sits between blkcg and memcg and it indeed can be hairy to
consider both sides especially given the current sorry complex state
of cgroup and I can see why it would seem tempting to add a separate
controller or at least knobs to support that.  That said, I *think*
given that memcg controls all other memory parameters it probably
would make most sense giving that parameter to memcg too.  I don't
think this is really relevant to this discussion tho.  Who owns
dirty_limits is a separate issue.

> I cannot believe you would keep overlooking all the problems without
> good reasons. Please do tell us the reasons that matter.

Well, I tried and I hope some of it got through.  I also wrote a lot
of questions, mainly regarding how what you have in mind is supposed
to work through what path.  Maybe I'm just not seeing what you're
seeing but I just can't see where all the IOs would go through and
come together.  Can you please elaborate more on that?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 19:33       ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 19:33 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hey, Fengguang.

On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> Yeah it should be trivial to apply the balance_dirty_pages()
> throttling algorithm to the read/direct IOs. However up to now I don't
> see much added value to *duplicate* the current block IO controller
> functionalities, assuming the current users and developers are happy
> with it.

Heh, trust me.  It's half broken and people ain't happy.  I get that
your algorithm can be updatd to consider all IOs and I believe that
but what I don't get is how would such information get to writeback
and in turn how writeback would enforce the result on reads and direct
IOs.  Through what path?  Will all reads and direct IOs travel through
balance_dirty_pages() even direct IOs on raw block devices?  Or would
the writeback algorithm take the configuration from cfq, apply the
algorithm and give back the limits to enforce to cfq?  If the latter,
isn't that at least somewhat messed up?

> I did the buffered write IO controller mainly to fill the gap.  If I
> happen to stand in your way, sorry that's not my initial intention.

No, no, it's not about standing in my way.  As Vivek said in the other
reply, it's that the "gap" that you filled was created *because*
writeback wasn't cgroup aware and now you're in turn filling that gap
by making writeback work around that "gap".  I mean, my mind boggles.
Doesn't yours?  I strongly believe everyone's should.

> It's a pity and surprise that Google as a big user does not buy in
> this simple solution. You may prefer more comprehensive controls which
> may not be easily achievable with the simple scheme. However the
> complexities and overheads involved in throttling the flusher IOs
> really upsets me. 

Heh, believe it or not, I'm not really wearing google hat on this
subject and google's writeback people may have completely different
opinions on the subject than mine.  In fact, I'm not even sure how
much "work" time I'll be able to assign to this.  :(

> The sweet split point would be for balance_dirty_pages() to do cgroup
> aware buffered write throttling and leave other IOs to the current
> blkcg. For this to work well as a total solution for end users, I hope
> we can cooperate and figure out ways for the two throttling entities
> to work well with each other.

There's where I'm confused.  How is the said split supposed to work?
They aren't independent.  I mean, who gets to decide what and where
are those decisions enforced?

> What I'm interested is, what's Google and other users' use schemes in
> practice. What's their desired interfaces. Whether and how the
> combined bdp+blkcg throttling can fulfill the goals.

I'm not too privy of mm and writeback in google and even if so I
probably shouldn't talk too much about it.  Confidentiality and all.
That said, I have the general feeling that goog already figured out
how to at least work around the existing implementation and would be
able to continue no matter how upstream development fans out.

That said, wearing the cgroup maintainer and general kernel
contributor hat, I'd really like to avoid another design mess up.

> > Let's please keep the layering clear.  IO limitations will be applied
> > at the block layer and pressure will be formed there and then
> > propagated upwards eventually to the originator.  Sure, exposing the
> > whole information might result in better behavior for certain
> > workloads, but down the road, say, in three or five years, devices
> > which can be shared without worrying too much about seeks might be
> > commonplace and we could be swearing at a disgusting structural mess,
> > and sadly various cgroup support seems to be a prominent source of
> > such design failures.
> 
> Super fast storages are coming which will make us regret to make the
> IO path over complex.  Spinning disks are not going away anytime soon.
> I doubt Google is willing to afford the disk seek costs on its
> millions of disks and has the patience to wait until switching all of
> the spin disks to SSD years later (if it will ever happen).

This is new.  Let's keep the damn employer out of the discussion.
While the area I work on is affected by my employment (writeback isn't
even my area BTW), I'm not gonna do something adverse to upstream even
if it's beneficial to google and I'm much more likely to do something
which may hurt google a bit if it's gonna benefit upstream.

As for the faster / newer storage argument, that is *exactly* why we
want to keep the layering proper.  Writeback works from the pressure
from the IO stack.  If IO technology changes, we update the IO stack
and writeback still works from the pressure.  It may need to be
adjusted but the principles don't change.

> It's obvious that your below proposal involves a lot of complexities,
> overheads, and will hurt performance. It basically involves

Hmmm... that's not the impression I got from the discussion.
According to Jan, applying the current writeback logic to cgroup'fied
bdi shouldn't be too complex, no?

> - running concurrent flusher threads for cgroups, which adds back the
>   disk seeks and lock contentions. And still has problems with sync
>   and shared inodes.

I agree this is an actual concern but if the user wants to split one
spindle to multiple resource domains, there's gonna be considerable
amount of overhead no matter what.  If you want to improve how block
layer handles the split, you're welcome to dive into the block layer,
where the split is made, and improve it.

> - splitting device queue for cgroups, possibly scaling up the pool of
>   writeback pages (and locked pages in the case of stable pages) which
>   could stall random processes in the system

Sure, it'll take up more buffering and memory but that's the overhead
of the cgroup business.  I want it to be less intrusive at the cost of
somewhat more resource consumption.  ie. I don't want writeback logic
itself deeply involved in block IO cgroup enforcement even if that
means somewhat less efficient resource usage.

> - the mess of metadata handling

Does throttling from writeback actually solve this problem?  What
about fsync()?  Does that already go through balance_dirty_pages()?

> - unnecessarily coupled with memcg, in order to take advantage of the
>   per-memcg dirty limits for balance_dirty_pages() to actually convert
>   the "pushed back" dirty pages pressure into lowered dirty rate. Why
>   the hell the users *have to* setup memcg (suffering from all the
>   inconvenience and overheads) in order to do IO throttling?  Please,
>   this is really ugly! And the "back pressure" may constantly push the
>   memcg dirty pages to the limits. I'm not going to support *miss use*
>   of per-memcg dirty limits like this!

Writeback sits between blkcg and memcg and it indeed can be hairy to
consider both sides especially given the current sorry complex state
of cgroup and I can see why it would seem tempting to add a separate
controller or at least knobs to support that.  That said, I *think*
given that memcg controls all other memory parameters it probably
would make most sense giving that parameter to memcg too.  I don't
think this is really relevant to this discussion tho.  Who owns
dirty_limits is a separate issue.

> I cannot believe you would keep overlooking all the problems without
> good reasons. Please do tell us the reasons that matter.

Well, I tried and I hope some of it got through.  I also wrote a lot
of questions, mainly regarding how what you have in mind is supposed
to work through what path.  Maybe I'm just not seeing what you're
seeing but I just can't see where all the IOs would go through and
come together.  Can you please elaborate more on that?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-04 19:33       ` Tejun Heo
  (?)
@ 2012-04-04 20:18           ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 20:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Wed, Apr 04, 2012 at 12:33:55PM -0700, Tejun Heo wrote:
> Hey, Fengguang.
> 
> On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> > Yeah it should be trivial to apply the balance_dirty_pages()
> > throttling algorithm to the read/direct IOs. However up to now I don't
> > see much added value to *duplicate* the current block IO controller
> > functionalities, assuming the current users and developers are happy
> > with it.
> 
> Heh, trust me.  It's half broken and people ain't happy.  I get that
> your algorithm can be updatd to consider all IOs and I believe that
> but what I don't get is how would such information get to writeback
> and in turn how writeback would enforce the result on reads and direct
> IOs.  Through what path?  Will all reads and direct IOs travel through
> balance_dirty_pages() even direct IOs on raw block devices?  Or would
> the writeback algorithm take the configuration from cfq, apply the
> algorithm and give back the limits to enforce to cfq?  If the latter,
> isn't that at least somewhat messed up?

I think he wanted to get the configuration with the help of blkcg
interface and just implement those policies up there without any
further interaction with CFQ or lower layers.

[..]
> > The sweet split point would be for balance_dirty_pages() to do cgroup
> > aware buffered write throttling and leave other IOs to the current
> > blkcg. For this to work well as a total solution for end users, I hope
> > we can cooperate and figure out ways for the two throttling entities
> > to work well with each other.
> 
> There's where I'm confused.  How is the said split supposed to work?
> They aren't independent.  I mean, who gets to decide what and where
> are those decisions enforced?

As you said, split is just a temporary gap filling in the absense of a
good solutiong for throttling buffered writes (which is often a source
of problem for sync IO latencies). So with this solution one could put
independetly control the buffered write rate of a cgroup. Lower layers
will not throttle that traffic again as it would show up in root
cgroup. Hence blkcg and writeback need not to communicate much as
such except for confirations knobs and possibly for some stats.

[..]
> > - running concurrent flusher threads for cgroups, which adds back the
> >   disk seeks and lock contentions. And still has problems with sync
> >   and shared inodes.
> 

Or, export the notion of per group per bdi congestion and flusher does
not try to submit IO from an inode if device is congested. That way
flusher will not get blocked and we don't have to create one flusher
thread per cgroup and be happy with one flusher per bdi.

And with the comprobmise of one inode belonging to one cgroup, we will
still dispatch a bunch of IO from one inode and then move to next.
Depending on size of chunk we can reduce the seek a bit. Size of quantum
will decide tradeoff between seek and fairness of writes from inodes.

[..]
> > - the mess of metadata handling
> 
> Does throttling from writeback actually solve this problem?  What
> about fsync()?  Does that already go through balance_dirty_pages()?

By throttling the process at the time of dirtying memory, you just allowed
enough IO from process as allowed by the limits. Now fsync() has to send
only those pages to the disk and does not have to be throttled again.

So throttling process while you are admitting IO avoids these issues
with filesystem metadata.

But at the same time it does not feel right to throttle read and AIO
synchronously. Current behavior of kernel queuing up bio and throttling
it asynchronously is desirable. Only buffered write is a special case
as we anyway throttle it actively based on amount of dirty memory.

[..]
> 
> > - unnecessarily coupled with memcg, in order to take advantage of the
> >   per-memcg dirty limits for balance_dirty_pages() to actually convert
> >   the "pushed back" dirty pages pressure into lowered dirty rate. Why
> >   the hell the users *have to* setup memcg (suffering from all the
> >   inconvenience and overheads) in order to do IO throttling?  Please,
> >   this is really ugly! And the "back pressure" may constantly push the
> >   memcg dirty pages to the limits. I'm not going to support *miss use*
> >   of per-memcg dirty limits like this!
> 
> Writeback sits between blkcg and memcg and it indeed can be hairy to
> consider both sides especially given the current sorry complex state
> of cgroup and I can see why it would seem tempting to add a separate
> controller or at least knobs to support that.  That said, I *think*
> given that memcg controls all other memory parameters it probably
> would make most sense giving that parameter to memcg too.  I don't
> think this is really relevant to this discussion tho.  Who owns
> dirty_limits is a separate issue.

I agree that dirty_limit control resembles more closely to memcg than
blkcg as it is all about writing to memory and that's the resource
controlled by memcg.

I think Fegguang wanted to keep those knobs in blkcg as he thinks that
in writeback logic he can actively throttle readers and direct IO too.
But that does not sounds little messy to me too.

Hey how about reconsidering my other proposal for which I had posted
the patches. And that is keep throttling still at device level. Reads
and direct IO get throttled asynchronously but buffered writes get
throttled synchronously.

Advantages of this scheme.

- There are no separate knobs.

- All the IO (read, direct IO and buffered write) is controlled using
  same set of knobs and goes in queue of same cgroup.

- Writeback logic has no knowledge of throttling. It just invokes a 
  hook into throttling logic of device queue.

I guess this is a hybrid of active writeback throttling and back pressure
mechanism.

But it still does not solve the NFS issue as well as for direct IO,
filesystems still can get serialized, so metadata issue still needs to 
be resolved. So one can argue that why not go for full "back pressure"
method, despite it being more complex.

Here is the link, just to refresh the memory. Something to keep in mind
while assessing alternatives.

https://lkml.org/lkml/2011/6/28/243

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 20:18           ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 20:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 04, 2012 at 12:33:55PM -0700, Tejun Heo wrote:
> Hey, Fengguang.
> 
> On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> > Yeah it should be trivial to apply the balance_dirty_pages()
> > throttling algorithm to the read/direct IOs. However up to now I don't
> > see much added value to *duplicate* the current block IO controller
> > functionalities, assuming the current users and developers are happy
> > with it.
> 
> Heh, trust me.  It's half broken and people ain't happy.  I get that
> your algorithm can be updatd to consider all IOs and I believe that
> but what I don't get is how would such information get to writeback
> and in turn how writeback would enforce the result on reads and direct
> IOs.  Through what path?  Will all reads and direct IOs travel through
> balance_dirty_pages() even direct IOs on raw block devices?  Or would
> the writeback algorithm take the configuration from cfq, apply the
> algorithm and give back the limits to enforce to cfq?  If the latter,
> isn't that at least somewhat messed up?

I think he wanted to get the configuration with the help of blkcg
interface and just implement those policies up there without any
further interaction with CFQ or lower layers.

[..]
> > The sweet split point would be for balance_dirty_pages() to do cgroup
> > aware buffered write throttling and leave other IOs to the current
> > blkcg. For this to work well as a total solution for end users, I hope
> > we can cooperate and figure out ways for the two throttling entities
> > to work well with each other.
> 
> There's where I'm confused.  How is the said split supposed to work?
> They aren't independent.  I mean, who gets to decide what and where
> are those decisions enforced?

As you said, split is just a temporary gap filling in the absense of a
good solutiong for throttling buffered writes (which is often a source
of problem for sync IO latencies). So with this solution one could put
independetly control the buffered write rate of a cgroup. Lower layers
will not throttle that traffic again as it would show up in root
cgroup. Hence blkcg and writeback need not to communicate much as
such except for confirations knobs and possibly for some stats.

[..]
> > - running concurrent flusher threads for cgroups, which adds back the
> >   disk seeks and lock contentions. And still has problems with sync
> >   and shared inodes.
> 

Or, export the notion of per group per bdi congestion and flusher does
not try to submit IO from an inode if device is congested. That way
flusher will not get blocked and we don't have to create one flusher
thread per cgroup and be happy with one flusher per bdi.

And with the comprobmise of one inode belonging to one cgroup, we will
still dispatch a bunch of IO from one inode and then move to next.
Depending on size of chunk we can reduce the seek a bit. Size of quantum
will decide tradeoff between seek and fairness of writes from inodes.

[..]
> > - the mess of metadata handling
> 
> Does throttling from writeback actually solve this problem?  What
> about fsync()?  Does that already go through balance_dirty_pages()?

By throttling the process at the time of dirtying memory, you just allowed
enough IO from process as allowed by the limits. Now fsync() has to send
only those pages to the disk and does not have to be throttled again.

So throttling process while you are admitting IO avoids these issues
with filesystem metadata.

But at the same time it does not feel right to throttle read and AIO
synchronously. Current behavior of kernel queuing up bio and throttling
it asynchronously is desirable. Only buffered write is a special case
as we anyway throttle it actively based on amount of dirty memory.

[..]
> 
> > - unnecessarily coupled with memcg, in order to take advantage of the
> >   per-memcg dirty limits for balance_dirty_pages() to actually convert
> >   the "pushed back" dirty pages pressure into lowered dirty rate. Why
> >   the hell the users *have to* setup memcg (suffering from all the
> >   inconvenience and overheads) in order to do IO throttling?  Please,
> >   this is really ugly! And the "back pressure" may constantly push the
> >   memcg dirty pages to the limits. I'm not going to support *miss use*
> >   of per-memcg dirty limits like this!
> 
> Writeback sits between blkcg and memcg and it indeed can be hairy to
> consider both sides especially given the current sorry complex state
> of cgroup and I can see why it would seem tempting to add a separate
> controller or at least knobs to support that.  That said, I *think*
> given that memcg controls all other memory parameters it probably
> would make most sense giving that parameter to memcg too.  I don't
> think this is really relevant to this discussion tho.  Who owns
> dirty_limits is a separate issue.

I agree that dirty_limit control resembles more closely to memcg than
blkcg as it is all about writing to memory and that's the resource
controlled by memcg.

I think Fegguang wanted to keep those knobs in blkcg as he thinks that
in writeback logic he can actively throttle readers and direct IO too.
But that does not sounds little messy to me too.

Hey how about reconsidering my other proposal for which I had posted
the patches. And that is keep throttling still at device level. Reads
and direct IO get throttled asynchronously but buffered writes get
throttled synchronously.

Advantages of this scheme.

- There are no separate knobs.

- All the IO (read, direct IO and buffered write) is controlled using
  same set of knobs and goes in queue of same cgroup.

- Writeback logic has no knowledge of throttling. It just invokes a 
  hook into throttling logic of device queue.

I guess this is a hybrid of active writeback throttling and back pressure
mechanism.

But it still does not solve the NFS issue as well as for direct IO,
filesystems still can get serialized, so metadata issue still needs to 
be resolved. So one can argue that why not go for full "back pressure"
method, despite it being more complex.

Here is the link, just to refresh the memory. Something to keep in mind
while assessing alternatives.

https://lkml.org/lkml/2011/6/28/243

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 20:18           ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 20:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 04, 2012 at 12:33:55PM -0700, Tejun Heo wrote:
> Hey, Fengguang.
> 
> On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> > Yeah it should be trivial to apply the balance_dirty_pages()
> > throttling algorithm to the read/direct IOs. However up to now I don't
> > see much added value to *duplicate* the current block IO controller
> > functionalities, assuming the current users and developers are happy
> > with it.
> 
> Heh, trust me.  It's half broken and people ain't happy.  I get that
> your algorithm can be updatd to consider all IOs and I believe that
> but what I don't get is how would such information get to writeback
> and in turn how writeback would enforce the result on reads and direct
> IOs.  Through what path?  Will all reads and direct IOs travel through
> balance_dirty_pages() even direct IOs on raw block devices?  Or would
> the writeback algorithm take the configuration from cfq, apply the
> algorithm and give back the limits to enforce to cfq?  If the latter,
> isn't that at least somewhat messed up?

I think he wanted to get the configuration with the help of blkcg
interface and just implement those policies up there without any
further interaction with CFQ or lower layers.

[..]
> > The sweet split point would be for balance_dirty_pages() to do cgroup
> > aware buffered write throttling and leave other IOs to the current
> > blkcg. For this to work well as a total solution for end users, I hope
> > we can cooperate and figure out ways for the two throttling entities
> > to work well with each other.
> 
> There's where I'm confused.  How is the said split supposed to work?
> They aren't independent.  I mean, who gets to decide what and where
> are those decisions enforced?

As you said, split is just a temporary gap filling in the absense of a
good solutiong for throttling buffered writes (which is often a source
of problem for sync IO latencies). So with this solution one could put
independetly control the buffered write rate of a cgroup. Lower layers
will not throttle that traffic again as it would show up in root
cgroup. Hence blkcg and writeback need not to communicate much as
such except for confirations knobs and possibly for some stats.

[..]
> > - running concurrent flusher threads for cgroups, which adds back the
> >   disk seeks and lock contentions. And still has problems with sync
> >   and shared inodes.
> 

Or, export the notion of per group per bdi congestion and flusher does
not try to submit IO from an inode if device is congested. That way
flusher will not get blocked and we don't have to create one flusher
thread per cgroup and be happy with one flusher per bdi.

And with the comprobmise of one inode belonging to one cgroup, we will
still dispatch a bunch of IO from one inode and then move to next.
Depending on size of chunk we can reduce the seek a bit. Size of quantum
will decide tradeoff between seek and fairness of writes from inodes.

[..]
> > - the mess of metadata handling
> 
> Does throttling from writeback actually solve this problem?  What
> about fsync()?  Does that already go through balance_dirty_pages()?

By throttling the process at the time of dirtying memory, you just allowed
enough IO from process as allowed by the limits. Now fsync() has to send
only those pages to the disk and does not have to be throttled again.

So throttling process while you are admitting IO avoids these issues
with filesystem metadata.

But at the same time it does not feel right to throttle read and AIO
synchronously. Current behavior of kernel queuing up bio and throttling
it asynchronously is desirable. Only buffered write is a special case
as we anyway throttle it actively based on amount of dirty memory.

[..]
> 
> > - unnecessarily coupled with memcg, in order to take advantage of the
> >   per-memcg dirty limits for balance_dirty_pages() to actually convert
> >   the "pushed back" dirty pages pressure into lowered dirty rate. Why
> >   the hell the users *have to* setup memcg (suffering from all the
> >   inconvenience and overheads) in order to do IO throttling?  Please,
> >   this is really ugly! And the "back pressure" may constantly push the
> >   memcg dirty pages to the limits. I'm not going to support *miss use*
> >   of per-memcg dirty limits like this!
> 
> Writeback sits between blkcg and memcg and it indeed can be hairy to
> consider both sides especially given the current sorry complex state
> of cgroup and I can see why it would seem tempting to add a separate
> controller or at least knobs to support that.  That said, I *think*
> given that memcg controls all other memory parameters it probably
> would make most sense giving that parameter to memcg too.  I don't
> think this is really relevant to this discussion tho.  Who owns
> dirty_limits is a separate issue.

I agree that dirty_limit control resembles more closely to memcg than
blkcg as it is all about writing to memory and that's the resource
controlled by memcg.

I think Fegguang wanted to keep those knobs in blkcg as he thinks that
in writeback logic he can actively throttle readers and direct IO too.
But that does not sounds little messy to me too.

Hey how about reconsidering my other proposal for which I had posted
the patches. And that is keep throttling still at device level. Reads
and direct IO get throttled asynchronously but buffered writes get
throttled synchronously.

Advantages of this scheme.

- There are no separate knobs.

- All the IO (read, direct IO and buffered write) is controlled using
  same set of knobs and goes in queue of same cgroup.

- Writeback logic has no knowledge of throttling. It just invokes a 
  hook into throttling logic of device queue.

I guess this is a hybrid of active writeback throttling and back pressure
mechanism.

But it still does not solve the NFS issue as well as for direct IO,
filesystems still can get serialized, so metadata issue still needs to 
be resolved. So one can argue that why not go for full "back pressure"
method, despite it being more complex.

Here is the link, just to refresh the memory. Something to keep in mind
while assessing alternatives.

https://lkml.org/lkml/2011/6/28/243

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]     ` <20120404184909.GB29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
  2012-04-04 19:23       ` Steve French
@ 2012-04-04 20:32       ` Vivek Goyal
  2012-04-05 16:38       ` Tejun Heo
  2012-04-14 11:53       ` [Lsf] " Peter Zijlstra
  3 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 20:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote:

[..]

> Thirdly, I don't see how writeback can control all the IOs.  I mean,
> what about reads or direct IOs?  It's not like IO devices have
> separate channels for those different types of IOs.  They interact
> heavily.

> Let's say we have iops/bps limitation applied on top of proportional IO
> distribution

We already do that. First IO is subjected to throttling limit and only 
then it is passed to the elevator to do the proportional IO. So throttling
is already stacked on top of proportional IO. The only question is 
should it be pushed to even higher layers or not.


> or a device holds two partitions and one
> of them is being used for direct IO w/o filesystems.  How would that
> work?  I think the question goes even deeper, what do the separate
> limits even mean?

Separate limits for buffered writes are just filling the gap. Agreed it
is not a very neat solution.

>  Does the IO sched have to calculate allocation of
> IO resource to different types of IOs and then give a "number" to
> writeback which in turn enforces that limit?  How does the elevator
> know what number to give?  Is the number iops or bps or weight?

If we push up all the throttling somewhere in higher layer, say some
of kind of per bdi throttling interface, then elevator just have to
worry about doing proportional IO. No interaction with higher layers
regarding iops/bps etc. (Not that elevator worries about it today).

> If
> the iosched doesn't know how much write workload exists, how does it
> distribute the surplus buffered writeback resource across different
> cgroups?  If so, what makes the limit actualy enforceable (due to
> inaccuracies in estimation, fluctuation in workload, delay in
> enforcement in different layers and whatnot) except for block layer
> applying the limit *again* on the resulting stream of combined IOs?

So split model is definitely confusing. Anyway, block layer will not
apply the limits again as flusher IO will go in root cgroup which 
generally goes to root which is unthrottled generally. Or flusher
could mark the bios with a flag saying "do not throttle" bios again as
these have been throttled already. So throttling again is probably not
an issue. 

In summary, agreed that split is confusing and it fills a gap existing
today.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-04 18:49     ` Tejun Heo
@ 2012-04-04 20:32       ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 20:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote:

[..]

> Thirdly, I don't see how writeback can control all the IOs.  I mean,
> what about reads or direct IOs?  It's not like IO devices have
> separate channels for those different types of IOs.  They interact
> heavily.

> Let's say we have iops/bps limitation applied on top of proportional IO
> distribution

We already do that. First IO is subjected to throttling limit and only 
then it is passed to the elevator to do the proportional IO. So throttling
is already stacked on top of proportional IO. The only question is 
should it be pushed to even higher layers or not.


> or a device holds two partitions and one
> of them is being used for direct IO w/o filesystems.  How would that
> work?  I think the question goes even deeper, what do the separate
> limits even mean?

Separate limits for buffered writes are just filling the gap. Agreed it
is not a very neat solution.

>  Does the IO sched have to calculate allocation of
> IO resource to different types of IOs and then give a "number" to
> writeback which in turn enforces that limit?  How does the elevator
> know what number to give?  Is the number iops or bps or weight?

If we push up all the throttling somewhere in higher layer, say some
of kind of per bdi throttling interface, then elevator just have to
worry about doing proportional IO. No interaction with higher layers
regarding iops/bps etc. (Not that elevator worries about it today).

> If
> the iosched doesn't know how much write workload exists, how does it
> distribute the surplus buffered writeback resource across different
> cgroups?  If so, what makes the limit actualy enforceable (due to
> inaccuracies in estimation, fluctuation in workload, delay in
> enforcement in different layers and whatnot) except for block layer
> applying the limit *again* on the resulting stream of combined IOs?

So split model is definitely confusing. Anyway, block layer will not
apply the limits again as flusher IO will go in root cgroup which 
generally goes to root which is unthrottled generally. Or flusher
could mark the bios with a flag saying "do not throttle" bios again as
these have been throttled already. So throttling again is probably not
an issue. 

In summary, agreed that split is confusing and it fills a gap existing
today.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 20:32       ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-04 20:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote:

[..]

> Thirdly, I don't see how writeback can control all the IOs.  I mean,
> what about reads or direct IOs?  It's not like IO devices have
> separate channels for those different types of IOs.  They interact
> heavily.

> Let's say we have iops/bps limitation applied on top of proportional IO
> distribution

We already do that. First IO is subjected to throttling limit and only 
then it is passed to the elevator to do the proportional IO. So throttling
is already stacked on top of proportional IO. The only question is 
should it be pushed to even higher layers or not.


> or a device holds two partitions and one
> of them is being used for direct IO w/o filesystems.  How would that
> work?  I think the question goes even deeper, what do the separate
> limits even mean?

Separate limits for buffered writes are just filling the gap. Agreed it
is not a very neat solution.

>  Does the IO sched have to calculate allocation of
> IO resource to different types of IOs and then give a "number" to
> writeback which in turn enforces that limit?  How does the elevator
> know what number to give?  Is the number iops or bps or weight?

If we push up all the throttling somewhere in higher layer, say some
of kind of per bdi throttling interface, then elevator just have to
worry about doing proportional IO. No interaction with higher layers
regarding iops/bps etc. (Not that elevator worries about it today).

> If
> the iosched doesn't know how much write workload exists, how does it
> distribute the surplus buffered writeback resource across different
> cgroups?  If so, what makes the limit actualy enforceable (due to
> inaccuracies in estimation, fluctuation in workload, delay in
> enforcement in different layers and whatnot) except for block layer
> applying the limit *again* on the resulting stream of combined IOs?

So split model is definitely confusing. Anyway, block layer will not
apply the limits again as flusher IO will go in root cgroup which 
generally goes to root which is unthrottled generally. Or flusher
could mark the bios with a flag saying "do not throttle" bios again as
these have been throttled already. So throttling again is probably not
an issue. 

In summary, agreed that split is confusing and it fills a gap existing
today.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-04 18:35       ` Vivek Goyal
  (?)
@ 2012-04-04 21:42           ` Fengguang Wu
  -1 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-04 21:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote:
> On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> 
> [..]
> > The sweet split point would be for balance_dirty_pages() to do cgroup
> > aware buffered write throttling and leave other IOs to the current
> > blkcg. For this to work well as a total solution for end users, I hope
> > we can cooperate and figure out ways for the two throttling entities
> > to work well with each other.
> 
> Throttling read + direct IO, higher up has few issues too. Users will

Yeah I have a bit worry about high layer throttling, too.
Anyway here are the ideas.

> not like that a task got blocked as it tried to submit a read from a
> throttled group.

That's not the same issue I worried about :) Throttling is about
inserting small sleep/waits into selected points. For reads, the ideal
sleep point is immediately after readahead IO is summited, at the end
of __do_page_cache_readahead(). The same should be applicable to
direct IO.

> Current async behavior works well where we queue up the
> bio from the task in throttled group and let task do other things. Same
> is true for AIO where we would not like to block in bio submission.

For AIO, we'll need to delay the IO completion notification or status
update, which may involve computing some delay time and delay the
calls to io_complete() with the help of some delayed work queue. There
may be more issues to deal with as I didn't look into aio.c carefully.

The thing worried me is that in the proportional throttling case, the
high level throttling works on the *estimated* task_ratelimit =
disk_bandwidth / N, where N is the number of read IO tasks. When N
suddenly changes from 2 to 1, it may take 1 second for the estimated
task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth,
during which time the disk won't get 100% utilized because of the
temporally over-throttling of the remaining IO task.

This is not a problem when throttling at the block/cfq layer, since it
has the full information of pending requests and should not depend on
such estimations.

The workaround I can think of, is to put the throttled task into a wait
queue, and let block layer wake up the waiters when the IO queue runs
empty. This should be able to avoid most disk idle time.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 21:42           ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-04 21:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote:
> On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> 
> [..]
> > The sweet split point would be for balance_dirty_pages() to do cgroup
> > aware buffered write throttling and leave other IOs to the current
> > blkcg. For this to work well as a total solution for end users, I hope
> > we can cooperate and figure out ways for the two throttling entities
> > to work well with each other.
> 
> Throttling read + direct IO, higher up has few issues too. Users will

Yeah I have a bit worry about high layer throttling, too.
Anyway here are the ideas.

> not like that a task got blocked as it tried to submit a read from a
> throttled group.

That's not the same issue I worried about :) Throttling is about
inserting small sleep/waits into selected points. For reads, the ideal
sleep point is immediately after readahead IO is summited, at the end
of __do_page_cache_readahead(). The same should be applicable to
direct IO.

> Current async behavior works well where we queue up the
> bio from the task in throttled group and let task do other things. Same
> is true for AIO where we would not like to block in bio submission.

For AIO, we'll need to delay the IO completion notification or status
update, which may involve computing some delay time and delay the
calls to io_complete() with the help of some delayed work queue. There
may be more issues to deal with as I didn't look into aio.c carefully.

The thing worried me is that in the proportional throttling case, the
high level throttling works on the *estimated* task_ratelimit =
disk_bandwidth / N, where N is the number of read IO tasks. When N
suddenly changes from 2 to 1, it may take 1 second for the estimated
task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth,
during which time the disk won't get 100% utilized because of the
temporally over-throttling of the remaining IO task.

This is not a problem when throttling at the block/cfq layer, since it
has the full information of pending requests and should not depend on
such estimations.

The workaround I can think of, is to put the throttled task into a wait
queue, and let block layer wake up the waiters when the IO queue runs
empty. This should be able to avoid most disk idle time.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 21:42           ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-04 21:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote:
> On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> 
> [..]
> > The sweet split point would be for balance_dirty_pages() to do cgroup
> > aware buffered write throttling and leave other IOs to the current
> > blkcg. For this to work well as a total solution for end users, I hope
> > we can cooperate and figure out ways for the two throttling entities
> > to work well with each other.
> 
> Throttling read + direct IO, higher up has few issues too. Users will

Yeah I have a bit worry about high layer throttling, too.
Anyway here are the ideas.

> not like that a task got blocked as it tried to submit a read from a
> throttled group.

That's not the same issue I worried about :) Throttling is about
inserting small sleep/waits into selected points. For reads, the ideal
sleep point is immediately after readahead IO is summited, at the end
of __do_page_cache_readahead(). The same should be applicable to
direct IO.

> Current async behavior works well where we queue up the
> bio from the task in throttled group and let task do other things. Same
> is true for AIO where we would not like to block in bio submission.

For AIO, we'll need to delay the IO completion notification or status
update, which may involve computing some delay time and delay the
calls to io_complete() with the help of some delayed work queue. There
may be more issues to deal with as I didn't look into aio.c carefully.

The thing worried me is that in the proportional throttling case, the
high level throttling works on the *estimated* task_ratelimit =
disk_bandwidth / N, where N is the number of read IO tasks. When N
suddenly changes from 2 to 1, it may take 1 second for the estimated
task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth,
during which time the disk won't get 100% utilized because of the
temporally over-throttling of the remaining IO task.

This is not a problem when throttling at the block/cfq layer, since it
has the full information of pending requests and should not depend on
such estimations.

The workaround I can think of, is to put the throttled task into a wait
queue, and let block layer wake up the waiters when the IO queue runs
empty. This should be able to avoid most disk idle time.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]       ` <20120404203239.GM12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-04 23:02         ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 23:02 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

Hello, Vivek.

On Wed, Apr 04, 2012 at 04:32:39PM -0400, Vivek Goyal wrote:
> > Let's say we have iops/bps limitation applied on top of proportional IO
> > distribution
> 
> We already do that. First IO is subjected to throttling limit and only 
> then it is passed to the elevator to do the proportional IO. So throttling
> is already stacked on top of proportional IO. The only question is 
> should it be pushed to even higher layers or not.

Yeah, I know we already can do that.  I was trying to give an example
of non-trivial IO limit configuration.

> So split model is definitely confusing. Anyway, block layer will not
> apply the limits again as flusher IO will go in root cgroup which 
> generally goes to root which is unthrottled generally. Or flusher
> could mark the bios with a flag saying "do not throttle" bios again as
> these have been throttled already. So throttling again is probably not
> an issue. 
> 
> In summary, agreed that split is confusing and it fills a gap existing
> today.

It's not only confusing.  It's broken.  So, what you're saying is that
there's no provision to orchestrate between buffered writes and other
types of IOs.  So, it would essentially work as if there are two
separate controls controlling each of two heavily interacting parts
with no designed provision between them.  What the....

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]       ` <20120404203239.GM12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2012-04-04 23:02         ` Tejun Heo
@ 2012-04-04 23:02         ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 23:02 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hello, Vivek.

On Wed, Apr 04, 2012 at 04:32:39PM -0400, Vivek Goyal wrote:
> > Let's say we have iops/bps limitation applied on top of proportional IO
> > distribution
> 
> We already do that. First IO is subjected to throttling limit and only 
> then it is passed to the elevator to do the proportional IO. So throttling
> is already stacked on top of proportional IO. The only question is 
> should it be pushed to even higher layers or not.

Yeah, I know we already can do that.  I was trying to give an example
of non-trivial IO limit configuration.

> So split model is definitely confusing. Anyway, block layer will not
> apply the limits again as flusher IO will go in root cgroup which 
> generally goes to root which is unthrottled generally. Or flusher
> could mark the bios with a flag saying "do not throttle" bios again as
> these have been throttled already. So throttling again is probably not
> an issue. 
> 
> In summary, agreed that split is confusing and it fills a gap existing
> today.

It's not only confusing.  It's broken.  So, what you're saying is that
there's no provision to orchestrate between buffered writes and other
types of IOs.  So, it would essentially work as if there are two
separate controls controlling each of two heavily interacting parts
with no designed provision between them.  What the....

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 23:02         ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 23:02 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Jan Kara, Jens Axboe,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k,
	andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hello, Vivek.

On Wed, Apr 04, 2012 at 04:32:39PM -0400, Vivek Goyal wrote:
> > Let's say we have iops/bps limitation applied on top of proportional IO
> > distribution
> 
> We already do that. First IO is subjected to throttling limit and only 
> then it is passed to the elevator to do the proportional IO. So throttling
> is already stacked on top of proportional IO. The only question is 
> should it be pushed to even higher layers or not.

Yeah, I know we already can do that.  I was trying to give an example
of non-trivial IO limit configuration.

> So split model is definitely confusing. Anyway, block layer will not
> apply the limits again as flusher IO will go in root cgroup which 
> generally goes to root which is unthrottled generally. Or flusher
> could mark the bios with a flag saying "do not throttle" bios again as
> these have been throttled already. So throttling again is probably not
> an issue. 
> 
> In summary, agreed that split is confusing and it fills a gap existing
> today.

It's not only confusing.  It's broken.  So, what you're saying is that
there's no provision to orchestrate between buffered writes and other
types of IOs.  So, it would essentially work as if there are two
separate controls controlling each of two heavily interacting parts
with no designed provision between them.  What the....

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-04 23:02         ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-04 23:02 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hello, Vivek.

On Wed, Apr 04, 2012 at 04:32:39PM -0400, Vivek Goyal wrote:
> > Let's say we have iops/bps limitation applied on top of proportional IO
> > distribution
> 
> We already do that. First IO is subjected to throttling limit and only 
> then it is passed to the elevator to do the proportional IO. So throttling
> is already stacked on top of proportional IO. The only question is 
> should it be pushed to even higher layers or not.

Yeah, I know we already can do that.  I was trying to give an example
of non-trivial IO limit configuration.

> So split model is definitely confusing. Anyway, block layer will not
> apply the limits again as flusher IO will go in root cgroup which 
> generally goes to root which is unthrottled generally. Or flusher
> could mark the bios with a flag saying "do not throttle" bios again as
> these have been throttled already. So throttling again is probably not
> an issue. 
> 
> In summary, agreed that split is confusing and it fills a gap existing
> today.

It's not only confusing.  It's broken.  So, what you're saying is that
there's no provision to orchestrate between buffered writes and other
types of IOs.  So, it would essentially work as if there are two
separate controls controlling each of two heavily interacting parts
with no designed provision between them.  What the....

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-04 21:42           ` Fengguang Wu
                             ` (2 preceding siblings ...)
  (?)
@ 2012-04-05 15:10           ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-05 15:10 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Wed, Apr 04, 2012 at 02:42:28PM -0700, Fengguang Wu wrote:
> On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote:
> > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> > 
> > [..]
> > > The sweet split point would be for balance_dirty_pages() to do cgroup
> > > aware buffered write throttling and leave other IOs to the current
> > > blkcg. For this to work well as a total solution for end users, I hope
> > > we can cooperate and figure out ways for the two throttling entities
> > > to work well with each other.
> > 
> > Throttling read + direct IO, higher up has few issues too. Users will
> 
> Yeah I have a bit worry about high layer throttling, too.
> Anyway here are the ideas.
> 
> > not like that a task got blocked as it tried to submit a read from a
> > throttled group.
> 
> That's not the same issue I worried about :) Throttling is about
> inserting small sleep/waits into selected points. For reads, the ideal
> sleep point is immediately after readahead IO is summited, at the end
> of __do_page_cache_readahead(). The same should be applicable to
> direct IO.

But after a read the process might want to process the read data and
do something else altogether. So throttling the process after completing
the read is not the best thing.

> 
> > Current async behavior works well where we queue up the
> > bio from the task in throttled group and let task do other things. Same
> > is true for AIO where we would not like to block in bio submission.
> 
> For AIO, we'll need to delay the IO completion notification or status
> update, which may involve computing some delay time and delay the
> calls to io_complete() with the help of some delayed work queue. There
> may be more issues to deal with as I didn't look into aio.c carefully.

I don't know but delaying compltion notifications sounds odd to me. So
you don't throttle while submitting requests. That does not help with
pressure on request queue as process can dump whole bunch of IO without
waiting for completion?

What I like better that AIO is allowed to submit bunch of IO till it
hits the nr_requests limit on request queue and then it is blocked as
request queue is too busy and not enough request descriptors are free.

> 
> The thing worried me is that in the proportional throttling case, the
> high level throttling works on the *estimated* task_ratelimit =
> disk_bandwidth / N, where N is the number of read IO tasks. When N
> suddenly changes from 2 to 1, it may take 1 second for the estimated
> task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth,
> during which time the disk won't get 100% utilized because of the
> temporally over-throttling of the remaining IO task.

I thought we were only considering the case of absolute throttling in
higher layers. Proportional IO will continue to be in CFQ. I don't think
we need to push proportional IO in higher layers.

> 
> This is not a problem when throttling at the block/cfq layer, since it
> has the full information of pending requests and should not depend on
> such estimations.

CFQ does not even look at pending requests. It just maintains bunch
of IO queues and selects one queue to dispatch IO from based on its
weight. So proportional IO comes very naturally to CFQ.

> 
> The workaround I can think of, is to put the throttled task into a wait
> queue, and let block layer wake up the waiters when the IO queue runs
> empty. This should be able to avoid most disk idle time.

Again, I am not convinced that proportional IO should go in higher layers.

For fast devices we are already suffering from queue locking overhead and
Jens seems to have patches for multi queue. Now by trying to implement
something at higher layer, that locking overhead will show up there too
and we will end up doing something similar to multi queue there and it
is not desirable.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-04 21:42           ` Fengguang Wu
@ 2012-04-05 15:10             ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-05 15:10 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 04, 2012 at 02:42:28PM -0700, Fengguang Wu wrote:
> On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote:
> > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> > 
> > [..]
> > > The sweet split point would be for balance_dirty_pages() to do cgroup
> > > aware buffered write throttling and leave other IOs to the current
> > > blkcg. For this to work well as a total solution for end users, I hope
> > > we can cooperate and figure out ways for the two throttling entities
> > > to work well with each other.
> > 
> > Throttling read + direct IO, higher up has few issues too. Users will
> 
> Yeah I have a bit worry about high layer throttling, too.
> Anyway here are the ideas.
> 
> > not like that a task got blocked as it tried to submit a read from a
> > throttled group.
> 
> That's not the same issue I worried about :) Throttling is about
> inserting small sleep/waits into selected points. For reads, the ideal
> sleep point is immediately after readahead IO is summited, at the end
> of __do_page_cache_readahead(). The same should be applicable to
> direct IO.

But after a read the process might want to process the read data and
do something else altogether. So throttling the process after completing
the read is not the best thing.

> 
> > Current async behavior works well where we queue up the
> > bio from the task in throttled group and let task do other things. Same
> > is true for AIO where we would not like to block in bio submission.
> 
> For AIO, we'll need to delay the IO completion notification or status
> update, which may involve computing some delay time and delay the
> calls to io_complete() with the help of some delayed work queue. There
> may be more issues to deal with as I didn't look into aio.c carefully.

I don't know but delaying compltion notifications sounds odd to me. So
you don't throttle while submitting requests. That does not help with
pressure on request queue as process can dump whole bunch of IO without
waiting for completion?

What I like better that AIO is allowed to submit bunch of IO till it
hits the nr_requests limit on request queue and then it is blocked as
request queue is too busy and not enough request descriptors are free.

> 
> The thing worried me is that in the proportional throttling case, the
> high level throttling works on the *estimated* task_ratelimit =
> disk_bandwidth / N, where N is the number of read IO tasks. When N
> suddenly changes from 2 to 1, it may take 1 second for the estimated
> task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth,
> during which time the disk won't get 100% utilized because of the
> temporally over-throttling of the remaining IO task.

I thought we were only considering the case of absolute throttling in
higher layers. Proportional IO will continue to be in CFQ. I don't think
we need to push proportional IO in higher layers.

> 
> This is not a problem when throttling at the block/cfq layer, since it
> has the full information of pending requests and should not depend on
> such estimations.

CFQ does not even look at pending requests. It just maintains bunch
of IO queues and selects one queue to dispatch IO from based on its
weight. So proportional IO comes very naturally to CFQ.

> 
> The workaround I can think of, is to put the throttled task into a wait
> queue, and let block layer wake up the waiters when the IO queue runs
> empty. This should be able to avoid most disk idle time.

Again, I am not convinced that proportional IO should go in higher layers.

For fast devices we are already suffering from queue locking overhead and
Jens seems to have patches for multi queue. Now by trying to implement
something at higher layer, that locking overhead will show up there too
and we will end up doing something similar to multi queue there and it
is not desirable.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-05 15:10             ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-05 15:10 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 04, 2012 at 02:42:28PM -0700, Fengguang Wu wrote:
> On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote:
> > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> > 
> > [..]
> > > The sweet split point would be for balance_dirty_pages() to do cgroup
> > > aware buffered write throttling and leave other IOs to the current
> > > blkcg. For this to work well as a total solution for end users, I hope
> > > we can cooperate and figure out ways for the two throttling entities
> > > to work well with each other.
> > 
> > Throttling read + direct IO, higher up has few issues too. Users will
> 
> Yeah I have a bit worry about high layer throttling, too.
> Anyway here are the ideas.
> 
> > not like that a task got blocked as it tried to submit a read from a
> > throttled group.
> 
> That's not the same issue I worried about :) Throttling is about
> inserting small sleep/waits into selected points. For reads, the ideal
> sleep point is immediately after readahead IO is summited, at the end
> of __do_page_cache_readahead(). The same should be applicable to
> direct IO.

But after a read the process might want to process the read data and
do something else altogether. So throttling the process after completing
the read is not the best thing.

> 
> > Current async behavior works well where we queue up the
> > bio from the task in throttled group and let task do other things. Same
> > is true for AIO where we would not like to block in bio submission.
> 
> For AIO, we'll need to delay the IO completion notification or status
> update, which may involve computing some delay time and delay the
> calls to io_complete() with the help of some delayed work queue. There
> may be more issues to deal with as I didn't look into aio.c carefully.

I don't know but delaying compltion notifications sounds odd to me. So
you don't throttle while submitting requests. That does not help with
pressure on request queue as process can dump whole bunch of IO without
waiting for completion?

What I like better that AIO is allowed to submit bunch of IO till it
hits the nr_requests limit on request queue and then it is blocked as
request queue is too busy and not enough request descriptors are free.

> 
> The thing worried me is that in the proportional throttling case, the
> high level throttling works on the *estimated* task_ratelimit =
> disk_bandwidth / N, where N is the number of read IO tasks. When N
> suddenly changes from 2 to 1, it may take 1 second for the estimated
> task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth,
> during which time the disk won't get 100% utilized because of the
> temporally over-throttling of the remaining IO task.

I thought we were only considering the case of absolute throttling in
higher layers. Proportional IO will continue to be in CFQ. I don't think
we need to push proportional IO in higher layers.

> 
> This is not a problem when throttling at the block/cfq layer, since it
> has the full information of pending requests and should not depend on
> such estimations.

CFQ does not even look at pending requests. It just maintains bunch
of IO queues and selects one queue to dispatch IO from based on its
weight. So proportional IO comes very naturally to CFQ.

> 
> The workaround I can think of, is to put the throttled task into a wait
> queue, and let block layer wake up the waiters when the IO queue runs
> empty. This should be able to avoid most disk idle time.

Again, I am not convinced that proportional IO should go in higher layers.

For fast devices we are already suffering from queue locking overhead and
Jens seems to have patches for multi queue. Now by trying to implement
something at higher layer, that locking overhead will show up there too
and we will end up doing something similar to multi queue there and it
is not desirable.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]           ` <20120404201816.GL12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-05 16:31             ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-05 16:31 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

Hey, Vivek.

On Wed, Apr 04, 2012 at 04:18:16PM -0400, Vivek Goyal wrote:
> Hey how about reconsidering my other proposal for which I had posted
> the patches. And that is keep throttling still at device level. Reads
> and direct IO get throttled asynchronously but buffered writes get
> throttled synchronously.
> 
> Advantages of this scheme.
> 
> - There are no separate knobs.
> 
> - All the IO (read, direct IO and buffered write) is controlled using
>   same set of knobs and goes in queue of same cgroup.
> 
> - Writeback logic has no knowledge of throttling. It just invokes a 
>   hook into throttling logic of device queue.
> 
> I guess this is a hybrid of active writeback throttling and back pressure
> mechanism.
> 
> But it still does not solve the NFS issue as well as for direct IO,
> filesystems still can get serialized, so metadata issue still needs to 
> be resolved. So one can argue that why not go for full "back pressure"
> method, despite it being more complex.
> 
> Here is the link, just to refresh the memory. Something to keep in mind
> while assessing alternatives.
> 
> https://lkml.org/lkml/2011/6/28/243

Hmmm... so, this only works for blk-throttle and not with the weight.
How do you manage interaction between buffered writes and direct
writes for the same cgroup?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-04 20:18           ` Vivek Goyal
@ 2012-04-05 16:31             ` Tejun Heo
  -1 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-05 16:31 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hey, Vivek.

On Wed, Apr 04, 2012 at 04:18:16PM -0400, Vivek Goyal wrote:
> Hey how about reconsidering my other proposal for which I had posted
> the patches. And that is keep throttling still at device level. Reads
> and direct IO get throttled asynchronously but buffered writes get
> throttled synchronously.
> 
> Advantages of this scheme.
> 
> - There are no separate knobs.
> 
> - All the IO (read, direct IO and buffered write) is controlled using
>   same set of knobs and goes in queue of same cgroup.
> 
> - Writeback logic has no knowledge of throttling. It just invokes a 
>   hook into throttling logic of device queue.
> 
> I guess this is a hybrid of active writeback throttling and back pressure
> mechanism.
> 
> But it still does not solve the NFS issue as well as for direct IO,
> filesystems still can get serialized, so metadata issue still needs to 
> be resolved. So one can argue that why not go for full "back pressure"
> method, despite it being more complex.
> 
> Here is the link, just to refresh the memory. Something to keep in mind
> while assessing alternatives.
> 
> https://lkml.org/lkml/2011/6/28/243

Hmmm... so, this only works for blk-throttle and not with the weight.
How do you manage interaction between buffered writes and direct
writes for the same cgroup?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-05 16:31             ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-05 16:31 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hey, Vivek.

On Wed, Apr 04, 2012 at 04:18:16PM -0400, Vivek Goyal wrote:
> Hey how about reconsidering my other proposal for which I had posted
> the patches. And that is keep throttling still at device level. Reads
> and direct IO get throttled asynchronously but buffered writes get
> throttled synchronously.
> 
> Advantages of this scheme.
> 
> - There are no separate knobs.
> 
> - All the IO (read, direct IO and buffered write) is controlled using
>   same set of knobs and goes in queue of same cgroup.
> 
> - Writeback logic has no knowledge of throttling. It just invokes a 
>   hook into throttling logic of device queue.
> 
> I guess this is a hybrid of active writeback throttling and back pressure
> mechanism.
> 
> But it still does not solve the NFS issue as well as for direct IO,
> filesystems still can get serialized, so metadata issue still needs to 
> be resolved. So one can argue that why not go for full "back pressure"
> method, despite it being more complex.
> 
> Here is the link, just to refresh the memory. Something to keep in mind
> while assessing alternatives.
> 
> https://lkml.org/lkml/2011/6/28/243

Hmmm... so, this only works for blk-throttle and not with the weight.
How do you manage interaction between buffered writes and direct
writes for the same cgroup?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]     ` <20120404184909.GB29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
  2012-04-04 19:23       ` Steve French
  2012-04-04 20:32       ` Vivek Goyal
@ 2012-04-05 16:38       ` Tejun Heo
  2012-04-14 11:53       ` [Lsf] " Peter Zijlstra
  3 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-05 16:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

Hey, Vivek.

On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote:
> > I am not sure what are you trying to say here. But primarily blk-throttle
> > will throttle read and direct IO. Buffered writes will go to root cgroup
> > which is typically unthrottled.
> 
> Ooh, my bad then.  Anyways, then the same applies to blk-throttle.
> Our current implementation essentially collapses at the face of
> write-heavy workload.

I went through the code and couldn't find where blk-throttle is
discriminating async IOs.  Were you saying that blk-throttle currently
doesn't throttle because those IOs aren't associated with the dirtying
task?  If so, note that it's different from cfq which explicitly
assigns all async IOs when choosing cfqq even if we fix tagging.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]     ` <20120404184909.GB29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
  2012-04-04 19:23       ` Steve French
@ 2012-04-05 16:38       ` Tejun Heo
  2012-04-05 16:38       ` Tejun Heo
  2012-04-14 11:53       ` [Lsf] " Peter Zijlstra
  3 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-05 16:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hey, Vivek.

On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote:
> > I am not sure what are you trying to say here. But primarily blk-throttle
> > will throttle read and direct IO. Buffered writes will go to root cgroup
> > which is typically unthrottled.
> 
> Ooh, my bad then.  Anyways, then the same applies to blk-throttle.
> Our current implementation essentially collapses at the face of
> write-heavy workload.

I went through the code and couldn't find where blk-throttle is
discriminating async IOs.  Were you saying that blk-throttle currently
doesn't throttle because those IOs aren't associated with the dirtying
task?  If so, note that it's different from cfq which explicitly
assigns all async IOs when choosing cfqq even if we fix tagging.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-05 16:38       ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-05 16:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Jan Kara, Jens Axboe,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k,
	andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hey, Vivek.

On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote:
> > I am not sure what are you trying to say here. But primarily blk-throttle
> > will throttle read and direct IO. Buffered writes will go to root cgroup
> > which is typically unthrottled.
> 
> Ooh, my bad then.  Anyways, then the same applies to blk-throttle.
> Our current implementation essentially collapses at the face of
> write-heavy workload.

I went through the code and couldn't find where blk-throttle is
discriminating async IOs.  Were you saying that blk-throttle currently
doesn't throttle because those IOs aren't associated with the dirtying
task?  If so, note that it's different from cfq which explicitly
assigns all async IOs when choosing cfqq even if we fix tagging.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-05 16:38       ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-05 16:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hey, Vivek.

On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote:
> > I am not sure what are you trying to say here. But primarily blk-throttle
> > will throttle read and direct IO. Buffered writes will go to root cgroup
> > which is typically unthrottled.
> 
> Ooh, my bad then.  Anyways, then the same applies to blk-throttle.
> Our current implementation essentially collapses at the face of
> write-heavy workload.

I went through the code and couldn't find where blk-throttle is
discriminating async IOs.  Were you saying that blk-throttle currently
doesn't throttle because those IOs aren't associated with the dirtying
task?  If so, note that it's different from cfq which explicitly
assigns all async IOs when choosing cfqq even if we fix tagging.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]             ` <20120405163113.GD12854-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-04-05 17:09               ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-05 17:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Thu, Apr 05, 2012 at 09:31:13AM -0700, Tejun Heo wrote:
> Hey, Vivek.
> 
> On Wed, Apr 04, 2012 at 04:18:16PM -0400, Vivek Goyal wrote:
> > Hey how about reconsidering my other proposal for which I had posted
> > the patches. And that is keep throttling still at device level. Reads
> > and direct IO get throttled asynchronously but buffered writes get
> > throttled synchronously.
> > 
> > Advantages of this scheme.
> > 
> > - There are no separate knobs.
> > 
> > - All the IO (read, direct IO and buffered write) is controlled using
> >   same set of knobs and goes in queue of same cgroup.
> > 
> > - Writeback logic has no knowledge of throttling. It just invokes a 
> >   hook into throttling logic of device queue.
> > 
> > I guess this is a hybrid of active writeback throttling and back pressure
> > mechanism.
> > 
> > But it still does not solve the NFS issue as well as for direct IO,
> > filesystems still can get serialized, so metadata issue still needs to 
> > be resolved. So one can argue that why not go for full "back pressure"
> > method, despite it being more complex.
> > 
> > Here is the link, just to refresh the memory. Something to keep in mind
> > while assessing alternatives.
> > 
> > https://lkml.org/lkml/2011/6/28/243
> 
> Hmmm... so, this only works for blk-throttle and not with the weight.
> How do you manage interaction between buffered writes and direct
> writes for the same cgroup?
> 

Yes, it is only for blk-throttle. We just account for buffered write
in balance_dirty_pages() instead of when they are actually submitted to
device by flusher thread.

IIRC, I just had two queues. In one queue I had bios and in another queue
I had  tasks with information how much memory they are dirtying. So I 
did round robin in terms of dispatch between two queues depending on
throttling rate. I will allow dispatch bio from direct IO queue, then 
look at the other queue and see how much IO other task wanted to do and
when sufficient time had passed based on throttling rate, I will remove
that task from my wait queue and wake it up. 

That way it becomes equivalent to that two IO paths (direct IO + buffered
write),  doing IO to single pipe which has throttling limit. Both the
IOs are sujected to same common limit (and no split). Just that we round
robin between two types of IO and try to divide available bandwidth
equally (This ofcourse could be made tunable).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-05 16:31             ` Tejun Heo
@ 2012-04-05 17:09               ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-05 17:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Thu, Apr 05, 2012 at 09:31:13AM -0700, Tejun Heo wrote:
> Hey, Vivek.
> 
> On Wed, Apr 04, 2012 at 04:18:16PM -0400, Vivek Goyal wrote:
> > Hey how about reconsidering my other proposal for which I had posted
> > the patches. And that is keep throttling still at device level. Reads
> > and direct IO get throttled asynchronously but buffered writes get
> > throttled synchronously.
> > 
> > Advantages of this scheme.
> > 
> > - There are no separate knobs.
> > 
> > - All the IO (read, direct IO and buffered write) is controlled using
> >   same set of knobs and goes in queue of same cgroup.
> > 
> > - Writeback logic has no knowledge of throttling. It just invokes a 
> >   hook into throttling logic of device queue.
> > 
> > I guess this is a hybrid of active writeback throttling and back pressure
> > mechanism.
> > 
> > But it still does not solve the NFS issue as well as for direct IO,
> > filesystems still can get serialized, so metadata issue still needs to 
> > be resolved. So one can argue that why not go for full "back pressure"
> > method, despite it being more complex.
> > 
> > Here is the link, just to refresh the memory. Something to keep in mind
> > while assessing alternatives.
> > 
> > https://lkml.org/lkml/2011/6/28/243
> 
> Hmmm... so, this only works for blk-throttle and not with the weight.
> How do you manage interaction between buffered writes and direct
> writes for the same cgroup?
> 

Yes, it is only for blk-throttle. We just account for buffered write
in balance_dirty_pages() instead of when they are actually submitted to
device by flusher thread.

IIRC, I just had two queues. In one queue I had bios and in another queue
I had  tasks with information how much memory they are dirtying. So I 
did round robin in terms of dispatch between two queues depending on
throttling rate. I will allow dispatch bio from direct IO queue, then 
look at the other queue and see how much IO other task wanted to do and
when sufficient time had passed based on throttling rate, I will remove
that task from my wait queue and wake it up. 

That way it becomes equivalent to that two IO paths (direct IO + buffered
write),  doing IO to single pipe which has throttling limit. Both the
IOs are sujected to same common limit (and no split). Just that we round
robin between two types of IO and try to divide available bandwidth
equally (This ofcourse could be made tunable).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-05 17:09               ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-05 17:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Thu, Apr 05, 2012 at 09:31:13AM -0700, Tejun Heo wrote:
> Hey, Vivek.
> 
> On Wed, Apr 04, 2012 at 04:18:16PM -0400, Vivek Goyal wrote:
> > Hey how about reconsidering my other proposal for which I had posted
> > the patches. And that is keep throttling still at device level. Reads
> > and direct IO get throttled asynchronously but buffered writes get
> > throttled synchronously.
> > 
> > Advantages of this scheme.
> > 
> > - There are no separate knobs.
> > 
> > - All the IO (read, direct IO and buffered write) is controlled using
> >   same set of knobs and goes in queue of same cgroup.
> > 
> > - Writeback logic has no knowledge of throttling. It just invokes a 
> >   hook into throttling logic of device queue.
> > 
> > I guess this is a hybrid of active writeback throttling and back pressure
> > mechanism.
> > 
> > But it still does not solve the NFS issue as well as for direct IO,
> > filesystems still can get serialized, so metadata issue still needs to 
> > be resolved. So one can argue that why not go for full "back pressure"
> > method, despite it being more complex.
> > 
> > Here is the link, just to refresh the memory. Something to keep in mind
> > while assessing alternatives.
> > 
> > https://lkml.org/lkml/2011/6/28/243
> 
> Hmmm... so, this only works for blk-throttle and not with the weight.
> How do you manage interaction between buffered writes and direct
> writes for the same cgroup?
> 

Yes, it is only for blk-throttle. We just account for buffered write
in balance_dirty_pages() instead of when they are actually submitted to
device by flusher thread.

IIRC, I just had two queues. In one queue I had bios and in another queue
I had  tasks with information how much memory they are dirtying. So I 
did round robin in terms of dispatch between two queues depending on
throttling rate. I will allow dispatch bio from direct IO queue, then 
look at the other queue and see how much IO other task wanted to do and
when sufficient time had passed based on throttling rate, I will remove
that task from my wait queue and wake it up. 

That way it becomes equivalent to that two IO paths (direct IO + buffered
write),  doing IO to single pipe which has throttling limit. Both the
IOs are sujected to same common limit (and no split). Just that we round
robin between two types of IO and try to divide available bandwidth
equally (This ofcourse could be made tunable).

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-05 16:38       ` Tejun Heo
  (?)
@ 2012-04-05 17:13           ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-05 17:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Thu, Apr 05, 2012 at 09:38:54AM -0700, Tejun Heo wrote:
> Hey, Vivek.
> 
> On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote:
> > > I am not sure what are you trying to say here. But primarily blk-throttle
> > > will throttle read and direct IO. Buffered writes will go to root cgroup
> > > which is typically unthrottled.
> > 
> > Ooh, my bad then.  Anyways, then the same applies to blk-throttle.
> > Our current implementation essentially collapses at the face of
> > write-heavy workload.
> 
> I went through the code and couldn't find where blk-throttle is
> discriminating async IOs.  Were you saying that blk-throttle currently
> doesn't throttle because those IOs aren't associated with the dirtying
> task?

Yes that's what I meant. Currently most of the async IO will come from
flusher thread which is in root cgroup. So all the async IO will be in
root group and we typically keep root group unthrottled. Sorry for the
confusion here.

> If so, note that it's different from cfq which explicitly
> assigns all async IOs when choosing cfqq even if we fix tagging.

Yes. So if we can properly account for submitter, and for blk-throttle,
async IO will go in right cgroup. Unlike CFQ, there is no hard coded logic
to keep async IO in a particular group. It is just a matter of getting
the right cgroup information.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-05 17:13           ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-05 17:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Thu, Apr 05, 2012 at 09:38:54AM -0700, Tejun Heo wrote:
> Hey, Vivek.
> 
> On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote:
> > > I am not sure what are you trying to say here. But primarily blk-throttle
> > > will throttle read and direct IO. Buffered writes will go to root cgroup
> > > which is typically unthrottled.
> > 
> > Ooh, my bad then.  Anyways, then the same applies to blk-throttle.
> > Our current implementation essentially collapses at the face of
> > write-heavy workload.
> 
> I went through the code and couldn't find where blk-throttle is
> discriminating async IOs.  Were you saying that blk-throttle currently
> doesn't throttle because those IOs aren't associated with the dirtying
> task?

Yes that's what I meant. Currently most of the async IO will come from
flusher thread which is in root cgroup. So all the async IO will be in
root group and we typically keep root group unthrottled. Sorry for the
confusion here.

> If so, note that it's different from cfq which explicitly
> assigns all async IOs when choosing cfqq even if we fix tagging.

Yes. So if we can properly account for submitter, and for blk-throttle,
async IO will go in right cgroup. Unlike CFQ, there is no hard coded logic
to keep async IO in a particular group. It is just a matter of getting
the right cgroup information.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-05 17:13           ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-05 17:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Thu, Apr 05, 2012 at 09:38:54AM -0700, Tejun Heo wrote:
> Hey, Vivek.
> 
> On Wed, Apr 04, 2012 at 11:49:09AM -0700, Tejun Heo wrote:
> > > I am not sure what are you trying to say here. But primarily blk-throttle
> > > will throttle read and direct IO. Buffered writes will go to root cgroup
> > > which is typically unthrottled.
> > 
> > Ooh, my bad then.  Anyways, then the same applies to blk-throttle.
> > Our current implementation essentially collapses at the face of
> > write-heavy workload.
> 
> I went through the code and couldn't find where blk-throttle is
> discriminating async IOs.  Were you saying that blk-throttle currently
> doesn't throttle because those IOs aren't associated with the dirtying
> task?

Yes that's what I meant. Currently most of the async IO will come from
flusher thread which is in root cgroup. So all the async IO will be in
root group and we typically keep root group unthrottled. Sorry for the
confusion here.

> If so, note that it's different from cfq which explicitly
> assigns all async IOs when choosing cfqq even if we fix tagging.

Yes. So if we can properly account for submitter, and for blk-throttle,
async IO will go in right cgroup. Unlike CFQ, there is no hard coded logic
to keep async IO in a particular group. It is just a matter of getting
the right cgroup information.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]             ` <20120405151026.GB23999-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-06  0:32               ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-06  0:32 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Vivek,

I totally agree that direct IOs can be best handled in block/cfq layers.

On Thu, Apr 05, 2012 at 11:10:26AM -0400, Vivek Goyal wrote:
> On Wed, Apr 04, 2012 at 02:42:28PM -0700, Fengguang Wu wrote:
> > On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote:
> > > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> > > 
> > > [..]
> > > > The sweet split point would be for balance_dirty_pages() to do cgroup
> > > > aware buffered write throttling and leave other IOs to the current
> > > > blkcg. For this to work well as a total solution for end users, I hope
> > > > we can cooperate and figure out ways for the two throttling entities
> > > > to work well with each other.
> > > 
> > > Throttling read + direct IO, higher up has few issues too. Users will
> > 
> > Yeah I have a bit worry about high layer throttling, too.
> > Anyway here are the ideas.
> > 
> > > not like that a task got blocked as it tried to submit a read from a
> > > throttled group.
> > 
> > That's not the same issue I worried about :) Throttling is about
> > inserting small sleep/waits into selected points. For reads, the ideal
> > sleep point is immediately after readahead IO is summited, at the end
> > of __do_page_cache_readahead(). The same should be applicable to
> > direct IO.
> 
> But after a read the process might want to process the read data and
> do something else altogether. So throttling the process after completing
> the read is not the best thing.

__do_page_cache_readahead() returns immediately after queuing the read
IOs. It may block occasionally on metadata IO but not data IO.

> > > Current async behavior works well where we queue up the
> > > bio from the task in throttled group and let task do other things. Same
> > > is true for AIO where we would not like to block in bio submission.
> > 
> > For AIO, we'll need to delay the IO completion notification or status
> > update, which may involve computing some delay time and delay the
> > calls to io_complete() with the help of some delayed work queue. There
> > may be more issues to deal with as I didn't look into aio.c carefully.
> 
> I don't know but delaying compltion notifications sounds odd to me. So
> you don't throttle while submitting requests. That does not help with
> pressure on request queue as process can dump whole bunch of IO without
> waiting for completion?
> 
> What I like better that AIO is allowed to submit bunch of IO till it
> hits the nr_requests limit on request queue and then it is blocked as
> request queue is too busy and not enough request descriptors are free.

You are right. Throttling direct IO and AIO in high layer has the
problem of added delays and less queue fullness. I suspect it may also
lead to extra cfq anticipatory idling and disk idles. And it won't be
able to deal with ioprio. All in all there are lots of problems actually.

> > The thing worried me is that in the proportional throttling case, the
> > high level throttling works on the *estimated* task_ratelimit =
> > disk_bandwidth / N, where N is the number of read IO tasks. When N
> > suddenly changes from 2 to 1, it may take 1 second for the estimated
> > task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth,
> > during which time the disk won't get 100% utilized because of the
> > temporally over-throttling of the remaining IO task.
> 
> I thought we were only considering the case of absolute throttling in
> higher layers. Proportional IO will continue to be in CFQ. I don't think
> we need to push proportional IO in higher layers.

Agreed for direct IO.

As for buffered writes, I'm seriously considering the possibility of
doing proportional IO control in balance_dirty_pages().

I'd take this as the central problem of this thread. If the CFQ
proportional IO controller can do its work well for direct IOs and
leave the buffered writes to the balance_dirty_pages() proportional IO
controller, it would result in a simple and efficient "feedback" system
(comparing to the "push back" idea).

I don't really know about any real use cases. However it seems to me
(and perhaps Jan Kara) the most user friendly and manageable IO
controller interfaces would allow the user to divide disk time (no
matter it's used for reads or writes, direct or buffered IOs) among
the cgroups. Then allow each cgroup to further split up disk time (or
bps/iops) to different types of IO.

For simplicity, let's assume only direct/buffered writes are happening
and the user configures 3 blkio cgroups A, B, C with equal split of
disk time and equal direct:buffered splits inside each cgroup.

In the case of

        A:      1 direct write dd + 1 buffered write dd
        B:      1 direct write dd
        C:      1 buffered write dd

The dd tasks should ideally be throttled to

        A.direct:       1/6 disk time
        A.buffered:     1/6 disk time
        B.direct:       1/3 disk time
        C.buffered:     1/3 disk time

So is it possible for the proportional block IO controller to throttle
direct IOs to

        A.direct:       1/6 disk time
        B.direct:       1/3 disk time

and leave the remaining 1/2 disk time to buffered writes from the
flusher thread?

Then I promise that balance_dirty_pages() will be able to throttle the
buffered writes to:

        A.buffered:     1/6 disk time
        C.buffered:     1/3 disk time

thanks to the fact that the balance_dirty_pages() throttling algorithm
is pretty adaptive. It will be able to work well with the blkio
throttling to achieve the throttling goals.

In the above case,

        equal split of disk time == equal split of write bandwidth

since all cgroups run the same type of workload.
balance_dirty_pages() will be able to work in that
cooperative way after adding some direct IO rate accounting.

In order to deal with mixed random/sequential workloads,
balance_dirty_pages() will also need some disk time stats feedback.
It will then throttle the dirtiers so that the disk time goals are
matched in long run.

> > This is not a problem when throttling at the block/cfq layer, since it
> > has the full information of pending requests and should not depend on
> > such estimations.
> 
> CFQ does not even look at pending requests. It just maintains bunch
> of IO queues and selects one queue to dispatch IO from based on its
> weight. So proportional IO comes very naturally to CFQ.

Sure. Nice work!

> > 
> > The workaround I can think of, is to put the throttled task into a wait
> > queue, and let block layer wake up the waiters when the IO queue runs
> > empty. This should be able to avoid most disk idle time.
> 
> Again, I am not convinced that proportional IO should go in higher layers.
> 
> For fast devices we are already suffering from queue locking overhead and
> Jens seems to have patches for multi queue. Now by trying to implement
> something at higher layer, that locking overhead will show up there too
> and we will end up doing something similar to multi queue there and it
> is not desirable.

Sure, yeah it's a hack. I was not really happy with it.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-05 15:10             ` Vivek Goyal
@ 2012-04-06  0:32               ` Fengguang Wu
  -1 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-06  0:32 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Vivek,

I totally agree that direct IOs can be best handled in block/cfq layers.

On Thu, Apr 05, 2012 at 11:10:26AM -0400, Vivek Goyal wrote:
> On Wed, Apr 04, 2012 at 02:42:28PM -0700, Fengguang Wu wrote:
> > On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote:
> > > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> > > 
> > > [..]
> > > > The sweet split point would be for balance_dirty_pages() to do cgroup
> > > > aware buffered write throttling and leave other IOs to the current
> > > > blkcg. For this to work well as a total solution for end users, I hope
> > > > we can cooperate and figure out ways for the two throttling entities
> > > > to work well with each other.
> > > 
> > > Throttling read + direct IO, higher up has few issues too. Users will
> > 
> > Yeah I have a bit worry about high layer throttling, too.
> > Anyway here are the ideas.
> > 
> > > not like that a task got blocked as it tried to submit a read from a
> > > throttled group.
> > 
> > That's not the same issue I worried about :) Throttling is about
> > inserting small sleep/waits into selected points. For reads, the ideal
> > sleep point is immediately after readahead IO is summited, at the end
> > of __do_page_cache_readahead(). The same should be applicable to
> > direct IO.
> 
> But after a read the process might want to process the read data and
> do something else altogether. So throttling the process after completing
> the read is not the best thing.

__do_page_cache_readahead() returns immediately after queuing the read
IOs. It may block occasionally on metadata IO but not data IO.

> > > Current async behavior works well where we queue up the
> > > bio from the task in throttled group and let task do other things. Same
> > > is true for AIO where we would not like to block in bio submission.
> > 
> > For AIO, we'll need to delay the IO completion notification or status
> > update, which may involve computing some delay time and delay the
> > calls to io_complete() with the help of some delayed work queue. There
> > may be more issues to deal with as I didn't look into aio.c carefully.
> 
> I don't know but delaying compltion notifications sounds odd to me. So
> you don't throttle while submitting requests. That does not help with
> pressure on request queue as process can dump whole bunch of IO without
> waiting for completion?
> 
> What I like better that AIO is allowed to submit bunch of IO till it
> hits the nr_requests limit on request queue and then it is blocked as
> request queue is too busy and not enough request descriptors are free.

You are right. Throttling direct IO and AIO in high layer has the
problem of added delays and less queue fullness. I suspect it may also
lead to extra cfq anticipatory idling and disk idles. And it won't be
able to deal with ioprio. All in all there are lots of problems actually.

> > The thing worried me is that in the proportional throttling case, the
> > high level throttling works on the *estimated* task_ratelimit =
> > disk_bandwidth / N, where N is the number of read IO tasks. When N
> > suddenly changes from 2 to 1, it may take 1 second for the estimated
> > task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth,
> > during which time the disk won't get 100% utilized because of the
> > temporally over-throttling of the remaining IO task.
> 
> I thought we were only considering the case of absolute throttling in
> higher layers. Proportional IO will continue to be in CFQ. I don't think
> we need to push proportional IO in higher layers.

Agreed for direct IO.

As for buffered writes, I'm seriously considering the possibility of
doing proportional IO control in balance_dirty_pages().

I'd take this as the central problem of this thread. If the CFQ
proportional IO controller can do its work well for direct IOs and
leave the buffered writes to the balance_dirty_pages() proportional IO
controller, it would result in a simple and efficient "feedback" system
(comparing to the "push back" idea).

I don't really know about any real use cases. However it seems to me
(and perhaps Jan Kara) the most user friendly and manageable IO
controller interfaces would allow the user to divide disk time (no
matter it's used for reads or writes, direct or buffered IOs) among
the cgroups. Then allow each cgroup to further split up disk time (or
bps/iops) to different types of IO.

For simplicity, let's assume only direct/buffered writes are happening
and the user configures 3 blkio cgroups A, B, C with equal split of
disk time and equal direct:buffered splits inside each cgroup.

In the case of

        A:      1 direct write dd + 1 buffered write dd
        B:      1 direct write dd
        C:      1 buffered write dd

The dd tasks should ideally be throttled to

        A.direct:       1/6 disk time
        A.buffered:     1/6 disk time
        B.direct:       1/3 disk time
        C.buffered:     1/3 disk time

So is it possible for the proportional block IO controller to throttle
direct IOs to

        A.direct:       1/6 disk time
        B.direct:       1/3 disk time

and leave the remaining 1/2 disk time to buffered writes from the
flusher thread?

Then I promise that balance_dirty_pages() will be able to throttle the
buffered writes to:

        A.buffered:     1/6 disk time
        C.buffered:     1/3 disk time

thanks to the fact that the balance_dirty_pages() throttling algorithm
is pretty adaptive. It will be able to work well with the blkio
throttling to achieve the throttling goals.

In the above case,

        equal split of disk time == equal split of write bandwidth

since all cgroups run the same type of workload.
balance_dirty_pages() will be able to work in that
cooperative way after adding some direct IO rate accounting.

In order to deal with mixed random/sequential workloads,
balance_dirty_pages() will also need some disk time stats feedback.
It will then throttle the dirtiers so that the disk time goals are
matched in long run.

> > This is not a problem when throttling at the block/cfq layer, since it
> > has the full information of pending requests and should not depend on
> > such estimations.
> 
> CFQ does not even look at pending requests. It just maintains bunch
> of IO queues and selects one queue to dispatch IO from based on its
> weight. So proportional IO comes very naturally to CFQ.

Sure. Nice work!

> > 
> > The workaround I can think of, is to put the throttled task into a wait
> > queue, and let block layer wake up the waiters when the IO queue runs
> > empty. This should be able to avoid most disk idle time.
> 
> Again, I am not convinced that proportional IO should go in higher layers.
> 
> For fast devices we are already suffering from queue locking overhead and
> Jens seems to have patches for multi queue. Now by trying to implement
> something at higher layer, that locking overhead will show up there too
> and we will end up doing something similar to multi queue there and it
> is not desirable.

Sure, yeah it's a hack. I was not really happy with it.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-06  0:32               ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-06  0:32 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Vivek,

I totally agree that direct IOs can be best handled in block/cfq layers.

On Thu, Apr 05, 2012 at 11:10:26AM -0400, Vivek Goyal wrote:
> On Wed, Apr 04, 2012 at 02:42:28PM -0700, Fengguang Wu wrote:
> > On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote:
> > > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> > > 
> > > [..]
> > > > The sweet split point would be for balance_dirty_pages() to do cgroup
> > > > aware buffered write throttling and leave other IOs to the current
> > > > blkcg. For this to work well as a total solution for end users, I hope
> > > > we can cooperate and figure out ways for the two throttling entities
> > > > to work well with each other.
> > > 
> > > Throttling read + direct IO, higher up has few issues too. Users will
> > 
> > Yeah I have a bit worry about high layer throttling, too.
> > Anyway here are the ideas.
> > 
> > > not like that a task got blocked as it tried to submit a read from a
> > > throttled group.
> > 
> > That's not the same issue I worried about :) Throttling is about
> > inserting small sleep/waits into selected points. For reads, the ideal
> > sleep point is immediately after readahead IO is summited, at the end
> > of __do_page_cache_readahead(). The same should be applicable to
> > direct IO.
> 
> But after a read the process might want to process the read data and
> do something else altogether. So throttling the process after completing
> the read is not the best thing.

__do_page_cache_readahead() returns immediately after queuing the read
IOs. It may block occasionally on metadata IO but not data IO.

> > > Current async behavior works well where we queue up the
> > > bio from the task in throttled group and let task do other things. Same
> > > is true for AIO where we would not like to block in bio submission.
> > 
> > For AIO, we'll need to delay the IO completion notification or status
> > update, which may involve computing some delay time and delay the
> > calls to io_complete() with the help of some delayed work queue. There
> > may be more issues to deal with as I didn't look into aio.c carefully.
> 
> I don't know but delaying compltion notifications sounds odd to me. So
> you don't throttle while submitting requests. That does not help with
> pressure on request queue as process can dump whole bunch of IO without
> waiting for completion?
> 
> What I like better that AIO is allowed to submit bunch of IO till it
> hits the nr_requests limit on request queue and then it is blocked as
> request queue is too busy and not enough request descriptors are free.

You are right. Throttling direct IO and AIO in high layer has the
problem of added delays and less queue fullness. I suspect it may also
lead to extra cfq anticipatory idling and disk idles. And it won't be
able to deal with ioprio. All in all there are lots of problems actually.

> > The thing worried me is that in the proportional throttling case, the
> > high level throttling works on the *estimated* task_ratelimit =
> > disk_bandwidth / N, where N is the number of read IO tasks. When N
> > suddenly changes from 2 to 1, it may take 1 second for the estimated
> > task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth,
> > during which time the disk won't get 100% utilized because of the
> > temporally over-throttling of the remaining IO task.
> 
> I thought we were only considering the case of absolute throttling in
> higher layers. Proportional IO will continue to be in CFQ. I don't think
> we need to push proportional IO in higher layers.

Agreed for direct IO.

As for buffered writes, I'm seriously considering the possibility of
doing proportional IO control in balance_dirty_pages().

I'd take this as the central problem of this thread. If the CFQ
proportional IO controller can do its work well for direct IOs and
leave the buffered writes to the balance_dirty_pages() proportional IO
controller, it would result in a simple and efficient "feedback" system
(comparing to the "push back" idea).

I don't really know about any real use cases. However it seems to me
(and perhaps Jan Kara) the most user friendly and manageable IO
controller interfaces would allow the user to divide disk time (no
matter it's used for reads or writes, direct or buffered IOs) among
the cgroups. Then allow each cgroup to further split up disk time (or
bps/iops) to different types of IO.

For simplicity, let's assume only direct/buffered writes are happening
and the user configures 3 blkio cgroups A, B, C with equal split of
disk time and equal direct:buffered splits inside each cgroup.

In the case of

        A:      1 direct write dd + 1 buffered write dd
        B:      1 direct write dd
        C:      1 buffered write dd

The dd tasks should ideally be throttled to

        A.direct:       1/6 disk time
        A.buffered:     1/6 disk time
        B.direct:       1/3 disk time
        C.buffered:     1/3 disk time

So is it possible for the proportional block IO controller to throttle
direct IOs to

        A.direct:       1/6 disk time
        B.direct:       1/3 disk time

and leave the remaining 1/2 disk time to buffered writes from the
flusher thread?

Then I promise that balance_dirty_pages() will be able to throttle the
buffered writes to:

        A.buffered:     1/6 disk time
        C.buffered:     1/3 disk time

thanks to the fact that the balance_dirty_pages() throttling algorithm
is pretty adaptive. It will be able to work well with the blkio
throttling to achieve the throttling goals.

In the above case,

        equal split of disk time == equal split of write bandwidth

since all cgroups run the same type of workload.
balance_dirty_pages() will be able to work in that
cooperative way after adding some direct IO rate accounting.

In order to deal with mixed random/sequential workloads,
balance_dirty_pages() will also need some disk time stats feedback.
It will then throttle the dirtiers so that the disk time goals are
matched in long run.

> > This is not a problem when throttling at the block/cfq layer, since it
> > has the full information of pending requests and should not depend on
> > such estimations.
> 
> CFQ does not even look at pending requests. It just maintains bunch
> of IO queues and selects one queue to dispatch IO from based on its
> weight. So proportional IO comes very naturally to CFQ.

Sure. Nice work!

> > 
> > The workaround I can think of, is to put the throttled task into a wait
> > queue, and let block layer wake up the waiters when the IO queue runs
> > empty. This should be able to avoid most disk idle time.
> 
> Again, I am not convinced that proportional IO should go in higher layers.
> 
> For fast devices we are already suffering from queue locking overhead and
> Jens seems to have patches for multi queue. Now by trying to implement
> something at higher layer, that locking overhead will show up there too
> and we will end up doing something similar to multi queue there and it
> is not desirable.

Sure, yeah it's a hack. I was not really happy with it.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]       ` <20120404193355.GD29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
  2012-04-04 20:18           ` Vivek Goyal
@ 2012-04-06  9:59         ` Fengguang Wu
  1 sibling, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-06  9:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA

Hi Tejun,

On Wed, Apr 04, 2012 at 12:33:55PM -0700, Tejun Heo wrote:
> Hey, Fengguang.
> 
> On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> > Yeah it should be trivial to apply the balance_dirty_pages()
> > throttling algorithm to the read/direct IOs. However up to now I don't
> > see much added value to *duplicate* the current block IO controller
> > functionalities, assuming the current users and developers are happy
> > with it.
> 
> Heh, trust me.  It's half broken and people ain't happy.  I get that

Yeah, although the balance_dirty_pages() IO controller for buffered
writes looks perfect in itself, it's not enough to meet user demands.

The user expectation should be: hey, please throttle *all* IOs from
this cgroup to this amount, either in absolute bps/iops limits or in
some proportional weight value (or both, whatever the lower takes
effect).  And if necessary, he may request further limits/weights for
each type of IO inside the cgroup.

Now the blkio cgroup supports direct IO and the balance_dirty_pages()
IO controller supports buffered writes. They are providing
limits/weights for either direct IO or buffered writes, which is fine
if it's pure direct IO or pure buffered write. For the common mixed
IO workloads, it's obviously not enough.

Fortunately, the above gap can be easily filled judging from the
block/cfq IO controller code. By adding some direct IO accounting
and changing several lines of my patches to make use of the collected
stats, the semantics of the blkio.throttle.write_bps interfaces can be
changed from "limit for direct IO" to "limit for direct+buffered IOs".
Ditto for blkio.weight and blkio.write_iops, as long as some
iops/device time stats are made available to balance_dirty_pages().

It would be a fairly *easy* change. :-) It's merely adding some
accounting code and there is no need to change the block IO
controlling algorithm at all. I'll do the work of accounting (which
is basically independent of the IO controlling) and use the new stats
in balance_dirty_pages().

The only problem I can see now, is that balance_dirty_pages() works
per-bdi and blkcg works per-device. So the two ends may not match
nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where
sdb is shared by lv0 and lv1. However it should be rare situations and
be much more acceptable than the problems arise from the "push back"
approach which impacts everyone.

> your algorithm can be updatd to consider all IOs and I believe that
> but what I don't get is how would such information get to writeback
> and in turn how writeback would enforce the result on reads and direct
> IOs.  Through what path?  Will all reads and direct IOs travel through
> balance_dirty_pages() even direct IOs on raw block devices?  Or would
> the writeback algorithm take the configuration from cfq, apply the
> algorithm and give back the limits to enforce to cfq?  If the latter,
> isn't that at least somewhat messed up?

cfq is working well and don't need any modifications. Let's just make
balance_dirty_pages() cgroup aware and fill the gap of the current
block IO controller.

If the balance_dirty_pages() throttling algorithms will ever be
applied to read and direct IOs, it would be for NFS, CIFS etc. Even
for them, there may be better throttling choices. For example, Trond
mentioned the RPC layer to me during the summit.

> > I did the buffered write IO controller mainly to fill the gap.  If I
> > happen to stand in your way, sorry that's not my initial intention.
> 
> No, no, it's not about standing in my way.  As Vivek said in the other
> reply, it's that the "gap" that you filled was created *because*
> writeback wasn't cgroup aware and now you're in turn filling that gap
> by making writeback work around that "gap".  I mean, my mind boggles.
> Doesn't yours?  I strongly believe everyone's should.

Heh. It's a hard problem indeed. I felt great pains in the IO-less
dirty throttling work. I did a lot reasoning about it, and have in
fact kept cgroup IO controller in mind since its early days. Now I'd
say it's hands down for it to adapt to the gap between the total IO
limit and what's carried out by the block IO controller.

> > It's a pity and surprise that Google as a big user does not buy in
> > this simple solution. You may prefer more comprehensive controls which
> > may not be easily achievable with the simple scheme. However the
> > complexities and overheads involved in throttling the flusher IOs
> > really upsets me. 
> 
> Heh, believe it or not, I'm not really wearing google hat on this
> subject and google's writeback people may have completely different
> opinions on the subject than mine.  In fact, I'm not even sure how
> much "work" time I'll be able to assign to this.  :(

OK, understand.

> > The sweet split point would be for balance_dirty_pages() to do cgroup
> > aware buffered write throttling and leave other IOs to the current
> > blkcg. For this to work well as a total solution for end users, I hope
> > we can cooperate and figure out ways for the two throttling entities
> > to work well with each other.
> 
> There's where I'm confused.  How is the said split supposed to work?
> They aren't independent.  I mean, who gets to decide what and where
> are those decisions enforced?

Yeah it's not independent. It's about

- keep block IO cgroup untouched (in its current algorithm, for
  throttling direct IO)

- let balance_dirty_pages() adapt to the throttling target
  
        buffered_write_limit = total_limit - direct_IOs

> > What I'm interested is, what's Google and other users' use schemes in
> > practice. What's their desired interfaces. Whether and how the
> > combined bdp+blkcg throttling can fulfill the goals.
> 
> I'm not too privy of mm and writeback in google and even if so I
> probably shouldn't talk too much about it.  Confidentiality and all.
> That said, I have the general feeling that goog already figured out
> how to at least work around the existing implementation and would be
> able to continue no matter how upstream development fans out.
> 
> That said, wearing the cgroup maintainer and general kernel
> contributor hat, I'd really like to avoid another design mess up.

To me it looks a pretty clean split and find it to be an easy
solution (after sorting it out the hard way). I'll show the code and
test results after some time.

> > > Let's please keep the layering clear.  IO limitations will be applied
> > > at the block layer and pressure will be formed there and then
> > > propagated upwards eventually to the originator.  Sure, exposing the
> > > whole information might result in better behavior for certain
> > > workloads, but down the road, say, in three or five years, devices
> > > which can be shared without worrying too much about seeks might be
> > > commonplace and we could be swearing at a disgusting structural mess,
> > > and sadly various cgroup support seems to be a prominent source of
> > > such design failures.
> > 
> > Super fast storages are coming which will make us regret to make the
> > IO path over complex.  Spinning disks are not going away anytime soon.
> > I doubt Google is willing to afford the disk seek costs on its
> > millions of disks and has the patience to wait until switching all of
> > the spin disks to SSD years later (if it will ever happen).
> 
> This is new.  Let's keep the damn employer out of the discussion.
> While the area I work on is affected by my employment (writeback isn't
> even my area BTW), I'm not gonna do something adverse to upstream even
> if it's beneficial to google and I'm much more likely to do something
> which may hurt google a bit if it's gonna benefit upstream.
> 
> As for the faster / newer storage argument, that is *exactly* why we
> want to keep the layering proper.  Writeback works from the pressure
> from the IO stack.  If IO technology changes, we update the IO stack
> and writeback still works from the pressure.  It may need to be
> adjusted but the principles don't change.

To me, balance_dirty_pages() is *the* proper layer for buffered writes.
It's always there doing 1:1 proportional throttling. Then you try to
kick in to add *double* throttling in block/cfq layer. Now the low
layer may enforce 10:1 throttling and push balance_dirty_pages() away
from its balanced state, leading to large fluctuations and program
stalls.  This can be avoided by telling balance_dirty_pages(): "your
balance goal is no longer 1:1, but 10:1". With this information
balance_dirty_pages() will behave right. Then there is the question:
if balance_dirty_pages() will work just well provided the information,
why bother doing the throttling at low layer and "push back" the
pressure all the way up?

> > It's obvious that your below proposal involves a lot of complexities,
> > overheads, and will hurt performance. It basically involves
> 
> Hmmm... that's not the impression I got from the discussion.
> According to Jan, applying the current writeback logic to cgroup'fied
> bdi shouldn't be too complex, no?

In the sense of "avoidable" complexity :-)

> > - running concurrent flusher threads for cgroups, which adds back the
> >   disk seeks and lock contentions. And still has problems with sync
> >   and shared inodes.
> 
> I agree this is an actual concern but if the user wants to split one
> spindle to multiple resource domains, there's gonna be considerable
> amount of overhead no matter what.  If you want to improve how block
> layer handles the split, you're welcome to dive into the block layer,
> where the split is made, and improve it.
> 
> > - splitting device queue for cgroups, possibly scaling up the pool of
> >   writeback pages (and locked pages in the case of stable pages) which
> >   could stall random processes in the system
> 
> Sure, it'll take up more buffering and memory but that's the overhead
> of the cgroup business.  I want it to be less intrusive at the cost of
> somewhat more resource consumption.  ie. I don't want writeback logic
> itself deeply involved in block IO cgroup enforcement even if that
> means somewhat less efficient resource usage.

The balance_dirty_pages() is already deeply involved in dirty throttling.
As you can see from this patchset, the same algorithms can be extended
trivially to work with cgroup IO limits.

buffered write IO controller in balance_dirty_pages()
https://lkml.org/lkml/2012/3/28/275

It does not require forking off the flusher threads and splitting up
the IO queue at all.

> > - the mess of metadata handling
> 
> Does throttling from writeback actually solve this problem?  What
> about fsync()?  Does that already go through balance_dirty_pages()?

balance_dirty_pages() does throttling at safe points outside of fs
transactions/locks.

fsync() only submits IO for already dirtied pages and won't be
throttled by balance_dirty_pages(). Throttling happens at earlier
times when the task is dirtying the pages.

> > - unnecessarily coupled with memcg, in order to take advantage of the
> >   per-memcg dirty limits for balance_dirty_pages() to actually convert
> >   the "pushed back" dirty pages pressure into lowered dirty rate. Why
> >   the hell the users *have to* setup memcg (suffering from all the
> >   inconvenience and overheads) in order to do IO throttling?  Please,
> >   this is really ugly! And the "back pressure" may constantly push the
> >   memcg dirty pages to the limits. I'm not going to support *miss use*
> >   of per-memcg dirty limits like this!
> 
> Writeback sits between blkcg and memcg and it indeed can be hairy to
> consider both sides especially given the current sorry complex state
> of cgroup and I can see why it would seem tempting to add a separate
> controller or at least knobs to support that.  That said, I *think*
> given that memcg controls all other memory parameters it probably
> would make most sense giving that parameter to memcg too.  I don't
> think this is really relevant to this discussion tho.  Who owns
> dirty_limits is a separate issue.

In the "back pressure" scheme, memcg is a must because only it has all
the infrastructure to track dirty pages upon which you can apply some
dirty_limits. Don't tell me you want to account dirty pages in blkcg...

> > I cannot believe you would keep overlooking all the problems without
> > good reasons. Please do tell us the reasons that matter.
> 
> Well, I tried and I hope some of it got through.  I also wrote a lot
> of questions, mainly regarding how what you have in mind is supposed
> to work through what path.  Maybe I'm just not seeing what you're
> seeing but I just can't see where all the IOs would go through and
> come together.  Can you please elaborate more on that?

What I can see is, it looks pretty simple and nature to let
balance_dirty_pages() fill the gap towards a total solution :-)

- add direct IO accounting in some convenient point of the IO path
  IO submission or completion point, either is fine.

- change several lines of the buffered write IO controller to
  integrate the direct IO rate into the formula to fit the "total
  IO" limit

- in future, add more accounting as well as feedback control to make
  balance_dirty_pages() work with IOPS and disk time

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]       ` <20120404193355.GD29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
  2012-04-04 20:18           ` Vivek Goyal
@ 2012-04-06  9:59         ` Fengguang Wu
  1 sibling, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-06  9:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hi Tejun,

On Wed, Apr 04, 2012 at 12:33:55PM -0700, Tejun Heo wrote:
> Hey, Fengguang.
> 
> On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> > Yeah it should be trivial to apply the balance_dirty_pages()
> > throttling algorithm to the read/direct IOs. However up to now I don't
> > see much added value to *duplicate* the current block IO controller
> > functionalities, assuming the current users and developers are happy
> > with it.
> 
> Heh, trust me.  It's half broken and people ain't happy.  I get that

Yeah, although the balance_dirty_pages() IO controller for buffered
writes looks perfect in itself, it's not enough to meet user demands.

The user expectation should be: hey, please throttle *all* IOs from
this cgroup to this amount, either in absolute bps/iops limits or in
some proportional weight value (or both, whatever the lower takes
effect).  And if necessary, he may request further limits/weights for
each type of IO inside the cgroup.

Now the blkio cgroup supports direct IO and the balance_dirty_pages()
IO controller supports buffered writes. They are providing
limits/weights for either direct IO or buffered writes, which is fine
if it's pure direct IO or pure buffered write. For the common mixed
IO workloads, it's obviously not enough.

Fortunately, the above gap can be easily filled judging from the
block/cfq IO controller code. By adding some direct IO accounting
and changing several lines of my patches to make use of the collected
stats, the semantics of the blkio.throttle.write_bps interfaces can be
changed from "limit for direct IO" to "limit for direct+buffered IOs".
Ditto for blkio.weight and blkio.write_iops, as long as some
iops/device time stats are made available to balance_dirty_pages().

It would be a fairly *easy* change. :-) It's merely adding some
accounting code and there is no need to change the block IO
controlling algorithm at all. I'll do the work of accounting (which
is basically independent of the IO controlling) and use the new stats
in balance_dirty_pages().

The only problem I can see now, is that balance_dirty_pages() works
per-bdi and blkcg works per-device. So the two ends may not match
nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where
sdb is shared by lv0 and lv1. However it should be rare situations and
be much more acceptable than the problems arise from the "push back"
approach which impacts everyone.

> your algorithm can be updatd to consider all IOs and I believe that
> but what I don't get is how would such information get to writeback
> and in turn how writeback would enforce the result on reads and direct
> IOs.  Through what path?  Will all reads and direct IOs travel through
> balance_dirty_pages() even direct IOs on raw block devices?  Or would
> the writeback algorithm take the configuration from cfq, apply the
> algorithm and give back the limits to enforce to cfq?  If the latter,
> isn't that at least somewhat messed up?

cfq is working well and don't need any modifications. Let's just make
balance_dirty_pages() cgroup aware and fill the gap of the current
block IO controller.

If the balance_dirty_pages() throttling algorithms will ever be
applied to read and direct IOs, it would be for NFS, CIFS etc. Even
for them, there may be better throttling choices. For example, Trond
mentioned the RPC layer to me during the summit.

> > I did the buffered write IO controller mainly to fill the gap.  If I
> > happen to stand in your way, sorry that's not my initial intention.
> 
> No, no, it's not about standing in my way.  As Vivek said in the other
> reply, it's that the "gap" that you filled was created *because*
> writeback wasn't cgroup aware and now you're in turn filling that gap
> by making writeback work around that "gap".  I mean, my mind boggles.
> Doesn't yours?  I strongly believe everyone's should.

Heh. It's a hard problem indeed. I felt great pains in the IO-less
dirty throttling work. I did a lot reasoning about it, and have in
fact kept cgroup IO controller in mind since its early days. Now I'd
say it's hands down for it to adapt to the gap between the total IO
limit and what's carried out by the block IO controller.

> > It's a pity and surprise that Google as a big user does not buy in
> > this simple solution. You may prefer more comprehensive controls which
> > may not be easily achievable with the simple scheme. However the
> > complexities and overheads involved in throttling the flusher IOs
> > really upsets me. 
> 
> Heh, believe it or not, I'm not really wearing google hat on this
> subject and google's writeback people may have completely different
> opinions on the subject than mine.  In fact, I'm not even sure how
> much "work" time I'll be able to assign to this.  :(

OK, understand.

> > The sweet split point would be for balance_dirty_pages() to do cgroup
> > aware buffered write throttling and leave other IOs to the current
> > blkcg. For this to work well as a total solution for end users, I hope
> > we can cooperate and figure out ways for the two throttling entities
> > to work well with each other.
> 
> There's where I'm confused.  How is the said split supposed to work?
> They aren't independent.  I mean, who gets to decide what and where
> are those decisions enforced?

Yeah it's not independent. It's about

- keep block IO cgroup untouched (in its current algorithm, for
  throttling direct IO)

- let balance_dirty_pages() adapt to the throttling target
  
        buffered_write_limit = total_limit - direct_IOs

> > What I'm interested is, what's Google and other users' use schemes in
> > practice. What's their desired interfaces. Whether and how the
> > combined bdp+blkcg throttling can fulfill the goals.
> 
> I'm not too privy of mm and writeback in google and even if so I
> probably shouldn't talk too much about it.  Confidentiality and all.
> That said, I have the general feeling that goog already figured out
> how to at least work around the existing implementation and would be
> able to continue no matter how upstream development fans out.
> 
> That said, wearing the cgroup maintainer and general kernel
> contributor hat, I'd really like to avoid another design mess up.

To me it looks a pretty clean split and find it to be an easy
solution (after sorting it out the hard way). I'll show the code and
test results after some time.

> > > Let's please keep the layering clear.  IO limitations will be applied
> > > at the block layer and pressure will be formed there and then
> > > propagated upwards eventually to the originator.  Sure, exposing the
> > > whole information might result in better behavior for certain
> > > workloads, but down the road, say, in three or five years, devices
> > > which can be shared without worrying too much about seeks might be
> > > commonplace and we could be swearing at a disgusting structural mess,
> > > and sadly various cgroup support seems to be a prominent source of
> > > such design failures.
> > 
> > Super fast storages are coming which will make us regret to make the
> > IO path over complex.  Spinning disks are not going away anytime soon.
> > I doubt Google is willing to afford the disk seek costs on its
> > millions of disks and has the patience to wait until switching all of
> > the spin disks to SSD years later (if it will ever happen).
> 
> This is new.  Let's keep the damn employer out of the discussion.
> While the area I work on is affected by my employment (writeback isn't
> even my area BTW), I'm not gonna do something adverse to upstream even
> if it's beneficial to google and I'm much more likely to do something
> which may hurt google a bit if it's gonna benefit upstream.
> 
> As for the faster / newer storage argument, that is *exactly* why we
> want to keep the layering proper.  Writeback works from the pressure
> from the IO stack.  If IO technology changes, we update the IO stack
> and writeback still works from the pressure.  It may need to be
> adjusted but the principles don't change.

To me, balance_dirty_pages() is *the* proper layer for buffered writes.
It's always there doing 1:1 proportional throttling. Then you try to
kick in to add *double* throttling in block/cfq layer. Now the low
layer may enforce 10:1 throttling and push balance_dirty_pages() away
from its balanced state, leading to large fluctuations and program
stalls.  This can be avoided by telling balance_dirty_pages(): "your
balance goal is no longer 1:1, but 10:1". With this information
balance_dirty_pages() will behave right. Then there is the question:
if balance_dirty_pages() will work just well provided the information,
why bother doing the throttling at low layer and "push back" the
pressure all the way up?

> > It's obvious that your below proposal involves a lot of complexities,
> > overheads, and will hurt performance. It basically involves
> 
> Hmmm... that's not the impression I got from the discussion.
> According to Jan, applying the current writeback logic to cgroup'fied
> bdi shouldn't be too complex, no?

In the sense of "avoidable" complexity :-)

> > - running concurrent flusher threads for cgroups, which adds back the
> >   disk seeks and lock contentions. And still has problems with sync
> >   and shared inodes.
> 
> I agree this is an actual concern but if the user wants to split one
> spindle to multiple resource domains, there's gonna be considerable
> amount of overhead no matter what.  If you want to improve how block
> layer handles the split, you're welcome to dive into the block layer,
> where the split is made, and improve it.
> 
> > - splitting device queue for cgroups, possibly scaling up the pool of
> >   writeback pages (and locked pages in the case of stable pages) which
> >   could stall random processes in the system
> 
> Sure, it'll take up more buffering and memory but that's the overhead
> of the cgroup business.  I want it to be less intrusive at the cost of
> somewhat more resource consumption.  ie. I don't want writeback logic
> itself deeply involved in block IO cgroup enforcement even if that
> means somewhat less efficient resource usage.

The balance_dirty_pages() is already deeply involved in dirty throttling.
As you can see from this patchset, the same algorithms can be extended
trivially to work with cgroup IO limits.

buffered write IO controller in balance_dirty_pages()
https://lkml.org/lkml/2012/3/28/275

It does not require forking off the flusher threads and splitting up
the IO queue at all.

> > - the mess of metadata handling
> 
> Does throttling from writeback actually solve this problem?  What
> about fsync()?  Does that already go through balance_dirty_pages()?

balance_dirty_pages() does throttling at safe points outside of fs
transactions/locks.

fsync() only submits IO for already dirtied pages and won't be
throttled by balance_dirty_pages(). Throttling happens at earlier
times when the task is dirtying the pages.

> > - unnecessarily coupled with memcg, in order to take advantage of the
> >   per-memcg dirty limits for balance_dirty_pages() to actually convert
> >   the "pushed back" dirty pages pressure into lowered dirty rate. Why
> >   the hell the users *have to* setup memcg (suffering from all the
> >   inconvenience and overheads) in order to do IO throttling?  Please,
> >   this is really ugly! And the "back pressure" may constantly push the
> >   memcg dirty pages to the limits. I'm not going to support *miss use*
> >   of per-memcg dirty limits like this!
> 
> Writeback sits between blkcg and memcg and it indeed can be hairy to
> consider both sides especially given the current sorry complex state
> of cgroup and I can see why it would seem tempting to add a separate
> controller or at least knobs to support that.  That said, I *think*
> given that memcg controls all other memory parameters it probably
> would make most sense giving that parameter to memcg too.  I don't
> think this is really relevant to this discussion tho.  Who owns
> dirty_limits is a separate issue.

In the "back pressure" scheme, memcg is a must because only it has all
the infrastructure to track dirty pages upon which you can apply some
dirty_limits. Don't tell me you want to account dirty pages in blkcg...

> > I cannot believe you would keep overlooking all the problems without
> > good reasons. Please do tell us the reasons that matter.
> 
> Well, I tried and I hope some of it got through.  I also wrote a lot
> of questions, mainly regarding how what you have in mind is supposed
> to work through what path.  Maybe I'm just not seeing what you're
> seeing but I just can't see where all the IOs would go through and
> come together.  Can you please elaborate more on that?

What I can see is, it looks pretty simple and nature to let
balance_dirty_pages() fill the gap towards a total solution :-)

- add direct IO accounting in some convenient point of the IO path
  IO submission or completion point, either is fine.

- change several lines of the buffered write IO controller to
  integrate the direct IO rate into the formula to fit the "total
  IO" limit

- in future, add more accounting as well as feedback control to make
  balance_dirty_pages() work with IOPS and disk time

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-06  9:59         ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-06  9:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k,
	andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hi Tejun,

On Wed, Apr 04, 2012 at 12:33:55PM -0700, Tejun Heo wrote:
> Hey, Fengguang.
> 
> On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> > Yeah it should be trivial to apply the balance_dirty_pages()
> > throttling algorithm to the read/direct IOs. However up to now I don't
> > see much added value to *duplicate* the current block IO controller
> > functionalities, assuming the current users and developers are happy
> > with it.
> 
> Heh, trust me.  It's half broken and people ain't happy.  I get that

Yeah, although the balance_dirty_pages() IO controller for buffered
writes looks perfect in itself, it's not enough to meet user demands.

The user expectation should be: hey, please throttle *all* IOs from
this cgroup to this amount, either in absolute bps/iops limits or in
some proportional weight value (or both, whatever the lower takes
effect).  And if necessary, he may request further limits/weights for
each type of IO inside the cgroup.

Now the blkio cgroup supports direct IO and the balance_dirty_pages()
IO controller supports buffered writes. They are providing
limits/weights for either direct IO or buffered writes, which is fine
if it's pure direct IO or pure buffered write. For the common mixed
IO workloads, it's obviously not enough.

Fortunately, the above gap can be easily filled judging from the
block/cfq IO controller code. By adding some direct IO accounting
and changing several lines of my patches to make use of the collected
stats, the semantics of the blkio.throttle.write_bps interfaces can be
changed from "limit for direct IO" to "limit for direct+buffered IOs".
Ditto for blkio.weight and blkio.write_iops, as long as some
iops/device time stats are made available to balance_dirty_pages().

It would be a fairly *easy* change. :-) It's merely adding some
accounting code and there is no need to change the block IO
controlling algorithm at all. I'll do the work of accounting (which
is basically independent of the IO controlling) and use the new stats
in balance_dirty_pages().

The only problem I can see now, is that balance_dirty_pages() works
per-bdi and blkcg works per-device. So the two ends may not match
nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where
sdb is shared by lv0 and lv1. However it should be rare situations and
be much more acceptable than the problems arise from the "push back"
approach which impacts everyone.

> your algorithm can be updatd to consider all IOs and I believe that
> but what I don't get is how would such information get to writeback
> and in turn how writeback would enforce the result on reads and direct
> IOs.  Through what path?  Will all reads and direct IOs travel through
> balance_dirty_pages() even direct IOs on raw block devices?  Or would
> the writeback algorithm take the configuration from cfq, apply the
> algorithm and give back the limits to enforce to cfq?  If the latter,
> isn't that at least somewhat messed up?

cfq is working well and don't need any modifications. Let's just make
balance_dirty_pages() cgroup aware and fill the gap of the current
block IO controller.

If the balance_dirty_pages() throttling algorithms will ever be
applied to read and direct IOs, it would be for NFS, CIFS etc. Even
for them, there may be better throttling choices. For example, Trond
mentioned the RPC layer to me during the summit.

> > I did the buffered write IO controller mainly to fill the gap.  If I
> > happen to stand in your way, sorry that's not my initial intention.
> 
> No, no, it's not about standing in my way.  As Vivek said in the other
> reply, it's that the "gap" that you filled was created *because*
> writeback wasn't cgroup aware and now you're in turn filling that gap
> by making writeback work around that "gap".  I mean, my mind boggles.
> Doesn't yours?  I strongly believe everyone's should.

Heh. It's a hard problem indeed. I felt great pains in the IO-less
dirty throttling work. I did a lot reasoning about it, and have in
fact kept cgroup IO controller in mind since its early days. Now I'd
say it's hands down for it to adapt to the gap between the total IO
limit and what's carried out by the block IO controller.

> > It's a pity and surprise that Google as a big user does not buy in
> > this simple solution. You may prefer more comprehensive controls which
> > may not be easily achievable with the simple scheme. However the
> > complexities and overheads involved in throttling the flusher IOs
> > really upsets me. 
> 
> Heh, believe it or not, I'm not really wearing google hat on this
> subject and google's writeback people may have completely different
> opinions on the subject than mine.  In fact, I'm not even sure how
> much "work" time I'll be able to assign to this.  :(

OK, understand.

> > The sweet split point would be for balance_dirty_pages() to do cgroup
> > aware buffered write throttling and leave other IOs to the current
> > blkcg. For this to work well as a total solution for end users, I hope
> > we can cooperate and figure out ways for the two throttling entities
> > to work well with each other.
> 
> There's where I'm confused.  How is the said split supposed to work?
> They aren't independent.  I mean, who gets to decide what and where
> are those decisions enforced?

Yeah it's not independent. It's about

- keep block IO cgroup untouched (in its current algorithm, for
  throttling direct IO)

- let balance_dirty_pages() adapt to the throttling target
  
        buffered_write_limit = total_limit - direct_IOs

> > What I'm interested is, what's Google and other users' use schemes in
> > practice. What's their desired interfaces. Whether and how the
> > combined bdp+blkcg throttling can fulfill the goals.
> 
> I'm not too privy of mm and writeback in google and even if so I
> probably shouldn't talk too much about it.  Confidentiality and all.
> That said, I have the general feeling that goog already figured out
> how to at least work around the existing implementation and would be
> able to continue no matter how upstream development fans out.
> 
> That said, wearing the cgroup maintainer and general kernel
> contributor hat, I'd really like to avoid another design mess up.

To me it looks a pretty clean split and find it to be an easy
solution (after sorting it out the hard way). I'll show the code and
test results after some time.

> > > Let's please keep the layering clear.  IO limitations will be applied
> > > at the block layer and pressure will be formed there and then
> > > propagated upwards eventually to the originator.  Sure, exposing the
> > > whole information might result in better behavior for certain
> > > workloads, but down the road, say, in three or five years, devices
> > > which can be shared without worrying too much about seeks might be
> > > commonplace and we could be swearing at a disgusting structural mess,
> > > and sadly various cgroup support seems to be a prominent source of
> > > such design failures.
> > 
> > Super fast storages are coming which will make us regret to make the
> > IO path over complex.  Spinning disks are not going away anytime soon.
> > I doubt Google is willing to afford the disk seek costs on its
> > millions of disks and has the patience to wait until switching all of
> > the spin disks to SSD years later (if it will ever happen).
> 
> This is new.  Let's keep the damn employer out of the discussion.
> While the area I work on is affected by my employment (writeback isn't
> even my area BTW), I'm not gonna do something adverse to upstream even
> if it's beneficial to google and I'm much more likely to do something
> which may hurt google a bit if it's gonna benefit upstream.
> 
> As for the faster / newer storage argument, that is *exactly* why we
> want to keep the layering proper.  Writeback works from the pressure
> from the IO stack.  If IO technology changes, we update the IO stack
> and writeback still works from the pressure.  It may need to be
> adjusted but the principles don't change.

To me, balance_dirty_pages() is *the* proper layer for buffered writes.
It's always there doing 1:1 proportional throttling. Then you try to
kick in to add *double* throttling in block/cfq layer. Now the low
layer may enforce 10:1 throttling and push balance_dirty_pages() away
from its balanced state, leading to large fluctuations and program
stalls.  This can be avoided by telling balance_dirty_pages(): "your
balance goal is no longer 1:1, but 10:1". With this information
balance_dirty_pages() will behave right. Then there is the question:
if balance_dirty_pages() will work just well provided the information,
why bother doing the throttling at low layer and "push back" the
pressure all the way up?

> > It's obvious that your below proposal involves a lot of complexities,
> > overheads, and will hurt performance. It basically involves
> 
> Hmmm... that's not the impression I got from the discussion.
> According to Jan, applying the current writeback logic to cgroup'fied
> bdi shouldn't be too complex, no?

In the sense of "avoidable" complexity :-)

> > - running concurrent flusher threads for cgroups, which adds back the
> >   disk seeks and lock contentions. And still has problems with sync
> >   and shared inodes.
> 
> I agree this is an actual concern but if the user wants to split one
> spindle to multiple resource domains, there's gonna be considerable
> amount of overhead no matter what.  If you want to improve how block
> layer handles the split, you're welcome to dive into the block layer,
> where the split is made, and improve it.
> 
> > - splitting device queue for cgroups, possibly scaling up the pool of
> >   writeback pages (and locked pages in the case of stable pages) which
> >   could stall random processes in the system
> 
> Sure, it'll take up more buffering and memory but that's the overhead
> of the cgroup business.  I want it to be less intrusive at the cost of
> somewhat more resource consumption.  ie. I don't want writeback logic
> itself deeply involved in block IO cgroup enforcement even if that
> means somewhat less efficient resource usage.

The balance_dirty_pages() is already deeply involved in dirty throttling.
As you can see from this patchset, the same algorithms can be extended
trivially to work with cgroup IO limits.

buffered write IO controller in balance_dirty_pages()
https://lkml.org/lkml/2012/3/28/275

It does not require forking off the flusher threads and splitting up
the IO queue at all.

> > - the mess of metadata handling
> 
> Does throttling from writeback actually solve this problem?  What
> about fsync()?  Does that already go through balance_dirty_pages()?

balance_dirty_pages() does throttling at safe points outside of fs
transactions/locks.

fsync() only submits IO for already dirtied pages and won't be
throttled by balance_dirty_pages(). Throttling happens at earlier
times when the task is dirtying the pages.

> > - unnecessarily coupled with memcg, in order to take advantage of the
> >   per-memcg dirty limits for balance_dirty_pages() to actually convert
> >   the "pushed back" dirty pages pressure into lowered dirty rate. Why
> >   the hell the users *have to* setup memcg (suffering from all the
> >   inconvenience and overheads) in order to do IO throttling?  Please,
> >   this is really ugly! And the "back pressure" may constantly push the
> >   memcg dirty pages to the limits. I'm not going to support *miss use*
> >   of per-memcg dirty limits like this!
> 
> Writeback sits between blkcg and memcg and it indeed can be hairy to
> consider both sides especially given the current sorry complex state
> of cgroup and I can see why it would seem tempting to add a separate
> controller or at least knobs to support that.  That said, I *think*
> given that memcg controls all other memory parameters it probably
> would make most sense giving that parameter to memcg too.  I don't
> think this is really relevant to this discussion tho.  Who owns
> dirty_limits is a separate issue.

In the "back pressure" scheme, memcg is a must because only it has all
the infrastructure to track dirty pages upon which you can apply some
dirty_limits. Don't tell me you want to account dirty pages in blkcg...

> > I cannot believe you would keep overlooking all the problems without
> > good reasons. Please do tell us the reasons that matter.
> 
> Well, I tried and I hope some of it got through.  I also wrote a lot
> of questions, mainly regarding how what you have in mind is supposed
> to work through what path.  Maybe I'm just not seeing what you're
> seeing but I just can't see where all the IOs would go through and
> come together.  Can you please elaborate more on that?

What I can see is, it looks pretty simple and nature to let
balance_dirty_pages() fill the gap towards a total solution :-)

- add direct IO accounting in some convenient point of the IO path
  IO submission or completion point, either is fine.

- change several lines of the buffered write IO controller to
  integrate the direct IO rate into the formula to fit the "total
  IO" limit

- in future, add more accounting as well as feedback control to make
  balance_dirty_pages() work with IOPS and disk time

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-06  9:59         ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-06  9:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hi Tejun,

On Wed, Apr 04, 2012 at 12:33:55PM -0700, Tejun Heo wrote:
> Hey, Fengguang.
> 
> On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:
> > Yeah it should be trivial to apply the balance_dirty_pages()
> > throttling algorithm to the read/direct IOs. However up to now I don't
> > see much added value to *duplicate* the current block IO controller
> > functionalities, assuming the current users and developers are happy
> > with it.
> 
> Heh, trust me.  It's half broken and people ain't happy.  I get that

Yeah, although the balance_dirty_pages() IO controller for buffered
writes looks perfect in itself, it's not enough to meet user demands.

The user expectation should be: hey, please throttle *all* IOs from
this cgroup to this amount, either in absolute bps/iops limits or in
some proportional weight value (or both, whatever the lower takes
effect).  And if necessary, he may request further limits/weights for
each type of IO inside the cgroup.

Now the blkio cgroup supports direct IO and the balance_dirty_pages()
IO controller supports buffered writes. They are providing
limits/weights for either direct IO or buffered writes, which is fine
if it's pure direct IO or pure buffered write. For the common mixed
IO workloads, it's obviously not enough.

Fortunately, the above gap can be easily filled judging from the
block/cfq IO controller code. By adding some direct IO accounting
and changing several lines of my patches to make use of the collected
stats, the semantics of the blkio.throttle.write_bps interfaces can be
changed from "limit for direct IO" to "limit for direct+buffered IOs".
Ditto for blkio.weight and blkio.write_iops, as long as some
iops/device time stats are made available to balance_dirty_pages().

It would be a fairly *easy* change. :-) It's merely adding some
accounting code and there is no need to change the block IO
controlling algorithm at all. I'll do the work of accounting (which
is basically independent of the IO controlling) and use the new stats
in balance_dirty_pages().

The only problem I can see now, is that balance_dirty_pages() works
per-bdi and blkcg works per-device. So the two ends may not match
nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where
sdb is shared by lv0 and lv1. However it should be rare situations and
be much more acceptable than the problems arise from the "push back"
approach which impacts everyone.

> your algorithm can be updatd to consider all IOs and I believe that
> but what I don't get is how would such information get to writeback
> and in turn how writeback would enforce the result on reads and direct
> IOs.  Through what path?  Will all reads and direct IOs travel through
> balance_dirty_pages() even direct IOs on raw block devices?  Or would
> the writeback algorithm take the configuration from cfq, apply the
> algorithm and give back the limits to enforce to cfq?  If the latter,
> isn't that at least somewhat messed up?

cfq is working well and don't need any modifications. Let's just make
balance_dirty_pages() cgroup aware and fill the gap of the current
block IO controller.

If the balance_dirty_pages() throttling algorithms will ever be
applied to read and direct IOs, it would be for NFS, CIFS etc. Even
for them, there may be better throttling choices. For example, Trond
mentioned the RPC layer to me during the summit.

> > I did the buffered write IO controller mainly to fill the gap.  If I
> > happen to stand in your way, sorry that's not my initial intention.
> 
> No, no, it's not about standing in my way.  As Vivek said in the other
> reply, it's that the "gap" that you filled was created *because*
> writeback wasn't cgroup aware and now you're in turn filling that gap
> by making writeback work around that "gap".  I mean, my mind boggles.
> Doesn't yours?  I strongly believe everyone's should.

Heh. It's a hard problem indeed. I felt great pains in the IO-less
dirty throttling work. I did a lot reasoning about it, and have in
fact kept cgroup IO controller in mind since its early days. Now I'd
say it's hands down for it to adapt to the gap between the total IO
limit and what's carried out by the block IO controller.

> > It's a pity and surprise that Google as a big user does not buy in
> > this simple solution. You may prefer more comprehensive controls which
> > may not be easily achievable with the simple scheme. However the
> > complexities and overheads involved in throttling the flusher IOs
> > really upsets me. 
> 
> Heh, believe it or not, I'm not really wearing google hat on this
> subject and google's writeback people may have completely different
> opinions on the subject than mine.  In fact, I'm not even sure how
> much "work" time I'll be able to assign to this.  :(

OK, understand.

> > The sweet split point would be for balance_dirty_pages() to do cgroup
> > aware buffered write throttling and leave other IOs to the current
> > blkcg. For this to work well as a total solution for end users, I hope
> > we can cooperate and figure out ways for the two throttling entities
> > to work well with each other.
> 
> There's where I'm confused.  How is the said split supposed to work?
> They aren't independent.  I mean, who gets to decide what and where
> are those decisions enforced?

Yeah it's not independent. It's about

- keep block IO cgroup untouched (in its current algorithm, for
  throttling direct IO)

- let balance_dirty_pages() adapt to the throttling target
  
        buffered_write_limit = total_limit - direct_IOs

> > What I'm interested is, what's Google and other users' use schemes in
> > practice. What's their desired interfaces. Whether and how the
> > combined bdp+blkcg throttling can fulfill the goals.
> 
> I'm not too privy of mm and writeback in google and even if so I
> probably shouldn't talk too much about it.  Confidentiality and all.
> That said, I have the general feeling that goog already figured out
> how to at least work around the existing implementation and would be
> able to continue no matter how upstream development fans out.
> 
> That said, wearing the cgroup maintainer and general kernel
> contributor hat, I'd really like to avoid another design mess up.

To me it looks a pretty clean split and find it to be an easy
solution (after sorting it out the hard way). I'll show the code and
test results after some time.

> > > Let's please keep the layering clear.  IO limitations will be applied
> > > at the block layer and pressure will be formed there and then
> > > propagated upwards eventually to the originator.  Sure, exposing the
> > > whole information might result in better behavior for certain
> > > workloads, but down the road, say, in three or five years, devices
> > > which can be shared without worrying too much about seeks might be
> > > commonplace and we could be swearing at a disgusting structural mess,
> > > and sadly various cgroup support seems to be a prominent source of
> > > such design failures.
> > 
> > Super fast storages are coming which will make us regret to make the
> > IO path over complex.  Spinning disks are not going away anytime soon.
> > I doubt Google is willing to afford the disk seek costs on its
> > millions of disks and has the patience to wait until switching all of
> > the spin disks to SSD years later (if it will ever happen).
> 
> This is new.  Let's keep the damn employer out of the discussion.
> While the area I work on is affected by my employment (writeback isn't
> even my area BTW), I'm not gonna do something adverse to upstream even
> if it's beneficial to google and I'm much more likely to do something
> which may hurt google a bit if it's gonna benefit upstream.
> 
> As for the faster / newer storage argument, that is *exactly* why we
> want to keep the layering proper.  Writeback works from the pressure
> from the IO stack.  If IO technology changes, we update the IO stack
> and writeback still works from the pressure.  It may need to be
> adjusted but the principles don't change.

To me, balance_dirty_pages() is *the* proper layer for buffered writes.
It's always there doing 1:1 proportional throttling. Then you try to
kick in to add *double* throttling in block/cfq layer. Now the low
layer may enforce 10:1 throttling and push balance_dirty_pages() away
from its balanced state, leading to large fluctuations and program
stalls.  This can be avoided by telling balance_dirty_pages(): "your
balance goal is no longer 1:1, but 10:1". With this information
balance_dirty_pages() will behave right. Then there is the question:
if balance_dirty_pages() will work just well provided the information,
why bother doing the throttling at low layer and "push back" the
pressure all the way up?

> > It's obvious that your below proposal involves a lot of complexities,
> > overheads, and will hurt performance. It basically involves
> 
> Hmmm... that's not the impression I got from the discussion.
> According to Jan, applying the current writeback logic to cgroup'fied
> bdi shouldn't be too complex, no?

In the sense of "avoidable" complexity :-)

> > - running concurrent flusher threads for cgroups, which adds back the
> >   disk seeks and lock contentions. And still has problems with sync
> >   and shared inodes.
> 
> I agree this is an actual concern but if the user wants to split one
> spindle to multiple resource domains, there's gonna be considerable
> amount of overhead no matter what.  If you want to improve how block
> layer handles the split, you're welcome to dive into the block layer,
> where the split is made, and improve it.
> 
> > - splitting device queue for cgroups, possibly scaling up the pool of
> >   writeback pages (and locked pages in the case of stable pages) which
> >   could stall random processes in the system
> 
> Sure, it'll take up more buffering and memory but that's the overhead
> of the cgroup business.  I want it to be less intrusive at the cost of
> somewhat more resource consumption.  ie. I don't want writeback logic
> itself deeply involved in block IO cgroup enforcement even if that
> means somewhat less efficient resource usage.

The balance_dirty_pages() is already deeply involved in dirty throttling.
As you can see from this patchset, the same algorithms can be extended
trivially to work with cgroup IO limits.

buffered write IO controller in balance_dirty_pages()
https://lkml.org/lkml/2012/3/28/275

It does not require forking off the flusher threads and splitting up
the IO queue at all.

> > - the mess of metadata handling
> 
> Does throttling from writeback actually solve this problem?  What
> about fsync()?  Does that already go through balance_dirty_pages()?

balance_dirty_pages() does throttling at safe points outside of fs
transactions/locks.

fsync() only submits IO for already dirtied pages and won't be
throttled by balance_dirty_pages(). Throttling happens at earlier
times when the task is dirtying the pages.

> > - unnecessarily coupled with memcg, in order to take advantage of the
> >   per-memcg dirty limits for balance_dirty_pages() to actually convert
> >   the "pushed back" dirty pages pressure into lowered dirty rate. Why
> >   the hell the users *have to* setup memcg (suffering from all the
> >   inconvenience and overheads) in order to do IO throttling?  Please,
> >   this is really ugly! And the "back pressure" may constantly push the
> >   memcg dirty pages to the limits. I'm not going to support *miss use*
> >   of per-memcg dirty limits like this!
> 
> Writeback sits between blkcg and memcg and it indeed can be hairy to
> consider both sides especially given the current sorry complex state
> of cgroup and I can see why it would seem tempting to add a separate
> controller or at least knobs to support that.  That said, I *think*
> given that memcg controls all other memory parameters it probably
> would make most sense giving that parameter to memcg too.  I don't
> think this is really relevant to this discussion tho.  Who owns
> dirty_limits is a separate issue.

In the "back pressure" scheme, memcg is a must because only it has all
the infrastructure to track dirty pages upon which you can apply some
dirty_limits. Don't tell me you want to account dirty pages in blkcg...

> > I cannot believe you would keep overlooking all the problems without
> > good reasons. Please do tell us the reasons that matter.
> 
> Well, I tried and I hope some of it got through.  I also wrote a lot
> of questions, mainly regarding how what you have in mind is supposed
> to work through what path.  Maybe I'm just not seeing what you're
> seeing but I just can't see where all the IOs would go through and
> come together.  Can you please elaborate more on that?

What I can see is, it looks pretty simple and nature to let
balance_dirty_pages() fill the gap towards a total solution :-)

- add direct IO accounting in some convenient point of the IO path
  IO submission or completion point, either is fine.

- change several lines of the buffered write IO controller to
  integrate the direct IO rate into the formula to fit the "total
  IO" limit

- in future, add more accounting as well as feedback control to make
  balance_dirty_pages() work with IOPS and disk time

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]   ` <20120404145134.GC12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2012-04-04 15:36     ` Steve French
  2012-04-04 18:49     ` Tejun Heo
@ 2012-04-07  8:00     ` Jan Kara
  2 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-07  8:00 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

  Hi Vivek,

On Wed 04-04-12 10:51:34, Vivek Goyal wrote:
> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
> [..]
> > IIUC, without cgroup, the current writeback code works more or less
> > like this.  Throwing in cgroup doesn't really change the fundamental
> > design.  Instead of a single pipe going down, we just have multiple
> > pipes to the same device, each of which should be treated separately.
> > Of course, a spinning disk can't be divided that easily and their
> > performance characteristics will be inter-dependent, but the place to
> > solve that problem is where the problem is, the block layer.
> 
> How do you take care of thorottling IO to NFS case in this model? Current
> throttling logic is tied to block device and in case of NFS, there is no
> block device.
  Yeah, for throttling NFS or other network filesystems we'd have to come
up with some throttling mechanism at some other level. The problem with
throttling at higher levels is that you have to somehow extract information
from lower levels about amount of work so I'm not completely certain now,
where would be the right place. Possibly it also depends on the intended
usecase - so far I don't know about any real user for this functionality...

> [..]
> > In the discussion, for such implementation, the following obstacles
> > were identified.
> > 
> > * There are a lot of cases where IOs are issued by a task which isn't
> >   the originiator.  ie. Writeback issues IOs for pages which are
> >   dirtied by some other tasks.  So, by the time an IO reaches the
> >   block layer, we don't know which cgroup the IO belongs to.
> > 
> >   Recently, block layer has grown support to attach a task to a bio
> >   which causes the bio to be handled as if it were issued by the
> >   associated task regardless of the actual issuing task.  It currently
> >   only allows attaching %current to a bio - bio_associate_current() -
> >   but changing it to support other tasks is trivial.
> > 
> >   We'll need to update the async issuers to tag the IOs they issue but
> >   the mechanism is already there.
> 
> Most likely this tagging will take place in "struct page" and I am not
> sure if we will be allowed to grow size of "struct page" for this reason.
  We can tag inodes and then bios so this should be fine.

> > * Unlike dirty data pages, metadata tends to have strict ordering
> >   requirements and thus is susceptible to priority inversion.  Two
> >   solutions were suggested - 1. allow overdrawl for metadata writes so
> >   that low prio metadata writes don't block the whole FS, 2. provide
> >   an interface to query and wait for bdi-cgroup congestion which can
> >   be called from FS metadata paths to throttle metadata operations
> >   before they enter the stream of ordered operations.
> 
> So that probably will mean changing the order of operations also. IIUC, 
> in case of fsync (ordered mode), we opened a meta data transaction first,
> then tried to flush all the cached data and then flush metadata. So if
> fsync is throttled, all the metadata operations behind it will get 
> serialized for ext3/ext4.
> 
> So you seem to be suggesting that we change the design so that metadata
> operation does not thrown into ordered stream till we have finished
> writing all the data back to disk? I am not a filesystem developer, so
> I don't know how feasible this change is.
> 
> This is just one of the points. In the past while talking to Dave Chinner,
> he mentioned that in XFS, if two cgroups fall into same allocation group
> then there were cases where IO of one cgroup can get serialized behind
> other.
> 
> In general, the core of the issue is that filesystems are not cgroup aware
> and if you do throttling below filesystems, then invariably one or other
> serialization issue will come up and I am concerned that we will be constantly
> fixing those serialization issues. Or the desgin point could be so central
> to filesystem design that it can't be changed.
  We talked about this at LSF and Dave Chinner had the idea that we could
make processes wait at the time when a transaction is started. At that time
we don't hold any global locks so process can be throttled without
serializing other processes. This effectively builds some cgroup awareness
into filesystems but pretty simple one so it should be doable.

> In general, if you do throttling deeper in the stakc and build back
> pressure, then all the layers sitting above should be cgroup aware
> to avoid problems. Two layers identified so far are writeback and
> filesystems. Is it really worth the complexity. How about doing 
> throttling in higher layers when IO is entering the kernel and
> keep proportional IO logic at the lowest level and current mechanism
> of building pressure continues to work?
  I would like to keep single throttling mechanism for different limitting
methods - i.e. handle proportional IO the same way as IO hard limits. So we
cannot really rely on the fact that throttling is work preserving.

The advantage of throttling at IO layer is that we can keep all the details
inside it and only export pretty minimal information (like is bdi congested
for given cgroup) to upper layers. If we wanted to do throttling at upper
layers (such as Fengguang's buffered write throttling), we need to export
the internal details to allow effective throttling...

								Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-04 14:51   ` Vivek Goyal
@ 2012-04-07  8:00     ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-07  8:00 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, Fengguang Wu, Jan Kara, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

  Hi Vivek,

On Wed 04-04-12 10:51:34, Vivek Goyal wrote:
> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
> [..]
> > IIUC, without cgroup, the current writeback code works more or less
> > like this.  Throwing in cgroup doesn't really change the fundamental
> > design.  Instead of a single pipe going down, we just have multiple
> > pipes to the same device, each of which should be treated separately.
> > Of course, a spinning disk can't be divided that easily and their
> > performance characteristics will be inter-dependent, but the place to
> > solve that problem is where the problem is, the block layer.
> 
> How do you take care of thorottling IO to NFS case in this model? Current
> throttling logic is tied to block device and in case of NFS, there is no
> block device.
  Yeah, for throttling NFS or other network filesystems we'd have to come
up with some throttling mechanism at some other level. The problem with
throttling at higher levels is that you have to somehow extract information
from lower levels about amount of work so I'm not completely certain now,
where would be the right place. Possibly it also depends on the intended
usecase - so far I don't know about any real user for this functionality...

> [..]
> > In the discussion, for such implementation, the following obstacles
> > were identified.
> > 
> > * There are a lot of cases where IOs are issued by a task which isn't
> >   the originiator.  ie. Writeback issues IOs for pages which are
> >   dirtied by some other tasks.  So, by the time an IO reaches the
> >   block layer, we don't know which cgroup the IO belongs to.
> > 
> >   Recently, block layer has grown support to attach a task to a bio
> >   which causes the bio to be handled as if it were issued by the
> >   associated task regardless of the actual issuing task.  It currently
> >   only allows attaching %current to a bio - bio_associate_current() -
> >   but changing it to support other tasks is trivial.
> > 
> >   We'll need to update the async issuers to tag the IOs they issue but
> >   the mechanism is already there.
> 
> Most likely this tagging will take place in "struct page" and I am not
> sure if we will be allowed to grow size of "struct page" for this reason.
  We can tag inodes and then bios so this should be fine.

> > * Unlike dirty data pages, metadata tends to have strict ordering
> >   requirements and thus is susceptible to priority inversion.  Two
> >   solutions were suggested - 1. allow overdrawl for metadata writes so
> >   that low prio metadata writes don't block the whole FS, 2. provide
> >   an interface to query and wait for bdi-cgroup congestion which can
> >   be called from FS metadata paths to throttle metadata operations
> >   before they enter the stream of ordered operations.
> 
> So that probably will mean changing the order of operations also. IIUC, 
> in case of fsync (ordered mode), we opened a meta data transaction first,
> then tried to flush all the cached data and then flush metadata. So if
> fsync is throttled, all the metadata operations behind it will get 
> serialized for ext3/ext4.
> 
> So you seem to be suggesting that we change the design so that metadata
> operation does not thrown into ordered stream till we have finished
> writing all the data back to disk? I am not a filesystem developer, so
> I don't know how feasible this change is.
> 
> This is just one of the points. In the past while talking to Dave Chinner,
> he mentioned that in XFS, if two cgroups fall into same allocation group
> then there were cases where IO of one cgroup can get serialized behind
> other.
> 
> In general, the core of the issue is that filesystems are not cgroup aware
> and if you do throttling below filesystems, then invariably one or other
> serialization issue will come up and I am concerned that we will be constantly
> fixing those serialization issues. Or the desgin point could be so central
> to filesystem design that it can't be changed.
  We talked about this at LSF and Dave Chinner had the idea that we could
make processes wait at the time when a transaction is started. At that time
we don't hold any global locks so process can be throttled without
serializing other processes. This effectively builds some cgroup awareness
into filesystems but pretty simple one so it should be doable.

> In general, if you do throttling deeper in the stakc and build back
> pressure, then all the layers sitting above should be cgroup aware
> to avoid problems. Two layers identified so far are writeback and
> filesystems. Is it really worth the complexity. How about doing 
> throttling in higher layers when IO is entering the kernel and
> keep proportional IO logic at the lowest level and current mechanism
> of building pressure continues to work?
  I would like to keep single throttling mechanism for different limitting
methods - i.e. handle proportional IO the same way as IO hard limits. So we
cannot really rely on the fact that throttling is work preserving.

The advantage of throttling at IO layer is that we can keep all the details
inside it and only export pretty minimal information (like is bdi congested
for given cgroup) to upper layers. If we wanted to do throttling at upper
layers (such as Fengguang's buffered write throttling), we need to export
the internal details to allow effective throttling...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-07  8:00     ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-07  8:00 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, Fengguang Wu, Jan Kara, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

  Hi Vivek,

On Wed 04-04-12 10:51:34, Vivek Goyal wrote:
> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
> [..]
> > IIUC, without cgroup, the current writeback code works more or less
> > like this.  Throwing in cgroup doesn't really change the fundamental
> > design.  Instead of a single pipe going down, we just have multiple
> > pipes to the same device, each of which should be treated separately.
> > Of course, a spinning disk can't be divided that easily and their
> > performance characteristics will be inter-dependent, but the place to
> > solve that problem is where the problem is, the block layer.
> 
> How do you take care of thorottling IO to NFS case in this model? Current
> throttling logic is tied to block device and in case of NFS, there is no
> block device.
  Yeah, for throttling NFS or other network filesystems we'd have to come
up with some throttling mechanism at some other level. The problem with
throttling at higher levels is that you have to somehow extract information
from lower levels about amount of work so I'm not completely certain now,
where would be the right place. Possibly it also depends on the intended
usecase - so far I don't know about any real user for this functionality...

> [..]
> > In the discussion, for such implementation, the following obstacles
> > were identified.
> > 
> > * There are a lot of cases where IOs are issued by a task which isn't
> >   the originiator.  ie. Writeback issues IOs for pages which are
> >   dirtied by some other tasks.  So, by the time an IO reaches the
> >   block layer, we don't know which cgroup the IO belongs to.
> > 
> >   Recently, block layer has grown support to attach a task to a bio
> >   which causes the bio to be handled as if it were issued by the
> >   associated task regardless of the actual issuing task.  It currently
> >   only allows attaching %current to a bio - bio_associate_current() -
> >   but changing it to support other tasks is trivial.
> > 
> >   We'll need to update the async issuers to tag the IOs they issue but
> >   the mechanism is already there.
> 
> Most likely this tagging will take place in "struct page" and I am not
> sure if we will be allowed to grow size of "struct page" for this reason.
  We can tag inodes and then bios so this should be fine.

> > * Unlike dirty data pages, metadata tends to have strict ordering
> >   requirements and thus is susceptible to priority inversion.  Two
> >   solutions were suggested - 1. allow overdrawl for metadata writes so
> >   that low prio metadata writes don't block the whole FS, 2. provide
> >   an interface to query and wait for bdi-cgroup congestion which can
> >   be called from FS metadata paths to throttle metadata operations
> >   before they enter the stream of ordered operations.
> 
> So that probably will mean changing the order of operations also. IIUC, 
> in case of fsync (ordered mode), we opened a meta data transaction first,
> then tried to flush all the cached data and then flush metadata. So if
> fsync is throttled, all the metadata operations behind it will get 
> serialized for ext3/ext4.
> 
> So you seem to be suggesting that we change the design so that metadata
> operation does not thrown into ordered stream till we have finished
> writing all the data back to disk? I am not a filesystem developer, so
> I don't know how feasible this change is.
> 
> This is just one of the points. In the past while talking to Dave Chinner,
> he mentioned that in XFS, if two cgroups fall into same allocation group
> then there were cases where IO of one cgroup can get serialized behind
> other.
> 
> In general, the core of the issue is that filesystems are not cgroup aware
> and if you do throttling below filesystems, then invariably one or other
> serialization issue will come up and I am concerned that we will be constantly
> fixing those serialization issues. Or the desgin point could be so central
> to filesystem design that it can't be changed.
  We talked about this at LSF and Dave Chinner had the idea that we could
make processes wait at the time when a transaction is started. At that time
we don't hold any global locks so process can be throttled without
serializing other processes. This effectively builds some cgroup awareness
into filesystems but pretty simple one so it should be doable.

> In general, if you do throttling deeper in the stakc and build back
> pressure, then all the layers sitting above should be cgroup aware
> to avoid problems. Two layers identified so far are writeback and
> filesystems. Is it really worth the complexity. How about doing 
> throttling in higher layers when IO is entering the kernel and
> keep proportional IO logic at the lowest level and current mechanism
> of building pressure continues to work?
  I would like to keep single throttling mechanism for different limitting
methods - i.e. handle proportional IO the same way as IO hard limits. So we
cannot really rely on the fact that throttling is work preserving.

The advantage of throttling at IO layer is that we can keep all the details
inside it and only export pretty minimal information (like is bdi congested
for given cgroup) to upper layers. If we wanted to do throttling at upper
layers (such as Fengguang's buffered write throttling), we need to export
the internal details to allow effective throttling...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]     ` <20120407080027.GA2584-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
@ 2012-04-10 16:23       ` Steve French
  2012-04-10 18:06       ` Vivek Goyal
  1 sibling, 0 replies; 262+ messages in thread
From: Steve French @ 2012-04-10 16:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal

On Sat, Apr 7, 2012 at 3:00 AM, Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> wrote:
>  Hi Vivek,
>
> On Wed 04-04-12 10:51:34, Vivek Goyal wrote:
>> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
>> [..]
>> > IIUC, without cgroup, the current writeback code works more or less
>> > like this.  Throwing in cgroup doesn't really change the fundamental
>> > design.  Instead of a single pipe going down, we just have multiple
>> > pipes to the same device, each of which should be treated separately.
>> > Of course, a spinning disk can't be divided that easily and their
>> > performance characteristics will be inter-dependent, but the place to
>> > solve that problem is where the problem is, the block layer.
>>
>> How do you take care of thorottling IO to NFS case in this model? Current
>> throttling logic is tied to block device and in case of NFS, there is no
>> block device.
>  Yeah, for throttling NFS or other network filesystems we'd have to come
> up with some throttling mechanism at some other level. The problem with
> throttling at higher levels is that you have to somehow extract information
> from lower levels about amount of work so I'm not completely certain now,
> where would be the right place. Possibly it also depends on the intended
> usecase - so far I don't know about any real user for this functionality...

Remember to distinguish between the two ends of the network file system.
There are slightly different problems.   The client has to be able to
expose the number of requests (and size of writes, or equivalently
number of pages it can write at one time) so that writeback is not done
too aggressively.  File servers have to be able to
discover the i/o limits dynamically of the underlying volume (not the
block device, but potentially a pool of devices) so it can tell
the client how much i/o it can send.  For SMB2 server (Samba) and
eventually for NFS, how many simultaneous requests it
can support will allow them to sanely set the number of "credits"
on each response - ie tell the client how many requests
are allowed in flight to a particular export.

In the case of block device throttling - other than the file system
internally using such APIs who would use block device specific
throttling - only the file system knows where it wants to put hot data,
and in the case of btrfs, doesn't the file system manage the
storage pool.   The block device should be transparent to the
user in the long run, and only the volume visible.


-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]     ` <20120407080027.GA2584-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
  2012-04-10 16:23       ` [Lsf] " Steve French
@ 2012-04-10 16:23       ` Steve French
  1 sibling, 0 replies; 262+ messages in thread
From: Steve French @ 2012-04-10 16:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel

On Sat, Apr 7, 2012 at 3:00 AM, Jan Kara <jack@suse.cz> wrote:
>  Hi Vivek,
>
> On Wed 04-04-12 10:51:34, Vivek Goyal wrote:
>> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
>> [..]
>> > IIUC, without cgroup, the current writeback code works more or less
>> > like this.  Throwing in cgroup doesn't really change the fundamental
>> > design.  Instead of a single pipe going down, we just have multiple
>> > pipes to the same device, each of which should be treated separately.
>> > Of course, a spinning disk can't be divided that easily and their
>> > performance characteristics will be inter-dependent, but the place to
>> > solve that problem is where the problem is, the block layer.
>>
>> How do you take care of thorottling IO to NFS case in this model? Current
>> throttling logic is tied to block device and in case of NFS, there is no
>> block device.
>  Yeah, for throttling NFS or other network filesystems we'd have to come
> up with some throttling mechanism at some other level. The problem with
> throttling at higher levels is that you have to somehow extract information
> from lower levels about amount of work so I'm not completely certain now,
> where would be the right place. Possibly it also depends on the intended
> usecase - so far I don't know about any real user for this functionality...

Remember to distinguish between the two ends of the network file system.
There are slightly different problems.   The client has to be able to
expose the number of requests (and size of writes, or equivalently
number of pages it can write at one time) so that writeback is not done
too aggressively.  File servers have to be able to
discover the i/o limits dynamically of the underlying volume (not the
block device, but potentially a pool of devices) so it can tell
the client how much i/o it can send.  For SMB2 server (Samba) and
eventually for NFS, how many simultaneous requests it
can support will allow them to sanely set the number of "credits"
on each response - ie tell the client how many requests
are allowed in flight to a particular export.

In the case of block device throttling - other than the file system
internally using such APIs who would use block device specific
throttling - only the file system knows where it wants to put hot data,
and in the case of btrfs, doesn't the file system manage the
storage pool.   The block device should be transparent to the
user in the long run, and only the volume visible.


-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-10 16:23       ` Steve French
  0 siblings, 0 replies; 262+ messages in thread
From: Steve French @ 2012-04-10 16:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Sat, Apr 7, 2012 at 3:00 AM, Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> wrote:
>  Hi Vivek,
>
> On Wed 04-04-12 10:51:34, Vivek Goyal wrote:
>> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
>> [..]
>> > IIUC, without cgroup, the current writeback code works more or less
>> > like this.  Throwing in cgroup doesn't really change the fundamental
>> > design.  Instead of a single pipe going down, we just have multiple
>> > pipes to the same device, each of which should be treated separately.
>> > Of course, a spinning disk can't be divided that easily and their
>> > performance characteristics will be inter-dependent, but the place to
>> > solve that problem is where the problem is, the block layer.
>>
>> How do you take care of thorottling IO to NFS case in this model? Current
>> throttling logic is tied to block device and in case of NFS, there is no
>> block device.
>  Yeah, for throttling NFS or other network filesystems we'd have to come
> up with some throttling mechanism at some other level. The problem with
> throttling at higher levels is that you have to somehow extract information
> from lower levels about amount of work so I'm not completely certain now,
> where would be the right place. Possibly it also depends on the intended
> usecase - so far I don't know about any real user for this functionality...

Remember to distinguish between the two ends of the network file system.
There are slightly different problems.   The client has to be able to
expose the number of requests (and size of writes, or equivalently
number of pages it can write at one time) so that writeback is not done
too aggressively.  File servers have to be able to
discover the i/o limits dynamically of the underlying volume (not the
block device, but potentially a pool of devices) so it can tell
the client how much i/o it can send.  For SMB2 server (Samba) and
eventually for NFS, how many simultaneous requests it
can support will allow them to sanely set the number of "credits"
on each response - ie tell the client how many requests
are allowed in flight to a particular export.

In the case of block device throttling - other than the file system
internally using such APIs who would use block device specific
throttling - only the file system knows where it wants to put hot data,
and in the case of btrfs, doesn't the file system manage the
storage pool.   The block device should be transparent to the
user in the long run, and only the volume visible.


-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-10 16:23       ` Steve French
  0 siblings, 0 replies; 262+ messages in thread
From: Steve French @ 2012-04-10 16:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel

On Sat, Apr 7, 2012 at 3:00 AM, Jan Kara <jack@suse.cz> wrote:
>  Hi Vivek,
>
> On Wed 04-04-12 10:51:34, Vivek Goyal wrote:
>> On Tue, Apr 03, 2012 at 11:36:55AM -0700, Tejun Heo wrote:
>> [..]
>> > IIUC, without cgroup, the current writeback code works more or less
>> > like this.  Throwing in cgroup doesn't really change the fundamental
>> > design.  Instead of a single pipe going down, we just have multiple
>> > pipes to the same device, each of which should be treated separately.
>> > Of course, a spinning disk can't be divided that easily and their
>> > performance characteristics will be inter-dependent, but the place to
>> > solve that problem is where the problem is, the block layer.
>>
>> How do you take care of thorottling IO to NFS case in this model? Current
>> throttling logic is tied to block device and in case of NFS, there is no
>> block device.
>  Yeah, for throttling NFS or other network filesystems we'd have to come
> up with some throttling mechanism at some other level. The problem with
> throttling at higher levels is that you have to somehow extract information
> from lower levels about amount of work so I'm not completely certain now,
> where would be the right place. Possibly it also depends on the intended
> usecase - so far I don't know about any real user for this functionality...

Remember to distinguish between the two ends of the network file system.
There are slightly different problems.   The client has to be able to
expose the number of requests (and size of writes, or equivalently
number of pages it can write at one time) so that writeback is not done
too aggressively.  File servers have to be able to
discover the i/o limits dynamically of the underlying volume (not the
block device, but potentially a pool of devices) so it can tell
the client how much i/o it can send.  For SMB2 server (Samba) and
eventually for NFS, how many simultaneous requests it
can support will allow them to sanely set the number of "credits"
on each response - ie tell the client how many requests
are allowed in flight to a particular export.

In the case of block device throttling - other than the file system
internally using such APIs who would use block device specific
throttling - only the file system knows where it wants to put hot data,
and in the case of btrfs, doesn't the file system manage the
storage pool.   The block device should be transparent to the
user in the long run, and only the volume visible.


-- 
Thanks,

Steve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]     ` <20120407080027.GA2584-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
  2012-04-10 16:23       ` [Lsf] " Steve French
@ 2012-04-10 18:06       ` Vivek Goyal
  1 sibling, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-10 18:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Sat, Apr 07, 2012 at 10:00:27AM +0200, Jan Kara wrote:
Hi Jan,

[..]
> > In general, the core of the issue is that filesystems are not cgroup aware
> > and if you do throttling below filesystems, then invariably one or other
> > serialization issue will come up and I am concerned that we will be constantly
> > fixing those serialization issues. Or the desgin point could be so central
> > to filesystem design that it can't be changed.
>   We talked about this at LSF and Dave Chinner had the idea that we could
> make processes wait at the time when a transaction is started. At that time
> we don't hold any global locks so process can be throttled without
> serializing other processes. This effectively builds some cgroup awareness
> into filesystems but pretty simple one so it should be doable.

Ok. So what is the meaning of "make process wait" here? What it will be
dependent on? I am thinking of a case where a process has 100MB of dirty
data, has 10MB/s write limit and it issues fsync. So before that process
is able to open a transaction, one needs to wait atleast 10seconds
(assuming other processes are not doing IO in same cgroup). 

If this wait is based on making sure all dirty data has been written back
before opening transaction, then it will work without any interaction with
block layer and sounds more feasible.

> 
> > In general, if you do throttling deeper in the stakc and build back
> > pressure, then all the layers sitting above should be cgroup aware
> > to avoid problems. Two layers identified so far are writeback and
> > filesystems. Is it really worth the complexity. How about doing 
> > throttling in higher layers when IO is entering the kernel and
> > keep proportional IO logic at the lowest level and current mechanism
> > of building pressure continues to work?
>   I would like to keep single throttling mechanism for different limitting
> methods - i.e. handle proportional IO the same way as IO hard limits. So we
> cannot really rely on the fact that throttling is work preserving.
> 
> The advantage of throttling at IO layer is that we can keep all the details
> inside it and only export pretty minimal information (like is bdi congested
> for given cgroup) to upper layers. If we wanted to do throttling at upper
> layers (such as Fengguang's buffered write throttling), we need to export
> the internal details to allow effective throttling...

For absolute throttling we really don't have to expose any details. In
fact in my implementation of throttling buffered writes, I just had exported
a single function to be called in bdi dirty rate limit. The caller will
simply sleep long enough depending on the size of IO it is doing and
how many other processes are doing IO in same cgroup.

So implementation was still in block layer and only a single function
was exposed to higher layers.

One more factor makes absolute throttling interesting and that is global
throttling and not per device throttling. For example in case of btrfs,
there is no single stacked device on which to put total throttling
limits.

So if filesystems can handle serialization issue, then back pressure
method looks more clean (thought complex).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-07  8:00     ` Jan Kara
@ 2012-04-10 18:06       ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-10 18:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Sat, Apr 07, 2012 at 10:00:27AM +0200, Jan Kara wrote:
Hi Jan,

[..]
> > In general, the core of the issue is that filesystems are not cgroup aware
> > and if you do throttling below filesystems, then invariably one or other
> > serialization issue will come up and I am concerned that we will be constantly
> > fixing those serialization issues. Or the desgin point could be so central
> > to filesystem design that it can't be changed.
>   We talked about this at LSF and Dave Chinner had the idea that we could
> make processes wait at the time when a transaction is started. At that time
> we don't hold any global locks so process can be throttled without
> serializing other processes. This effectively builds some cgroup awareness
> into filesystems but pretty simple one so it should be doable.

Ok. So what is the meaning of "make process wait" here? What it will be
dependent on? I am thinking of a case where a process has 100MB of dirty
data, has 10MB/s write limit and it issues fsync. So before that process
is able to open a transaction, one needs to wait atleast 10seconds
(assuming other processes are not doing IO in same cgroup). 

If this wait is based on making sure all dirty data has been written back
before opening transaction, then it will work without any interaction with
block layer and sounds more feasible.

> 
> > In general, if you do throttling deeper in the stakc and build back
> > pressure, then all the layers sitting above should be cgroup aware
> > to avoid problems. Two layers identified so far are writeback and
> > filesystems. Is it really worth the complexity. How about doing 
> > throttling in higher layers when IO is entering the kernel and
> > keep proportional IO logic at the lowest level and current mechanism
> > of building pressure continues to work?
>   I would like to keep single throttling mechanism for different limitting
> methods - i.e. handle proportional IO the same way as IO hard limits. So we
> cannot really rely on the fact that throttling is work preserving.
> 
> The advantage of throttling at IO layer is that we can keep all the details
> inside it and only export pretty minimal information (like is bdi congested
> for given cgroup) to upper layers. If we wanted to do throttling at upper
> layers (such as Fengguang's buffered write throttling), we need to export
> the internal details to allow effective throttling...

For absolute throttling we really don't have to expose any details. In
fact in my implementation of throttling buffered writes, I just had exported
a single function to be called in bdi dirty rate limit. The caller will
simply sleep long enough depending on the size of IO it is doing and
how many other processes are doing IO in same cgroup.

So implementation was still in block layer and only a single function
was exposed to higher layers.

One more factor makes absolute throttling interesting and that is global
throttling and not per device throttling. For example in case of btrfs,
there is no single stacked device on which to put total throttling
limits.

So if filesystems can handle serialization issue, then back pressure
method looks more clean (thought complex).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-10 18:06       ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-10 18:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Sat, Apr 07, 2012 at 10:00:27AM +0200, Jan Kara wrote:
Hi Jan,

[..]
> > In general, the core of the issue is that filesystems are not cgroup aware
> > and if you do throttling below filesystems, then invariably one or other
> > serialization issue will come up and I am concerned that we will be constantly
> > fixing those serialization issues. Or the desgin point could be so central
> > to filesystem design that it can't be changed.
>   We talked about this at LSF and Dave Chinner had the idea that we could
> make processes wait at the time when a transaction is started. At that time
> we don't hold any global locks so process can be throttled without
> serializing other processes. This effectively builds some cgroup awareness
> into filesystems but pretty simple one so it should be doable.

Ok. So what is the meaning of "make process wait" here? What it will be
dependent on? I am thinking of a case where a process has 100MB of dirty
data, has 10MB/s write limit and it issues fsync. So before that process
is able to open a transaction, one needs to wait atleast 10seconds
(assuming other processes are not doing IO in same cgroup). 

If this wait is based on making sure all dirty data has been written back
before opening transaction, then it will work without any interaction with
block layer and sounds more feasible.

> 
> > In general, if you do throttling deeper in the stakc and build back
> > pressure, then all the layers sitting above should be cgroup aware
> > to avoid problems. Two layers identified so far are writeback and
> > filesystems. Is it really worth the complexity. How about doing 
> > throttling in higher layers when IO is entering the kernel and
> > keep proportional IO logic at the lowest level and current mechanism
> > of building pressure continues to work?
>   I would like to keep single throttling mechanism for different limitting
> methods - i.e. handle proportional IO the same way as IO hard limits. So we
> cannot really rely on the fact that throttling is work preserving.
> 
> The advantage of throttling at IO layer is that we can keep all the details
> inside it and only export pretty minimal information (like is bdi congested
> for given cgroup) to upper layers. If we wanted to do throttling at upper
> layers (such as Fengguang's buffered write throttling), we need to export
> the internal details to allow effective throttling...

For absolute throttling we really don't have to expose any details. In
fact in my implementation of throttling buffered writes, I just had exported
a single function to be called in bdi dirty rate limit. The caller will
simply sleep long enough depending on the size of IO it is doing and
how many other processes are doing IO in same cgroup.

So implementation was still in block layer and only a single function
was exposed to higher layers.

One more factor makes absolute throttling interesting and that is global
throttling and not per device throttling. For example in case of btrfs,
there is no single stacked device on which to put total throttling
limits.

So if filesystems can handle serialization issue, then back pressure
method looks more clean (thought complex).

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]       ` <CAH2r5mvLVnM3Se5vBBsYzwaz5Ckp3i6SVnGp2T0XaGe9_u8YYA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2012-04-10 18:16         ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-10 18:16 UTC (permalink / raw)
  To: Steve French
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue, Apr 10, 2012 at 11:23:16AM -0500, Steve French wrote:

[..]
> In the case of block device throttling - other than the file system
> internally using such APIs who would use block device specific
> throttling - only the file system knows where it wants to put hot data,
> and in the case of btrfs, doesn't the file system manage the
> storage pool.   The block device should be transparent to the
> user in the long run, and only the volume visible.

This is a good point. I guess this goes back to Jan's question of what's
the intended use case of absolute throttling. Having a dependency on 
per device limits has the drawback of user knowing exactly the details
of storage stack and it assumes that there is one single aggregation point
of block devices. (Which is not true in case of btrfs).

If user is simply looking for something like that I don't want a backup
process to be writing at more than 50MB/s (so that other processes doing
IO to same filesystem are effected less), then it is a case of global
throttling and per device throttling really does not gel well.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-10 16:23       ` Steve French
@ 2012-04-10 18:16         ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-10 18:16 UTC (permalink / raw)
  To: Steve French
  Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel

On Tue, Apr 10, 2012 at 11:23:16AM -0500, Steve French wrote:

[..]
> In the case of block device throttling - other than the file system
> internally using such APIs who would use block device specific
> throttling - only the file system knows where it wants to put hot data,
> and in the case of btrfs, doesn't the file system manage the
> storage pool.   The block device should be transparent to the
> user in the long run, and only the volume visible.

This is a good point. I guess this goes back to Jan's question of what's
the intended use case of absolute throttling. Having a dependency on 
per device limits has the drawback of user knowing exactly the details
of storage stack and it assumes that there is one single aggregation point
of block devices. (Which is not true in case of btrfs).

If user is simply looking for something like that I don't want a backup
process to be writing at more than 50MB/s (so that other processes doing
IO to same filesystem are effected less), then it is a case of global
throttling and per device throttling really does not gel well.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-10 18:16         ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-10 18:16 UTC (permalink / raw)
  To: Steve French
  Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel

On Tue, Apr 10, 2012 at 11:23:16AM -0500, Steve French wrote:

[..]
> In the case of block device throttling - other than the file system
> internally using such APIs who would use block device specific
> throttling - only the file system knows where it wants to put hot data,
> and in the case of btrfs, doesn't the file system manage the
> storage pool.   The block device should be transparent to the
> user in the long run, and only the volume visible.

This is a good point. I guess this goes back to Jan's question of what's
the intended use case of absolute throttling. Having a dependency on 
per device limits has the drawback of user knowing exactly the details
of storage stack and it assumes that there is one single aggregation point
of block devices. (Which is not true in case of btrfs).

If user is simply looking for something like that I don't want a backup
process to be writing at more than 50MB/s (so that other processes doing
IO to same filesystem are effected less), then it is a case of global
throttling and per device throttling really does not gel well.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-10 18:06       ` Vivek Goyal
  (?)
@ 2012-04-10 21:05           ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-10 21:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

  Hi Vivek,

On Tue 10-04-12 14:06:53, Vivek Goyal wrote:
> On Sat, Apr 07, 2012 at 10:00:27AM +0200, Jan Kara wrote:
> > > In general, the core of the issue is that filesystems are not cgroup aware
> > > and if you do throttling below filesystems, then invariably one or other
> > > serialization issue will come up and I am concerned that we will be constantly
> > > fixing those serialization issues. Or the desgin point could be so central
> > > to filesystem design that it can't be changed.
> >   We talked about this at LSF and Dave Chinner had the idea that we could
> > make processes wait at the time when a transaction is started. At that time
> > we don't hold any global locks so process can be throttled without
> > serializing other processes. This effectively builds some cgroup awareness
> > into filesystems but pretty simple one so it should be doable.
> 
> Ok. So what is the meaning of "make process wait" here? What it will be
> dependent on? I am thinking of a case where a process has 100MB of dirty
> data, has 10MB/s write limit and it issues fsync. So before that process
> is able to open a transaction, one needs to wait atleast 10seconds
> (assuming other processes are not doing IO in same cgroup). 
  The original idea was that we'd have "bdi-congested-for-cgroup" flag
and the process starting a transaction will wait for this flag to get
cleared before starting a new transaction. This will be easy to implement
in filesystems and won't have serialization issues. But my knowledge of
blk-throttle is lacking so there might be some problems with this approach.

> If this wait is based on making sure all dirty data has been written back
> before opening transaction, then it will work without any interaction with
> block layer and sounds more feasible.
> 
> > 
> > > In general, if you do throttling deeper in the stakc and build back
> > > pressure, then all the layers sitting above should be cgroup aware
> > > to avoid problems. Two layers identified so far are writeback and
> > > filesystems. Is it really worth the complexity. How about doing 
> > > throttling in higher layers when IO is entering the kernel and
> > > keep proportional IO logic at the lowest level and current mechanism
> > > of building pressure continues to work?
> >   I would like to keep single throttling mechanism for different limitting
> > methods - i.e. handle proportional IO the same way as IO hard limits. So we
> > cannot really rely on the fact that throttling is work preserving.
> > 
> > The advantage of throttling at IO layer is that we can keep all the details
> > inside it and only export pretty minimal information (like is bdi congested
> > for given cgroup) to upper layers. If we wanted to do throttling at upper
> > layers (such as Fengguang's buffered write throttling), we need to export
> > the internal details to allow effective throttling...
> 
> For absolute throttling we really don't have to expose any details. In
> fact in my implementation of throttling buffered writes, I just had exported
> a single function to be called in bdi dirty rate limit. The caller will
> simply sleep long enough depending on the size of IO it is doing and
> how many other processes are doing IO in same cgroup.
>
> So implementation was still in block layer and only a single function
> was exposed to higher layers.
  OK, I see.
 
> One more factor makes absolute throttling interesting and that is global
> throttling and not per device throttling. For example in case of btrfs,
> there is no single stacked device on which to put total throttling
> limits.
  Yes. My intended interface for the throttling is bdi. But you are right
it does not exactly match the fact that the throttling happens per device
so it might get tricky. Which brings up a question - shouldn't the
throttling blk-throttle does rather happen at bdi layer? Because the
uses of the functionality I have in mind would match that better.

								Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-10 21:05           ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-10 21:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

  Hi Vivek,

On Tue 10-04-12 14:06:53, Vivek Goyal wrote:
> On Sat, Apr 07, 2012 at 10:00:27AM +0200, Jan Kara wrote:
> > > In general, the core of the issue is that filesystems are not cgroup aware
> > > and if you do throttling below filesystems, then invariably one or other
> > > serialization issue will come up and I am concerned that we will be constantly
> > > fixing those serialization issues. Or the desgin point could be so central
> > > to filesystem design that it can't be changed.
> >   We talked about this at LSF and Dave Chinner had the idea that we could
> > make processes wait at the time when a transaction is started. At that time
> > we don't hold any global locks so process can be throttled without
> > serializing other processes. This effectively builds some cgroup awareness
> > into filesystems but pretty simple one so it should be doable.
> 
> Ok. So what is the meaning of "make process wait" here? What it will be
> dependent on? I am thinking of a case where a process has 100MB of dirty
> data, has 10MB/s write limit and it issues fsync. So before that process
> is able to open a transaction, one needs to wait atleast 10seconds
> (assuming other processes are not doing IO in same cgroup). 
  The original idea was that we'd have "bdi-congested-for-cgroup" flag
and the process starting a transaction will wait for this flag to get
cleared before starting a new transaction. This will be easy to implement
in filesystems and won't have serialization issues. But my knowledge of
blk-throttle is lacking so there might be some problems with this approach.

> If this wait is based on making sure all dirty data has been written back
> before opening transaction, then it will work without any interaction with
> block layer and sounds more feasible.
> 
> > 
> > > In general, if you do throttling deeper in the stakc and build back
> > > pressure, then all the layers sitting above should be cgroup aware
> > > to avoid problems. Two layers identified so far are writeback and
> > > filesystems. Is it really worth the complexity. How about doing 
> > > throttling in higher layers when IO is entering the kernel and
> > > keep proportional IO logic at the lowest level and current mechanism
> > > of building pressure continues to work?
> >   I would like to keep single throttling mechanism for different limitting
> > methods - i.e. handle proportional IO the same way as IO hard limits. So we
> > cannot really rely on the fact that throttling is work preserving.
> > 
> > The advantage of throttling at IO layer is that we can keep all the details
> > inside it and only export pretty minimal information (like is bdi congested
> > for given cgroup) to upper layers. If we wanted to do throttling at upper
> > layers (such as Fengguang's buffered write throttling), we need to export
> > the internal details to allow effective throttling...
> 
> For absolute throttling we really don't have to expose any details. In
> fact in my implementation of throttling buffered writes, I just had exported
> a single function to be called in bdi dirty rate limit. The caller will
> simply sleep long enough depending on the size of IO it is doing and
> how many other processes are doing IO in same cgroup.
>
> So implementation was still in block layer and only a single function
> was exposed to higher layers.
  OK, I see.
 
> One more factor makes absolute throttling interesting and that is global
> throttling and not per device throttling. For example in case of btrfs,
> there is no single stacked device on which to put total throttling
> limits.
  Yes. My intended interface for the throttling is bdi. But you are right
it does not exactly match the fact that the throttling happens per device
so it might get tricky. Which brings up a question - shouldn't the
throttling blk-throttle does rather happen at bdi layer? Because the
uses of the functionality I have in mind would match that better.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-10 21:05           ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-10 21:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

  Hi Vivek,

On Tue 10-04-12 14:06:53, Vivek Goyal wrote:
> On Sat, Apr 07, 2012 at 10:00:27AM +0200, Jan Kara wrote:
> > > In general, the core of the issue is that filesystems are not cgroup aware
> > > and if you do throttling below filesystems, then invariably one or other
> > > serialization issue will come up and I am concerned that we will be constantly
> > > fixing those serialization issues. Or the desgin point could be so central
> > > to filesystem design that it can't be changed.
> >   We talked about this at LSF and Dave Chinner had the idea that we could
> > make processes wait at the time when a transaction is started. At that time
> > we don't hold any global locks so process can be throttled without
> > serializing other processes. This effectively builds some cgroup awareness
> > into filesystems but pretty simple one so it should be doable.
> 
> Ok. So what is the meaning of "make process wait" here? What it will be
> dependent on? I am thinking of a case where a process has 100MB of dirty
> data, has 10MB/s write limit and it issues fsync. So before that process
> is able to open a transaction, one needs to wait atleast 10seconds
> (assuming other processes are not doing IO in same cgroup). 
  The original idea was that we'd have "bdi-congested-for-cgroup" flag
and the process starting a transaction will wait for this flag to get
cleared before starting a new transaction. This will be easy to implement
in filesystems and won't have serialization issues. But my knowledge of
blk-throttle is lacking so there might be some problems with this approach.

> If this wait is based on making sure all dirty data has been written back
> before opening transaction, then it will work without any interaction with
> block layer and sounds more feasible.
> 
> > 
> > > In general, if you do throttling deeper in the stakc and build back
> > > pressure, then all the layers sitting above should be cgroup aware
> > > to avoid problems. Two layers identified so far are writeback and
> > > filesystems. Is it really worth the complexity. How about doing 
> > > throttling in higher layers when IO is entering the kernel and
> > > keep proportional IO logic at the lowest level and current mechanism
> > > of building pressure continues to work?
> >   I would like to keep single throttling mechanism for different limitting
> > methods - i.e. handle proportional IO the same way as IO hard limits. So we
> > cannot really rely on the fact that throttling is work preserving.
> > 
> > The advantage of throttling at IO layer is that we can keep all the details
> > inside it and only export pretty minimal information (like is bdi congested
> > for given cgroup) to upper layers. If we wanted to do throttling at upper
> > layers (such as Fengguang's buffered write throttling), we need to export
> > the internal details to allow effective throttling...
> 
> For absolute throttling we really don't have to expose any details. In
> fact in my implementation of throttling buffered writes, I just had exported
> a single function to be called in bdi dirty rate limit. The caller will
> simply sleep long enough depending on the size of IO it is doing and
> how many other processes are doing IO in same cgroup.
>
> So implementation was still in block layer and only a single function
> was exposed to higher layers.
  OK, I see.
 
> One more factor makes absolute throttling interesting and that is global
> throttling and not per device throttling. For example in case of btrfs,
> there is no single stacked device on which to put total throttling
> limits.
  Yes. My intended interface for the throttling is bdi. But you are right
it does not exactly match the fact that the throttling happens per device
so it might get tricky. Which brings up a question - shouldn't the
throttling blk-throttle does rather happen at bdi layer? Because the
uses of the functionality I have in mind would match that better.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]           ` <20120410210505.GE4936-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
@ 2012-04-10 21:20             ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-10 21:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Tue, Apr 10, 2012 at 11:05:05PM +0200, Jan Kara wrote:

[..]
> > Ok. So what is the meaning of "make process wait" here? What it will be
> > dependent on? I am thinking of a case where a process has 100MB of dirty
> > data, has 10MB/s write limit and it issues fsync. So before that process
> > is able to open a transaction, one needs to wait atleast 10seconds
> > (assuming other processes are not doing IO in same cgroup). 
>   The original idea was that we'd have "bdi-congested-for-cgroup" flag
> and the process starting a transaction will wait for this flag to get
> cleared before starting a new transaction. This will be easy to implement
> in filesystems and won't have serialization issues. But my knowledge of
> blk-throttle is lacking so there might be some problems with this approach.

I have implemented and posted patches for per bdi per cgroup congestion
flag. The only problem I see with that is that a group might be congested
for a long time because of lots of other IO happening (say direct IO) and
if you keep on backing off and never submit the metadata IO (transaction),
you get starved. And if you go ahead and submit IO in a congested group,
we are back to serialization issue.

[..]
> > One more factor makes absolute throttling interesting and that is global
> > throttling and not per device throttling. For example in case of btrfs,
> > there is no single stacked device on which to put total throttling
> > limits.
>   Yes. My intended interface for the throttling is bdi. But you are right
> it does not exactly match the fact that the throttling happens per device
> so it might get tricky. Which brings up a question - shouldn't the
> throttling blk-throttle does rather happen at bdi layer? Because the
> uses of the functionality I have in mind would match that better.

I guess throttling at bdi layer will take care of network filesystem
case too?  But isn't the notion of "bdi" internal to kernel and user does
not really program thing in terms of bdi.

Also per bdi limit mechanism will not solve the issue of global throttling
where in case of btrfs an IO might go to multiple bdi's. So throttling limits
are not total but per bdi.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-10 21:05           ` Jan Kara
@ 2012-04-10 21:20             ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-10 21:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Tue, Apr 10, 2012 at 11:05:05PM +0200, Jan Kara wrote:

[..]
> > Ok. So what is the meaning of "make process wait" here? What it will be
> > dependent on? I am thinking of a case where a process has 100MB of dirty
> > data, has 10MB/s write limit and it issues fsync. So before that process
> > is able to open a transaction, one needs to wait atleast 10seconds
> > (assuming other processes are not doing IO in same cgroup). 
>   The original idea was that we'd have "bdi-congested-for-cgroup" flag
> and the process starting a transaction will wait for this flag to get
> cleared before starting a new transaction. This will be easy to implement
> in filesystems and won't have serialization issues. But my knowledge of
> blk-throttle is lacking so there might be some problems with this approach.

I have implemented and posted patches for per bdi per cgroup congestion
flag. The only problem I see with that is that a group might be congested
for a long time because of lots of other IO happening (say direct IO) and
if you keep on backing off and never submit the metadata IO (transaction),
you get starved. And if you go ahead and submit IO in a congested group,
we are back to serialization issue.

[..]
> > One more factor makes absolute throttling interesting and that is global
> > throttling and not per device throttling. For example in case of btrfs,
> > there is no single stacked device on which to put total throttling
> > limits.
>   Yes. My intended interface for the throttling is bdi. But you are right
> it does not exactly match the fact that the throttling happens per device
> so it might get tricky. Which brings up a question - shouldn't the
> throttling blk-throttle does rather happen at bdi layer? Because the
> uses of the functionality I have in mind would match that better.

I guess throttling at bdi layer will take care of network filesystem
case too?  But isn't the notion of "bdi" internal to kernel and user does
not really program thing in terms of bdi.

Also per bdi limit mechanism will not solve the issue of global throttling
where in case of btrfs an IO might go to multiple bdi's. So throttling limits
are not total but per bdi.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-10 21:20             ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-10 21:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Tue, Apr 10, 2012 at 11:05:05PM +0200, Jan Kara wrote:

[..]
> > Ok. So what is the meaning of "make process wait" here? What it will be
> > dependent on? I am thinking of a case where a process has 100MB of dirty
> > data, has 10MB/s write limit and it issues fsync. So before that process
> > is able to open a transaction, one needs to wait atleast 10seconds
> > (assuming other processes are not doing IO in same cgroup). 
>   The original idea was that we'd have "bdi-congested-for-cgroup" flag
> and the process starting a transaction will wait for this flag to get
> cleared before starting a new transaction. This will be easy to implement
> in filesystems and won't have serialization issues. But my knowledge of
> blk-throttle is lacking so there might be some problems with this approach.

I have implemented and posted patches for per bdi per cgroup congestion
flag. The only problem I see with that is that a group might be congested
for a long time because of lots of other IO happening (say direct IO) and
if you keep on backing off and never submit the metadata IO (transaction),
you get starved. And if you go ahead and submit IO in a congested group,
we are back to serialization issue.

[..]
> > One more factor makes absolute throttling interesting and that is global
> > throttling and not per device throttling. For example in case of btrfs,
> > there is no single stacked device on which to put total throttling
> > limits.
>   Yes. My intended interface for the throttling is bdi. But you are right
> it does not exactly match the fact that the throttling happens per device
> so it might get tricky. Which brings up a question - shouldn't the
> throttling blk-throttle does rather happen at bdi layer? Because the
> uses of the functionality I have in mind would match that better.

I guess throttling at bdi layer will take care of network filesystem
case too?  But isn't the notion of "bdi" internal to kernel and user does
not really program thing in terms of bdi.

Also per bdi limit mechanism will not solve the issue of global throttling
where in case of btrfs an IO might go to multiple bdi's. So throttling limits
are not total but per bdi.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]             ` <20120410212041.GP21801-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-10 22:24               ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-10 22:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Tue 10-04-12 17:20:41, Vivek Goyal wrote:
> On Tue, Apr 10, 2012 at 11:05:05PM +0200, Jan Kara wrote:
> 
> [..]
> > > Ok. So what is the meaning of "make process wait" here? What it will be
> > > dependent on? I am thinking of a case where a process has 100MB of dirty
> > > data, has 10MB/s write limit and it issues fsync. So before that process
> > > is able to open a transaction, one needs to wait atleast 10seconds
> > > (assuming other processes are not doing IO in same cgroup). 
> >   The original idea was that we'd have "bdi-congested-for-cgroup" flag
> > and the process starting a transaction will wait for this flag to get
> > cleared before starting a new transaction. This will be easy to implement
> > in filesystems and won't have serialization issues. But my knowledge of
> > blk-throttle is lacking so there might be some problems with this approach.
> 
> I have implemented and posted patches for per bdi per cgroup congestion
> flag. The only problem I see with that is that a group might be congested
> for a long time because of lots of other IO happening (say direct IO) and
> if you keep on backing off and never submit the metadata IO (transaction),
> you get starved. And if you go ahead and submit IO in a congested group,
> we are back to serialization issue.
  Clearly, we mustn't throttle metadata IO once it gets to the block layer.
That's why we discuss throttling of processes at transaction start after
all. But I agree starvation is an issue - I originally thought blk-throttle
throttles synchronously which wouldn't have starvation issues. But when
that's not the case things are a bit more tricky. We could treat
transaction start as an IO of some size (since we already have some
estimation how large a transaction will be when we are starting it) and let
the transaction start only when our "virtual" IO would be submitted but
I feel that gets maybe too complicated... Maybe we could just delay the
transaction start by the amount reported from blk-throttle layer? Something
along your callback for throttling you implemented?

> [..]
> > > One more factor makes absolute throttling interesting and that is global
> > > throttling and not per device throttling. For example in case of btrfs,
> > > there is no single stacked device on which to put total throttling
> > > limits.
> >   Yes. My intended interface for the throttling is bdi. But you are right
> > it does not exactly match the fact that the throttling happens per device
> > so it might get tricky. Which brings up a question - shouldn't the
> > throttling blk-throttle does rather happen at bdi layer? Because the
> > uses of the functionality I have in mind would match that better.
> 
> I guess throttling at bdi layer will take care of network filesystem
> case too?
  Yes. At least for client side. On sever side Steve wants server to have
insight into how much IO we could push in future so that it can limit
number of outstanding requests if I understand him right. I'm not sure we
really want / are able to provide this amount of knowledge to filesystems
even less userspace...

> But isn't the notion of "bdi" internal to kernel and user does
> not really program thing in terms of bdi.
  Well, it is. But we already have per-bdi tunables (e.g.  readahead) that
are exported in /sys/block/<device>/queue/ so we have some precedens.
 
> Also per bdi limit mechanism will not solve the issue of global throttling
> where in case of btrfs an IO might go to multiple bdi's. So throttling limits
> are not total but per bdi.
  Well, btrfs plays tricks with bdi's but there is a special bdi called
"btrfs" which backs the whole filesystem and that is what's put in
sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a
global bdi to work with.

									Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-10 21:20             ` Vivek Goyal
@ 2012-04-10 22:24               ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-10 22:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

On Tue 10-04-12 17:20:41, Vivek Goyal wrote:
> On Tue, Apr 10, 2012 at 11:05:05PM +0200, Jan Kara wrote:
> 
> [..]
> > > Ok. So what is the meaning of "make process wait" here? What it will be
> > > dependent on? I am thinking of a case where a process has 100MB of dirty
> > > data, has 10MB/s write limit and it issues fsync. So before that process
> > > is able to open a transaction, one needs to wait atleast 10seconds
> > > (assuming other processes are not doing IO in same cgroup). 
> >   The original idea was that we'd have "bdi-congested-for-cgroup" flag
> > and the process starting a transaction will wait for this flag to get
> > cleared before starting a new transaction. This will be easy to implement
> > in filesystems and won't have serialization issues. But my knowledge of
> > blk-throttle is lacking so there might be some problems with this approach.
> 
> I have implemented and posted patches for per bdi per cgroup congestion
> flag. The only problem I see with that is that a group might be congested
> for a long time because of lots of other IO happening (say direct IO) and
> if you keep on backing off and never submit the metadata IO (transaction),
> you get starved. And if you go ahead and submit IO in a congested group,
> we are back to serialization issue.
  Clearly, we mustn't throttle metadata IO once it gets to the block layer.
That's why we discuss throttling of processes at transaction start after
all. But I agree starvation is an issue - I originally thought blk-throttle
throttles synchronously which wouldn't have starvation issues. But when
that's not the case things are a bit more tricky. We could treat
transaction start as an IO of some size (since we already have some
estimation how large a transaction will be when we are starting it) and let
the transaction start only when our "virtual" IO would be submitted but
I feel that gets maybe too complicated... Maybe we could just delay the
transaction start by the amount reported from blk-throttle layer? Something
along your callback for throttling you implemented?

> [..]
> > > One more factor makes absolute throttling interesting and that is global
> > > throttling and not per device throttling. For example in case of btrfs,
> > > there is no single stacked device on which to put total throttling
> > > limits.
> >   Yes. My intended interface for the throttling is bdi. But you are right
> > it does not exactly match the fact that the throttling happens per device
> > so it might get tricky. Which brings up a question - shouldn't the
> > throttling blk-throttle does rather happen at bdi layer? Because the
> > uses of the functionality I have in mind would match that better.
> 
> I guess throttling at bdi layer will take care of network filesystem
> case too?
  Yes. At least for client side. On sever side Steve wants server to have
insight into how much IO we could push in future so that it can limit
number of outstanding requests if I understand him right. I'm not sure we
really want / are able to provide this amount of knowledge to filesystems
even less userspace...

> But isn't the notion of "bdi" internal to kernel and user does
> not really program thing in terms of bdi.
  Well, it is. But we already have per-bdi tunables (e.g.  readahead) that
are exported in /sys/block/<device>/queue/ so we have some precedens.
 
> Also per bdi limit mechanism will not solve the issue of global throttling
> where in case of btrfs an IO might go to multiple bdi's. So throttling limits
> are not total but per bdi.
  Well, btrfs plays tricks with bdi's but there is a special bdi called
"btrfs" which backs the whole filesystem and that is what's put in
sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a
global bdi to work with.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-10 22:24               ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-10 22:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

On Tue 10-04-12 17:20:41, Vivek Goyal wrote:
> On Tue, Apr 10, 2012 at 11:05:05PM +0200, Jan Kara wrote:
> 
> [..]
> > > Ok. So what is the meaning of "make process wait" here? What it will be
> > > dependent on? I am thinking of a case where a process has 100MB of dirty
> > > data, has 10MB/s write limit and it issues fsync. So before that process
> > > is able to open a transaction, one needs to wait atleast 10seconds
> > > (assuming other processes are not doing IO in same cgroup). 
> >   The original idea was that we'd have "bdi-congested-for-cgroup" flag
> > and the process starting a transaction will wait for this flag to get
> > cleared before starting a new transaction. This will be easy to implement
> > in filesystems and won't have serialization issues. But my knowledge of
> > blk-throttle is lacking so there might be some problems with this approach.
> 
> I have implemented and posted patches for per bdi per cgroup congestion
> flag. The only problem I see with that is that a group might be congested
> for a long time because of lots of other IO happening (say direct IO) and
> if you keep on backing off and never submit the metadata IO (transaction),
> you get starved. And if you go ahead and submit IO in a congested group,
> we are back to serialization issue.
  Clearly, we mustn't throttle metadata IO once it gets to the block layer.
That's why we discuss throttling of processes at transaction start after
all. But I agree starvation is an issue - I originally thought blk-throttle
throttles synchronously which wouldn't have starvation issues. But when
that's not the case things are a bit more tricky. We could treat
transaction start as an IO of some size (since we already have some
estimation how large a transaction will be when we are starting it) and let
the transaction start only when our "virtual" IO would be submitted but
I feel that gets maybe too complicated... Maybe we could just delay the
transaction start by the amount reported from blk-throttle layer? Something
along your callback for throttling you implemented?

> [..]
> > > One more factor makes absolute throttling interesting and that is global
> > > throttling and not per device throttling. For example in case of btrfs,
> > > there is no single stacked device on which to put total throttling
> > > limits.
> >   Yes. My intended interface for the throttling is bdi. But you are right
> > it does not exactly match the fact that the throttling happens per device
> > so it might get tricky. Which brings up a question - shouldn't the
> > throttling blk-throttle does rather happen at bdi layer? Because the
> > uses of the functionality I have in mind would match that better.
> 
> I guess throttling at bdi layer will take care of network filesystem
> case too?
  Yes. At least for client side. On sever side Steve wants server to have
insight into how much IO we could push in future so that it can limit
number of outstanding requests if I understand him right. I'm not sure we
really want / are able to provide this amount of knowledge to filesystems
even less userspace...

> But isn't the notion of "bdi" internal to kernel and user does
> not really program thing in terms of bdi.
  Well, it is. But we already have per-bdi tunables (e.g.  readahead) that
are exported in /sys/block/<device>/queue/ so we have some precedens.
 
> Also per bdi limit mechanism will not solve the issue of global throttling
> where in case of btrfs an IO might go to multiple bdi's. So throttling limits
> are not total but per bdi.
  Well, btrfs plays tricks with bdi's but there is a special bdi called
"btrfs" which backs the whole filesystem and that is what's put in
sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a
global bdi to work with.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-10 22:24               ` Jan Kara
  (?)
@ 2012-04-11 15:40                   ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-11 15:40 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:

[..]
> > I have implemented and posted patches for per bdi per cgroup congestion
> > flag. The only problem I see with that is that a group might be congested
> > for a long time because of lots of other IO happening (say direct IO) and
> > if you keep on backing off and never submit the metadata IO (transaction),
> > you get starved. And if you go ahead and submit IO in a congested group,
> > we are back to serialization issue.
>   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> That's why we discuss throttling of processes at transaction start after
> all. But I agree starvation is an issue - I originally thought blk-throttle
> throttles synchronously which wouldn't have starvation issues. But when
> that's not the case things are a bit more tricky. We could treat
> transaction start as an IO of some size (since we already have some
> estimation how large a transaction will be when we are starting it) and let
> the transaction start only when our "virtual" IO would be submitted but
> I feel that gets maybe too complicated... Maybe we could just delay the
> transaction start by the amount reported from blk-throttle layer? Something
> along your callback for throttling you implemented?

I think now I have lost you. It probably stems from the fact that I don't
know much about transactions and filesystem.
 
So all the metadata IO will happen thorough journaling thread and that
will be in root group which should remain unthrottled. So any journal
IO going to disk should remain unthrottled.

Now, IIRC, fsync problem with throttling was that we had opened a
transaction but could not write it back to disk because we had to
wait for all the cached data to go to disk (which is throttled). So
my question is, can't we first wait for all the data to be flushed
to disk and then open a transaction for metadata. metadata will be
unthrottled so filesystem will not have to do any tricks like bdi is
congested or not.

IOW, can't we first wait for dependent operation to finish before we
throw anything into metada stream.

[..]
> > I guess throttling at bdi layer will take care of network filesystem
> > case too?
>   Yes. At least for client side. On sever side Steve wants server to have
> insight into how much IO we could push in future so that it can limit
> number of outstanding requests if I understand him right. I'm not sure we
> really want / are able to provide this amount of knowledge to filesystems
> even less userspace...

I am not sure what does it mean but server could simply query the bdi
and read configured rate and then it knows at what rate IO will go to
disk and make predictions about future?

> 
> > But isn't the notion of "bdi" internal to kernel and user does
> > not really program thing in terms of bdi.
>   Well, it is. But we already have per-bdi tunables (e.g.  readahead) that
> are exported in /sys/block/<device>/queue/ so we have some precedens.

ok, so they are exposed as if they are queue/device tunables but
internally stored in bdi and work accordingly.

>  
> > Also per bdi limit mechanism will not solve the issue of global throttling
> > where in case of btrfs an IO might go to multiple bdi's. So throttling limits
> > are not total but per bdi.
>   Well, btrfs plays tricks with bdi's but there is a special bdi called
> "btrfs" which backs the whole filesystem and that is what's put in
> sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a
> global bdi to work with.

Ok, that's good to know. How would we configure this special bdi? I am
assuming there is no backing device visible in /sys/block/<device>/queue/?
Same is true for network file systems.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-11 15:40                   ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-11 15:40 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:

[..]
> > I have implemented and posted patches for per bdi per cgroup congestion
> > flag. The only problem I see with that is that a group might be congested
> > for a long time because of lots of other IO happening (say direct IO) and
> > if you keep on backing off and never submit the metadata IO (transaction),
> > you get starved. And if you go ahead and submit IO in a congested group,
> > we are back to serialization issue.
>   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> That's why we discuss throttling of processes at transaction start after
> all. But I agree starvation is an issue - I originally thought blk-throttle
> throttles synchronously which wouldn't have starvation issues. But when
> that's not the case things are a bit more tricky. We could treat
> transaction start as an IO of some size (since we already have some
> estimation how large a transaction will be when we are starting it) and let
> the transaction start only when our "virtual" IO would be submitted but
> I feel that gets maybe too complicated... Maybe we could just delay the
> transaction start by the amount reported from blk-throttle layer? Something
> along your callback for throttling you implemented?

I think now I have lost you. It probably stems from the fact that I don't
know much about transactions and filesystem.
 
So all the metadata IO will happen thorough journaling thread and that
will be in root group which should remain unthrottled. So any journal
IO going to disk should remain unthrottled.

Now, IIRC, fsync problem with throttling was that we had opened a
transaction but could not write it back to disk because we had to
wait for all the cached data to go to disk (which is throttled). So
my question is, can't we first wait for all the data to be flushed
to disk and then open a transaction for metadata. metadata will be
unthrottled so filesystem will not have to do any tricks like bdi is
congested or not.

IOW, can't we first wait for dependent operation to finish before we
throw anything into metada stream.

[..]
> > I guess throttling at bdi layer will take care of network filesystem
> > case too?
>   Yes. At least for client side. On sever side Steve wants server to have
> insight into how much IO we could push in future so that it can limit
> number of outstanding requests if I understand him right. I'm not sure we
> really want / are able to provide this amount of knowledge to filesystems
> even less userspace...

I am not sure what does it mean but server could simply query the bdi
and read configured rate and then it knows at what rate IO will go to
disk and make predictions about future?

> 
> > But isn't the notion of "bdi" internal to kernel and user does
> > not really program thing in terms of bdi.
>   Well, it is. But we already have per-bdi tunables (e.g.  readahead) that
> are exported in /sys/block/<device>/queue/ so we have some precedens.

ok, so they are exposed as if they are queue/device tunables but
internally stored in bdi and work accordingly.

>  
> > Also per bdi limit mechanism will not solve the issue of global throttling
> > where in case of btrfs an IO might go to multiple bdi's. So throttling limits
> > are not total but per bdi.
>   Well, btrfs plays tricks with bdi's but there is a special bdi called
> "btrfs" which backs the whole filesystem and that is what's put in
> sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a
> global bdi to work with.

Ok, that's good to know. How would we configure this special bdi? I am
assuming there is no backing device visible in /sys/block/<device>/queue/?
Same is true for network file systems.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-11 15:40                   ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-11 15:40 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:

[..]
> > I have implemented and posted patches for per bdi per cgroup congestion
> > flag. The only problem I see with that is that a group might be congested
> > for a long time because of lots of other IO happening (say direct IO) and
> > if you keep on backing off and never submit the metadata IO (transaction),
> > you get starved. And if you go ahead and submit IO in a congested group,
> > we are back to serialization issue.
>   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> That's why we discuss throttling of processes at transaction start after
> all. But I agree starvation is an issue - I originally thought blk-throttle
> throttles synchronously which wouldn't have starvation issues. But when
> that's not the case things are a bit more tricky. We could treat
> transaction start as an IO of some size (since we already have some
> estimation how large a transaction will be when we are starting it) and let
> the transaction start only when our "virtual" IO would be submitted but
> I feel that gets maybe too complicated... Maybe we could just delay the
> transaction start by the amount reported from blk-throttle layer? Something
> along your callback for throttling you implemented?

I think now I have lost you. It probably stems from the fact that I don't
know much about transactions and filesystem.
 
So all the metadata IO will happen thorough journaling thread and that
will be in root group which should remain unthrottled. So any journal
IO going to disk should remain unthrottled.

Now, IIRC, fsync problem with throttling was that we had opened a
transaction but could not write it back to disk because we had to
wait for all the cached data to go to disk (which is throttled). So
my question is, can't we first wait for all the data to be flushed
to disk and then open a transaction for metadata. metadata will be
unthrottled so filesystem will not have to do any tricks like bdi is
congested or not.

IOW, can't we first wait for dependent operation to finish before we
throw anything into metada stream.

[..]
> > I guess throttling at bdi layer will take care of network filesystem
> > case too?
>   Yes. At least for client side. On sever side Steve wants server to have
> insight into how much IO we could push in future so that it can limit
> number of outstanding requests if I understand him right. I'm not sure we
> really want / are able to provide this amount of knowledge to filesystems
> even less userspace...

I am not sure what does it mean but server could simply query the bdi
and read configured rate and then it knows at what rate IO will go to
disk and make predictions about future?

> 
> > But isn't the notion of "bdi" internal to kernel and user does
> > not really program thing in terms of bdi.
>   Well, it is. But we already have per-bdi tunables (e.g.  readahead) that
> are exported in /sys/block/<device>/queue/ so we have some precedens.

ok, so they are exposed as if they are queue/device tunables but
internally stored in bdi and work accordingly.

>  
> > Also per bdi limit mechanism will not solve the issue of global throttling
> > where in case of btrfs an IO might go to multiple bdi's. So throttling limits
> > are not total but per bdi.
>   Well, btrfs plays tricks with bdi's but there is a special bdi called
> "btrfs" which backs the whole filesystem and that is what's put in
> sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a
> global bdi to work with.

Ok, that's good to know. How would we configure this special bdi? I am
assuming there is no backing device visible in /sys/block/<device>/queue/?
Same is true for network file systems.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                   ` <20120411154005.GD16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-11 15:45                     ` Vivek Goyal
  2012-04-11 19:22                     ` Jan Kara
  2012-04-14 12:25                     ` [Lsf] " Peter Zijlstra
  2 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-11 15:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote:
> On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> 
> [..]
> > > I have implemented and posted patches for per bdi per cgroup congestion
> > > flag. The only problem I see with that is that a group might be congested
> > > for a long time because of lots of other IO happening (say direct IO) and
> > > if you keep on backing off and never submit the metadata IO (transaction),
> > > you get starved. And if you go ahead and submit IO in a congested group,
> > > we are back to serialization issue.
> >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > That's why we discuss throttling of processes at transaction start after
> > all. But I agree starvation is an issue - I originally thought blk-throttle
> > throttles synchronously which wouldn't have starvation issues.

Current bio throttling is asynchrounous. Process can submit the bio
and go back and wait for bio to finish. That bio will be queued at device
queue in a per cgroup queue and will be dispatched to device according
to configured IO rate for cgroup.

The additional feature for buffered throttle (which never went upstream),
was synchronous in nature. That is we were actively putting writer to
sleep on a per cgroup wait queue in the request queue and wake it up when
it can do further IO based on cgroup limits.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-11 15:40                   ` Vivek Goyal
@ 2012-04-11 15:45                     ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-11 15:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote:
> On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> 
> [..]
> > > I have implemented and posted patches for per bdi per cgroup congestion
> > > flag. The only problem I see with that is that a group might be congested
> > > for a long time because of lots of other IO happening (say direct IO) and
> > > if you keep on backing off and never submit the metadata IO (transaction),
> > > you get starved. And if you go ahead and submit IO in a congested group,
> > > we are back to serialization issue.
> >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > That's why we discuss throttling of processes at transaction start after
> > all. But I agree starvation is an issue - I originally thought blk-throttle
> > throttles synchronously which wouldn't have starvation issues.

Current bio throttling is asynchrounous. Process can submit the bio
and go back and wait for bio to finish. That bio will be queued at device
queue in a per cgroup queue and will be dispatched to device according
to configured IO rate for cgroup.

The additional feature for buffered throttle (which never went upstream),
was synchronous in nature. That is we were actively putting writer to
sleep on a per cgroup wait queue in the request queue and wake it up when
it can do further IO based on cgroup limits.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-11 15:45                     ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-11 15:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote:
> On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> 
> [..]
> > > I have implemented and posted patches for per bdi per cgroup congestion
> > > flag. The only problem I see with that is that a group might be congested
> > > for a long time because of lots of other IO happening (say direct IO) and
> > > if you keep on backing off and never submit the metadata IO (transaction),
> > > you get starved. And if you go ahead and submit IO in a congested group,
> > > we are back to serialization issue.
> >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > That's why we discuss throttling of processes at transaction start after
> > all. But I agree starvation is an issue - I originally thought blk-throttle
> > throttles synchronously which wouldn't have starvation issues.

Current bio throttling is asynchrounous. Process can submit the bio
and go back and wait for bio to finish. That bio will be queued at device
queue in a per cgroup queue and will be dispatched to device according
to configured IO rate for cgroup.

The additional feature for buffered throttle (which never went upstream),
was synchronous in nature. That is we were actively putting writer to
sleep on a per cgroup wait queue in the request queue and wake it up when
it can do further IO based on cgroup limits.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                     ` <20120411154531.GE16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-11 17:05                       ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-11 17:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Wed 11-04-12 11:45:31, Vivek Goyal wrote:
> On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote:
> > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> > 
> > [..]
> > > > I have implemented and posted patches for per bdi per cgroup congestion
> > > > flag. The only problem I see with that is that a group might be congested
> > > > for a long time because of lots of other IO happening (say direct IO) and
> > > > if you keep on backing off and never submit the metadata IO (transaction),
> > > > you get starved. And if you go ahead and submit IO in a congested group,
> > > > we are back to serialization issue.
> > >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > > That's why we discuss throttling of processes at transaction start after
> > > all. But I agree starvation is an issue - I originally thought blk-throttle
> > > throttles synchronously which wouldn't have starvation issues.
> 
> Current bio throttling is asynchrounous. Process can submit the bio
> and go back and wait for bio to finish. That bio will be queued at device
> queue in a per cgroup queue and will be dispatched to device according
> to configured IO rate for cgroup.
> 
> The additional feature for buffered throttle (which never went upstream),
> was synchronous in nature. That is we were actively putting writer to
> sleep on a per cgroup wait queue in the request queue and wake it up when
> it can do further IO based on cgroup limits.
  Hmm, but then there would be similar starvation issues as with my simple
scheme because async IO could always use the whole available bandwidth.
Mixing of sync & async throttling is really problematic... I'm wondering
how useful the async throttling is. Because we will block on request
allocation once there are more than nr_requests pending requests so at that
point throttling becomes sync anyway.

								Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-11 15:45                     ` Vivek Goyal
@ 2012-04-11 17:05                       ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-11 17:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

On Wed 11-04-12 11:45:31, Vivek Goyal wrote:
> On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote:
> > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> > 
> > [..]
> > > > I have implemented and posted patches for per bdi per cgroup congestion
> > > > flag. The only problem I see with that is that a group might be congested
> > > > for a long time because of lots of other IO happening (say direct IO) and
> > > > if you keep on backing off and never submit the metadata IO (transaction),
> > > > you get starved. And if you go ahead and submit IO in a congested group,
> > > > we are back to serialization issue.
> > >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > > That's why we discuss throttling of processes at transaction start after
> > > all. But I agree starvation is an issue - I originally thought blk-throttle
> > > throttles synchronously which wouldn't have starvation issues.
> 
> Current bio throttling is asynchrounous. Process can submit the bio
> and go back and wait for bio to finish. That bio will be queued at device
> queue in a per cgroup queue and will be dispatched to device according
> to configured IO rate for cgroup.
> 
> The additional feature for buffered throttle (which never went upstream),
> was synchronous in nature. That is we were actively putting writer to
> sleep on a per cgroup wait queue in the request queue and wake it up when
> it can do further IO based on cgroup limits.
  Hmm, but then there would be similar starvation issues as with my simple
scheme because async IO could always use the whole available bandwidth.
Mixing of sync & async throttling is really problematic... I'm wondering
how useful the async throttling is. Because we will block on request
allocation once there are more than nr_requests pending requests so at that
point throttling becomes sync anyway.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-11 17:05                       ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-11 17:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

On Wed 11-04-12 11:45:31, Vivek Goyal wrote:
> On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote:
> > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> > 
> > [..]
> > > > I have implemented and posted patches for per bdi per cgroup congestion
> > > > flag. The only problem I see with that is that a group might be congested
> > > > for a long time because of lots of other IO happening (say direct IO) and
> > > > if you keep on backing off and never submit the metadata IO (transaction),
> > > > you get starved. And if you go ahead and submit IO in a congested group,
> > > > we are back to serialization issue.
> > >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > > That's why we discuss throttling of processes at transaction start after
> > > all. But I agree starvation is an issue - I originally thought blk-throttle
> > > throttles synchronously which wouldn't have starvation issues.
> 
> Current bio throttling is asynchrounous. Process can submit the bio
> and go back and wait for bio to finish. That bio will be queued at device
> queue in a per cgroup queue and will be dispatched to device according
> to configured IO rate for cgroup.
> 
> The additional feature for buffered throttle (which never went upstream),
> was synchronous in nature. That is we were actively putting writer to
> sleep on a per cgroup wait queue in the request queue and wake it up when
> it can do further IO based on cgroup limits.
  Hmm, but then there would be similar starvation issues as with my simple
scheme because async IO could always use the whole available bandwidth.
Mixing of sync & async throttling is really problematic... I'm wondering
how useful the async throttling is. Because we will block on request
allocation once there are more than nr_requests pending requests so at that
point throttling becomes sync anyway.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                       ` <20120411170542.GB16008-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
@ 2012-04-11 17:23                         ` Vivek Goyal
  2012-04-17 21:48                         ` Tejun Heo
  1 sibling, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-11 17:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote:
> On Wed 11-04-12 11:45:31, Vivek Goyal wrote:
> > On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote:
> > > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> > > 
> > > [..]
> > > > > I have implemented and posted patches for per bdi per cgroup congestion
> > > > > flag. The only problem I see with that is that a group might be congested
> > > > > for a long time because of lots of other IO happening (say direct IO) and
> > > > > if you keep on backing off and never submit the metadata IO (transaction),
> > > > > you get starved. And if you go ahead and submit IO in a congested group,
> > > > > we are back to serialization issue.
> > > >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > > > That's why we discuss throttling of processes at transaction start after
> > > > all. But I agree starvation is an issue - I originally thought blk-throttle
> > > > throttles synchronously which wouldn't have starvation issues.
> > 
> > Current bio throttling is asynchrounous. Process can submit the bio
> > and go back and wait for bio to finish. That bio will be queued at device
> > queue in a per cgroup queue and will be dispatched to device according
> > to configured IO rate for cgroup.
> > 
> > The additional feature for buffered throttle (which never went upstream),
> > was synchronous in nature. That is we were actively putting writer to
> > sleep on a per cgroup wait queue in the request queue and wake it up when
> > it can do further IO based on cgroup limits.
>   Hmm, but then there would be similar starvation issues as with my simple
> scheme because async IO could always use the whole available bandwidth.

It depends on how the throttling logic decides to divide bandwidth between
sync and async. I had chosen a round robin policy of dispatching some
bios and then allowing some async IO etc. So async IO was not consuming
the whole available bandwidth. We could easibly tilt it in favor of sync IO
with a tunable knob.

> Mixing of sync & async throttling is really problematic... I'm wondering
> how useful the async throttling is.

If sync throttling is useful, then async throttling has to be useful too?
Especially given the fact that often async IO consumes all bandwidth
impacting sync latencies.

> Because we will block on request
> allocation once there are more than nr_requests pending requests so at that
> point throttling becomes sync anyway.

First of all flushers will block on nr_requests and not actual writers.
And secondly we thought of having per group request descriptors so that
writes of one group don't impact others. So once the writes of a group
are backlogged, then flusher can query the congestion status of group
and not submit any more writes to that group. As some writes are already
queued in that group, writes will not be starved. Well, in case of
deadline, even direct writes go in write queue so theoritically we can
hit starvation issue (flush not being able to submit writes without
risking blocking) there too.

To avoid this starvation, ideally we need per bdi per cgroup flusher. so
that flusher can simply block if there are not enough request descriptors
in the cgroup.

So trying to throttle buffered writes synchronously in balance_dirty_pages(),
atleast simlifies the implementation.  I like my implementation better
over Fengguang's approach of throttling for simple reason that buffered
writes and direct writes can be subjected to same throttling limits
instead of separate limits for buffered writes.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-11 17:05                       ` Jan Kara
@ 2012-04-11 17:23                         ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-11 17:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote:
> On Wed 11-04-12 11:45:31, Vivek Goyal wrote:
> > On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote:
> > > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> > > 
> > > [..]
> > > > > I have implemented and posted patches for per bdi per cgroup congestion
> > > > > flag. The only problem I see with that is that a group might be congested
> > > > > for a long time because of lots of other IO happening (say direct IO) and
> > > > > if you keep on backing off and never submit the metadata IO (transaction),
> > > > > you get starved. And if you go ahead and submit IO in a congested group,
> > > > > we are back to serialization issue.
> > > >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > > > That's why we discuss throttling of processes at transaction start after
> > > > all. But I agree starvation is an issue - I originally thought blk-throttle
> > > > throttles synchronously which wouldn't have starvation issues.
> > 
> > Current bio throttling is asynchrounous. Process can submit the bio
> > and go back and wait for bio to finish. That bio will be queued at device
> > queue in a per cgroup queue and will be dispatched to device according
> > to configured IO rate for cgroup.
> > 
> > The additional feature for buffered throttle (which never went upstream),
> > was synchronous in nature. That is we were actively putting writer to
> > sleep on a per cgroup wait queue in the request queue and wake it up when
> > it can do further IO based on cgroup limits.
>   Hmm, but then there would be similar starvation issues as with my simple
> scheme because async IO could always use the whole available bandwidth.

It depends on how the throttling logic decides to divide bandwidth between
sync and async. I had chosen a round robin policy of dispatching some
bios and then allowing some async IO etc. So async IO was not consuming
the whole available bandwidth. We could easibly tilt it in favor of sync IO
with a tunable knob.

> Mixing of sync & async throttling is really problematic... I'm wondering
> how useful the async throttling is.

If sync throttling is useful, then async throttling has to be useful too?
Especially given the fact that often async IO consumes all bandwidth
impacting sync latencies.

> Because we will block on request
> allocation once there are more than nr_requests pending requests so at that
> point throttling becomes sync anyway.

First of all flushers will block on nr_requests and not actual writers.
And secondly we thought of having per group request descriptors so that
writes of one group don't impact others. So once the writes of a group
are backlogged, then flusher can query the congestion status of group
and not submit any more writes to that group. As some writes are already
queued in that group, writes will not be starved. Well, in case of
deadline, even direct writes go in write queue so theoritically we can
hit starvation issue (flush not being able to submit writes without
risking blocking) there too.

To avoid this starvation, ideally we need per bdi per cgroup flusher. so
that flusher can simply block if there are not enough request descriptors
in the cgroup.

So trying to throttle buffered writes synchronously in balance_dirty_pages(),
atleast simlifies the implementation.  I like my implementation better
over Fengguang's approach of throttling for simple reason that buffered
writes and direct writes can be subjected to same throttling limits
instead of separate limits for buffered writes.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-11 17:23                         ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-11 17:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote:
> On Wed 11-04-12 11:45:31, Vivek Goyal wrote:
> > On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote:
> > > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> > > 
> > > [..]
> > > > > I have implemented and posted patches for per bdi per cgroup congestion
> > > > > flag. The only problem I see with that is that a group might be congested
> > > > > for a long time because of lots of other IO happening (say direct IO) and
> > > > > if you keep on backing off and never submit the metadata IO (transaction),
> > > > > you get starved. And if you go ahead and submit IO in a congested group,
> > > > > we are back to serialization issue.
> > > >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > > > That's why we discuss throttling of processes at transaction start after
> > > > all. But I agree starvation is an issue - I originally thought blk-throttle
> > > > throttles synchronously which wouldn't have starvation issues.
> > 
> > Current bio throttling is asynchrounous. Process can submit the bio
> > and go back and wait for bio to finish. That bio will be queued at device
> > queue in a per cgroup queue and will be dispatched to device according
> > to configured IO rate for cgroup.
> > 
> > The additional feature for buffered throttle (which never went upstream),
> > was synchronous in nature. That is we were actively putting writer to
> > sleep on a per cgroup wait queue in the request queue and wake it up when
> > it can do further IO based on cgroup limits.
>   Hmm, but then there would be similar starvation issues as with my simple
> scheme because async IO could always use the whole available bandwidth.

It depends on how the throttling logic decides to divide bandwidth between
sync and async. I had chosen a round robin policy of dispatching some
bios and then allowing some async IO etc. So async IO was not consuming
the whole available bandwidth. We could easibly tilt it in favor of sync IO
with a tunable knob.

> Mixing of sync & async throttling is really problematic... I'm wondering
> how useful the async throttling is.

If sync throttling is useful, then async throttling has to be useful too?
Especially given the fact that often async IO consumes all bandwidth
impacting sync latencies.

> Because we will block on request
> allocation once there are more than nr_requests pending requests so at that
> point throttling becomes sync anyway.

First of all flushers will block on nr_requests and not actual writers.
And secondly we thought of having per group request descriptors so that
writes of one group don't impact others. So once the writes of a group
are backlogged, then flusher can query the congestion status of group
and not submit any more writes to that group. As some writes are already
queued in that group, writes will not be starved. Well, in case of
deadline, even direct writes go in write queue so theoritically we can
hit starvation issue (flush not being able to submit writes without
risking blocking) there too.

To avoid this starvation, ideally we need per bdi per cgroup flusher. so
that flusher can simply block if there are not enough request descriptors
in the cgroup.

So trying to throttle buffered writes synchronously in balance_dirty_pages(),
atleast simlifies the implementation.  I like my implementation better
over Fengguang's approach of throttling for simple reason that buffered
writes and direct writes can be subjected to same throttling limits
instead of separate limits for buffered writes.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                   ` <20120411154005.GD16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2012-04-11 15:45                     ` Vivek Goyal
@ 2012-04-11 19:22                     ` Jan Kara
  2012-04-14 12:25                     ` [Lsf] " Peter Zijlstra
  2 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-11 19:22 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Wed 11-04-12 11:40:05, Vivek Goyal wrote:
> On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> 
> [..]
> > > I have implemented and posted patches for per bdi per cgroup congestion
> > > flag. The only problem I see with that is that a group might be congested
> > > for a long time because of lots of other IO happening (say direct IO) and
> > > if you keep on backing off and never submit the metadata IO (transaction),
> > > you get starved. And if you go ahead and submit IO in a congested group,
> > > we are back to serialization issue.
> >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > That's why we discuss throttling of processes at transaction start after
> > all. But I agree starvation is an issue - I originally thought blk-throttle
> > throttles synchronously which wouldn't have starvation issues. But when
> > that's not the case things are a bit more tricky. We could treat
> > transaction start as an IO of some size (since we already have some
> > estimation how large a transaction will be when we are starting it) and let
> > the transaction start only when our "virtual" IO would be submitted but
> > I feel that gets maybe too complicated... Maybe we could just delay the
> > transaction start by the amount reported from blk-throttle layer? Something
> > along your callback for throttling you implemented?
> 
> I think now I have lost you. It probably stems from the fact that I don't
> know much about transactions and filesystem.
>  
> So all the metadata IO will happen thorough journaling thread and that
> will be in root group which should remain unthrottled. So any journal
> IO going to disk should remain unthrottled.
  Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't
have to have the journal thread (as is the case of reiserfs where random
writer may end up doing commit) but let's not complicate things
unnecessarily.

> Now, IIRC, fsync problem with throttling was that we had opened a
> transaction but could not write it back to disk because we had to
> wait for all the cached data to go to disk (which is throttled). So
> my question is, can't we first wait for all the data to be flushed
> to disk and then open a transaction for metadata. metadata will be
> unthrottled so filesystem will not have to do any tricks like bdi is
> congested or not.
  Actually that's what's happening. We first do filemap_write_and_wait()
which syncs all the data and then we go and force transaction commit to
make sure all metadata got to stable storage. The problem is that writeout
of data may need to allocate new blocks and that starts a transaction and
while the transaction is started we may need to do some reads (e.g. of
bitmaps etc.) which may be throttled and at that moment the whole
filesystem is blocked. I don't remember the stack traces you showed me so
I'm not sure it this is what your observed but it's certainly one possible
scenario. The reason why fsync triggers problems is simply that it's the
only place where process normally does significant amount of writing. In
most cases flusher thread / journal thread do it so this effect is not
visible. And to precede your question, it would be rather hard to avoid IO
while the transaction is started due to locking.

> [..]
> > > I guess throttling at bdi layer will take care of network filesystem
> > > case too?
> >   Yes. At least for client side. On sever side Steve wants server to have
> > insight into how much IO we could push in future so that it can limit
> > number of outstanding requests if I understand him right. I'm not sure we
> > really want / are able to provide this amount of knowledge to filesystems
> > even less userspace...
> 
> I am not sure what does it mean but server could simply query the bdi
> and read configured rate and then it knows at what rate IO will go to
> disk and make predictions about future?
  Yeah, that would work if we had the current bandwidth for current cgroup
exposed in bdi.
 
> > > Also per bdi limit mechanism will not solve the issue of global throttling
> > > where in case of btrfs an IO might go to multiple bdi's. So throttling limits
> > > are not total but per bdi.
> >   Well, btrfs plays tricks with bdi's but there is a special bdi called
> > "btrfs" which backs the whole filesystem and that is what's put in
> > sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a
> > global bdi to work with.
> 
> Ok, that's good to know. How would we configure this special bdi? I am
> assuming there is no backing device visible in /sys/block/<device>/queue/?
> Same is true for network file systems.
  Where should be the backing device visible? Now it's me who is lost :)

								Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-11 15:40                   ` Vivek Goyal
@ 2012-04-11 19:22                     ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-11 19:22 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

On Wed 11-04-12 11:40:05, Vivek Goyal wrote:
> On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> 
> [..]
> > > I have implemented and posted patches for per bdi per cgroup congestion
> > > flag. The only problem I see with that is that a group might be congested
> > > for a long time because of lots of other IO happening (say direct IO) and
> > > if you keep on backing off and never submit the metadata IO (transaction),
> > > you get starved. And if you go ahead and submit IO in a congested group,
> > > we are back to serialization issue.
> >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > That's why we discuss throttling of processes at transaction start after
> > all. But I agree starvation is an issue - I originally thought blk-throttle
> > throttles synchronously which wouldn't have starvation issues. But when
> > that's not the case things are a bit more tricky. We could treat
> > transaction start as an IO of some size (since we already have some
> > estimation how large a transaction will be when we are starting it) and let
> > the transaction start only when our "virtual" IO would be submitted but
> > I feel that gets maybe too complicated... Maybe we could just delay the
> > transaction start by the amount reported from blk-throttle layer? Something
> > along your callback for throttling you implemented?
> 
> I think now I have lost you. It probably stems from the fact that I don't
> know much about transactions and filesystem.
>  
> So all the metadata IO will happen thorough journaling thread and that
> will be in root group which should remain unthrottled. So any journal
> IO going to disk should remain unthrottled.
  Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't
have to have the journal thread (as is the case of reiserfs where random
writer may end up doing commit) but let's not complicate things
unnecessarily.

> Now, IIRC, fsync problem with throttling was that we had opened a
> transaction but could not write it back to disk because we had to
> wait for all the cached data to go to disk (which is throttled). So
> my question is, can't we first wait for all the data to be flushed
> to disk and then open a transaction for metadata. metadata will be
> unthrottled so filesystem will not have to do any tricks like bdi is
> congested or not.
  Actually that's what's happening. We first do filemap_write_and_wait()
which syncs all the data and then we go and force transaction commit to
make sure all metadata got to stable storage. The problem is that writeout
of data may need to allocate new blocks and that starts a transaction and
while the transaction is started we may need to do some reads (e.g. of
bitmaps etc.) which may be throttled and at that moment the whole
filesystem is blocked. I don't remember the stack traces you showed me so
I'm not sure it this is what your observed but it's certainly one possible
scenario. The reason why fsync triggers problems is simply that it's the
only place where process normally does significant amount of writing. In
most cases flusher thread / journal thread do it so this effect is not
visible. And to precede your question, it would be rather hard to avoid IO
while the transaction is started due to locking.

> [..]
> > > I guess throttling at bdi layer will take care of network filesystem
> > > case too?
> >   Yes. At least for client side. On sever side Steve wants server to have
> > insight into how much IO we could push in future so that it can limit
> > number of outstanding requests if I understand him right. I'm not sure we
> > really want / are able to provide this amount of knowledge to filesystems
> > even less userspace...
> 
> I am not sure what does it mean but server could simply query the bdi
> and read configured rate and then it knows at what rate IO will go to
> disk and make predictions about future?
  Yeah, that would work if we had the current bandwidth for current cgroup
exposed in bdi.
 
> > > Also per bdi limit mechanism will not solve the issue of global throttling
> > > where in case of btrfs an IO might go to multiple bdi's. So throttling limits
> > > are not total but per bdi.
> >   Well, btrfs plays tricks with bdi's but there is a special bdi called
> > "btrfs" which backs the whole filesystem and that is what's put in
> > sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a
> > global bdi to work with.
> 
> Ok, that's good to know. How would we configure this special bdi? I am
> assuming there is no backing device visible in /sys/block/<device>/queue/?
> Same is true for network file systems.
  Where should be the backing device visible? Now it's me who is lost :)

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-11 19:22                     ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-11 19:22 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

On Wed 11-04-12 11:40:05, Vivek Goyal wrote:
> On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> 
> [..]
> > > I have implemented and posted patches for per bdi per cgroup congestion
> > > flag. The only problem I see with that is that a group might be congested
> > > for a long time because of lots of other IO happening (say direct IO) and
> > > if you keep on backing off and never submit the metadata IO (transaction),
> > > you get starved. And if you go ahead and submit IO in a congested group,
> > > we are back to serialization issue.
> >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > That's why we discuss throttling of processes at transaction start after
> > all. But I agree starvation is an issue - I originally thought blk-throttle
> > throttles synchronously which wouldn't have starvation issues. But when
> > that's not the case things are a bit more tricky. We could treat
> > transaction start as an IO of some size (since we already have some
> > estimation how large a transaction will be when we are starting it) and let
> > the transaction start only when our "virtual" IO would be submitted but
> > I feel that gets maybe too complicated... Maybe we could just delay the
> > transaction start by the amount reported from blk-throttle layer? Something
> > along your callback for throttling you implemented?
> 
> I think now I have lost you. It probably stems from the fact that I don't
> know much about transactions and filesystem.
>  
> So all the metadata IO will happen thorough journaling thread and that
> will be in root group which should remain unthrottled. So any journal
> IO going to disk should remain unthrottled.
  Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't
have to have the journal thread (as is the case of reiserfs where random
writer may end up doing commit) but let's not complicate things
unnecessarily.

> Now, IIRC, fsync problem with throttling was that we had opened a
> transaction but could not write it back to disk because we had to
> wait for all the cached data to go to disk (which is throttled). So
> my question is, can't we first wait for all the data to be flushed
> to disk and then open a transaction for metadata. metadata will be
> unthrottled so filesystem will not have to do any tricks like bdi is
> congested or not.
  Actually that's what's happening. We first do filemap_write_and_wait()
which syncs all the data and then we go and force transaction commit to
make sure all metadata got to stable storage. The problem is that writeout
of data may need to allocate new blocks and that starts a transaction and
while the transaction is started we may need to do some reads (e.g. of
bitmaps etc.) which may be throttled and at that moment the whole
filesystem is blocked. I don't remember the stack traces you showed me so
I'm not sure it this is what your observed but it's certainly one possible
scenario. The reason why fsync triggers problems is simply that it's the
only place where process normally does significant amount of writing. In
most cases flusher thread / journal thread do it so this effect is not
visible. And to precede your question, it would be rather hard to avoid IO
while the transaction is started due to locking.

> [..]
> > > I guess throttling at bdi layer will take care of network filesystem
> > > case too?
> >   Yes. At least for client side. On sever side Steve wants server to have
> > insight into how much IO we could push in future so that it can limit
> > number of outstanding requests if I understand him right. I'm not sure we
> > really want / are able to provide this amount of knowledge to filesystems
> > even less userspace...
> 
> I am not sure what does it mean but server could simply query the bdi
> and read configured rate and then it knows at what rate IO will go to
> disk and make predictions about future?
  Yeah, that would work if we had the current bandwidth for current cgroup
exposed in bdi.
 
> > > Also per bdi limit mechanism will not solve the issue of global throttling
> > > where in case of btrfs an IO might go to multiple bdi's. So throttling limits
> > > are not total but per bdi.
> >   Well, btrfs plays tricks with bdi's but there is a special bdi called
> > "btrfs" which backs the whole filesystem and that is what's put in
> > sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a
> > global bdi to work with.
> 
> Ok, that's good to know. How would we configure this special bdi? I am
> assuming there is no backing device visible in /sys/block/<device>/queue/?
> Same is true for network file systems.
  Where should be the backing device visible? Now it's me who is lost :)

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-11 17:23                         ` Vivek Goyal
  (?)
@ 2012-04-11 19:44                             ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-11 19:44 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Wed 11-04-12 13:23:11, Vivek Goyal wrote:
> On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote:
> > On Wed 11-04-12 11:45:31, Vivek Goyal wrote:
> > > On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote:
> > > > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> > > > 
> > > > [..]
> > > > > > I have implemented and posted patches for per bdi per cgroup congestion
> > > > > > flag. The only problem I see with that is that a group might be congested
> > > > > > for a long time because of lots of other IO happening (say direct IO) and
> > > > > > if you keep on backing off and never submit the metadata IO (transaction),
> > > > > > you get starved. And if you go ahead and submit IO in a congested group,
> > > > > > we are back to serialization issue.
> > > > >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > > > > That's why we discuss throttling of processes at transaction start after
> > > > > all. But I agree starvation is an issue - I originally thought blk-throttle
> > > > > throttles synchronously which wouldn't have starvation issues.
> > > 
> > > Current bio throttling is asynchrounous. Process can submit the bio
> > > and go back and wait for bio to finish. That bio will be queued at device
> > > queue in a per cgroup queue and will be dispatched to device according
> > > to configured IO rate for cgroup.
> > > 
> > > The additional feature for buffered throttle (which never went upstream),
> > > was synchronous in nature. That is we were actively putting writer to
> > > sleep on a per cgroup wait queue in the request queue and wake it up when
> > > it can do further IO based on cgroup limits.
> >   Hmm, but then there would be similar starvation issues as with my simple
> > scheme because async IO could always use the whole available bandwidth.
> 
> It depends on how the throttling logic decides to divide bandwidth between
> sync and async. I had chosen a round robin policy of dispatching some
> bios and then allowing some async IO etc. So async IO was not consuming
> the whole available bandwidth. We could easibly tilt it in favor of sync IO
> with a tunable knob.
  Ah, OK.

> > Mixing of sync & async throttling is really problematic... I'm wondering
> > how useful the async throttling is.
> 
> If sync throttling is useful, then async throttling has to be useful too?
> Especially given the fact that often async IO consumes all bandwidth
> impacting sync latencies.
  I wasn't clear enough I guess. I meant to ask if async throttling brings
some serious advantage over the sync one. And I think your answer is that
we want to have at least some IO prepared to be submitted to maintain
reasonable device utilization.

> > Because we will block on request
> > allocation once there are more than nr_requests pending requests so at that
> > point throttling becomes sync anyway.
> 
> First of all flushers will block on nr_requests and not actual writers.
  Well, but as soon as you are going to do real IO (not just use the
cache), you can block - i.e. direct IO writers, or fsync, or readers can
block.

> And secondly we thought of having per group request descriptors so that
> writes of one group don't impact others. So once the writes of a group
> are backlogged, then flusher can query the congestion status of group
> and not submit any more writes to that group. As some writes are already
> queued in that group, writes will not be starved. Well, in case of
> deadline, even direct writes go in write queue so theoritically we can
> hit starvation issue (flush not being able to submit writes without
> risking blocking) there too.
> 
> To avoid this starvation, ideally we need per bdi per cgroup flusher. so
> that flusher can simply block if there are not enough request descriptors
> in the cgroup.
  Yeah, on one hand this would simplify some things, but on the other hand
you would possibly create performance issue with interleaving IO from
different flusher threads (although that shouldn't be a big problem because
they would work on disjoint sets of inodes and should submit large enough
chunks) and also fs-wide operations such as sync(2) would need some
thinking.

Actually handling of sync(2) is interesting on it's own because if it
should obey throttling limits for each cgroup whose inode is written, it
may take *really* long time to complete it...
 
> So trying to throttle buffered writes synchronously in balance_dirty_pages(),
> atleast simlifies the implementation.  I like my implementation better
> over Fengguang's approach of throttling for simple reason that buffered
> writes and direct writes can be subjected to same throttling limits
> instead of separate limits for buffered writes.
  I guess we all agree (including Fengguang) that this is desirable.
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-11 19:44                             ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-11 19:44 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

On Wed 11-04-12 13:23:11, Vivek Goyal wrote:
> On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote:
> > On Wed 11-04-12 11:45:31, Vivek Goyal wrote:
> > > On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote:
> > > > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> > > > 
> > > > [..]
> > > > > > I have implemented and posted patches for per bdi per cgroup congestion
> > > > > > flag. The only problem I see with that is that a group might be congested
> > > > > > for a long time because of lots of other IO happening (say direct IO) and
> > > > > > if you keep on backing off and never submit the metadata IO (transaction),
> > > > > > you get starved. And if you go ahead and submit IO in a congested group,
> > > > > > we are back to serialization issue.
> > > > >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > > > > That's why we discuss throttling of processes at transaction start after
> > > > > all. But I agree starvation is an issue - I originally thought blk-throttle
> > > > > throttles synchronously which wouldn't have starvation issues.
> > > 
> > > Current bio throttling is asynchrounous. Process can submit the bio
> > > and go back and wait for bio to finish. That bio will be queued at device
> > > queue in a per cgroup queue and will be dispatched to device according
> > > to configured IO rate for cgroup.
> > > 
> > > The additional feature for buffered throttle (which never went upstream),
> > > was synchronous in nature. That is we were actively putting writer to
> > > sleep on a per cgroup wait queue in the request queue and wake it up when
> > > it can do further IO based on cgroup limits.
> >   Hmm, but then there would be similar starvation issues as with my simple
> > scheme because async IO could always use the whole available bandwidth.
> 
> It depends on how the throttling logic decides to divide bandwidth between
> sync and async. I had chosen a round robin policy of dispatching some
> bios and then allowing some async IO etc. So async IO was not consuming
> the whole available bandwidth. We could easibly tilt it in favor of sync IO
> with a tunable knob.
  Ah, OK.

> > Mixing of sync & async throttling is really problematic... I'm wondering
> > how useful the async throttling is.
> 
> If sync throttling is useful, then async throttling has to be useful too?
> Especially given the fact that often async IO consumes all bandwidth
> impacting sync latencies.
  I wasn't clear enough I guess. I meant to ask if async throttling brings
some serious advantage over the sync one. And I think your answer is that
we want to have at least some IO prepared to be submitted to maintain
reasonable device utilization.

> > Because we will block on request
> > allocation once there are more than nr_requests pending requests so at that
> > point throttling becomes sync anyway.
> 
> First of all flushers will block on nr_requests and not actual writers.
  Well, but as soon as you are going to do real IO (not just use the
cache), you can block - i.e. direct IO writers, or fsync, or readers can
block.

> And secondly we thought of having per group request descriptors so that
> writes of one group don't impact others. So once the writes of a group
> are backlogged, then flusher can query the congestion status of group
> and not submit any more writes to that group. As some writes are already
> queued in that group, writes will not be starved. Well, in case of
> deadline, even direct writes go in write queue so theoritically we can
> hit starvation issue (flush not being able to submit writes without
> risking blocking) there too.
> 
> To avoid this starvation, ideally we need per bdi per cgroup flusher. so
> that flusher can simply block if there are not enough request descriptors
> in the cgroup.
  Yeah, on one hand this would simplify some things, but on the other hand
you would possibly create performance issue with interleaving IO from
different flusher threads (although that shouldn't be a big problem because
they would work on disjoint sets of inodes and should submit large enough
chunks) and also fs-wide operations such as sync(2) would need some
thinking.

Actually handling of sync(2) is interesting on it's own because if it
should obey throttling limits for each cgroup whose inode is written, it
may take *really* long time to complete it...
 
> So trying to throttle buffered writes synchronously in balance_dirty_pages(),
> atleast simlifies the implementation.  I like my implementation better
> over Fengguang's approach of throttling for simple reason that buffered
> writes and direct writes can be subjected to same throttling limits
> instead of separate limits for buffered writes.
  I guess we all agree (including Fengguang) that this is desirable.
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-11 19:44                             ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-11 19:44 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

On Wed 11-04-12 13:23:11, Vivek Goyal wrote:
> On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote:
> > On Wed 11-04-12 11:45:31, Vivek Goyal wrote:
> > > On Wed, Apr 11, 2012 at 11:40:05AM -0400, Vivek Goyal wrote:
> > > > On Wed, Apr 11, 2012 at 12:24:25AM +0200, Jan Kara wrote:
> > > > 
> > > > [..]
> > > > > > I have implemented and posted patches for per bdi per cgroup congestion
> > > > > > flag. The only problem I see with that is that a group might be congested
> > > > > > for a long time because of lots of other IO happening (say direct IO) and
> > > > > > if you keep on backing off and never submit the metadata IO (transaction),
> > > > > > you get starved. And if you go ahead and submit IO in a congested group,
> > > > > > we are back to serialization issue.
> > > > >   Clearly, we mustn't throttle metadata IO once it gets to the block layer.
> > > > > That's why we discuss throttling of processes at transaction start after
> > > > > all. But I agree starvation is an issue - I originally thought blk-throttle
> > > > > throttles synchronously which wouldn't have starvation issues.
> > > 
> > > Current bio throttling is asynchrounous. Process can submit the bio
> > > and go back and wait for bio to finish. That bio will be queued at device
> > > queue in a per cgroup queue and will be dispatched to device according
> > > to configured IO rate for cgroup.
> > > 
> > > The additional feature for buffered throttle (which never went upstream),
> > > was synchronous in nature. That is we were actively putting writer to
> > > sleep on a per cgroup wait queue in the request queue and wake it up when
> > > it can do further IO based on cgroup limits.
> >   Hmm, but then there would be similar starvation issues as with my simple
> > scheme because async IO could always use the whole available bandwidth.
> 
> It depends on how the throttling logic decides to divide bandwidth between
> sync and async. I had chosen a round robin policy of dispatching some
> bios and then allowing some async IO etc. So async IO was not consuming
> the whole available bandwidth. We could easibly tilt it in favor of sync IO
> with a tunable knob.
  Ah, OK.

> > Mixing of sync & async throttling is really problematic... I'm wondering
> > how useful the async throttling is.
> 
> If sync throttling is useful, then async throttling has to be useful too?
> Especially given the fact that often async IO consumes all bandwidth
> impacting sync latencies.
  I wasn't clear enough I guess. I meant to ask if async throttling brings
some serious advantage over the sync one. And I think your answer is that
we want to have at least some IO prepared to be submitted to maintain
reasonable device utilization.

> > Because we will block on request
> > allocation once there are more than nr_requests pending requests so at that
> > point throttling becomes sync anyway.
> 
> First of all flushers will block on nr_requests and not actual writers.
  Well, but as soon as you are going to do real IO (not just use the
cache), you can block - i.e. direct IO writers, or fsync, or readers can
block.

> And secondly we thought of having per group request descriptors so that
> writes of one group don't impact others. So once the writes of a group
> are backlogged, then flusher can query the congestion status of group
> and not submit any more writes to that group. As some writes are already
> queued in that group, writes will not be starved. Well, in case of
> deadline, even direct writes go in write queue so theoritically we can
> hit starvation issue (flush not being able to submit writes without
> risking blocking) there too.
> 
> To avoid this starvation, ideally we need per bdi per cgroup flusher. so
> that flusher can simply block if there are not enough request descriptors
> in the cgroup.
  Yeah, on one hand this would simplify some things, but on the other hand
you would possibly create performance issue with interleaving IO from
different flusher threads (although that shouldn't be a big problem because
they would work on disjoint sets of inodes and should submit large enough
chunks) and also fs-wide operations such as sync(2) would need some
thinking.

Actually handling of sync(2) is interesting on it's own because if it
should obey throttling limits for each cgroup whose inode is written, it
may take *really* long time to complete it...
 
> So trying to throttle buffered writes synchronously in balance_dirty_pages(),
> atleast simlifies the implementation.  I like my implementation better
> over Fengguang's approach of throttling for simple reason that buffered
> writes and direct writes can be subjected to same throttling limits
> instead of separate limits for buffered writes.
  I guess we all agree (including Fengguang) that this is desirable.
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-11 19:22                     ` Jan Kara
  (?)
@ 2012-04-12 20:37                         ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-12 20:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote:

[..]
> > >   Well, btrfs plays tricks with bdi's but there is a special bdi called
> > > "btrfs" which backs the whole filesystem and that is what's put in
> > > sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a
> > > global bdi to work with.
> > 
> > Ok, that's good to know. How would we configure this special bdi? I am
> > assuming there is no backing device visible in /sys/block/<device>/queue/?
> > Same is true for network file systems.
>   Where should be the backing device visible? Now it's me who is lost :)

I mean how are we supposed to put cgroup throttling rules using cgroup
interface for network filesystems and for btrfs global bdi. Using "dev_t"
associated with bdi? I see that all the bdi's are showing up in
/sys/class/bdi, but how do I know which one I am intereste in or which
one belongs to filesystem I am interestd in putting throttling rule on.

For block devices, we simply use "major:min limit" format to write to
a cgroup file and this configuration will sit in one of the per queue
per cgroup data structure.

I am assuming that when you say throttling should happen at bdi, you
are thinking of maintaining per cgroup per bdi data structures and user
is somehow supposed to pass "bdi_maj:bdi_min  limit" through cgroup files?
If yes, how does one map a filesystem's bdi we want to put rules on?

Also, at request queue level we have bios and we throttle bios. At bdi
level, I think there are no bios yet. So somehow we got to deal with
pages. Not sure how exactly will throttling happen.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-12 20:37                         ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-12 20:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote:

[..]
> > >   Well, btrfs plays tricks with bdi's but there is a special bdi called
> > > "btrfs" which backs the whole filesystem and that is what's put in
> > > sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a
> > > global bdi to work with.
> > 
> > Ok, that's good to know. How would we configure this special bdi? I am
> > assuming there is no backing device visible in /sys/block/<device>/queue/?
> > Same is true for network file systems.
>   Where should be the backing device visible? Now it's me who is lost :)

I mean how are we supposed to put cgroup throttling rules using cgroup
interface for network filesystems and for btrfs global bdi. Using "dev_t"
associated with bdi? I see that all the bdi's are showing up in
/sys/class/bdi, but how do I know which one I am intereste in or which
one belongs to filesystem I am interestd in putting throttling rule on.

For block devices, we simply use "major:min limit" format to write to
a cgroup file and this configuration will sit in one of the per queue
per cgroup data structure.

I am assuming that when you say throttling should happen at bdi, you
are thinking of maintaining per cgroup per bdi data structures and user
is somehow supposed to pass "bdi_maj:bdi_min  limit" through cgroup files?
If yes, how does one map a filesystem's bdi we want to put rules on?

Also, at request queue level we have bios and we throttle bios. At bdi
level, I think there are no bios yet. So somehow we got to deal with
pages. Not sure how exactly will throttling happen.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-12 20:37                         ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-12 20:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote:

[..]
> > >   Well, btrfs plays tricks with bdi's but there is a special bdi called
> > > "btrfs" which backs the whole filesystem and that is what's put in
> > > sb->s_bdi or in each inode's i_mapping->backing_dev_info. So we have a
> > > global bdi to work with.
> > 
> > Ok, that's good to know. How would we configure this special bdi? I am
> > assuming there is no backing device visible in /sys/block/<device>/queue/?
> > Same is true for network file systems.
>   Where should be the backing device visible? Now it's me who is lost :)

I mean how are we supposed to put cgroup throttling rules using cgroup
interface for network filesystems and for btrfs global bdi. Using "dev_t"
associated with bdi? I see that all the bdi's are showing up in
/sys/class/bdi, but how do I know which one I am intereste in or which
one belongs to filesystem I am interestd in putting throttling rule on.

For block devices, we simply use "major:min limit" format to write to
a cgroup file and this configuration will sit in one of the per queue
per cgroup data structure.

I am assuming that when you say throttling should happen at bdi, you
are thinking of maintaining per cgroup per bdi data structures and user
is somehow supposed to pass "bdi_maj:bdi_min  limit" through cgroup files?
If yes, how does one map a filesystem's bdi we want to put rules on?

Also, at request queue level we have bios and we throttle bios. At bdi
level, I think there are no bios yet. So somehow we got to deal with
pages. Not sure how exactly will throttling happen.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                         ` <20120412203719.GL2207-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-12 20:51                           ` Tejun Heo
  2012-04-15 11:37                           ` [Lsf] " Peter Zijlstra
  1 sibling, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-12 20:51 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

Hello, Vivek.

On Thu, Apr 12, 2012 at 04:37:19PM -0400, Vivek Goyal wrote:
> I mean how are we supposed to put cgroup throttling rules using cgroup
> interface for network filesystems and for btrfs global bdi. Using "dev_t"
> associated with bdi? I see that all the bdi's are showing up in
> /sys/class/bdi, but how do I know which one I am intereste in or which
> one belongs to filesystem I am interestd in putting throttling rule on.
> 
> For block devices, we simply use "major:min limit" format to write to
> a cgroup file and this configuration will sit in one of the per queue
> per cgroup data structure.
> 
> I am assuming that when you say throttling should happen at bdi, you
> are thinking of maintaining per cgroup per bdi data structures and user
> is somehow supposed to pass "bdi_maj:bdi_min  limit" through cgroup files?
> If yes, how does one map a filesystem's bdi we want to put rules on?

I think you're worrying way too much.  One of the biggest reasons we
have layers and abstractions is to avoid worrying about everything
from everywhere.  Let block device implement per-device limits.  Let
writeback work from the backpressure it gets from the relevant IO
channel, bdi-cgroup combination in this case.

For stacked or combined devices, let the combining layer deal with
piping the congestion information.  If it's per-file split, the
combined bdi can simply forward information from the matching
underlying device.  If the file is striped / duplicated somehow, the
*only* layer which knows what to do is and should be the layer
performing the striping and duplication.  There's no need to worry
about it from blkcg and if you get the layering correct it isn't
difficult to slice such logic inbetween.  In fact, most of it
(backpressure propagation) would just happen as part of the usual
buffering between layers.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-12 20:37                         ` Vivek Goyal
@ 2012-04-12 20:51                           ` Tejun Heo
  -1 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-12 20:51 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hello, Vivek.

On Thu, Apr 12, 2012 at 04:37:19PM -0400, Vivek Goyal wrote:
> I mean how are we supposed to put cgroup throttling rules using cgroup
> interface for network filesystems and for btrfs global bdi. Using "dev_t"
> associated with bdi? I see that all the bdi's are showing up in
> /sys/class/bdi, but how do I know which one I am intereste in or which
> one belongs to filesystem I am interestd in putting throttling rule on.
> 
> For block devices, we simply use "major:min limit" format to write to
> a cgroup file and this configuration will sit in one of the per queue
> per cgroup data structure.
> 
> I am assuming that when you say throttling should happen at bdi, you
> are thinking of maintaining per cgroup per bdi data structures and user
> is somehow supposed to pass "bdi_maj:bdi_min  limit" through cgroup files?
> If yes, how does one map a filesystem's bdi we want to put rules on?

I think you're worrying way too much.  One of the biggest reasons we
have layers and abstractions is to avoid worrying about everything
from everywhere.  Let block device implement per-device limits.  Let
writeback work from the backpressure it gets from the relevant IO
channel, bdi-cgroup combination in this case.

For stacked or combined devices, let the combining layer deal with
piping the congestion information.  If it's per-file split, the
combined bdi can simply forward information from the matching
underlying device.  If the file is striped / duplicated somehow, the
*only* layer which knows what to do is and should be the layer
performing the striping and duplication.  There's no need to worry
about it from blkcg and if you get the layering correct it isn't
difficult to slice such logic inbetween.  In fact, most of it
(backpressure propagation) would just happen as part of the usual
buffering between layers.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-12 20:51                           ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-12 20:51 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hello, Vivek.

On Thu, Apr 12, 2012 at 04:37:19PM -0400, Vivek Goyal wrote:
> I mean how are we supposed to put cgroup throttling rules using cgroup
> interface for network filesystems and for btrfs global bdi. Using "dev_t"
> associated with bdi? I see that all the bdi's are showing up in
> /sys/class/bdi, but how do I know which one I am intereste in or which
> one belongs to filesystem I am interestd in putting throttling rule on.
> 
> For block devices, we simply use "major:min limit" format to write to
> a cgroup file and this configuration will sit in one of the per queue
> per cgroup data structure.
> 
> I am assuming that when you say throttling should happen at bdi, you
> are thinking of maintaining per cgroup per bdi data structures and user
> is somehow supposed to pass "bdi_maj:bdi_min  limit" through cgroup files?
> If yes, how does one map a filesystem's bdi we want to put rules on?

I think you're worrying way too much.  One of the biggest reasons we
have layers and abstractions is to avoid worrying about everything
from everywhere.  Let block device implement per-device limits.  Let
writeback work from the backpressure it gets from the relevant IO
channel, bdi-cgroup combination in this case.

For stacked or combined devices, let the combining layer deal with
piping the congestion information.  If it's per-file split, the
combined bdi can simply forward information from the matching
underlying device.  If the file is striped / duplicated somehow, the
*only* layer which knows what to do is and should be the layer
performing the striping and duplication.  There's no need to worry
about it from blkcg and if you get the layering correct it isn't
difficult to slice such logic inbetween.  In fact, most of it
(backpressure propagation) would just happen as part of the usual
buffering between layers.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]     ` <20120404184909.GB29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
                         ` (2 preceding siblings ...)
  2012-04-05 16:38       ` Tejun Heo
@ 2012-04-14 11:53       ` Peter Zijlstra
  3 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-14 11:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal

On Wed, 2012-04-04 at 11:49 -0700, Tejun Heo wrote:
> > - How to handle NFS.
> 
> As said above, maybe through network based bdi pressure propagation,
> Maybe some other special case mechanism.  Unsure but I don't think
> this concern should dictate the whole design. 

NFS has a custom bdi implementation and implements congestion control
based on the number of outstanding writeback pages.

See fs/nfs/write.c:nfs_{set,end}_page_writeback

All !block based filesystems have their own BDI implementation, I'm not
sure on the congestion implementation of anything other than NFS though.

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]     ` <20120404184909.GB29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
  2012-04-04 19:23       ` Steve French
@ 2012-04-14 11:53       ` Peter Zijlstra
  2012-04-05 16:38       ` Tejun Heo
  2012-04-14 11:53       ` [Lsf] " Peter Zijlstra
  3 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-14 11:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Wed, 2012-04-04 at 11:49 -0700, Tejun Heo wrote:
> > - How to handle NFS.
> 
> As said above, maybe through network based bdi pressure propagation,
> Maybe some other special case mechanism.  Unsure but I don't think
> this concern should dictate the whole design. 

NFS has a custom bdi implementation and implements congestion control
based on the number of outstanding writeback pages.

See fs/nfs/write.c:nfs_{set,end}_page_writeback

All !block based filesystems have their own BDI implementation, I'm not
sure on the congestion implementation of anything other than NFS though.

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-14 11:53       ` Peter Zijlstra
  0 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-14 11:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, 2012-04-04 at 11:49 -0700, Tejun Heo wrote:
> > - How to handle NFS.
> 
> As said above, maybe through network based bdi pressure propagation,
> Maybe some other special case mechanism.  Unsure but I don't think
> this concern should dictate the whole design. 

NFS has a custom bdi implementation and implements congestion control
based on the number of outstanding writeback pages.

See fs/nfs/write.c:nfs_{set,end}_page_writeback

All !block based filesystems have their own BDI implementation, I'm not
sure on the congestion implementation of anything other than NFS though.

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-14 11:53       ` Peter Zijlstra
  0 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-14 11:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Wed, 2012-04-04 at 11:49 -0700, Tejun Heo wrote:
> > - How to handle NFS.
> 
> As said above, maybe through network based bdi pressure propagation,
> Maybe some other special case mechanism.  Unsure but I don't think
> this concern should dictate the whole design. 

NFS has a custom bdi implementation and implements congestion control
based on the number of outstanding writeback pages.

See fs/nfs/write.c:nfs_{set,end}_page_writeback

All !block based filesystems have their own BDI implementation, I'm not
sure on the congestion implementation of anything other than NFS though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]       ` <CAH2r5mvP56D0y4mk5wKrJcj+=OZ0e0Q5No_L+9a8a=GMcEhRew-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2012-04-14 12:15         ` Peter Zijlstra
  0 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-14 12:15 UTC (permalink / raw)
  To: Steve French
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Wed, 2012-04-04 at 14:23 -0500, Steve French wrote:
> Current use of bdi is a little hard to understand since
> there are 25+ fields in the structure.  

Filesystems only need a small fraction of those.

In particular,

  backing_dev_info::name	-- string
  backing_dev_info::ra_pages	-- number of read-ahead-pages
  backing_dev_info::capability	-- see BDI_CAP_*
  
One should properly initialize/destroy the thing using:

  bdi_init()/bdi_destroy()


Furthermore, it has hooks into the regular page-writeback stuff:

  test_{set,clear}_page_writeback()/bdi_writeout_inc()
  set_page_dirty()/account_page_dirtied()
  
but also allows filesystems to do custom stuff, see FUSE for example.

The only other bit is the pressure valve, aka.
{set,clear}_bdi_congested(). Which really is rather broken and of
dubious value.

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-04 19:23       ` Steve French
  (?)
@ 2012-04-14 12:15         ` Peter Zijlstra
  -1 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-14 12:15 UTC (permalink / raw)
  To: Steve French
  Cc: Tejun Heo, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Wed, 2012-04-04 at 14:23 -0500, Steve French wrote:
> Current use of bdi is a little hard to understand since
> there are 25+ fields in the structure.  

Filesystems only need a small fraction of those.

In particular,

  backing_dev_info::name	-- string
  backing_dev_info::ra_pages	-- number of read-ahead-pages
  backing_dev_info::capability	-- see BDI_CAP_*
  
One should properly initialize/destroy the thing using:

  bdi_init()/bdi_destroy()


Furthermore, it has hooks into the regular page-writeback stuff:

  test_{set,clear}_page_writeback()/bdi_writeout_inc()
  set_page_dirty()/account_page_dirtied()
  
but also allows filesystems to do custom stuff, see FUSE for example.

The only other bit is the pressure valve, aka.
{set,clear}_bdi_congested(). Which really is rather broken and of
dubious value.



^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-14 12:15         ` Peter Zijlstra
  0 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-14 12:15 UTC (permalink / raw)
  To: Steve French
  Cc: Tejun Heo, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Wed, 2012-04-04 at 14:23 -0500, Steve French wrote:
> Current use of bdi is a little hard to understand since
> there are 25+ fields in the structure.  

Filesystems only need a small fraction of those.

In particular,

  backing_dev_info::name	-- string
  backing_dev_info::ra_pages	-- number of read-ahead-pages
  backing_dev_info::capability	-- see BDI_CAP_*
  
One should properly initialize/destroy the thing using:

  bdi_init()/bdi_destroy()


Furthermore, it has hooks into the regular page-writeback stuff:

  test_{set,clear}_page_writeback()/bdi_writeout_inc()
  set_page_dirty()/account_page_dirtied()
  
but also allows filesystems to do custom stuff, see FUSE for example.

The only other bit is the pressure valve, aka.
{set,clear}_bdi_congested(). Which really is rather broken and of
dubious value.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-14 12:15         ` Peter Zijlstra
  0 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-14 12:15 UTC (permalink / raw)
  To: Steve French
  Cc: Tejun Heo, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Wed, 2012-04-04 at 14:23 -0500, Steve French wrote:
> Current use of bdi is a little hard to understand since
> there are 25+ fields in the structure.  

Filesystems only need a small fraction of those.

In particular,

  backing_dev_info::name	-- string
  backing_dev_info::ra_pages	-- number of read-ahead-pages
  backing_dev_info::capability	-- see BDI_CAP_*
  
One should properly initialize/destroy the thing using:

  bdi_init()/bdi_destroy()


Furthermore, it has hooks into the regular page-writeback stuff:

  test_{set,clear}_page_writeback()/bdi_writeout_inc()
  set_page_dirty()/account_page_dirtied()
  
but also allows filesystems to do custom stuff, see FUSE for example.

The only other bit is the pressure valve, aka.
{set,clear}_bdi_congested(). Which really is rather broken and of
dubious value.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]                   ` <20120411154005.GD16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2012-04-11 15:45                     ` Vivek Goyal
  2012-04-11 19:22                     ` Jan Kara
@ 2012-04-14 12:25                     ` Peter Zijlstra
  2 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-14 12:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote:
> 
> Ok, that's good to know. How would we configure this special bdi? I am
> assuming there is no backing device visible in /sys/block/<device>/queue/?
> Same is true for network file systems. 

root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done
ls: cannot access /sys/class/bdi/0:20/: No such file or directory
total 0
drwxr-xr-x  3 root root    0 2012-03-27 23:18 .
drwxr-xr-x 35 root root    0 2012-03-27 23:02 ..
-rw-r--r--  1 root root 4096 2012-04-14 14:22 max_ratio
-rw-r--r--  1 root root 4096 2012-04-14 14:22 min_ratio
drwxr-xr-x  2 root root    0 2012-04-14 14:22 power
-rw-r--r--  1 root root 4096 2012-04-14 14:22 read_ahead_kb
lrwxrwxrwx  1 root root    0 2012-03-27 23:18 subsystem -> ../../../../class/bdi
-rw-r--r--  1 root root 4096 2012-03-27 23:18 uevent

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-11 15:40                   ` Vivek Goyal
  (?)
@ 2012-04-14 12:25                     ` Peter Zijlstra
  -1 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-14 12:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel

On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote:
> 
> Ok, that's good to know. How would we configure this special bdi? I am
> assuming there is no backing device visible in /sys/block/<device>/queue/?
> Same is true for network file systems. 

root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done
ls: cannot access /sys/class/bdi/0:20/: No such file or directory
total 0
drwxr-xr-x  3 root root    0 2012-03-27 23:18 .
drwxr-xr-x 35 root root    0 2012-03-27 23:02 ..
-rw-r--r--  1 root root 4096 2012-04-14 14:22 max_ratio
-rw-r--r--  1 root root 4096 2012-04-14 14:22 min_ratio
drwxr-xr-x  2 root root    0 2012-04-14 14:22 power
-rw-r--r--  1 root root 4096 2012-04-14 14:22 read_ahead_kb
lrwxrwxrwx  1 root root    0 2012-03-27 23:18 subsystem -> ../../../../class/bdi
-rw-r--r--  1 root root 4096 2012-03-27 23:18 uevent


^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-14 12:25                     ` Peter Zijlstra
  0 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-14 12:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel

On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote:
> 
> Ok, that's good to know. How would we configure this special bdi? I am
> assuming there is no backing device visible in /sys/block/<device>/queue/?
> Same is true for network file systems. 

root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done
ls: cannot access /sys/class/bdi/0:20/: No such file or directory
total 0
drwxr-xr-x  3 root root    0 2012-03-27 23:18 .
drwxr-xr-x 35 root root    0 2012-03-27 23:02 ..
-rw-r--r--  1 root root 4096 2012-04-14 14:22 max_ratio
-rw-r--r--  1 root root 4096 2012-04-14 14:22 min_ratio
drwxr-xr-x  2 root root    0 2012-04-14 14:22 power
-rw-r--r--  1 root root 4096 2012-04-14 14:22 read_ahead_kb
lrwxrwxrwx  1 root root    0 2012-03-27 23:18 subsystem -> ../../../../class/bdi
-rw-r--r--  1 root root 4096 2012-03-27 23:18 uevent

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-14 12:25                     ` Peter Zijlstra
  0 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-14 12:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel

On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote:
> 
> Ok, that's good to know. How would we configure this special bdi? I am
> assuming there is no backing device visible in /sys/block/<device>/queue/?
> Same is true for network file systems. 

root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done
ls: cannot access /sys/class/bdi/0:20/: No such file or directory
total 0
drwxr-xr-x  3 root root    0 2012-03-27 23:18 .
drwxr-xr-x 35 root root    0 2012-03-27 23:02 ..
-rw-r--r--  1 root root 4096 2012-04-14 14:22 max_ratio
-rw-r--r--  1 root root 4096 2012-04-14 14:22 min_ratio
drwxr-xr-x  2 root root    0 2012-04-14 14:22 power
-rw-r--r--  1 root root 4096 2012-04-14 14:22 read_ahead_kb
lrwxrwxrwx  1 root root    0 2012-03-27 23:18 subsystem -> ../../../../class/bdi
-rw-r--r--  1 root root 4096 2012-03-27 23:18 uevent

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-12 20:51                           ` Tejun Heo
@ 2012-04-14 14:36                               ` Fengguang Wu
  -1 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-14 14:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal

[-- Attachment #1: Type: text/plain, Size: 4887 bytes --]

On Thu, Apr 12, 2012 at 01:51:48PM -0700, Tejun Heo wrote:
> Hello, Vivek.
> 
> On Thu, Apr 12, 2012 at 04:37:19PM -0400, Vivek Goyal wrote:
> > I mean how are we supposed to put cgroup throttling rules using cgroup
> > interface for network filesystems and for btrfs global bdi. Using "dev_t"
> > associated with bdi? I see that all the bdi's are showing up in
> > /sys/class/bdi, but how do I know which one I am intereste in or which
> > one belongs to filesystem I am interestd in putting throttling rule on.
> > 
> > For block devices, we simply use "major:min limit" format to write to
> > a cgroup file and this configuration will sit in one of the per queue
> > per cgroup data structure.
> > 
> > I am assuming that when you say throttling should happen at bdi, you
> > are thinking of maintaining per cgroup per bdi data structures and user
> > is somehow supposed to pass "bdi_maj:bdi_min  limit" through cgroup files?
> > If yes, how does one map a filesystem's bdi we want to put rules on?
> 
> I think you're worrying way too much.  One of the biggest reasons we
> have layers and abstractions is to avoid worrying about everything
> from everywhere.  Let block device implement per-device limits.  Let
> writeback work from the backpressure it gets from the relevant IO
> channel, bdi-cgroup combination in this case.
> 
> For stacked or combined devices, let the combining layer deal with
> piping the congestion information.  If it's per-file split, the
> combined bdi can simply forward information from the matching
> underlying device.  If the file is striped / duplicated somehow, the
> *only* layer which knows what to do is and should be the layer
> performing the striping and duplication.  There's no need to worry
> about it from blkcg and if you get the layering correct it isn't
> difficult to slice such logic inbetween.  In fact, most of it
> (backpressure propagation) would just happen as part of the usual
> buffering between layers.

Yeah the backpressure idea would work nicely with all possible
intermediate stacking between the bdi and leaf devices. In my attempt
to do combined IO bandwidth control for

- buffered writes, in balance_dirty_pages()
- direct IO, in the cfq IO scheduler

I have to look into the cfq code in the past days to get an idea how
the two throttling layers can cooperate (and suffer from the pains
arise from the violations of layers). It's also rather tricky to get
two previously independent throttling mechanisms to work seamlessly
with each other for providing the desired _unified_ user interface. It
took a lot of reasoning and experiments to work the basic scheme out...

But here is the first result. The attached graph shows progress of 4
tasks:
- cgroup A: 1 direct dd + 1 buffered dd
- cgroup B: 1 direct dd + 1 buffered dd

The 4 tasks are mostly progressing at the same pace. The top 2
smoother lines are for the buffered dirtiers. The bottom 2 lines are
for the direct writers. As you may notice, the two direct writers are
somehow stalled for 1-2 times, which increases the gaps between the
lines. Otherwise, the algorithm is working as expected to distribute
the bandwidth to each task.

The current code's target is to satisfy the more realistic user demand
of distributing bandwidth equally to each cgroup, and inside each
cgroup, distribute bandwidth equally to buffered/direct writes. On top
of which, weights can be specified to change the default distribution.

The implementation involves adding "weight for direct IO" to the cfq
groups and "weight for buffered writes" to the root cgroup. Note that
current cfq proportional IO conroller does not offer explicit control
over the direct:buffered ratio.

When there are both direct/buffered writers in the cgroup,
balance_dirty_pages() will kick in and adjust the weights for cfq to
execute. Note that cfq will continue to send all flusher IOs to the
root cgroup.  balance_dirty_pages() will compute the overall async
weight for it so that in the above test case, the computed weights
will be

- 1000 async weight for the root cgroup (2 buffered dds)
- 500 dio weight for cgroup A (1 direct dd)
- 500 dio weight for cgroup B (1 direct dd)

The second graph shows result for another test case:
- cgroup A, weight 300: 1 buffered cp
- cgroup B, weight 600: 1 buffered dd + 1 direct dd
- cgroup C, weight 300: 1 direct dd
which is also working as expected.

Once the cfq properly grants total async IO share to the flusher,
balance_dirty_pages() will then do its original job of distributing
the buffered write bandwidth among the buffered dd tasks.

It will have to assume that the devices under the same bdi are
"symmetry". It also needs further stats feedback on IOPS or disk time
in order to do IOPS/time based IO distribution. Anyway it would be
interesting to see how far this scheme can go. I'll cleanup the code
and post it soon.

Thanks,
Fengguang

[-- Attachment #2: balance_dirty_pages-task-bw.png --]
[-- Type: image/png, Size: 72619 bytes --]

[-- Attachment #3: balance_dirty_pages-task-bw.png --]
[-- Type: image/png, Size: 69646 bytes --]

[-- Attachment #4: Type: text/plain, Size: 205 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-14 14:36                               ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-14 14:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

[-- Attachment #1: Type: text/plain, Size: 4887 bytes --]

On Thu, Apr 12, 2012 at 01:51:48PM -0700, Tejun Heo wrote:
> Hello, Vivek.
> 
> On Thu, Apr 12, 2012 at 04:37:19PM -0400, Vivek Goyal wrote:
> > I mean how are we supposed to put cgroup throttling rules using cgroup
> > interface for network filesystems and for btrfs global bdi. Using "dev_t"
> > associated with bdi? I see that all the bdi's are showing up in
> > /sys/class/bdi, but how do I know which one I am intereste in or which
> > one belongs to filesystem I am interestd in putting throttling rule on.
> > 
> > For block devices, we simply use "major:min limit" format to write to
> > a cgroup file and this configuration will sit in one of the per queue
> > per cgroup data structure.
> > 
> > I am assuming that when you say throttling should happen at bdi, you
> > are thinking of maintaining per cgroup per bdi data structures and user
> > is somehow supposed to pass "bdi_maj:bdi_min  limit" through cgroup files?
> > If yes, how does one map a filesystem's bdi we want to put rules on?
> 
> I think you're worrying way too much.  One of the biggest reasons we
> have layers and abstractions is to avoid worrying about everything
> from everywhere.  Let block device implement per-device limits.  Let
> writeback work from the backpressure it gets from the relevant IO
> channel, bdi-cgroup combination in this case.
> 
> For stacked or combined devices, let the combining layer deal with
> piping the congestion information.  If it's per-file split, the
> combined bdi can simply forward information from the matching
> underlying device.  If the file is striped / duplicated somehow, the
> *only* layer which knows what to do is and should be the layer
> performing the striping and duplication.  There's no need to worry
> about it from blkcg and if you get the layering correct it isn't
> difficult to slice such logic inbetween.  In fact, most of it
> (backpressure propagation) would just happen as part of the usual
> buffering between layers.

Yeah the backpressure idea would work nicely with all possible
intermediate stacking between the bdi and leaf devices. In my attempt
to do combined IO bandwidth control for

- buffered writes, in balance_dirty_pages()
- direct IO, in the cfq IO scheduler

I have to look into the cfq code in the past days to get an idea how
the two throttling layers can cooperate (and suffer from the pains
arise from the violations of layers). It's also rather tricky to get
two previously independent throttling mechanisms to work seamlessly
with each other for providing the desired _unified_ user interface. It
took a lot of reasoning and experiments to work the basic scheme out...

But here is the first result. The attached graph shows progress of 4
tasks:
- cgroup A: 1 direct dd + 1 buffered dd
- cgroup B: 1 direct dd + 1 buffered dd

The 4 tasks are mostly progressing at the same pace. The top 2
smoother lines are for the buffered dirtiers. The bottom 2 lines are
for the direct writers. As you may notice, the two direct writers are
somehow stalled for 1-2 times, which increases the gaps between the
lines. Otherwise, the algorithm is working as expected to distribute
the bandwidth to each task.

The current code's target is to satisfy the more realistic user demand
of distributing bandwidth equally to each cgroup, and inside each
cgroup, distribute bandwidth equally to buffered/direct writes. On top
of which, weights can be specified to change the default distribution.

The implementation involves adding "weight for direct IO" to the cfq
groups and "weight for buffered writes" to the root cgroup. Note that
current cfq proportional IO conroller does not offer explicit control
over the direct:buffered ratio.

When there are both direct/buffered writers in the cgroup,
balance_dirty_pages() will kick in and adjust the weights for cfq to
execute. Note that cfq will continue to send all flusher IOs to the
root cgroup.  balance_dirty_pages() will compute the overall async
weight for it so that in the above test case, the computed weights
will be

- 1000 async weight for the root cgroup (2 buffered dds)
- 500 dio weight for cgroup A (1 direct dd)
- 500 dio weight for cgroup B (1 direct dd)

The second graph shows result for another test case:
- cgroup A, weight 300: 1 buffered cp
- cgroup B, weight 600: 1 buffered dd + 1 direct dd
- cgroup C, weight 300: 1 direct dd
which is also working as expected.

Once the cfq properly grants total async IO share to the flusher,
balance_dirty_pages() will then do its original job of distributing
the buffered write bandwidth among the buffered dd tasks.

It will have to assume that the devices under the same bdi are
"symmetry". It also needs further stats feedback on IOPS or disk time
in order to do IOPS/time based IO distribution. Anyway it would be
interesting to see how far this scheme can go. I'll cleanup the code
and post it soon.

Thanks,
Fengguang

[-- Attachment #2: balance_dirty_pages-task-bw.png --]
[-- Type: image/png, Size: 72619 bytes --]

[-- Attachment #3: balance_dirty_pages-task-bw.png --]
[-- Type: image/png, Size: 69646 bytes --]

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]                         ` <20120412203719.GL2207-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2012-04-12 20:51                           ` Tejun Heo
@ 2012-04-15 11:37                           ` Peter Zijlstra
  1 sibling, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-15 11:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu, 2012-04-12 at 16:37 -0400, Vivek Goyal wrote:
> If yes, how does one map a filesystem's bdi we want to put rules on?
> 
/proc/self/mountinfo has the required bits

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-12 20:37                         ` Vivek Goyal
  (?)
@ 2012-04-15 11:37                           ` Peter Zijlstra
  -1 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-15 11:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel

On Thu, 2012-04-12 at 16:37 -0400, Vivek Goyal wrote:
> If yes, how does one map a filesystem's bdi we want to put rules on?
> 
/proc/self/mountinfo has the required bits

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-15 11:37                           ` Peter Zijlstra
  0 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-15 11:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel

On Thu, 2012-04-12 at 16:37 -0400, Vivek Goyal wrote:
> If yes, how does one map a filesystem's bdi we want to put rules on?
> 
/proc/self/mountinfo has the required bits

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-15 11:37                           ` Peter Zijlstra
  0 siblings, 0 replies; 262+ messages in thread
From: Peter Zijlstra @ 2012-04-15 11:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel

On Thu, 2012-04-12 at 16:37 -0400, Vivek Goyal wrote:
> If yes, how does one map a filesystem's bdi we want to put rules on?
> 
/proc/self/mountinfo has the required bits

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-14 11:53       ` Peter Zijlstra
  (?)
  (?)
@ 2012-04-16  1:25       ` Steve French
  -1 siblings, 0 replies; 262+ messages in thread
From: Steve French @ 2012-04-16  1:25 UTC (permalink / raw)
  To: linux-cifs-u79uwXL29TY76Z2rM5mHXA

This long thread on linux-mm and linux-fsdevel has been discussing
writeback, throttling, cgroups etc.  This post reminded me that we
should look more carefully at the cifs bdi implementation, compare to
nfs, and also check what needs to be improved in the bdi
implementation to handle smb2 credits.  It will be interesting to see
if that will help writeback.

On Sat, Apr 14, 2012 at 6:53 AM, Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
>
> On Wed, 2012-04-04 at 11:49 -0700, Tejun Heo wrote:
> > > - How to handle NFS.
> >
> > As said above, maybe through network based bdi pressure propagation,
> > Maybe some other special case mechanism.  Unsure but I don't think
> > this concern should dictate the whole design.
>
> NFS has a custom bdi implementation and implements congestion control
> based on the number of outstanding writeback pages.
>
> See fs/nfs/write.c:nfs_{set,end}_page_writeback
>
> All !block based filesystems have their own BDI implementation, I'm not
> sure on the congestion implementation of anything other than NFS though.
> _______________________________________________
> Lsf mailing list
> Lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/lsf




--
Thanks,

Steve

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-14 12:25                     ` Peter Zijlstra
  (?)
@ 2012-04-16 12:54                       ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-16 12:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote:
> On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote:
> > 
> > Ok, that's good to know. How would we configure this special bdi? I am
> > assuming there is no backing device visible in /sys/block/<device>/queue/?
> > Same is true for network file systems. 
> 
> root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done
> ls: cannot access /sys/class/bdi/0:20/: No such file or directory
> total 0
> drwxr-xr-x  3 root root    0 2012-03-27 23:18 .
> drwxr-xr-x 35 root root    0 2012-03-27 23:02 ..
> -rw-r--r--  1 root root 4096 2012-04-14 14:22 max_ratio
> -rw-r--r--  1 root root 4096 2012-04-14 14:22 min_ratio
> drwxr-xr-x  2 root root    0 2012-04-14 14:22 power
> -rw-r--r--  1 root root 4096 2012-04-14 14:22 read_ahead_kb
> lrwxrwxrwx  1 root root    0 2012-03-27 23:18 subsystem -> ../../../../class/bdi
> -rw-r--r--  1 root root 4096 2012-03-27 23:18 uevent

Ok, got it. So /proc/self/mountinfo has the information about st_dev and
one can use that to reach to associated bdi. Thanks Peter.

Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-16 12:54                       ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-16 12:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel

On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote:
> On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote:
> > 
> > Ok, that's good to know. How would we configure this special bdi? I am
> > assuming there is no backing device visible in /sys/block/<device>/queue/?
> > Same is true for network file systems. 
> 
> root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done
> ls: cannot access /sys/class/bdi/0:20/: No such file or directory
> total 0
> drwxr-xr-x  3 root root    0 2012-03-27 23:18 .
> drwxr-xr-x 35 root root    0 2012-03-27 23:02 ..
> -rw-r--r--  1 root root 4096 2012-04-14 14:22 max_ratio
> -rw-r--r--  1 root root 4096 2012-04-14 14:22 min_ratio
> drwxr-xr-x  2 root root    0 2012-04-14 14:22 power
> -rw-r--r--  1 root root 4096 2012-04-14 14:22 read_ahead_kb
> lrwxrwxrwx  1 root root    0 2012-03-27 23:18 subsystem -> ../../../../class/bdi
> -rw-r--r--  1 root root 4096 2012-03-27 23:18 uevent

Ok, got it. So /proc/self/mountinfo has the information about st_dev and
one can use that to reach to associated bdi. Thanks Peter.

Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-16 12:54                       ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-16 12:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jan Kara, ctalbott, rni, andrea, containers, linux-kernel, lsf,
	linux-mm, jmoyer, lizefan, cgroups, linux-fsdevel

On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote:
> On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote:
> > 
> > Ok, that's good to know. How would we configure this special bdi? I am
> > assuming there is no backing device visible in /sys/block/<device>/queue/?
> > Same is true for network file systems. 
> 
> root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done
> ls: cannot access /sys/class/bdi/0:20/: No such file or directory
> total 0
> drwxr-xr-x  3 root root    0 2012-03-27 23:18 .
> drwxr-xr-x 35 root root    0 2012-03-27 23:02 ..
> -rw-r--r--  1 root root 4096 2012-04-14 14:22 max_ratio
> -rw-r--r--  1 root root 4096 2012-04-14 14:22 min_ratio
> drwxr-xr-x  2 root root    0 2012-04-14 14:22 power
> -rw-r--r--  1 root root 4096 2012-04-14 14:22 read_ahead_kb
> lrwxrwxrwx  1 root root    0 2012-03-27 23:18 subsystem -> ../../../../class/bdi
> -rw-r--r--  1 root root 4096 2012-03-27 23:18 uevent

Ok, got it. So /proc/self/mountinfo has the information about st_dev and
one can use that to reach to associated bdi. Thanks Peter.

Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]                       ` <20120416125432.GB12776-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-16 13:07                         ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-16 13:07 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon, Apr 16, 2012 at 08:54:32AM -0400, Vivek Goyal wrote:
> On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote:
> > On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote:
> > > 
> > > Ok, that's good to know. How would we configure this special bdi? I am
> > > assuming there is no backing device visible in /sys/block/<device>/queue/?
> > > Same is true for network file systems. 
> > 
> > root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done
> > ls: cannot access /sys/class/bdi/0:20/: No such file or directory
> > total 0
> > drwxr-xr-x  3 root root    0 2012-03-27 23:18 .
> > drwxr-xr-x 35 root root    0 2012-03-27 23:02 ..
> > -rw-r--r--  1 root root 4096 2012-04-14 14:22 max_ratio
> > -rw-r--r--  1 root root 4096 2012-04-14 14:22 min_ratio
> > drwxr-xr-x  2 root root    0 2012-04-14 14:22 power
> > -rw-r--r--  1 root root 4096 2012-04-14 14:22 read_ahead_kb
> > lrwxrwxrwx  1 root root    0 2012-03-27 23:18 subsystem -> ../../../../class/bdi
> > -rw-r--r--  1 root root 4096 2012-03-27 23:18 uevent
> 
> Ok, got it. So /proc/self/mountinfo has the information about st_dev and
> one can use that to reach to associated bdi. Thanks Peter.

Vivek, I noticed these lines in cfq code

                sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);

Why not use bdi->dev->devt?  The problem is that dev_name() will
return "btrfs-X" for btrfs rather than "major:minor".

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-16 12:54                       ` Vivek Goyal
@ 2012-04-16 13:07                         ` Fengguang Wu
  -1 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-16 13:07 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Mon, Apr 16, 2012 at 08:54:32AM -0400, Vivek Goyal wrote:
> On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote:
> > On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote:
> > > 
> > > Ok, that's good to know. How would we configure this special bdi? I am
> > > assuming there is no backing device visible in /sys/block/<device>/queue/?
> > > Same is true for network file systems. 
> > 
> > root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done
> > ls: cannot access /sys/class/bdi/0:20/: No such file or directory
> > total 0
> > drwxr-xr-x  3 root root    0 2012-03-27 23:18 .
> > drwxr-xr-x 35 root root    0 2012-03-27 23:02 ..
> > -rw-r--r--  1 root root 4096 2012-04-14 14:22 max_ratio
> > -rw-r--r--  1 root root 4096 2012-04-14 14:22 min_ratio
> > drwxr-xr-x  2 root root    0 2012-04-14 14:22 power
> > -rw-r--r--  1 root root 4096 2012-04-14 14:22 read_ahead_kb
> > lrwxrwxrwx  1 root root    0 2012-03-27 23:18 subsystem -> ../../../../class/bdi
> > -rw-r--r--  1 root root 4096 2012-03-27 23:18 uevent
> 
> Ok, got it. So /proc/self/mountinfo has the information about st_dev and
> one can use that to reach to associated bdi. Thanks Peter.

Vivek, I noticed these lines in cfq code

                sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);

Why not use bdi->dev->devt?  The problem is that dev_name() will
return "btrfs-X" for btrfs rather than "major:minor".

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-16 13:07                         ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-16 13:07 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Mon, Apr 16, 2012 at 08:54:32AM -0400, Vivek Goyal wrote:
> On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote:
> > On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote:
> > > 
> > > Ok, that's good to know. How would we configure this special bdi? I am
> > > assuming there is no backing device visible in /sys/block/<device>/queue/?
> > > Same is true for network file systems. 
> > 
> > root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done
> > ls: cannot access /sys/class/bdi/0:20/: No such file or directory
> > total 0
> > drwxr-xr-x  3 root root    0 2012-03-27 23:18 .
> > drwxr-xr-x 35 root root    0 2012-03-27 23:02 ..
> > -rw-r--r--  1 root root 4096 2012-04-14 14:22 max_ratio
> > -rw-r--r--  1 root root 4096 2012-04-14 14:22 min_ratio
> > drwxr-xr-x  2 root root    0 2012-04-14 14:22 power
> > -rw-r--r--  1 root root 4096 2012-04-14 14:22 read_ahead_kb
> > lrwxrwxrwx  1 root root    0 2012-03-27 23:18 subsystem -> ../../../../class/bdi
> > -rw-r--r--  1 root root 4096 2012-03-27 23:18 uevent
> 
> Ok, got it. So /proc/self/mountinfo has the information about st_dev and
> one can use that to reach to associated bdi. Thanks Peter.

Vivek, I noticed these lines in cfq code

                sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);

Why not use bdi->dev->devt?  The problem is that dev_name() will
return "btrfs-X" for btrfs rather than "major:minor".

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-16 13:07                         ` Fengguang Wu
  (?)
@ 2012-04-16 14:19                         ` Fengguang Wu
  -1 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-16 14:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote:
> On Mon, Apr 16, 2012 at 08:54:32AM -0400, Vivek Goyal wrote:
> > On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote:
> > > On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote:
> > > > 
> > > > Ok, that's good to know. How would we configure this special bdi? I am
> > > > assuming there is no backing device visible in /sys/block/<device>/queue/?
> > > > Same is true for network file systems. 
> > > 
> > > root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done
> > > ls: cannot access /sys/class/bdi/0:20/: No such file or directory
> > > total 0
> > > drwxr-xr-x  3 root root    0 2012-03-27 23:18 .
> > > drwxr-xr-x 35 root root    0 2012-03-27 23:02 ..
> > > -rw-r--r--  1 root root 4096 2012-04-14 14:22 max_ratio
> > > -rw-r--r--  1 root root 4096 2012-04-14 14:22 min_ratio
> > > drwxr-xr-x  2 root root    0 2012-04-14 14:22 power
> > > -rw-r--r--  1 root root 4096 2012-04-14 14:22 read_ahead_kb
> > > lrwxrwxrwx  1 root root    0 2012-03-27 23:18 subsystem -> ../../../../class/bdi
> > > -rw-r--r--  1 root root 4096 2012-03-27 23:18 uevent
> > 
> > Ok, got it. So /proc/self/mountinfo has the information about st_dev and
> > one can use that to reach to associated bdi. Thanks Peter.
> 
> Vivek, I noticed these lines in cfq code
> 
>                 sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
> 
> Why not use bdi->dev->devt?  The problem is that dev_name() will
> return "btrfs-X" for btrfs rather than "major:minor".

Sorry it's not that simple. btrfs reports its faked btrfs_fs_info.bdi
to upper layer which is different from the bdi's for
btrfs_fs_info.fs_devices.devices saw by cfq.

It's the faked btrfs bdi that is named "btrfs-X" by this function:

setup_bdi():
        bdi_setup_and_register(bdi, "btrfs", BDI_CAP_MAP_COPY);

It does impose difficulties to interpret btrfs mountinfo, where you
cannot directly get the block device major/minor numbers:

35 16 0:26 / /fs/sda3 rw,relatime - btrfs /dev/sda3 rw,noacl

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-16 13:07                         ` Fengguang Wu
@ 2012-04-16 14:19                           ` Fengguang Wu
  -1 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-16 14:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote:
> On Mon, Apr 16, 2012 at 08:54:32AM -0400, Vivek Goyal wrote:
> > On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote:
> > > On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote:
> > > > 
> > > > Ok, that's good to know. How would we configure this special bdi? I am
> > > > assuming there is no backing device visible in /sys/block/<device>/queue/?
> > > > Same is true for network file systems. 
> > > 
> > > root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done
> > > ls: cannot access /sys/class/bdi/0:20/: No such file or directory
> > > total 0
> > > drwxr-xr-x  3 root root    0 2012-03-27 23:18 .
> > > drwxr-xr-x 35 root root    0 2012-03-27 23:02 ..
> > > -rw-r--r--  1 root root 4096 2012-04-14 14:22 max_ratio
> > > -rw-r--r--  1 root root 4096 2012-04-14 14:22 min_ratio
> > > drwxr-xr-x  2 root root    0 2012-04-14 14:22 power
> > > -rw-r--r--  1 root root 4096 2012-04-14 14:22 read_ahead_kb
> > > lrwxrwxrwx  1 root root    0 2012-03-27 23:18 subsystem -> ../../../../class/bdi
> > > -rw-r--r--  1 root root 4096 2012-03-27 23:18 uevent
> > 
> > Ok, got it. So /proc/self/mountinfo has the information about st_dev and
> > one can use that to reach to associated bdi. Thanks Peter.
> 
> Vivek, I noticed these lines in cfq code
> 
>                 sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
> 
> Why not use bdi->dev->devt?  The problem is that dev_name() will
> return "btrfs-X" for btrfs rather than "major:minor".

Sorry it's not that simple. btrfs reports its faked btrfs_fs_info.bdi
to upper layer which is different from the bdi's for
btrfs_fs_info.fs_devices.devices saw by cfq.

It's the faked btrfs bdi that is named "btrfs-X" by this function:

setup_bdi():
        bdi_setup_and_register(bdi, "btrfs", BDI_CAP_MAP_COPY);

It does impose difficulties to interpret btrfs mountinfo, where you
cannot directly get the block device major/minor numbers:

35 16 0:26 / /fs/sda3 rw,relatime - btrfs /dev/sda3 rw,noacl

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-16 14:19                           ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-16 14:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote:
> On Mon, Apr 16, 2012 at 08:54:32AM -0400, Vivek Goyal wrote:
> > On Sat, Apr 14, 2012 at 02:25:14PM +0200, Peter Zijlstra wrote:
> > > On Wed, 2012-04-11 at 11:40 -0400, Vivek Goyal wrote:
> > > > 
> > > > Ok, that's good to know. How would we configure this special bdi? I am
> > > > assuming there is no backing device visible in /sys/block/<device>/queue/?
> > > > Same is true for network file systems. 
> > > 
> > > root@twins:/usr/src/linux-2.6# awk '/nfs/ {print $3}' /proc/self/mountinfo | while read bdi ; do ls -la /sys/class/bdi/${bdi}/ ; done
> > > ls: cannot access /sys/class/bdi/0:20/: No such file or directory
> > > total 0
> > > drwxr-xr-x  3 root root    0 2012-03-27 23:18 .
> > > drwxr-xr-x 35 root root    0 2012-03-27 23:02 ..
> > > -rw-r--r--  1 root root 4096 2012-04-14 14:22 max_ratio
> > > -rw-r--r--  1 root root 4096 2012-04-14 14:22 min_ratio
> > > drwxr-xr-x  2 root root    0 2012-04-14 14:22 power
> > > -rw-r--r--  1 root root 4096 2012-04-14 14:22 read_ahead_kb
> > > lrwxrwxrwx  1 root root    0 2012-03-27 23:18 subsystem -> ../../../../class/bdi
> > > -rw-r--r--  1 root root 4096 2012-03-27 23:18 uevent
> > 
> > Ok, got it. So /proc/self/mountinfo has the information about st_dev and
> > one can use that to reach to associated bdi. Thanks Peter.
> 
> Vivek, I noticed these lines in cfq code
> 
>                 sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
> 
> Why not use bdi->dev->devt?  The problem is that dev_name() will
> return "btrfs-X" for btrfs rather than "major:minor".

Sorry it's not that simple. btrfs reports its faked btrfs_fs_info.bdi
to upper layer which is different from the bdi's for
btrfs_fs_info.fs_devices.devices saw by cfq.

It's the faked btrfs bdi that is named "btrfs-X" by this function:

setup_bdi():
        bdi_setup_and_register(bdi, "btrfs", BDI_CAP_MAP_COPY);

It does impose difficulties to interpret btrfs mountinfo, where you
cannot directly get the block device major/minor numbers:

35 16 0:26 / /fs/sda3 rw,relatime - btrfs /dev/sda3 rw,noacl

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-14 14:36                               ` Fengguang Wu
  (?)
  (?)
@ 2012-04-16 14:57                               ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-16 14:57 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote:

[..]
> Yeah the backpressure idea would work nicely with all possible
> intermediate stacking between the bdi and leaf devices. In my attempt
> to do combined IO bandwidth control for
> 
> - buffered writes, in balance_dirty_pages()
> - direct IO, in the cfq IO scheduler
> 
> I have to look into the cfq code in the past days to get an idea how
> the two throttling layers can cooperate (and suffer from the pains
> arise from the violations of layers). It's also rather tricky to get
> two previously independent throttling mechanisms to work seamlessly
> with each other for providing the desired _unified_ user interface. It
> took a lot of reasoning and experiments to work the basic scheme out...
> 
> But here is the first result. The attached graph shows progress of 4
> tasks:
> - cgroup A: 1 direct dd + 1 buffered dd
> - cgroup B: 1 direct dd + 1 buffered dd
> 
> The 4 tasks are mostly progressing at the same pace. The top 2
> smoother lines are for the buffered dirtiers. The bottom 2 lines are
> for the direct writers. As you may notice, the two direct writers are
> somehow stalled for 1-2 times, which increases the gaps between the
> lines. Otherwise, the algorithm is working as expected to distribute
> the bandwidth to each task.
> 
> The current code's target is to satisfy the more realistic user demand
> of distributing bandwidth equally to each cgroup, and inside each
> cgroup, distribute bandwidth equally to buffered/direct writes. On top
> of which, weights can be specified to change the default distribution.
> 
> The implementation involves adding "weight for direct IO" to the cfq
> groups and "weight for buffered writes" to the root cgroup. Note that
> current cfq proportional IO conroller does not offer explicit control
> over the direct:buffered ratio.
> 
> When there are both direct/buffered writers in the cgroup,
> balance_dirty_pages() will kick in and adjust the weights for cfq to
> execute. Note that cfq will continue to send all flusher IOs to the
> root cgroup.  balance_dirty_pages() will compute the overall async
> weight for it so that in the above test case, the computed weights
> will be

I think having separate weigths for sync IO groups and async IO is not
very appealing. There should be one notion of group weight and bandwidth
distrubuted among groups according to their weight.

Now one can argue that with-in a group, there might be one knob in CFQ
which allows to change the share or sync/async IO.

Also Tejun and Jan have expressed the desire that once we have figured
out a way to communicate the submitter's context for async IO, we would
like to account that IO in associated cgroup instead of root cgroup (as
we do today).

> 
> - 1000 async weight for the root cgroup (2 buffered dds)
> - 500 dio weight for cgroup A (1 direct dd)
> - 500 dio weight for cgroup B (1 direct dd)
> 
> The second graph shows result for another test case:
> - cgroup A, weight 300: 1 buffered cp
> - cgroup B, weight 600: 1 buffered dd + 1 direct dd
> - cgroup C, weight 300: 1 direct dd
> which is also working as expected.
> 
> Once the cfq properly grants total async IO share to the flusher,
> balance_dirty_pages() will then do its original job of distributing
> the buffered write bandwidth among the buffered dd tasks.
> 
> It will have to assume that the devices under the same bdi are
> "symmetry". It also needs further stats feedback on IOPS or disk time
> in order to do IOPS/time based IO distribution. Anyway it would be
> interesting to see how far this scheme can go. I'll cleanup the code
> and post it soon.

Your proposal relies on few things.

- Bandwidth needs to be divided eually among sync and async IO.
- Flusher thread async IO will always to go to root cgroup.
- I am not sure how this scheme is going to work when we introduce
  hierarchical blkio cgroups.
- cgroup weights for sync IO seems to be being controlled by user and
  somehow root cgroup weight seems to be controlled by this async IO
  logic silently.

Overall sounds very odd design to me. I am not sure what are we achieving
by this. In current scheme one should be able to just adjust the weight
of root cgroup using cgroup interface and achieve same results which you
are showing. So where is the need of dynamically changing it inside
kernel.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-14 14:36                               ` Fengguang Wu
@ 2012-04-16 14:57                                 ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-16 14:57 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote:

[..]
> Yeah the backpressure idea would work nicely with all possible
> intermediate stacking between the bdi and leaf devices. In my attempt
> to do combined IO bandwidth control for
> 
> - buffered writes, in balance_dirty_pages()
> - direct IO, in the cfq IO scheduler
> 
> I have to look into the cfq code in the past days to get an idea how
> the two throttling layers can cooperate (and suffer from the pains
> arise from the violations of layers). It's also rather tricky to get
> two previously independent throttling mechanisms to work seamlessly
> with each other for providing the desired _unified_ user interface. It
> took a lot of reasoning and experiments to work the basic scheme out...
> 
> But here is the first result. The attached graph shows progress of 4
> tasks:
> - cgroup A: 1 direct dd + 1 buffered dd
> - cgroup B: 1 direct dd + 1 buffered dd
> 
> The 4 tasks are mostly progressing at the same pace. The top 2
> smoother lines are for the buffered dirtiers. The bottom 2 lines are
> for the direct writers. As you may notice, the two direct writers are
> somehow stalled for 1-2 times, which increases the gaps between the
> lines. Otherwise, the algorithm is working as expected to distribute
> the bandwidth to each task.
> 
> The current code's target is to satisfy the more realistic user demand
> of distributing bandwidth equally to each cgroup, and inside each
> cgroup, distribute bandwidth equally to buffered/direct writes. On top
> of which, weights can be specified to change the default distribution.
> 
> The implementation involves adding "weight for direct IO" to the cfq
> groups and "weight for buffered writes" to the root cgroup. Note that
> current cfq proportional IO conroller does not offer explicit control
> over the direct:buffered ratio.
> 
> When there are both direct/buffered writers in the cgroup,
> balance_dirty_pages() will kick in and adjust the weights for cfq to
> execute. Note that cfq will continue to send all flusher IOs to the
> root cgroup.  balance_dirty_pages() will compute the overall async
> weight for it so that in the above test case, the computed weights
> will be

I think having separate weigths for sync IO groups and async IO is not
very appealing. There should be one notion of group weight and bandwidth
distrubuted among groups according to their weight.

Now one can argue that with-in a group, there might be one knob in CFQ
which allows to change the share or sync/async IO.

Also Tejun and Jan have expressed the desire that once we have figured
out a way to communicate the submitter's context for async IO, we would
like to account that IO in associated cgroup instead of root cgroup (as
we do today).

> 
> - 1000 async weight for the root cgroup (2 buffered dds)
> - 500 dio weight for cgroup A (1 direct dd)
> - 500 dio weight for cgroup B (1 direct dd)
> 
> The second graph shows result for another test case:
> - cgroup A, weight 300: 1 buffered cp
> - cgroup B, weight 600: 1 buffered dd + 1 direct dd
> - cgroup C, weight 300: 1 direct dd
> which is also working as expected.
> 
> Once the cfq properly grants total async IO share to the flusher,
> balance_dirty_pages() will then do its original job of distributing
> the buffered write bandwidth among the buffered dd tasks.
> 
> It will have to assume that the devices under the same bdi are
> "symmetry". It also needs further stats feedback on IOPS or disk time
> in order to do IOPS/time based IO distribution. Anyway it would be
> interesting to see how far this scheme can go. I'll cleanup the code
> and post it soon.

Your proposal relies on few things.

- Bandwidth needs to be divided eually among sync and async IO.
- Flusher thread async IO will always to go to root cgroup.
- I am not sure how this scheme is going to work when we introduce
  hierarchical blkio cgroups.
- cgroup weights for sync IO seems to be being controlled by user and
  somehow root cgroup weight seems to be controlled by this async IO
  logic silently.

Overall sounds very odd design to me. I am not sure what are we achieving
by this. In current scheme one should be able to just adjust the weight
of root cgroup using cgroup interface and achieve same results which you
are showing. So where is the need of dynamically changing it inside
kernel.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-16 14:57                                 ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-16 14:57 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote:

[..]
> Yeah the backpressure idea would work nicely with all possible
> intermediate stacking between the bdi and leaf devices. In my attempt
> to do combined IO bandwidth control for
> 
> - buffered writes, in balance_dirty_pages()
> - direct IO, in the cfq IO scheduler
> 
> I have to look into the cfq code in the past days to get an idea how
> the two throttling layers can cooperate (and suffer from the pains
> arise from the violations of layers). It's also rather tricky to get
> two previously independent throttling mechanisms to work seamlessly
> with each other for providing the desired _unified_ user interface. It
> took a lot of reasoning and experiments to work the basic scheme out...
> 
> But here is the first result. The attached graph shows progress of 4
> tasks:
> - cgroup A: 1 direct dd + 1 buffered dd
> - cgroup B: 1 direct dd + 1 buffered dd
> 
> The 4 tasks are mostly progressing at the same pace. The top 2
> smoother lines are for the buffered dirtiers. The bottom 2 lines are
> for the direct writers. As you may notice, the two direct writers are
> somehow stalled for 1-2 times, which increases the gaps between the
> lines. Otherwise, the algorithm is working as expected to distribute
> the bandwidth to each task.
> 
> The current code's target is to satisfy the more realistic user demand
> of distributing bandwidth equally to each cgroup, and inside each
> cgroup, distribute bandwidth equally to buffered/direct writes. On top
> of which, weights can be specified to change the default distribution.
> 
> The implementation involves adding "weight for direct IO" to the cfq
> groups and "weight for buffered writes" to the root cgroup. Note that
> current cfq proportional IO conroller does not offer explicit control
> over the direct:buffered ratio.
> 
> When there are both direct/buffered writers in the cgroup,
> balance_dirty_pages() will kick in and adjust the weights for cfq to
> execute. Note that cfq will continue to send all flusher IOs to the
> root cgroup.  balance_dirty_pages() will compute the overall async
> weight for it so that in the above test case, the computed weights
> will be

I think having separate weigths for sync IO groups and async IO is not
very appealing. There should be one notion of group weight and bandwidth
distrubuted among groups according to their weight.

Now one can argue that with-in a group, there might be one knob in CFQ
which allows to change the share or sync/async IO.

Also Tejun and Jan have expressed the desire that once we have figured
out a way to communicate the submitter's context for async IO, we would
like to account that IO in associated cgroup instead of root cgroup (as
we do today).

> 
> - 1000 async weight for the root cgroup (2 buffered dds)
> - 500 dio weight for cgroup A (1 direct dd)
> - 500 dio weight for cgroup B (1 direct dd)
> 
> The second graph shows result for another test case:
> - cgroup A, weight 300: 1 buffered cp
> - cgroup B, weight 600: 1 buffered dd + 1 direct dd
> - cgroup C, weight 300: 1 direct dd
> which is also working as expected.
> 
> Once the cfq properly grants total async IO share to the flusher,
> balance_dirty_pages() will then do its original job of distributing
> the buffered write bandwidth among the buffered dd tasks.
> 
> It will have to assume that the devices under the same bdi are
> "symmetry". It also needs further stats feedback on IOPS or disk time
> in order to do IOPS/time based IO distribution. Anyway it would be
> interesting to see how far this scheme can go. I'll cleanup the code
> and post it soon.

Your proposal relies on few things.

- Bandwidth needs to be divided eually among sync and async IO.
- Flusher thread async IO will always to go to root cgroup.
- I am not sure how this scheme is going to work when we introduce
  hierarchical blkio cgroups.
- cgroup weights for sync IO seems to be being controlled by user and
  somehow root cgroup weight seems to be controlled by this async IO
  logic silently.

Overall sounds very odd design to me. I am not sure what are we achieving
by this. In current scheme one should be able to just adjust the weight
of root cgroup using cgroup interface and achieve same results which you
are showing. So where is the need of dynamically changing it inside
kernel.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-16 13:07                         ` Fengguang Wu
                                           ` (2 preceding siblings ...)
  (?)
@ 2012-04-16 15:52                         ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-16 15:52 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote:

[..]
> Vivek, I noticed these lines in cfq code
> 
>                 sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
> 
> Why not use bdi->dev->devt?  The problem is that dev_name() will
> return "btrfs-X" for btrfs rather than "major:minor".

Isn't bdi->dev->devt 0?  I see following code.

add_disk()
   bdi_register_dev()
      bdi_register()
         device_create_vargs(MKDEV(0,0))
	      dev->devt = devt = MKDEV(0,0);

So for normal block devices, I think bdi->dev->devt will be zero, that's
why probably we don't use it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-16 13:07                         ` Fengguang Wu
@ 2012-04-16 15:52                           ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-16 15:52 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Peter Zijlstra, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote:

[..]
> Vivek, I noticed these lines in cfq code
> 
>                 sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
> 
> Why not use bdi->dev->devt?  The problem is that dev_name() will
> return "btrfs-X" for btrfs rather than "major:minor".

Isn't bdi->dev->devt 0?  I see following code.

add_disk()
   bdi_register_dev()
      bdi_register()
         device_create_vargs(MKDEV(0,0))
	      dev->devt = devt = MKDEV(0,0);

So for normal block devices, I think bdi->dev->devt will be zero, that's
why probably we don't use it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-16 15:52                           ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-16 15:52 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Peter Zijlstra, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote:

[..]
> Vivek, I noticed these lines in cfq code
> 
>                 sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
> 
> Why not use bdi->dev->devt?  The problem is that dev_name() will
> return "btrfs-X" for btrfs rather than "major:minor".

Isn't bdi->dev->devt 0?  I see following code.

add_disk()
   bdi_register_dev()
      bdi_register()
         device_create_vargs(MKDEV(0,0))
	      dev->devt = devt = MKDEV(0,0);

So for normal block devices, I think bdi->dev->devt will be zero, that's
why probably we don't use it.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
  2012-04-16 15:52                           ` Vivek Goyal
  (?)
@ 2012-04-17  2:14                               ` Fengguang Wu
  -1 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-17  2:14 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon, Apr 16, 2012 at 11:52:07AM -0400, Vivek Goyal wrote:
> On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote:
> 
> [..]
> > Vivek, I noticed these lines in cfq code
> > 
> >                 sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
> > 
> > Why not use bdi->dev->devt?  The problem is that dev_name() will
> > return "btrfs-X" for btrfs rather than "major:minor".
> 
> Isn't bdi->dev->devt 0?  I see following code.
> 
> add_disk()
>    bdi_register_dev()
>       bdi_register()
>          device_create_vargs(MKDEV(0,0))
> 	      dev->devt = devt = MKDEV(0,0);
> 
> So for normal block devices, I think bdi->dev->devt will be zero, that's
> why probably we don't use it.

Yes indeed. I can confirm this with tracing. There are two main cases

- some filesystems do not have a real device for the bdi.

- add_disk() calls bdi_register_dev() with the devt, however this
  information is not passed down for some reason.
  device_create_vargs() will try to create a sysfs dev file if the
  devt is not MKDEV(0,0).

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-17  2:14                               ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-17  2:14 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Mon, Apr 16, 2012 at 11:52:07AM -0400, Vivek Goyal wrote:
> On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote:
> 
> [..]
> > Vivek, I noticed these lines in cfq code
> > 
> >                 sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
> > 
> > Why not use bdi->dev->devt?  The problem is that dev_name() will
> > return "btrfs-X" for btrfs rather than "major:minor".
> 
> Isn't bdi->dev->devt 0?  I see following code.
> 
> add_disk()
>    bdi_register_dev()
>       bdi_register()
>          device_create_vargs(MKDEV(0,0))
> 	      dev->devt = devt = MKDEV(0,0);
> 
> So for normal block devices, I think bdi->dev->devt will be zero, that's
> why probably we don't use it.

Yes indeed. I can confirm this with tracing. There are two main cases

- some filesystems do not have a real device for the bdi.

- add_disk() calls bdi_register_dev() with the devt, however this
  information is not passed down for some reason.
  device_create_vargs() will try to create a sysfs dev file if the
  devt is not MKDEV(0,0).

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-17  2:14                               ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-17  2:14 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, ctalbott, rni, andrea, containers, linux-kernel,
	lsf, linux-mm, jmoyer, lizefan, linux-fsdevel, cgroups

On Mon, Apr 16, 2012 at 11:52:07AM -0400, Vivek Goyal wrote:
> On Mon, Apr 16, 2012 at 09:07:07PM +0800, Fengguang Wu wrote:
> 
> [..]
> > Vivek, I noticed these lines in cfq code
> > 
> >                 sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
> > 
> > Why not use bdi->dev->devt?  The problem is that dev_name() will
> > return "btrfs-X" for btrfs rather than "major:minor".
> 
> Isn't bdi->dev->devt 0?  I see following code.
> 
> add_disk()
>    bdi_register_dev()
>       bdi_register()
>          device_create_vargs(MKDEV(0,0))
> 	      dev->devt = devt = MKDEV(0,0);
> 
> So for normal block devices, I think bdi->dev->devt will be zero, that's
> why probably we don't use it.

Yes indeed. I can confirm this with tracing. There are two main cases

- some filesystems do not have a real device for the bdi.

- add_disk() calls bdi_register_dev() with the devt, however this
  information is not passed down for some reason.
  device_create_vargs() will try to create a sysfs dev file if the
  devt is not MKDEV(0,0).

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                       ` <20120411170542.GB16008-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
  2012-04-11 17:23                         ` Vivek Goyal
@ 2012-04-17 21:48                         ` Tejun Heo
  1 sibling, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-17 21:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu, Vivek Goyal

Hello,

On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote:
> > The additional feature for buffered throttle (which never went upstream),
> > was synchronous in nature. That is we were actively putting writer to
> > sleep on a per cgroup wait queue in the request queue and wake it up when
> > it can do further IO based on cgroup limits.
>
>   Hmm, but then there would be similar starvation issues as with my simple
> scheme because async IO could always use the whole available bandwidth.
> Mixing of sync & async throttling is really problematic... I'm wondering
> how useful the async throttling is. Because we will block on request
> allocation once there are more than nr_requests pending requests so at that
> point throttling becomes sync anyway.

I haven't thought about the interface too much yet but, with the
synchronous wait at transaction start, we have information both ways -
ie. lower layer also knows that there are synchrnous waiters.  At the
simplest, not allowing any more async IOs when sync writers exist
should solve the starvation issue.

As for priority inversion through shared request pool, it is a problem
which needs to be solved regardless of how async IOs are throttled.
I'm not determined to which extent yet tho.  Different cgroups
definitely need to be on separate pools but do we also want
distinguish sync and async and what about ioprio?  Maybe we need a
bybrid approach with larger common pool and reserved ones for each
class?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                       ` <20120411170542.GB16008-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
  2012-04-11 17:23                         ` Vivek Goyal
@ 2012-04-17 21:48                         ` Tejun Heo
  1 sibling, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-17 21:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

Hello,

On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote:
> > The additional feature for buffered throttle (which never went upstream),
> > was synchronous in nature. That is we were actively putting writer to
> > sleep on a per cgroup wait queue in the request queue and wake it up when
> > it can do further IO based on cgroup limits.
>
>   Hmm, but then there would be similar starvation issues as with my simple
> scheme because async IO could always use the whole available bandwidth.
> Mixing of sync & async throttling is really problematic... I'm wondering
> how useful the async throttling is. Because we will block on request
> allocation once there are more than nr_requests pending requests so at that
> point throttling becomes sync anyway.

I haven't thought about the interface too much yet but, with the
synchronous wait at transaction start, we have information both ways -
ie. lower layer also knows that there are synchrnous waiters.  At the
simplest, not allowing any more async IOs when sync writers exist
should solve the starvation issue.

As for priority inversion through shared request pool, it is a problem
which needs to be solved regardless of how async IOs are throttled.
I'm not determined to which extent yet tho.  Different cgroups
definitely need to be on separate pools but do we also want
distinguish sync and async and what about ioprio?  Maybe we need a
bybrid approach with larger common pool and reserved ones for each
class?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-17 21:48                         ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-17 21:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, Fengguang Wu, Jens Axboe,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k,
	andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hello,

On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote:
> > The additional feature for buffered throttle (which never went upstream),
> > was synchronous in nature. That is we were actively putting writer to
> > sleep on a per cgroup wait queue in the request queue and wake it up when
> > it can do further IO based on cgroup limits.
>
>   Hmm, but then there would be similar starvation issues as with my simple
> scheme because async IO could always use the whole available bandwidth.
> Mixing of sync & async throttling is really problematic... I'm wondering
> how useful the async throttling is. Because we will block on request
> allocation once there are more than nr_requests pending requests so at that
> point throttling becomes sync anyway.

I haven't thought about the interface too much yet but, with the
synchronous wait at transaction start, we have information both ways -
ie. lower layer also knows that there are synchrnous waiters.  At the
simplest, not allowing any more async IOs when sync writers exist
should solve the starvation issue.

As for priority inversion through shared request pool, it is a problem
which needs to be solved regardless of how async IOs are throttled.
I'm not determined to which extent yet tho.  Different cgroups
definitely need to be on separate pools but do we also want
distinguish sync and async and what about ioprio?  Maybe we need a
bybrid approach with larger common pool and reserved ones for each
class?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-17 21:48                         ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-17 21:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

Hello,

On Wed, Apr 11, 2012 at 07:05:42PM +0200, Jan Kara wrote:
> > The additional feature for buffered throttle (which never went upstream),
> > was synchronous in nature. That is we were actively putting writer to
> > sleep on a per cgroup wait queue in the request queue and wake it up when
> > it can do further IO based on cgroup limits.
>
>   Hmm, but then there would be similar starvation issues as with my simple
> scheme because async IO could always use the whole available bandwidth.
> Mixing of sync & async throttling is really problematic... I'm wondering
> how useful the async throttling is. Because we will block on request
> allocation once there are more than nr_requests pending requests so at that
> point throttling becomes sync anyway.

I haven't thought about the interface too much yet but, with the
synchronous wait at transaction start, we have information both ways -
ie. lower layer also knows that there are synchrnous waiters.  At the
simplest, not allowing any more async IOs when sync writers exist
should solve the starvation issue.

As for priority inversion through shared request pool, it is a problem
which needs to be solved regardless of how async IOs are throttled.
I'm not determined to which extent yet tho.  Different cgroups
definitely need to be on separate pools but do we also want
distinguish sync and async and what about ioprio?  Maybe we need a
bybrid approach with larger common pool and reserved ones for each
class?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                     ` <20120411192231.GF16008-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
  2012-04-12 20:37                         ` Vivek Goyal
@ 2012-04-17 22:01                       ` Tejun Heo
  1 sibling, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-17 22:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu, Vivek Goyal

Hello,

On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote:
> > So all the metadata IO will happen thorough journaling thread and that
> > will be in root group which should remain unthrottled. So any journal
> > IO going to disk should remain unthrottled.
>
>   Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't
> have to have the journal thread (as is the case of reiserfs where random
> writer may end up doing commit) but let's not complicate things
> unnecessarily.

Why can't journal entries keep track of the originator so that bios
can be attributed to the originator while committing?  That shouldn't
be too difficult to implement, no?

> > Now, IIRC, fsync problem with throttling was that we had opened a
> > transaction but could not write it back to disk because we had to
> > wait for all the cached data to go to disk (which is throttled). So
> > my question is, can't we first wait for all the data to be flushed
> > to disk and then open a transaction for metadata. metadata will be
> > unthrottled so filesystem will not have to do any tricks like bdi is
> > congested or not.
>
>   Actually that's what's happening. We first do filemap_write_and_wait()
> which syncs all the data and then we go and force transaction commit to
> make sure all metadata got to stable storage. The problem is that writeout
> of data may need to allocate new blocks and that starts a transaction and
> while the transaction is started we may need to do some reads (e.g. of
> bitmaps etc.) which may be throttled and at that moment the whole
> filesystem is blocked. I don't remember the stack traces you showed me so
> I'm not sure it this is what your observed but it's certainly one possible
> scenario. The reason why fsync triggers problems is simply that it's the
> only place where process normally does significant amount of writing. In
> most cases flusher thread / journal thread do it so this effect is not
> visible. And to precede your question, it would be rather hard to avoid IO
> while the transaction is started due to locking.

Probably we should mark all IOs issued inside transaction as META (or
whatever which tells blkcg to avoid throttling it).  We're gonna need
overcharging for metadata writes anyway, so I don't think this will
make too much of a difference.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                     ` <20120411192231.GF16008-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
  2012-04-12 20:37                         ` Vivek Goyal
@ 2012-04-17 22:01                       ` Tejun Heo
  1 sibling, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-17 22:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

Hello,

On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote:
> > So all the metadata IO will happen thorough journaling thread and that
> > will be in root group which should remain unthrottled. So any journal
> > IO going to disk should remain unthrottled.
>
>   Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't
> have to have the journal thread (as is the case of reiserfs where random
> writer may end up doing commit) but let's not complicate things
> unnecessarily.

Why can't journal entries keep track of the originator so that bios
can be attributed to the originator while committing?  That shouldn't
be too difficult to implement, no?

> > Now, IIRC, fsync problem with throttling was that we had opened a
> > transaction but could not write it back to disk because we had to
> > wait for all the cached data to go to disk (which is throttled). So
> > my question is, can't we first wait for all the data to be flushed
> > to disk and then open a transaction for metadata. metadata will be
> > unthrottled so filesystem will not have to do any tricks like bdi is
> > congested or not.
>
>   Actually that's what's happening. We first do filemap_write_and_wait()
> which syncs all the data and then we go and force transaction commit to
> make sure all metadata got to stable storage. The problem is that writeout
> of data may need to allocate new blocks and that starts a transaction and
> while the transaction is started we may need to do some reads (e.g. of
> bitmaps etc.) which may be throttled and at that moment the whole
> filesystem is blocked. I don't remember the stack traces you showed me so
> I'm not sure it this is what your observed but it's certainly one possible
> scenario. The reason why fsync triggers problems is simply that it's the
> only place where process normally does significant amount of writing. In
> most cases flusher thread / journal thread do it so this effect is not
> visible. And to precede your question, it would be rather hard to avoid IO
> while the transaction is started due to locking.

Probably we should mark all IOs issued inside transaction as META (or
whatever which tells blkcg to avoid throttling it).  We're gonna need
overcharging for metadata writes anyway, so I don't think this will
make too much of a difference.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-17 22:01                       ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-17 22:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, Fengguang Wu, Jens Axboe,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k,
	andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hello,

On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote:
> > So all the metadata IO will happen thorough journaling thread and that
> > will be in root group which should remain unthrottled. So any journal
> > IO going to disk should remain unthrottled.
>
>   Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't
> have to have the journal thread (as is the case of reiserfs where random
> writer may end up doing commit) but let's not complicate things
> unnecessarily.

Why can't journal entries keep track of the originator so that bios
can be attributed to the originator while committing?  That shouldn't
be too difficult to implement, no?

> > Now, IIRC, fsync problem with throttling was that we had opened a
> > transaction but could not write it back to disk because we had to
> > wait for all the cached data to go to disk (which is throttled). So
> > my question is, can't we first wait for all the data to be flushed
> > to disk and then open a transaction for metadata. metadata will be
> > unthrottled so filesystem will not have to do any tricks like bdi is
> > congested or not.
>
>   Actually that's what's happening. We first do filemap_write_and_wait()
> which syncs all the data and then we go and force transaction commit to
> make sure all metadata got to stable storage. The problem is that writeout
> of data may need to allocate new blocks and that starts a transaction and
> while the transaction is started we may need to do some reads (e.g. of
> bitmaps etc.) which may be throttled and at that moment the whole
> filesystem is blocked. I don't remember the stack traces you showed me so
> I'm not sure it this is what your observed but it's certainly one possible
> scenario. The reason why fsync triggers problems is simply that it's the
> only place where process normally does significant amount of writing. In
> most cases flusher thread / journal thread do it so this effect is not
> visible. And to precede your question, it would be rather hard to avoid IO
> while the transaction is started due to locking.

Probably we should mark all IOs issued inside transaction as META (or
whatever which tells blkcg to avoid throttling it).  We're gonna need
overcharging for metadata writes anyway, so I don't think this will
make too much of a difference.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-17 22:01                       ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-17 22:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

Hello,

On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote:
> > So all the metadata IO will happen thorough journaling thread and that
> > will be in root group which should remain unthrottled. So any journal
> > IO going to disk should remain unthrottled.
>
>   Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't
> have to have the journal thread (as is the case of reiserfs where random
> writer may end up doing commit) but let's not complicate things
> unnecessarily.

Why can't journal entries keep track of the originator so that bios
can be attributed to the originator while committing?  That shouldn't
be too difficult to implement, no?

> > Now, IIRC, fsync problem with throttling was that we had opened a
> > transaction but could not write it back to disk because we had to
> > wait for all the cached data to go to disk (which is throttled). So
> > my question is, can't we first wait for all the data to be flushed
> > to disk and then open a transaction for metadata. metadata will be
> > unthrottled so filesystem will not have to do any tricks like bdi is
> > congested or not.
>
>   Actually that's what's happening. We first do filemap_write_and_wait()
> which syncs all the data and then we go and force transaction commit to
> make sure all metadata got to stable storage. The problem is that writeout
> of data may need to allocate new blocks and that starts a transaction and
> while the transaction is started we may need to do some reads (e.g. of
> bitmaps etc.) which may be throttled and at that moment the whole
> filesystem is blocked. I don't remember the stack traces you showed me so
> I'm not sure it this is what your observed but it's certainly one possible
> scenario. The reason why fsync triggers problems is simply that it's the
> only place where process normally does significant amount of writing. In
> most cases flusher thread / journal thread do it so this effect is not
> visible. And to precede your question, it would be rather hard to avoid IO
> while the transaction is started due to locking.

Probably we should mark all IOs issued inside transaction as META (or
whatever which tells blkcg to avoid throttling it).  We're gonna need
overcharging for metadata writes anyway, so I don't think this will
make too much of a difference.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-06  9:59         ` Fengguang Wu
  (?)
  (?)
@ 2012-04-17 22:38         ` Tejun Heo
  -1 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-17 22:38 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA

Hello, Fengguang.

On Fri, Apr 06, 2012 at 02:59:34AM -0700, Fengguang Wu wrote:
> Fortunately, the above gap can be easily filled judging from the
> block/cfq IO controller code. By adding some direct IO accounting
> and changing several lines of my patches to make use of the collected
> stats, the semantics of the blkio.throttle.write_bps interfaces can be
> changed from "limit for direct IO" to "limit for direct+buffered IOs".
> Ditto for blkio.weight and blkio.write_iops, as long as some
> iops/device time stats are made available to balance_dirty_pages().
> 
> It would be a fairly *easy* change. :-) It's merely adding some
> accounting code and there is no need to change the block IO
> controlling algorithm at all. I'll do the work of accounting (which
> is basically independent of the IO controlling) and use the new stats
> in balance_dirty_pages().

I don't really understand how this can work.  For hard limits, maybe,
but for proportional IO, you have to know which cgroups have IOs
before assigning the proportions, so blkcg assigning IO bandwidth
without knowing async writes simply can't work.

For example, let's say cgroups A and B have 2:8 split.  If A has IOs
on queue and B doesn't, blkcg will assign all IO bandwidth to A.  I
can't wrap my head around how writeback is gonna make use of the
resulting stats but let's say it decides it needs to put out some IOs
out for both cgroups.  What happens then?  Do all the async writes go
through the root cgroup controlled by and affecting the ratio between
rootcg and cgroup A and B?  Or do they have to be accounted as part of
cgroups A and B?  If so, what if the added bandwidth goes over the
limit?  Let's say if we implement overcharge; then, I suppose we'll
have to communicate that upwards too, right?

This is still easy.  What about hierarchical propio?  What happens
then?  You can't do hierarchical proportional allocation without
knowing how much IOs are pending for which group.  How is that
information gonna be communicated between blkcg and writeback?  Are we
gonna have two separate hierarchical proportional IO allocators?  How
is that gonna work at all?  If we're gonna have single allocator in
block layer, writeback would have to feed the amount of IOs it may
generate into the allocator, get the resulting allocation and then
issue IO and then block layer again will have to account these to the
originating cgroups.  It's just crazy.

> The only problem I can see now, is that balance_dirty_pages() works
> per-bdi and blkcg works per-device. So the two ends may not match
> nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where
> sdb is shared by lv0 and lv1. However it should be rare situations and
> be much more acceptable than the problems arise from the "push back"
> approach which impacts everyone.

I don't know.  What problems?  AFAICS, the biggest issue is writeback
of different inodes getting mixed resulting in poor performance, but
if you think about it, that's about the frequency of switching cgroups
and a problem which can and should be dealt with from block layer
(e.g. use larger time slice if all the pending IOs are async).

Writeback's duty is generating stream of async writes which can be
served efficiently for the *cgroup* and keeping the buffer filled as
necessary and chaining the backpressure from there to the actual
dirtier.  That's what writeback does without cgroup.  Nothing
fundamental changes with cgroup.  It's just finer grained.

> > No, no, it's not about standing in my way.  As Vivek said in the other
> > reply, it's that the "gap" that you filled was created *because*
> > writeback wasn't cgroup aware and now you're in turn filling that gap
> > by making writeback work around that "gap".  I mean, my mind boggles.
> > Doesn't yours?  I strongly believe everyone's should.
> 
> Heh. It's a hard problem indeed. I felt great pains in the IO-less
> dirty throttling work. I did a lot reasoning about it, and have in
> fact kept cgroup IO controller in mind since its early days. Now I'd
> say it's hands down for it to adapt to the gap between the total IO
> limit and what's carried out by the block IO controller.

You're not providing any valid counter arguments about the issues
being raised about the messed up design.  How is anything "hands down"
here?

> > There's where I'm confused.  How is the said split supposed to work?
> > They aren't independent.  I mean, who gets to decide what and where
> > are those decisions enforced?
> 
> Yeah it's not independent. It's about
> 
> - keep block IO cgroup untouched (in its current algorithm, for
>   throttling direct IO)
> 
> - let balance_dirty_pages() adapt to the throttling target
>   
>         buffered_write_limit = total_limit - direct_IOs

Think about proportional allocation.  You don't have a number until
you know who have pending IOs and how much.

> To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> It's always there doing 1:1 proportional throttling. Then you try to
> kick in to add *double* throttling in block/cfq layer. Now the low
> layer may enforce 10:1 throttling and push balance_dirty_pages() away
> from its balanced state, leading to large fluctuations and program
> stalls.

Just do the same 1:1 inside each cgroup.

>  This can be avoided by telling balance_dirty_pages(): "your
> balance goal is no longer 1:1, but 10:1". With this information
> balance_dirty_pages() will behave right. Then there is the question:
> if balance_dirty_pages() will work just well provided the information,
> why bother doing the throttling at low layer and "push back" the
> pressure all the way up?

Because splitting a resource into two pieces arbitrarily with
different amount of consumptions on each side and then applying the
same proportion on both doesn't mean anything?

> The balance_dirty_pages() is already deeply involved in dirty throttling.
> As you can see from this patchset, the same algorithms can be extended
> trivially to work with cgroup IO limits.
> 
> buffered write IO controller in balance_dirty_pages()
> https://lkml.org/lkml/2012/3/28/275

It is half broken thing with fundamental design flaws which can't be
corrected without complete reimplementation.  I don't know what to
say.

> In the "back pressure" scheme, memcg is a must because only it has all
> the infrastructure to track dirty pages upon which you can apply some
> dirty_limits. Don't tell me you want to account dirty pages in blkcg...

For now, per-inode tracking seems good enough.

> What I can see is, it looks pretty simple and nature to let
> balance_dirty_pages() fill the gap towards a total solution :-)
> 
> - add direct IO accounting in some convenient point of the IO path
>   IO submission or completion point, either is fine.
> 
> - change several lines of the buffered write IO controller to
>   integrate the direct IO rate into the formula to fit the "total
>   IO" limit
> 
> - in future, add more accounting as well as feedback control to make
>   balance_dirty_pages() work with IOPS and disk time

To me, you seem to be not addressing the issues I've been raising at
all and just repeating the same points again and again.  If I'm
misunderstanding something, please point out.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-06  9:59         ` Fengguang Wu
  (?)
@ 2012-04-17 22:38           ` Tejun Heo
  -1 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-17 22:38 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hello, Fengguang.

On Fri, Apr 06, 2012 at 02:59:34AM -0700, Fengguang Wu wrote:
> Fortunately, the above gap can be easily filled judging from the
> block/cfq IO controller code. By adding some direct IO accounting
> and changing several lines of my patches to make use of the collected
> stats, the semantics of the blkio.throttle.write_bps interfaces can be
> changed from "limit for direct IO" to "limit for direct+buffered IOs".
> Ditto for blkio.weight and blkio.write_iops, as long as some
> iops/device time stats are made available to balance_dirty_pages().
> 
> It would be a fairly *easy* change. :-) It's merely adding some
> accounting code and there is no need to change the block IO
> controlling algorithm at all. I'll do the work of accounting (which
> is basically independent of the IO controlling) and use the new stats
> in balance_dirty_pages().

I don't really understand how this can work.  For hard limits, maybe,
but for proportional IO, you have to know which cgroups have IOs
before assigning the proportions, so blkcg assigning IO bandwidth
without knowing async writes simply can't work.

For example, let's say cgroups A and B have 2:8 split.  If A has IOs
on queue and B doesn't, blkcg will assign all IO bandwidth to A.  I
can't wrap my head around how writeback is gonna make use of the
resulting stats but let's say it decides it needs to put out some IOs
out for both cgroups.  What happens then?  Do all the async writes go
through the root cgroup controlled by and affecting the ratio between
rootcg and cgroup A and B?  Or do they have to be accounted as part of
cgroups A and B?  If so, what if the added bandwidth goes over the
limit?  Let's say if we implement overcharge; then, I suppose we'll
have to communicate that upwards too, right?

This is still easy.  What about hierarchical propio?  What happens
then?  You can't do hierarchical proportional allocation without
knowing how much IOs are pending for which group.  How is that
information gonna be communicated between blkcg and writeback?  Are we
gonna have two separate hierarchical proportional IO allocators?  How
is that gonna work at all?  If we're gonna have single allocator in
block layer, writeback would have to feed the amount of IOs it may
generate into the allocator, get the resulting allocation and then
issue IO and then block layer again will have to account these to the
originating cgroups.  It's just crazy.

> The only problem I can see now, is that balance_dirty_pages() works
> per-bdi and blkcg works per-device. So the two ends may not match
> nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where
> sdb is shared by lv0 and lv1. However it should be rare situations and
> be much more acceptable than the problems arise from the "push back"
> approach which impacts everyone.

I don't know.  What problems?  AFAICS, the biggest issue is writeback
of different inodes getting mixed resulting in poor performance, but
if you think about it, that's about the frequency of switching cgroups
and a problem which can and should be dealt with from block layer
(e.g. use larger time slice if all the pending IOs are async).

Writeback's duty is generating stream of async writes which can be
served efficiently for the *cgroup* and keeping the buffer filled as
necessary and chaining the backpressure from there to the actual
dirtier.  That's what writeback does without cgroup.  Nothing
fundamental changes with cgroup.  It's just finer grained.

> > No, no, it's not about standing in my way.  As Vivek said in the other
> > reply, it's that the "gap" that you filled was created *because*
> > writeback wasn't cgroup aware and now you're in turn filling that gap
> > by making writeback work around that "gap".  I mean, my mind boggles.
> > Doesn't yours?  I strongly believe everyone's should.
> 
> Heh. It's a hard problem indeed. I felt great pains in the IO-less
> dirty throttling work. I did a lot reasoning about it, and have in
> fact kept cgroup IO controller in mind since its early days. Now I'd
> say it's hands down for it to adapt to the gap between the total IO
> limit and what's carried out by the block IO controller.

You're not providing any valid counter arguments about the issues
being raised about the messed up design.  How is anything "hands down"
here?

> > There's where I'm confused.  How is the said split supposed to work?
> > They aren't independent.  I mean, who gets to decide what and where
> > are those decisions enforced?
> 
> Yeah it's not independent. It's about
> 
> - keep block IO cgroup untouched (in its current algorithm, for
>   throttling direct IO)
> 
> - let balance_dirty_pages() adapt to the throttling target
>   
>         buffered_write_limit = total_limit - direct_IOs

Think about proportional allocation.  You don't have a number until
you know who have pending IOs and how much.

> To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> It's always there doing 1:1 proportional throttling. Then you try to
> kick in to add *double* throttling in block/cfq layer. Now the low
> layer may enforce 10:1 throttling and push balance_dirty_pages() away
> from its balanced state, leading to large fluctuations and program
> stalls.

Just do the same 1:1 inside each cgroup.

>  This can be avoided by telling balance_dirty_pages(): "your
> balance goal is no longer 1:1, but 10:1". With this information
> balance_dirty_pages() will behave right. Then there is the question:
> if balance_dirty_pages() will work just well provided the information,
> why bother doing the throttling at low layer and "push back" the
> pressure all the way up?

Because splitting a resource into two pieces arbitrarily with
different amount of consumptions on each side and then applying the
same proportion on both doesn't mean anything?

> The balance_dirty_pages() is already deeply involved in dirty throttling.
> As you can see from this patchset, the same algorithms can be extended
> trivially to work with cgroup IO limits.
> 
> buffered write IO controller in balance_dirty_pages()
> https://lkml.org/lkml/2012/3/28/275

It is half broken thing with fundamental design flaws which can't be
corrected without complete reimplementation.  I don't know what to
say.

> In the "back pressure" scheme, memcg is a must because only it has all
> the infrastructure to track dirty pages upon which you can apply some
> dirty_limits. Don't tell me you want to account dirty pages in blkcg...

For now, per-inode tracking seems good enough.

> What I can see is, it looks pretty simple and nature to let
> balance_dirty_pages() fill the gap towards a total solution :-)
> 
> - add direct IO accounting in some convenient point of the IO path
>   IO submission or completion point, either is fine.
> 
> - change several lines of the buffered write IO controller to
>   integrate the direct IO rate into the formula to fit the "total
>   IO" limit
> 
> - in future, add more accounting as well as feedback control to make
>   balance_dirty_pages() work with IOPS and disk time

To me, you seem to be not addressing the issues I've been raising at
all and just repeating the same points again and again.  If I'm
misunderstanding something, please point out.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-17 22:38           ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-17 22:38 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k,
	andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hello, Fengguang.

On Fri, Apr 06, 2012 at 02:59:34AM -0700, Fengguang Wu wrote:
> Fortunately, the above gap can be easily filled judging from the
> block/cfq IO controller code. By adding some direct IO accounting
> and changing several lines of my patches to make use of the collected
> stats, the semantics of the blkio.throttle.write_bps interfaces can be
> changed from "limit for direct IO" to "limit for direct+buffered IOs".
> Ditto for blkio.weight and blkio.write_iops, as long as some
> iops/device time stats are made available to balance_dirty_pages().
> 
> It would be a fairly *easy* change. :-) It's merely adding some
> accounting code and there is no need to change the block IO
> controlling algorithm at all. I'll do the work of accounting (which
> is basically independent of the IO controlling) and use the new stats
> in balance_dirty_pages().

I don't really understand how this can work.  For hard limits, maybe,
but for proportional IO, you have to know which cgroups have IOs
before assigning the proportions, so blkcg assigning IO bandwidth
without knowing async writes simply can't work.

For example, let's say cgroups A and B have 2:8 split.  If A has IOs
on queue and B doesn't, blkcg will assign all IO bandwidth to A.  I
can't wrap my head around how writeback is gonna make use of the
resulting stats but let's say it decides it needs to put out some IOs
out for both cgroups.  What happens then?  Do all the async writes go
through the root cgroup controlled by and affecting the ratio between
rootcg and cgroup A and B?  Or do they have to be accounted as part of
cgroups A and B?  If so, what if the added bandwidth goes over the
limit?  Let's say if we implement overcharge; then, I suppose we'll
have to communicate that upwards too, right?

This is still easy.  What about hierarchical propio?  What happens
then?  You can't do hierarchical proportional allocation without
knowing how much IOs are pending for which group.  How is that
information gonna be communicated between blkcg and writeback?  Are we
gonna have two separate hierarchical proportional IO allocators?  How
is that gonna work at all?  If we're gonna have single allocator in
block layer, writeback would have to feed the amount of IOs it may
generate into the allocator, get the resulting allocation and then
issue IO and then block layer again will have to account these to the
originating cgroups.  It's just crazy.

> The only problem I can see now, is that balance_dirty_pages() works
> per-bdi and blkcg works per-device. So the two ends may not match
> nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where
> sdb is shared by lv0 and lv1. However it should be rare situations and
> be much more acceptable than the problems arise from the "push back"
> approach which impacts everyone.

I don't know.  What problems?  AFAICS, the biggest issue is writeback
of different inodes getting mixed resulting in poor performance, but
if you think about it, that's about the frequency of switching cgroups
and a problem which can and should be dealt with from block layer
(e.g. use larger time slice if all the pending IOs are async).

Writeback's duty is generating stream of async writes which can be
served efficiently for the *cgroup* and keeping the buffer filled as
necessary and chaining the backpressure from there to the actual
dirtier.  That's what writeback does without cgroup.  Nothing
fundamental changes with cgroup.  It's just finer grained.

> > No, no, it's not about standing in my way.  As Vivek said in the other
> > reply, it's that the "gap" that you filled was created *because*
> > writeback wasn't cgroup aware and now you're in turn filling that gap
> > by making writeback work around that "gap".  I mean, my mind boggles.
> > Doesn't yours?  I strongly believe everyone's should.
> 
> Heh. It's a hard problem indeed. I felt great pains in the IO-less
> dirty throttling work. I did a lot reasoning about it, and have in
> fact kept cgroup IO controller in mind since its early days. Now I'd
> say it's hands down for it to adapt to the gap between the total IO
> limit and what's carried out by the block IO controller.

You're not providing any valid counter arguments about the issues
being raised about the messed up design.  How is anything "hands down"
here?

> > There's where I'm confused.  How is the said split supposed to work?
> > They aren't independent.  I mean, who gets to decide what and where
> > are those decisions enforced?
> 
> Yeah it's not independent. It's about
> 
> - keep block IO cgroup untouched (in its current algorithm, for
>   throttling direct IO)
> 
> - let balance_dirty_pages() adapt to the throttling target
>   
>         buffered_write_limit = total_limit - direct_IOs

Think about proportional allocation.  You don't have a number until
you know who have pending IOs and how much.

> To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> It's always there doing 1:1 proportional throttling. Then you try to
> kick in to add *double* throttling in block/cfq layer. Now the low
> layer may enforce 10:1 throttling and push balance_dirty_pages() away
> from its balanced state, leading to large fluctuations and program
> stalls.

Just do the same 1:1 inside each cgroup.

>  This can be avoided by telling balance_dirty_pages(): "your
> balance goal is no longer 1:1, but 10:1". With this information
> balance_dirty_pages() will behave right. Then there is the question:
> if balance_dirty_pages() will work just well provided the information,
> why bother doing the throttling at low layer and "push back" the
> pressure all the way up?

Because splitting a resource into two pieces arbitrarily with
different amount of consumptions on each side and then applying the
same proportion on both doesn't mean anything?

> The balance_dirty_pages() is already deeply involved in dirty throttling.
> As you can see from this patchset, the same algorithms can be extended
> trivially to work with cgroup IO limits.
> 
> buffered write IO controller in balance_dirty_pages()
> https://lkml.org/lkml/2012/3/28/275

It is half broken thing with fundamental design flaws which can't be
corrected without complete reimplementation.  I don't know what to
say.

> In the "back pressure" scheme, memcg is a must because only it has all
> the infrastructure to track dirty pages upon which you can apply some
> dirty_limits. Don't tell me you want to account dirty pages in blkcg...

For now, per-inode tracking seems good enough.

> What I can see is, it looks pretty simple and nature to let
> balance_dirty_pages() fill the gap towards a total solution :-)
> 
> - add direct IO accounting in some convenient point of the IO path
>   IO submission or completion point, either is fine.
> 
> - change several lines of the buffered write IO controller to
>   integrate the direct IO rate into the formula to fit the "total
>   IO" limit
> 
> - in future, add more accounting as well as feedback control to make
>   balance_dirty_pages() work with IOPS and disk time

To me, you seem to be not addressing the issues I've been raising at
all and just repeating the same points again and again.  If I'm
misunderstanding something, please point out.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-17 22:38           ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-17 22:38 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hello, Fengguang.

On Fri, Apr 06, 2012 at 02:59:34AM -0700, Fengguang Wu wrote:
> Fortunately, the above gap can be easily filled judging from the
> block/cfq IO controller code. By adding some direct IO accounting
> and changing several lines of my patches to make use of the collected
> stats, the semantics of the blkio.throttle.write_bps interfaces can be
> changed from "limit for direct IO" to "limit for direct+buffered IOs".
> Ditto for blkio.weight and blkio.write_iops, as long as some
> iops/device time stats are made available to balance_dirty_pages().
> 
> It would be a fairly *easy* change. :-) It's merely adding some
> accounting code and there is no need to change the block IO
> controlling algorithm at all. I'll do the work of accounting (which
> is basically independent of the IO controlling) and use the new stats
> in balance_dirty_pages().

I don't really understand how this can work.  For hard limits, maybe,
but for proportional IO, you have to know which cgroups have IOs
before assigning the proportions, so blkcg assigning IO bandwidth
without knowing async writes simply can't work.

For example, let's say cgroups A and B have 2:8 split.  If A has IOs
on queue and B doesn't, blkcg will assign all IO bandwidth to A.  I
can't wrap my head around how writeback is gonna make use of the
resulting stats but let's say it decides it needs to put out some IOs
out for both cgroups.  What happens then?  Do all the async writes go
through the root cgroup controlled by and affecting the ratio between
rootcg and cgroup A and B?  Or do they have to be accounted as part of
cgroups A and B?  If so, what if the added bandwidth goes over the
limit?  Let's say if we implement overcharge; then, I suppose we'll
have to communicate that upwards too, right?

This is still easy.  What about hierarchical propio?  What happens
then?  You can't do hierarchical proportional allocation without
knowing how much IOs are pending for which group.  How is that
information gonna be communicated between blkcg and writeback?  Are we
gonna have two separate hierarchical proportional IO allocators?  How
is that gonna work at all?  If we're gonna have single allocator in
block layer, writeback would have to feed the amount of IOs it may
generate into the allocator, get the resulting allocation and then
issue IO and then block layer again will have to account these to the
originating cgroups.  It's just crazy.

> The only problem I can see now, is that balance_dirty_pages() works
> per-bdi and blkcg works per-device. So the two ends may not match
> nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where
> sdb is shared by lv0 and lv1. However it should be rare situations and
> be much more acceptable than the problems arise from the "push back"
> approach which impacts everyone.

I don't know.  What problems?  AFAICS, the biggest issue is writeback
of different inodes getting mixed resulting in poor performance, but
if you think about it, that's about the frequency of switching cgroups
and a problem which can and should be dealt with from block layer
(e.g. use larger time slice if all the pending IOs are async).

Writeback's duty is generating stream of async writes which can be
served efficiently for the *cgroup* and keeping the buffer filled as
necessary and chaining the backpressure from there to the actual
dirtier.  That's what writeback does without cgroup.  Nothing
fundamental changes with cgroup.  It's just finer grained.

> > No, no, it's not about standing in my way.  As Vivek said in the other
> > reply, it's that the "gap" that you filled was created *because*
> > writeback wasn't cgroup aware and now you're in turn filling that gap
> > by making writeback work around that "gap".  I mean, my mind boggles.
> > Doesn't yours?  I strongly believe everyone's should.
> 
> Heh. It's a hard problem indeed. I felt great pains in the IO-less
> dirty throttling work. I did a lot reasoning about it, and have in
> fact kept cgroup IO controller in mind since its early days. Now I'd
> say it's hands down for it to adapt to the gap between the total IO
> limit and what's carried out by the block IO controller.

You're not providing any valid counter arguments about the issues
being raised about the messed up design.  How is anything "hands down"
here?

> > There's where I'm confused.  How is the said split supposed to work?
> > They aren't independent.  I mean, who gets to decide what and where
> > are those decisions enforced?
> 
> Yeah it's not independent. It's about
> 
> - keep block IO cgroup untouched (in its current algorithm, for
>   throttling direct IO)
> 
> - let balance_dirty_pages() adapt to the throttling target
>   
>         buffered_write_limit = total_limit - direct_IOs

Think about proportional allocation.  You don't have a number until
you know who have pending IOs and how much.

> To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> It's always there doing 1:1 proportional throttling. Then you try to
> kick in to add *double* throttling in block/cfq layer. Now the low
> layer may enforce 10:1 throttling and push balance_dirty_pages() away
> from its balanced state, leading to large fluctuations and program
> stalls.

Just do the same 1:1 inside each cgroup.

>  This can be avoided by telling balance_dirty_pages(): "your
> balance goal is no longer 1:1, but 10:1". With this information
> balance_dirty_pages() will behave right. Then there is the question:
> if balance_dirty_pages() will work just well provided the information,
> why bother doing the throttling at low layer and "push back" the
> pressure all the way up?

Because splitting a resource into two pieces arbitrarily with
different amount of consumptions on each side and then applying the
same proportion on both doesn't mean anything?

> The balance_dirty_pages() is already deeply involved in dirty throttling.
> As you can see from this patchset, the same algorithms can be extended
> trivially to work with cgroup IO limits.
> 
> buffered write IO controller in balance_dirty_pages()
> https://lkml.org/lkml/2012/3/28/275

It is half broken thing with fundamental design flaws which can't be
corrected without complete reimplementation.  I don't know what to
say.

> In the "back pressure" scheme, memcg is a must because only it has all
> the infrastructure to track dirty pages upon which you can apply some
> dirty_limits. Don't tell me you want to account dirty pages in blkcg...

For now, per-inode tracking seems good enough.

> What I can see is, it looks pretty simple and nature to let
> balance_dirty_pages() fill the gap towards a total solution :-)
> 
> - add direct IO accounting in some convenient point of the IO path
>   IO submission or completion point, either is fine.
> 
> - change several lines of the buffered write IO controller to
>   integrate the direct IO rate into the formula to fit the "total
>   IO" limit
> 
> - in future, add more accounting as well as feedback control to make
>   balance_dirty_pages() work with IOPS and disk time

To me, you seem to be not addressing the issues I've been raising at
all and just repeating the same points again and again.  If I'm
misunderstanding something, please point out.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                       ` <20120417220106.GF19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-04-18  6:30                         ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-18  6:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu, Vivek Goyal

  Hello,

On Tue 17-04-12 15:01:06, Tejun Heo wrote:
> On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote:
> > > So all the metadata IO will happen thorough journaling thread and that
> > > will be in root group which should remain unthrottled. So any journal
> > > IO going to disk should remain unthrottled.
> >
> >   Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't
> > have to have the journal thread (as is the case of reiserfs where random
> > writer may end up doing commit) but let's not complicate things
> > unnecessarily.
> 
> Why can't journal entries keep track of the originator so that bios
> can be attributed to the originator while committing?  That shouldn't
> be too difficult to implement, no?
  I think I was just describing the current state but yes, in future we
can track which cgroup first attached a buffer to a transaction.

> > > Now, IIRC, fsync problem with throttling was that we had opened a
> > > transaction but could not write it back to disk because we had to
> > > wait for all the cached data to go to disk (which is throttled). So
> > > my question is, can't we first wait for all the data to be flushed
> > > to disk and then open a transaction for metadata. metadata will be
> > > unthrottled so filesystem will not have to do any tricks like bdi is
> > > congested or not.
> >
> >   Actually that's what's happening. We first do filemap_write_and_wait()
> > which syncs all the data and then we go and force transaction commit to
> > make sure all metadata got to stable storage. The problem is that writeout
> > of data may need to allocate new blocks and that starts a transaction and
> > while the transaction is started we may need to do some reads (e.g. of
> > bitmaps etc.) which may be throttled and at that moment the whole
> > filesystem is blocked. I don't remember the stack traces you showed me so
> > I'm not sure it this is what your observed but it's certainly one possible
> > scenario. The reason why fsync triggers problems is simply that it's the
> > only place where process normally does significant amount of writing. In
> > most cases flusher thread / journal thread do it so this effect is not
> > visible. And to precede your question, it would be rather hard to avoid IO
> > while the transaction is started due to locking.
> 
> Probably we should mark all IOs issued inside transaction as META (or
> whatever which tells blkcg to avoid throttling it).  We're gonna need
> overcharging for metadata writes anyway, so I don't think this will
> make too much of a difference.
  Agreed.

								Honza

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-17 22:01                       ` Tejun Heo
@ 2012-04-18  6:30                         ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-18  6:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Vivek Goyal, Fengguang Wu, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

  Hello,

On Tue 17-04-12 15:01:06, Tejun Heo wrote:
> On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote:
> > > So all the metadata IO will happen thorough journaling thread and that
> > > will be in root group which should remain unthrottled. So any journal
> > > IO going to disk should remain unthrottled.
> >
> >   Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't
> > have to have the journal thread (as is the case of reiserfs where random
> > writer may end up doing commit) but let's not complicate things
> > unnecessarily.
> 
> Why can't journal entries keep track of the originator so that bios
> can be attributed to the originator while committing?  That shouldn't
> be too difficult to implement, no?
  I think I was just describing the current state but yes, in future we
can track which cgroup first attached a buffer to a transaction.

> > > Now, IIRC, fsync problem with throttling was that we had opened a
> > > transaction but could not write it back to disk because we had to
> > > wait for all the cached data to go to disk (which is throttled). So
> > > my question is, can't we first wait for all the data to be flushed
> > > to disk and then open a transaction for metadata. metadata will be
> > > unthrottled so filesystem will not have to do any tricks like bdi is
> > > congested or not.
> >
> >   Actually that's what's happening. We first do filemap_write_and_wait()
> > which syncs all the data and then we go and force transaction commit to
> > make sure all metadata got to stable storage. The problem is that writeout
> > of data may need to allocate new blocks and that starts a transaction and
> > while the transaction is started we may need to do some reads (e.g. of
> > bitmaps etc.) which may be throttled and at that moment the whole
> > filesystem is blocked. I don't remember the stack traces you showed me so
> > I'm not sure it this is what your observed but it's certainly one possible
> > scenario. The reason why fsync triggers problems is simply that it's the
> > only place where process normally does significant amount of writing. In
> > most cases flusher thread / journal thread do it so this effect is not
> > visible. And to precede your question, it would be rather hard to avoid IO
> > while the transaction is started due to locking.
> 
> Probably we should mark all IOs issued inside transaction as META (or
> whatever which tells blkcg to avoid throttling it).  We're gonna need
> overcharging for metadata writes anyway, so I don't think this will
> make too much of a difference.
  Agreed.

								Honza

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-18  6:30                         ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-18  6:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Vivek Goyal, Fengguang Wu, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

  Hello,

On Tue 17-04-12 15:01:06, Tejun Heo wrote:
> On Wed, Apr 11, 2012 at 09:22:31PM +0200, Jan Kara wrote:
> > > So all the metadata IO will happen thorough journaling thread and that
> > > will be in root group which should remain unthrottled. So any journal
> > > IO going to disk should remain unthrottled.
> >
> >   Yes, that is true at least for ext3/ext4 or btrfs. In principle we don't
> > have to have the journal thread (as is the case of reiserfs where random
> > writer may end up doing commit) but let's not complicate things
> > unnecessarily.
> 
> Why can't journal entries keep track of the originator so that bios
> can be attributed to the originator while committing?  That shouldn't
> be too difficult to implement, no?
  I think I was just describing the current state but yes, in future we
can track which cgroup first attached a buffer to a transaction.

> > > Now, IIRC, fsync problem with throttling was that we had opened a
> > > transaction but could not write it back to disk because we had to
> > > wait for all the cached data to go to disk (which is throttled). So
> > > my question is, can't we first wait for all the data to be flushed
> > > to disk and then open a transaction for metadata. metadata will be
> > > unthrottled so filesystem will not have to do any tricks like bdi is
> > > congested or not.
> >
> >   Actually that's what's happening. We first do filemap_write_and_wait()
> > which syncs all the data and then we go and force transaction commit to
> > make sure all metadata got to stable storage. The problem is that writeout
> > of data may need to allocate new blocks and that starts a transaction and
> > while the transaction is started we may need to do some reads (e.g. of
> > bitmaps etc.) which may be throttled and at that moment the whole
> > filesystem is blocked. I don't remember the stack traces you showed me so
> > I'm not sure it this is what your observed but it's certainly one possible
> > scenario. The reason why fsync triggers problems is simply that it's the
> > only place where process normally does significant amount of writing. In
> > most cases flusher thread / journal thread do it so this effect is not
> > visible. And to precede your question, it would be rather hard to avoid IO
> > while the transaction is started due to locking.
> 
> Probably we should mark all IOs issued inside transaction as META (or
> whatever which tells blkcg to avoid throttling it).  We're gonna need
> overcharging for metadata writes anyway, so I don't think this will
> make too much of a difference.
  Agreed.

								Honza

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-06  9:59         ` Fengguang Wu
                           ` (4 preceding siblings ...)
  (?)
@ 2012-04-18  6:57         ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-18  6:57 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA

On Fri 06-04-12 02:59:34, Wu Fengguang wrote:
...
> > > > Let's please keep the layering clear.  IO limitations will be applied
> > > > at the block layer and pressure will be formed there and then
> > > > propagated upwards eventually to the originator.  Sure, exposing the
> > > > whole information might result in better behavior for certain
> > > > workloads, but down the road, say, in three or five years, devices
> > > > which can be shared without worrying too much about seeks might be
> > > > commonplace and we could be swearing at a disgusting structural mess,
> > > > and sadly various cgroup support seems to be a prominent source of
> > > > such design failures.
> > > 
> > > Super fast storages are coming which will make us regret to make the
> > > IO path over complex.  Spinning disks are not going away anytime soon.
> > > I doubt Google is willing to afford the disk seek costs on its
> > > millions of disks and has the patience to wait until switching all of
> > > the spin disks to SSD years later (if it will ever happen).
> > 
> > This is new.  Let's keep the damn employer out of the discussion.
> > While the area I work on is affected by my employment (writeback isn't
> > even my area BTW), I'm not gonna do something adverse to upstream even
> > if it's beneficial to google and I'm much more likely to do something
> > which may hurt google a bit if it's gonna benefit upstream.
> > 
> > As for the faster / newer storage argument, that is *exactly* why we
> > want to keep the layering proper.  Writeback works from the pressure
> > from the IO stack.  If IO technology changes, we update the IO stack
> > and writeback still works from the pressure.  It may need to be
> > adjusted but the principles don't change.
> 
> To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> It's always there doing 1:1 proportional throttling. Then you try to
> kick in to add *double* throttling in block/cfq layer. Now the low
> layer may enforce 10:1 throttling and push balance_dirty_pages() away
> from its balanced state, leading to large fluctuations and program
> stalls.  This can be avoided by telling balance_dirty_pages(): "your
> balance goal is no longer 1:1, but 10:1". With this information
> balance_dirty_pages() will behave right. Then there is the question:
> if balance_dirty_pages() will work just well provided the information,
> why bother doing the throttling at low layer and "push back" the
> pressure all the way up?
  Fengguang, maybe we should first agree on some basics:
  The two main goals of balance_dirty_pages() are (and always have been
AFAIK) to limit amount of dirty pages in memory and keep enough dirty pages
in memory to allow for efficient writeback. Secondary goals are to also
keep amount of dirty pages somewhat fair among bdis and processes. Agreed?

Thus shift to trying to control *IO throughput* (or even just buffered
write throughput) from balance_dirty_pages() is a fundamental shift in the
goals of balance_dirty_pages(), not just some tweak (although technically,
it might be relatively easy to do for buffered writes given the current
implementation).

...
> > Well, I tried and I hope some of it got through.  I also wrote a lot
> > of questions, mainly regarding how what you have in mind is supposed
> > to work through what path.  Maybe I'm just not seeing what you're
> > seeing but I just can't see where all the IOs would go through and
> > come together.  Can you please elaborate more on that?
> 
> What I can see is, it looks pretty simple and nature to let
> balance_dirty_pages() fill the gap towards a total solution :-)
> 
> - add direct IO accounting in some convenient point of the IO path
>   IO submission or completion point, either is fine.
> 
> - change several lines of the buffered write IO controller to
>   integrate the direct IO rate into the formula to fit the "total
>   IO" limit
> 
> - in future, add more accounting as well as feedback control to make
>   balance_dirty_pages() work with IOPS and disk time
  Sorry Fengguang but I also think this is a wrong way to go.
balance_dirty_pages() must primarily control the amount of dirty pages.
Trying to bend it to control IO throughput by including direct IO and
reads in the accounting will just make the logic even more complex than it
already is.

								Honza

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-06  9:59         ` Fengguang Wu
@ 2012-04-18  6:57           ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-18  6:57 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Tejun Heo, Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Fri 06-04-12 02:59:34, Wu Fengguang wrote:
...
> > > > Let's please keep the layering clear.  IO limitations will be applied
> > > > at the block layer and pressure will be formed there and then
> > > > propagated upwards eventually to the originator.  Sure, exposing the
> > > > whole information might result in better behavior for certain
> > > > workloads, but down the road, say, in three or five years, devices
> > > > which can be shared without worrying too much about seeks might be
> > > > commonplace and we could be swearing at a disgusting structural mess,
> > > > and sadly various cgroup support seems to be a prominent source of
> > > > such design failures.
> > > 
> > > Super fast storages are coming which will make us regret to make the
> > > IO path over complex.  Spinning disks are not going away anytime soon.
> > > I doubt Google is willing to afford the disk seek costs on its
> > > millions of disks and has the patience to wait until switching all of
> > > the spin disks to SSD years later (if it will ever happen).
> > 
> > This is new.  Let's keep the damn employer out of the discussion.
> > While the area I work on is affected by my employment (writeback isn't
> > even my area BTW), I'm not gonna do something adverse to upstream even
> > if it's beneficial to google and I'm much more likely to do something
> > which may hurt google a bit if it's gonna benefit upstream.
> > 
> > As for the faster / newer storage argument, that is *exactly* why we
> > want to keep the layering proper.  Writeback works from the pressure
> > from the IO stack.  If IO technology changes, we update the IO stack
> > and writeback still works from the pressure.  It may need to be
> > adjusted but the principles don't change.
> 
> To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> It's always there doing 1:1 proportional throttling. Then you try to
> kick in to add *double* throttling in block/cfq layer. Now the low
> layer may enforce 10:1 throttling and push balance_dirty_pages() away
> from its balanced state, leading to large fluctuations and program
> stalls.  This can be avoided by telling balance_dirty_pages(): "your
> balance goal is no longer 1:1, but 10:1". With this information
> balance_dirty_pages() will behave right. Then there is the question:
> if balance_dirty_pages() will work just well provided the information,
> why bother doing the throttling at low layer and "push back" the
> pressure all the way up?
  Fengguang, maybe we should first agree on some basics:
  The two main goals of balance_dirty_pages() are (and always have been
AFAIK) to limit amount of dirty pages in memory and keep enough dirty pages
in memory to allow for efficient writeback. Secondary goals are to also
keep amount of dirty pages somewhat fair among bdis and processes. Agreed?

Thus shift to trying to control *IO throughput* (or even just buffered
write throughput) from balance_dirty_pages() is a fundamental shift in the
goals of balance_dirty_pages(), not just some tweak (although technically,
it might be relatively easy to do for buffered writes given the current
implementation).

...
> > Well, I tried and I hope some of it got through.  I also wrote a lot
> > of questions, mainly regarding how what you have in mind is supposed
> > to work through what path.  Maybe I'm just not seeing what you're
> > seeing but I just can't see where all the IOs would go through and
> > come together.  Can you please elaborate more on that?
> 
> What I can see is, it looks pretty simple and nature to let
> balance_dirty_pages() fill the gap towards a total solution :-)
> 
> - add direct IO accounting in some convenient point of the IO path
>   IO submission or completion point, either is fine.
> 
> - change several lines of the buffered write IO controller to
>   integrate the direct IO rate into the formula to fit the "total
>   IO" limit
> 
> - in future, add more accounting as well as feedback control to make
>   balance_dirty_pages() work with IOPS and disk time
  Sorry Fengguang but I also think this is a wrong way to go.
balance_dirty_pages() must primarily control the amount of dirty pages.
Trying to bend it to control IO throughput by including direct IO and
reads in the accounting will just make the logic even more complex than it
already is.

								Honza

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-18  6:57           ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-18  6:57 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Tejun Heo, Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Fri 06-04-12 02:59:34, Wu Fengguang wrote:
...
> > > > Let's please keep the layering clear.  IO limitations will be applied
> > > > at the block layer and pressure will be formed there and then
> > > > propagated upwards eventually to the originator.  Sure, exposing the
> > > > whole information might result in better behavior for certain
> > > > workloads, but down the road, say, in three or five years, devices
> > > > which can be shared without worrying too much about seeks might be
> > > > commonplace and we could be swearing at a disgusting structural mess,
> > > > and sadly various cgroup support seems to be a prominent source of
> > > > such design failures.
> > > 
> > > Super fast storages are coming which will make us regret to make the
> > > IO path over complex.  Spinning disks are not going away anytime soon.
> > > I doubt Google is willing to afford the disk seek costs on its
> > > millions of disks and has the patience to wait until switching all of
> > > the spin disks to SSD years later (if it will ever happen).
> > 
> > This is new.  Let's keep the damn employer out of the discussion.
> > While the area I work on is affected by my employment (writeback isn't
> > even my area BTW), I'm not gonna do something adverse to upstream even
> > if it's beneficial to google and I'm much more likely to do something
> > which may hurt google a bit if it's gonna benefit upstream.
> > 
> > As for the faster / newer storage argument, that is *exactly* why we
> > want to keep the layering proper.  Writeback works from the pressure
> > from the IO stack.  If IO technology changes, we update the IO stack
> > and writeback still works from the pressure.  It may need to be
> > adjusted but the principles don't change.
> 
> To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> It's always there doing 1:1 proportional throttling. Then you try to
> kick in to add *double* throttling in block/cfq layer. Now the low
> layer may enforce 10:1 throttling and push balance_dirty_pages() away
> from its balanced state, leading to large fluctuations and program
> stalls.  This can be avoided by telling balance_dirty_pages(): "your
> balance goal is no longer 1:1, but 10:1". With this information
> balance_dirty_pages() will behave right. Then there is the question:
> if balance_dirty_pages() will work just well provided the information,
> why bother doing the throttling at low layer and "push back" the
> pressure all the way up?
  Fengguang, maybe we should first agree on some basics:
  The two main goals of balance_dirty_pages() are (and always have been
AFAIK) to limit amount of dirty pages in memory and keep enough dirty pages
in memory to allow for efficient writeback. Secondary goals are to also
keep amount of dirty pages somewhat fair among bdis and processes. Agreed?

Thus shift to trying to control *IO throughput* (or even just buffered
write throughput) from balance_dirty_pages() is a fundamental shift in the
goals of balance_dirty_pages(), not just some tweak (although technically,
it might be relatively easy to do for buffered writes given the current
implementation).

...
> > Well, I tried and I hope some of it got through.  I also wrote a lot
> > of questions, mainly regarding how what you have in mind is supposed
> > to work through what path.  Maybe I'm just not seeing what you're
> > seeing but I just can't see where all the IOs would go through and
> > come together.  Can you please elaborate more on that?
> 
> What I can see is, it looks pretty simple and nature to let
> balance_dirty_pages() fill the gap towards a total solution :-)
> 
> - add direct IO accounting in some convenient point of the IO path
>   IO submission or completion point, either is fine.
> 
> - change several lines of the buffered write IO controller to
>   integrate the direct IO rate into the formula to fit the "total
>   IO" limit
> 
> - in future, add more accounting as well as feedback control to make
>   balance_dirty_pages() work with IOPS and disk time
  Sorry Fengguang but I also think this is a wrong way to go.
balance_dirty_pages() must primarily control the amount of dirty pages.
Trying to bend it to control IO throughput by including direct IO and
reads in the accounting will just make the logic even more complex than it
already is.

								Honza

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]           ` <20120418065720.GA21485-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
@ 2012-04-18  7:58             ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-18  7:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA

On Wed, Apr 18, 2012 at 08:57:20AM +0200, Jan Kara wrote:
> On Fri 06-04-12 02:59:34, Wu Fengguang wrote:
> ...
> > > > > Let's please keep the layering clear.  IO limitations will be applied
> > > > > at the block layer and pressure will be formed there and then
> > > > > propagated upwards eventually to the originator.  Sure, exposing the
> > > > > whole information might result in better behavior for certain
> > > > > workloads, but down the road, say, in three or five years, devices
> > > > > which can be shared without worrying too much about seeks might be
> > > > > commonplace and we could be swearing at a disgusting structural mess,
> > > > > and sadly various cgroup support seems to be a prominent source of
> > > > > such design failures.
> > > > 
> > > > Super fast storages are coming which will make us regret to make the
> > > > IO path over complex.  Spinning disks are not going away anytime soon.
> > > > I doubt Google is willing to afford the disk seek costs on its
> > > > millions of disks and has the patience to wait until switching all of
> > > > the spin disks to SSD years later (if it will ever happen).
> > > 
> > > This is new.  Let's keep the damn employer out of the discussion.
> > > While the area I work on is affected by my employment (writeback isn't
> > > even my area BTW), I'm not gonna do something adverse to upstream even
> > > if it's beneficial to google and I'm much more likely to do something
> > > which may hurt google a bit if it's gonna benefit upstream.
> > > 
> > > As for the faster / newer storage argument, that is *exactly* why we
> > > want to keep the layering proper.  Writeback works from the pressure
> > > from the IO stack.  If IO technology changes, we update the IO stack
> > > and writeback still works from the pressure.  It may need to be
> > > adjusted but the principles don't change.
> > 
> > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > It's always there doing 1:1 proportional throttling. Then you try to
> > kick in to add *double* throttling in block/cfq layer. Now the low
> > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > from its balanced state, leading to large fluctuations and program
> > stalls.  This can be avoided by telling balance_dirty_pages(): "your
> > balance goal is no longer 1:1, but 10:1". With this information
> > balance_dirty_pages() will behave right. Then there is the question:
> > if balance_dirty_pages() will work just well provided the information,
> > why bother doing the throttling at low layer and "push back" the
> > pressure all the way up?
>   Fengguang, maybe we should first agree on some basics:
>   The two main goals of balance_dirty_pages() are (and always have been
> AFAIK) to limit amount of dirty pages in memory and keep enough dirty pages
> in memory to allow for efficient writeback. Secondary goals are to also
> keep amount of dirty pages somewhat fair among bdis and processes. Agreed?

Agreed. In fact, before the IO-less change, balance_dirty_pages() had
no much explicit control over the dirty rate and fairness.

> Thus shift to trying to control *IO throughput* (or even just buffered
> write throughput) from balance_dirty_pages() is a fundamental shift in the
> goals of balance_dirty_pages(), not just some tweak (although technically,
> it might be relatively easy to do for buffered writes given the current
> implementation).

Yes, it has been a bit shift to the rate based dirty control.

> ...
> > > Well, I tried and I hope some of it got through.  I also wrote a lot
> > > of questions, mainly regarding how what you have in mind is supposed
> > > to work through what path.  Maybe I'm just not seeing what you're
> > > seeing but I just can't see where all the IOs would go through and
> > > come together.  Can you please elaborate more on that?
> > 
> > What I can see is, it looks pretty simple and nature to let
> > balance_dirty_pages() fill the gap towards a total solution :-)
> > 
> > - add direct IO accounting in some convenient point of the IO path
> >   IO submission or completion point, either is fine.
> > 
> > - change several lines of the buffered write IO controller to
> >   integrate the direct IO rate into the formula to fit the "total
> >   IO" limit
> > 
> > - in future, add more accounting as well as feedback control to make
> >   balance_dirty_pages() work with IOPS and disk time
>   Sorry Fengguang but I also think this is a wrong way to go.
> balance_dirty_pages() must primarily control the amount of dirty pages.
> Trying to bend it to control IO throughput by including direct IO and
> reads in the accounting will just make the logic even more complex than it
> already is.

Right, I have been adding too much complexity to balance_dirty_pages().
The control algorithms are pretty hard to understand and get right for
all cases.

OK, I'll post results of my experiments up to now, answer some
questions and take a comfortable break. Phooo..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]           ` <20120418065720.GA21485-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
  2012-04-18  7:58             ` Fengguang Wu
@ 2012-04-18  7:58             ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-18  7:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 18, 2012 at 08:57:20AM +0200, Jan Kara wrote:
> On Fri 06-04-12 02:59:34, Wu Fengguang wrote:
> ...
> > > > > Let's please keep the layering clear.  IO limitations will be applied
> > > > > at the block layer and pressure will be formed there and then
> > > > > propagated upwards eventually to the originator.  Sure, exposing the
> > > > > whole information might result in better behavior for certain
> > > > > workloads, but down the road, say, in three or five years, devices
> > > > > which can be shared without worrying too much about seeks might be
> > > > > commonplace and we could be swearing at a disgusting structural mess,
> > > > > and sadly various cgroup support seems to be a prominent source of
> > > > > such design failures.
> > > > 
> > > > Super fast storages are coming which will make us regret to make the
> > > > IO path over complex.  Spinning disks are not going away anytime soon.
> > > > I doubt Google is willing to afford the disk seek costs on its
> > > > millions of disks and has the patience to wait until switching all of
> > > > the spin disks to SSD years later (if it will ever happen).
> > > 
> > > This is new.  Let's keep the damn employer out of the discussion.
> > > While the area I work on is affected by my employment (writeback isn't
> > > even my area BTW), I'm not gonna do something adverse to upstream even
> > > if it's beneficial to google and I'm much more likely to do something
> > > which may hurt google a bit if it's gonna benefit upstream.
> > > 
> > > As for the faster / newer storage argument, that is *exactly* why we
> > > want to keep the layering proper.  Writeback works from the pressure
> > > from the IO stack.  If IO technology changes, we update the IO stack
> > > and writeback still works from the pressure.  It may need to be
> > > adjusted but the principles don't change.
> > 
> > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > It's always there doing 1:1 proportional throttling. Then you try to
> > kick in to add *double* throttling in block/cfq layer. Now the low
> > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > from its balanced state, leading to large fluctuations and program
> > stalls.  This can be avoided by telling balance_dirty_pages(): "your
> > balance goal is no longer 1:1, but 10:1". With this information
> > balance_dirty_pages() will behave right. Then there is the question:
> > if balance_dirty_pages() will work just well provided the information,
> > why bother doing the throttling at low layer and "push back" the
> > pressure all the way up?
>   Fengguang, maybe we should first agree on some basics:
>   The two main goals of balance_dirty_pages() are (and always have been
> AFAIK) to limit amount of dirty pages in memory and keep enough dirty pages
> in memory to allow for efficient writeback. Secondary goals are to also
> keep amount of dirty pages somewhat fair among bdis and processes. Agreed?

Agreed. In fact, before the IO-less change, balance_dirty_pages() had
no much explicit control over the dirty rate and fairness.

> Thus shift to trying to control *IO throughput* (or even just buffered
> write throughput) from balance_dirty_pages() is a fundamental shift in the
> goals of balance_dirty_pages(), not just some tweak (although technically,
> it might be relatively easy to do for buffered writes given the current
> implementation).

Yes, it has been a bit shift to the rate based dirty control.

> ...
> > > Well, I tried and I hope some of it got through.  I also wrote a lot
> > > of questions, mainly regarding how what you have in mind is supposed
> > > to work through what path.  Maybe I'm just not seeing what you're
> > > seeing but I just can't see where all the IOs would go through and
> > > come together.  Can you please elaborate more on that?
> > 
> > What I can see is, it looks pretty simple and nature to let
> > balance_dirty_pages() fill the gap towards a total solution :-)
> > 
> > - add direct IO accounting in some convenient point of the IO path
> >   IO submission or completion point, either is fine.
> > 
> > - change several lines of the buffered write IO controller to
> >   integrate the direct IO rate into the formula to fit the "total
> >   IO" limit
> > 
> > - in future, add more accounting as well as feedback control to make
> >   balance_dirty_pages() work with IOPS and disk time
>   Sorry Fengguang but I also think this is a wrong way to go.
> balance_dirty_pages() must primarily control the amount of dirty pages.
> Trying to bend it to control IO throughput by including direct IO and
> reads in the accounting will just make the logic even more complex than it
> already is.

Right, I have been adding too much complexity to balance_dirty_pages().
The control algorithms are pretty hard to understand and get right for
all cases.

OK, I'll post results of my experiments up to now, answer some
questions and take a comfortable break. Phooo..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-18  7:58             ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-18  7:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k,
	andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Wed, Apr 18, 2012 at 08:57:20AM +0200, Jan Kara wrote:
> On Fri 06-04-12 02:59:34, Wu Fengguang wrote:
> ...
> > > > > Let's please keep the layering clear.  IO limitations will be applied
> > > > > at the block layer and pressure will be formed there and then
> > > > > propagated upwards eventually to the originator.  Sure, exposing the
> > > > > whole information might result in better behavior for certain
> > > > > workloads, but down the road, say, in three or five years, devices
> > > > > which can be shared without worrying too much about seeks might be
> > > > > commonplace and we could be swearing at a disgusting structural mess,
> > > > > and sadly various cgroup support seems to be a prominent source of
> > > > > such design failures.
> > > > 
> > > > Super fast storages are coming which will make us regret to make the
> > > > IO path over complex.  Spinning disks are not going away anytime soon.
> > > > I doubt Google is willing to afford the disk seek costs on its
> > > > millions of disks and has the patience to wait until switching all of
> > > > the spin disks to SSD years later (if it will ever happen).
> > > 
> > > This is new.  Let's keep the damn employer out of the discussion.
> > > While the area I work on is affected by my employment (writeback isn't
> > > even my area BTW), I'm not gonna do something adverse to upstream even
> > > if it's beneficial to google and I'm much more likely to do something
> > > which may hurt google a bit if it's gonna benefit upstream.
> > > 
> > > As for the faster / newer storage argument, that is *exactly* why we
> > > want to keep the layering proper.  Writeback works from the pressure
> > > from the IO stack.  If IO technology changes, we update the IO stack
> > > and writeback still works from the pressure.  It may need to be
> > > adjusted but the principles don't change.
> > 
> > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > It's always there doing 1:1 proportional throttling. Then you try to
> > kick in to add *double* throttling in block/cfq layer. Now the low
> > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > from its balanced state, leading to large fluctuations and program
> > stalls.  This can be avoided by telling balance_dirty_pages(): "your
> > balance goal is no longer 1:1, but 10:1". With this information
> > balance_dirty_pages() will behave right. Then there is the question:
> > if balance_dirty_pages() will work just well provided the information,
> > why bother doing the throttling at low layer and "push back" the
> > pressure all the way up?
>   Fengguang, maybe we should first agree on some basics:
>   The two main goals of balance_dirty_pages() are (and always have been
> AFAIK) to limit amount of dirty pages in memory and keep enough dirty pages
> in memory to allow for efficient writeback. Secondary goals are to also
> keep amount of dirty pages somewhat fair among bdis and processes. Agreed?

Agreed. In fact, before the IO-less change, balance_dirty_pages() had
no much explicit control over the dirty rate and fairness.

> Thus shift to trying to control *IO throughput* (or even just buffered
> write throughput) from balance_dirty_pages() is a fundamental shift in the
> goals of balance_dirty_pages(), not just some tweak (although technically,
> it might be relatively easy to do for buffered writes given the current
> implementation).

Yes, it has been a bit shift to the rate based dirty control.

> ...
> > > Well, I tried and I hope some of it got through.  I also wrote a lot
> > > of questions, mainly regarding how what you have in mind is supposed
> > > to work through what path.  Maybe I'm just not seeing what you're
> > > seeing but I just can't see where all the IOs would go through and
> > > come together.  Can you please elaborate more on that?
> > 
> > What I can see is, it looks pretty simple and nature to let
> > balance_dirty_pages() fill the gap towards a total solution :-)
> > 
> > - add direct IO accounting in some convenient point of the IO path
> >   IO submission or completion point, either is fine.
> > 
> > - change several lines of the buffered write IO controller to
> >   integrate the direct IO rate into the formula to fit the "total
> >   IO" limit
> > 
> > - in future, add more accounting as well as feedback control to make
> >   balance_dirty_pages() work with IOPS and disk time
>   Sorry Fengguang but I also think this is a wrong way to go.
> balance_dirty_pages() must primarily control the amount of dirty pages.
> Trying to bend it to control IO throughput by including direct IO and
> reads in the accounting will just make the logic even more complex than it
> already is.

Right, I have been adding too much complexity to balance_dirty_pages().
The control algorithms are pretty hard to understand and get right for
all cases.

OK, I'll post results of my experiments up to now, answer some
questions and take a comfortable break. Phooo..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-18  7:58             ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-18  7:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Wed, Apr 18, 2012 at 08:57:20AM +0200, Jan Kara wrote:
> On Fri 06-04-12 02:59:34, Wu Fengguang wrote:
> ...
> > > > > Let's please keep the layering clear.  IO limitations will be applied
> > > > > at the block layer and pressure will be formed there and then
> > > > > propagated upwards eventually to the originator.  Sure, exposing the
> > > > > whole information might result in better behavior for certain
> > > > > workloads, but down the road, say, in three or five years, devices
> > > > > which can be shared without worrying too much about seeks might be
> > > > > commonplace and we could be swearing at a disgusting structural mess,
> > > > > and sadly various cgroup support seems to be a prominent source of
> > > > > such design failures.
> > > > 
> > > > Super fast storages are coming which will make us regret to make the
> > > > IO path over complex.  Spinning disks are not going away anytime soon.
> > > > I doubt Google is willing to afford the disk seek costs on its
> > > > millions of disks and has the patience to wait until switching all of
> > > > the spin disks to SSD years later (if it will ever happen).
> > > 
> > > This is new.  Let's keep the damn employer out of the discussion.
> > > While the area I work on is affected by my employment (writeback isn't
> > > even my area BTW), I'm not gonna do something adverse to upstream even
> > > if it's beneficial to google and I'm much more likely to do something
> > > which may hurt google a bit if it's gonna benefit upstream.
> > > 
> > > As for the faster / newer storage argument, that is *exactly* why we
> > > want to keep the layering proper.  Writeback works from the pressure
> > > from the IO stack.  If IO technology changes, we update the IO stack
> > > and writeback still works from the pressure.  It may need to be
> > > adjusted but the principles don't change.
> > 
> > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > It's always there doing 1:1 proportional throttling. Then you try to
> > kick in to add *double* throttling in block/cfq layer. Now the low
> > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > from its balanced state, leading to large fluctuations and program
> > stalls.  This can be avoided by telling balance_dirty_pages(): "your
> > balance goal is no longer 1:1, but 10:1". With this information
> > balance_dirty_pages() will behave right. Then there is the question:
> > if balance_dirty_pages() will work just well provided the information,
> > why bother doing the throttling at low layer and "push back" the
> > pressure all the way up?
>   Fengguang, maybe we should first agree on some basics:
>   The two main goals of balance_dirty_pages() are (and always have been
> AFAIK) to limit amount of dirty pages in memory and keep enough dirty pages
> in memory to allow for efficient writeback. Secondary goals are to also
> keep amount of dirty pages somewhat fair among bdis and processes. Agreed?

Agreed. In fact, before the IO-less change, balance_dirty_pages() had
no much explicit control over the dirty rate and fairness.

> Thus shift to trying to control *IO throughput* (or even just buffered
> write throughput) from balance_dirty_pages() is a fundamental shift in the
> goals of balance_dirty_pages(), not just some tweak (although technically,
> it might be relatively easy to do for buffered writes given the current
> implementation).

Yes, it has been a bit shift to the rate based dirty control.

> ...
> > > Well, I tried and I hope some of it got through.  I also wrote a lot
> > > of questions, mainly regarding how what you have in mind is supposed
> > > to work through what path.  Maybe I'm just not seeing what you're
> > > seeing but I just can't see where all the IOs would go through and
> > > come together.  Can you please elaborate more on that?
> > 
> > What I can see is, it looks pretty simple and nature to let
> > balance_dirty_pages() fill the gap towards a total solution :-)
> > 
> > - add direct IO accounting in some convenient point of the IO path
> >   IO submission or completion point, either is fine.
> > 
> > - change several lines of the buffered write IO controller to
> >   integrate the direct IO rate into the formula to fit the "total
> >   IO" limit
> > 
> > - in future, add more accounting as well as feedback control to make
> >   balance_dirty_pages() work with IOPS and disk time
>   Sorry Fengguang but I also think this is a wrong way to go.
> balance_dirty_pages() must primarily control the amount of dirty pages.
> Trying to bend it to control IO throughput by including direct IO and
> reads in the accounting will just make the logic even more complex than it
> already is.

Right, I have been adding too much complexity to balance_dirty_pages().
The control algorithms are pretty hard to understand and get right for
all cases.

OK, I'll post results of my experiments up to now, answer some
questions and take a comfortable break. Phooo..

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                         ` <20120417214831.GE19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-04-18 18:18                           ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-18 18:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Tue, Apr 17, 2012 at 02:48:31PM -0700, Tejun Heo wrote:
[..]

> As for priority inversion through shared request pool, it is a problem
> which needs to be solved regardless of how async IOs are throttled.
> I'm not determined to which extent yet tho.  Different cgroups
> definitely need to be on separate pools but do we also want
> distinguish sync and async and what about ioprio?  Maybe we need a
> bybrid approach with larger common pool and reserved ones for each
> class?

currently we have global pool with separate limits for sync and async
and there is no consideration of ioprio. I think to keep it simple we
can just extend the same notion to keep per cgroup pool with internal
limits on sync/async requests to make sure sync IO does not get
serialized behind async IO. Personally I am not too worried about
async IO prio. It has never worked.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-17 21:48                         ` Tejun Heo
@ 2012-04-18 18:18                           ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-18 18:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Tue, Apr 17, 2012 at 02:48:31PM -0700, Tejun Heo wrote:
[..]

> As for priority inversion through shared request pool, it is a problem
> which needs to be solved regardless of how async IOs are throttled.
> I'm not determined to which extent yet tho.  Different cgroups
> definitely need to be on separate pools but do we also want
> distinguish sync and async and what about ioprio?  Maybe we need a
> bybrid approach with larger common pool and reserved ones for each
> class?

currently we have global pool with separate limits for sync and async
and there is no consideration of ioprio. I think to keep it simple we
can just extend the same notion to keep per cgroup pool with internal
limits on sync/async requests to make sure sync IO does not get
serialized behind async IO. Personally I am not too worried about
async IO prio. It has never worked.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-18 18:18                           ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-18 18:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Fengguang Wu, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Tue, Apr 17, 2012 at 02:48:31PM -0700, Tejun Heo wrote:
[..]

> As for priority inversion through shared request pool, it is a problem
> which needs to be solved regardless of how async IOs are throttled.
> I'm not determined to which extent yet tho.  Different cgroups
> definitely need to be on separate pools but do we also want
> distinguish sync and async and what about ioprio?  Maybe we need a
> bybrid approach with larger common pool and reserved ones for each
> class?

currently we have global pool with separate limits for sync and async
and there is no consideration of ioprio. I think to keep it simple we
can just extend the same notion to keep per cgroup pool with internal
limits on sync/async requests to make sure sync IO does not get
serialized behind async IO. Personally I am not too worried about
async IO prio. It has never worked.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]           ` <20120417223854.GG19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-04-19 14:23             ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-19 14:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA

Hi Tejun,

On Tue, Apr 17, 2012 at 03:38:54PM -0700, Tejun Heo wrote:
> Hello, Fengguang.
> 
> On Fri, Apr 06, 2012 at 02:59:34AM -0700, Fengguang Wu wrote:
> > Fortunately, the above gap can be easily filled judging from the
> > block/cfq IO controller code. By adding some direct IO accounting
> > and changing several lines of my patches to make use of the collected
> > stats, the semantics of the blkio.throttle.write_bps interfaces can be
> > changed from "limit for direct IO" to "limit for direct+buffered IOs".
> > Ditto for blkio.weight and blkio.write_iops, as long as some
> > iops/device time stats are made available to balance_dirty_pages().
> > 
> > It would be a fairly *easy* change. :-) It's merely adding some
> > accounting code and there is no need to change the block IO
> > controlling algorithm at all. I'll do the work of accounting (which
> > is basically independent of the IO controlling) and use the new stats
> > in balance_dirty_pages().
> 
> I don't really understand how this can work.  For hard limits, maybe,

Yeah, hard limits are the easiest.

> but for proportional IO, you have to know which cgroups have IOs
> before assigning the proportions, so blkcg assigning IO bandwidth
> without knowing async writes simply can't work.
> 
> For example, let's say cgroups A and B have 2:8 split.  If A has IOs
> on queue and B doesn't, blkcg will assign all IO bandwidth to A.  I
> can't wrap my head around how writeback is gonna make use of the
> resulting stats but let's say it decides it needs to put out some IOs
> out for both cgroups.  What happens then?  Do all the async writes go
> through the root cgroup controlled by and affecting the ratio between
> rootcg and cgroup A and B?  Or do they have to be accounted as part of
> cgroups A and B?  If so, what if the added bandwidth goes over the
> limit?  Let's say if we implement overcharge; then, I suppose we'll
> have to communicate that upwards too, right?

The trick is to do the throttling for buffered writes at page dirty
time, when balance_dirty_pages() knows exactly what cgroup the dirtier
task belongs to, the dirty rate and whether or not it's an aggressive
dirtier. The cgroup's direct IO rate can also be measured. The only
missing information is whether it's a non-aggressive direct writer
(only cfq may know about that). Now I'm simply assuming direct writers
are all aggressive.

So if A and B have 2:8 split and A only submits async IO and B only
submits direct IO, there will be no cfqg exist for A at all. cfq will
be serving B and root cgroup interleavely. In the patch I just posted,
blkcg_update_dirty_ratelimit() will transfer A's weight 2 to the root
cgroup for use by the flusher. In the end the flusher gets weight 2
and B gets weight 8. Here we need to distinguish the weight assigned
by user and the weight after the async/sync adjustment.

The other missing information is the real cost when the dirtied pages
eventually hit the disk after perhaps dozens of seconds.  For that
part I'm assuming simple dd at this time and balance_dirty_pages()
is now splitting out the flusher's overall writeout progress to the
dirtier tasks' dirty ratelimit based on bandwidth fairness.

> This is still easy.  What about hierarchical propio?  What happens
> then?  You can't do hierarchical proportional allocation without
> knowing how much IOs are pending for which group.  How is that
> information gonna be communicated between blkcg and writeback?  Are we
> gonna have two separate hierarchical proportional IO allocators?  How
> is that gonna work at all?  If we're gonna have single allocator in
> block layer, writeback would have to feed the amount of IOs it may
> generate into the allocator, get the resulting allocation and then
> issue IO and then block layer again will have to account these to the
> originating cgroups.  It's just crazy.

No I have not got the idea on how to do the hierarchical proportional
IO controller without physically splitting up the async IO streams.
It's pretty hard and I'd better break out before it drives me crazy.

So in the following discussion, let's assume cfq will move async
requests from the current root cgroup to individual IO issuer's cfqgs
and schedule service for the async streams there. And thus the need to
create "backpressure" for balance_dirty_pages() to eventually throttle
the individual dirtier tasks.

That said, I still don't think we've come up with any satisfactory
solutions. It's hard problem after all.

> > The only problem I can see now, is that balance_dirty_pages() works
> > per-bdi and blkcg works per-device. So the two ends may not match
> > nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where
> > sdb is shared by lv0 and lv1. However it should be rare situations and
> > be much more acceptable than the problems arise from the "push back"
> > approach which impacts everyone.
> 
> I don't know.  What problems?  AFAICS, the biggest issue is writeback
> of different inodes getting mixed resulting in poor performance, but
> if you think about it, that's about the frequency of switching cgroups
> and a problem which can and should be dealt with from block layer
> (e.g. use larger time slice if all the pending IOs are async).

Yeah increasing time slice would help that case. In general it's not
merely the frequency of switching cgroup if take hard disk' writeback
cache into account.  Think about some inodes with async IO: A1, A2,
A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different
cgroups. So when the root cgroup holds all async inodes, the cfq may
schedule IO interleavely like this

        A1,    A1,    A1,    A2,    A1,    A2,    ...
           D1,    D2,    D3,    D4,    D5,    D6, ...

Now it becomes

        A1,    A2,    A3,    A4,    A5,    A6,    ...
           D1,    D2,    D3,    D4,    D5,    D6, ...

The difference is that it's now switching the async inodes each time.
At cfq level, the seek costs look the same, however the disk's
writeback cache may help merge the data chunks from the same inode A1.
Well, it may cost some latency for spin disks. But how about SSD? It
can run deeper queue and benefit from large writes.

> Writeback's duty is generating stream of async writes which can be
> served efficiently for the *cgroup* and keeping the buffer filled as
> necessary and chaining the backpressure from there to the actual
> dirtier.  That's what writeback does without cgroup.  Nothing
> fundamental changes with cgroup.  It's just finer grained.

Believe me, physically partitioning the dirty pages and async IO
streams comes at big costs. It won't scale well in many ways.

For one instance, splitting the request queues will give rise to
PG_writeback pages.  Those pages have been the biggest source of
latency issues in the various parts of the system.

It's not uncommon for me to see filesystems sleep on PG_writeback
pages during heavy writeback, within some lock or transaction, which in
turn stall many tasks that try to do IO or merely dirty some page in
memory. Random writes are especially susceptible to such stalls. The
stable page feature also vastly increase the chances of stalls by
locking the writeback pages. 

Page reclaim may also block on PG_writeback and/or PG_dirty pages. In
the case of direct reclaim, it means blocking random tasks that are
allocating memory in the system.

PG_writeback pages are much worse than PG_dirty pages in that they are
not movable. This makes a big difference for high-order page allocations.
To make room for a 2MB huge page, vmscan has the option to migrate
PG_dirty pages, but for PG_writeback it has no better choices than to
wait for IO completion.

The difficulty of THP allocation goes up *exponentially* with the
number of PG_writeback pages. Assume PG_writeback pages are randomly
distributed in the physical memory space. Then we have formula

        P(reclaimable for THP) = 1 - P(hit PG_writeback)^256

That's the possibly for a contiguous range of 256 pages to be free of
PG_writeback, so that it's immediately reclaimable for use by
transparent huge page. This ruby script shows us the concrete numbers.

irb> 1.upto(10) { |i| j=i/1000.0; printf "%.3f\t\t\t%.3f\n", j, (1-j)**512 }

        P(hit PG_writeback)     P(reclaimable for THP)
        0.001                   0.599
        0.002                   0.359
        0.003                   0.215
        0.004                   0.128
        0.005                   0.077
        0.006                   0.046
        0.007                   0.027
        0.008                   0.016
        0.009                   0.010
        0.010                   0.006

The numbers show that when the PG_writeback pages go up from 0.1% to
1% of system memory, the THP reclaim success ratio drops quickly from
60% to 0.6%. It indicates that in order to use THP without constantly
running into stalls, the reasonable PG_writeback ratio is <= 0.1%.
Going beyond that threshold, it quickly becomes intolerable.

That makes a limit of 256MB writeback pages for a mem=256GB system.
Looking at the real vmstat:nr_writeback numbers in dd write tests:

JBOD-12SSD-thresh=8G/ext4-1dd-1-3.3.0/vmstat-end:nr_writeback 217009
JBOD-12SSD-thresh=8G/ext4-10dd-1-3.3.0/vmstat-end:nr_writeback 198335
JBOD-12SSD-thresh=8G/xfs-1dd-1-3.3.0/vmstat-end:nr_writeback 306026
JBOD-12SSD-thresh=8G/xfs-10dd-1-3.3.0/vmstat-end:nr_writeback 315099
JBOD-12SSD-thresh=8G/btrfs-1dd-1-3.3.0/vmstat-end:nr_writeback 1216058
JBOD-12SSD-thresh=8G/btrfs-10dd-1-3.3.0/vmstat-end:nr_writeback 895335

Oops btrfs has 4GB writeback pages -- which asks for some bug fixing.
Even ext4's 800MB still looks way too high, but that's ~1s worth of
data per queue (or 130ms worth of data for the high performance Intel
SSD, which is perhaps in danger of queue underruns?). So this system
would require 512GB memory to comfortably run KVM instances with THP
support.

Judging from the above numbers, we can hardly afford to split up the
IO queues and proliferate writeback pages.

It's worth to note that running multiple flusher threads per bdi means
not only disk seeks for spin disks, smaller IO size for SSD, but also
lock contentions and cache bouncing for metadata heavy workloads and
fast storage.

To give some concrete examples on how much CPU overheads can be saved
by reducing multiple IO submitters, here are some summaries for the
IO-less dirty throttling gains. Tests show that it yields huge
benefits for reducing IO seeks as well as CPU overheads.

For example, the fs_mark benchmark on a 12-drive software RAID0 goes
from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock
contention". (by Dave Chinner)

- "CPU usage has dropped by ~55%", "it certainly appears that most of
  the CPU time saving comes from the removal of contention on the
  inode_wb_list_lock"
  (IMHO at least 10% comes from the reduction of cacheline bouncing,
  because the new code is able to call much less frequently into
  balance_dirty_pages() and hence access the _global_ page states)

- the user space "App overhead" is reduced by 20%, by avoiding the
  cacheline pollution by the complex writeback code path

- "for a ~5% throughput reduction", "the number of write IOs have
  dropped by ~25%", and the elapsed time reduced from 41:42.17 to
  40:53.23.

And for simple dd tests

- "throughput for a _single_ large dd (100GB) increase from ~650MB/s
  to 700MB/s"

- "On a simple test of 100 dd, it reduces the CPU %system time from
  30% to 3%, and improves IO throughput from 38MB/s to 42MB/s."

> > > No, no, it's not about standing in my way.  As Vivek said in the other
> > > reply, it's that the "gap" that you filled was created *because*
> > > writeback wasn't cgroup aware and now you're in turn filling that gap
> > > by making writeback work around that "gap".  I mean, my mind boggles.
> > > Doesn't yours?  I strongly believe everyone's should.
> > 
> > Heh. It's a hard problem indeed. I felt great pains in the IO-less
> > dirty throttling work. I did a lot reasoning about it, and have in
> > fact kept cgroup IO controller in mind since its early days. Now I'd
> > say it's hands down for it to adapt to the gap between the total IO
> > limit and what's carried out by the block IO controller.
> 
> You're not providing any valid counter arguments about the issues
> being raised about the messed up design.  How is anything "hands down"
> here?

Yeah sadly, it turns out to be not "hands down" when it comes to the
proportional async/sync splits, and it's even prohibiting when comes
to the hierarchical support..

> > > There's where I'm confused.  How is the said split supposed to work?
> > > They aren't independent.  I mean, who gets to decide what and where
> > > are those decisions enforced?
> > 
> > Yeah it's not independent. It's about
> > 
> > - keep block IO cgroup untouched (in its current algorithm, for
> >   throttling direct IO)
> > 
> > - let balance_dirty_pages() adapt to the throttling target
> >   
> >         buffered_write_limit = total_limit - direct_IOs
> 
> Think about proportional allocation.  You don't have a number until
> you know who have pending IOs and how much.

We have the IO rate. The above formula is actually working on "rates".
That's good enough for calculating the ratelimit for buffered writes.
We don't have to know every transient states of the pending IOs.
Because the direct IOs are handled by cfq based on cfqg weight and 
for async IOs, there are plenty of dirty pages for
buffering/tolerating small errors in the dirty rate control.

> > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > It's always there doing 1:1 proportional throttling. Then you try to
> > kick in to add *double* throttling in block/cfq layer. Now the low
> > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > from its balanced state, leading to large fluctuations and program
> > stalls.
> 
> Just do the same 1:1 inside each cgroup.

Sure. But the ratio mismatch I'm talking about is inter-cgroup.
For example there are only 2 dd tasks doing buffered writes in the
system. Now consider the mismatch that cfq is dispatching their IO
requests at 10:1 weights, while balance_dirty_pages() is throttling
the dd tasks at 1:1 equal split because it's not aware of the cgroup
weights.

What will happen in the end? The 1:1 ratio imposed by
balance_dirty_pages() will take effect and the dd tasks will progress
at the same pace. The cfq weights will be defeated because the async
queue for the second dd (and cgroup) constantly runs empty.

> >  This can be avoided by telling balance_dirty_pages(): "your
> > balance goal is no longer 1:1, but 10:1". With this information
> > balance_dirty_pages() will behave right. Then there is the question:
> > if balance_dirty_pages() will work just well provided the information,
> > why bother doing the throttling at low layer and "push back" the
> > pressure all the way up?
> 
> Because splitting a resource into two pieces arbitrarily with
> different amount of consumptions on each side and then applying the
> same proportion on both doesn't mean anything?

Sorry, I don't quite catch your words here.

> > The balance_dirty_pages() is already deeply involved in dirty throttling.
> > As you can see from this patchset, the same algorithms can be extended
> > trivially to work with cgroup IO limits.
> > 
> > buffered write IO controller in balance_dirty_pages()
> > https://lkml.org/lkml/2012/3/28/275
> 
> It is half broken thing with fundamental design flaws which can't be
> corrected without complete reimplementation.  I don't know what to
> say.

I'm fully aware of that, and so have been exploring new ways out :)

> > In the "back pressure" scheme, memcg is a must because only it has all
> > the infrastructure to track dirty pages upon which you can apply some
> > dirty_limits. Don't tell me you want to account dirty pages in blkcg...
> 
> For now, per-inode tracking seems good enough.

There are actually two directions of information passing.

1) pass the dirtier ownership down to bio. For this part, it's mostly
   enough to do the light weight per-inode tracking.

2) pass the backpressure up, from cfq (IO dispatch) to flusher (IO
submit) as well as to balance_dirty_pages() (to actually throttle the
dirty tasks). The flusher naturally works on inode granularities.
However balance_dirty_pages() is about limiting dirty pages. For this
part, it needs to know the total number of dirty pages and writeout
bandwidth for each cgroup in order to do proper dirty throttling. And
to maintain proper number of dirty pages to avoid the queue underrun
issue explained in the above 2-dd example.

> > What I can see is, it looks pretty simple and nature to let
> > balance_dirty_pages() fill the gap towards a total solution :-)
> > 
> > - add direct IO accounting in some convenient point of the IO path
> >   IO submission or completion point, either is fine.
> > 
> > - change several lines of the buffered write IO controller to
> >   integrate the direct IO rate into the formula to fit the "total
> >   IO" limit
> > 
> > - in future, add more accounting as well as feedback control to make
> >   balance_dirty_pages() work with IOPS and disk time
> 
> To me, you seem to be not addressing the issues I've been raising at
> all and just repeating the same points again and again.  If I'm
> misunderstanding something, please point out.

Hopefully the renewed patch can dismiss some of your questions. It's a
pity that I didn't thought about the hierarchical requirement at the
time. Otherwise the complexity of calculations still looks manageable.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]           ` <20120417223854.GG19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2012-04-19 14:23             ` Fengguang Wu
@ 2012-04-19 14:23             ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-19 14:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hi Tejun,

On Tue, Apr 17, 2012 at 03:38:54PM -0700, Tejun Heo wrote:
> Hello, Fengguang.
> 
> On Fri, Apr 06, 2012 at 02:59:34AM -0700, Fengguang Wu wrote:
> > Fortunately, the above gap can be easily filled judging from the
> > block/cfq IO controller code. By adding some direct IO accounting
> > and changing several lines of my patches to make use of the collected
> > stats, the semantics of the blkio.throttle.write_bps interfaces can be
> > changed from "limit for direct IO" to "limit for direct+buffered IOs".
> > Ditto for blkio.weight and blkio.write_iops, as long as some
> > iops/device time stats are made available to balance_dirty_pages().
> > 
> > It would be a fairly *easy* change. :-) It's merely adding some
> > accounting code and there is no need to change the block IO
> > controlling algorithm at all. I'll do the work of accounting (which
> > is basically independent of the IO controlling) and use the new stats
> > in balance_dirty_pages().
> 
> I don't really understand how this can work.  For hard limits, maybe,

Yeah, hard limits are the easiest.

> but for proportional IO, you have to know which cgroups have IOs
> before assigning the proportions, so blkcg assigning IO bandwidth
> without knowing async writes simply can't work.
> 
> For example, let's say cgroups A and B have 2:8 split.  If A has IOs
> on queue and B doesn't, blkcg will assign all IO bandwidth to A.  I
> can't wrap my head around how writeback is gonna make use of the
> resulting stats but let's say it decides it needs to put out some IOs
> out for both cgroups.  What happens then?  Do all the async writes go
> through the root cgroup controlled by and affecting the ratio between
> rootcg and cgroup A and B?  Or do they have to be accounted as part of
> cgroups A and B?  If so, what if the added bandwidth goes over the
> limit?  Let's say if we implement overcharge; then, I suppose we'll
> have to communicate that upwards too, right?

The trick is to do the throttling for buffered writes at page dirty
time, when balance_dirty_pages() knows exactly what cgroup the dirtier
task belongs to, the dirty rate and whether or not it's an aggressive
dirtier. The cgroup's direct IO rate can also be measured. The only
missing information is whether it's a non-aggressive direct writer
(only cfq may know about that). Now I'm simply assuming direct writers
are all aggressive.

So if A and B have 2:8 split and A only submits async IO and B only
submits direct IO, there will be no cfqg exist for A at all. cfq will
be serving B and root cgroup interleavely. In the patch I just posted,
blkcg_update_dirty_ratelimit() will transfer A's weight 2 to the root
cgroup for use by the flusher. In the end the flusher gets weight 2
and B gets weight 8. Here we need to distinguish the weight assigned
by user and the weight after the async/sync adjustment.

The other missing information is the real cost when the dirtied pages
eventually hit the disk after perhaps dozens of seconds.  For that
part I'm assuming simple dd at this time and balance_dirty_pages()
is now splitting out the flusher's overall writeout progress to the
dirtier tasks' dirty ratelimit based on bandwidth fairness.

> This is still easy.  What about hierarchical propio?  What happens
> then?  You can't do hierarchical proportional allocation without
> knowing how much IOs are pending for which group.  How is that
> information gonna be communicated between blkcg and writeback?  Are we
> gonna have two separate hierarchical proportional IO allocators?  How
> is that gonna work at all?  If we're gonna have single allocator in
> block layer, writeback would have to feed the amount of IOs it may
> generate into the allocator, get the resulting allocation and then
> issue IO and then block layer again will have to account these to the
> originating cgroups.  It's just crazy.

No I have not got the idea on how to do the hierarchical proportional
IO controller without physically splitting up the async IO streams.
It's pretty hard and I'd better break out before it drives me crazy.

So in the following discussion, let's assume cfq will move async
requests from the current root cgroup to individual IO issuer's cfqgs
and schedule service for the async streams there. And thus the need to
create "backpressure" for balance_dirty_pages() to eventually throttle
the individual dirtier tasks.

That said, I still don't think we've come up with any satisfactory
solutions. It's hard problem after all.

> > The only problem I can see now, is that balance_dirty_pages() works
> > per-bdi and blkcg works per-device. So the two ends may not match
> > nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where
> > sdb is shared by lv0 and lv1. However it should be rare situations and
> > be much more acceptable than the problems arise from the "push back"
> > approach which impacts everyone.
> 
> I don't know.  What problems?  AFAICS, the biggest issue is writeback
> of different inodes getting mixed resulting in poor performance, but
> if you think about it, that's about the frequency of switching cgroups
> and a problem which can and should be dealt with from block layer
> (e.g. use larger time slice if all the pending IOs are async).

Yeah increasing time slice would help that case. In general it's not
merely the frequency of switching cgroup if take hard disk' writeback
cache into account.  Think about some inodes with async IO: A1, A2,
A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different
cgroups. So when the root cgroup holds all async inodes, the cfq may
schedule IO interleavely like this

        A1,    A1,    A1,    A2,    A1,    A2,    ...
           D1,    D2,    D3,    D4,    D5,    D6, ...

Now it becomes

        A1,    A2,    A3,    A4,    A5,    A6,    ...
           D1,    D2,    D3,    D4,    D5,    D6, ...

The difference is that it's now switching the async inodes each time.
At cfq level, the seek costs look the same, however the disk's
writeback cache may help merge the data chunks from the same inode A1.
Well, it may cost some latency for spin disks. But how about SSD? It
can run deeper queue and benefit from large writes.

> Writeback's duty is generating stream of async writes which can be
> served efficiently for the *cgroup* and keeping the buffer filled as
> necessary and chaining the backpressure from there to the actual
> dirtier.  That's what writeback does without cgroup.  Nothing
> fundamental changes with cgroup.  It's just finer grained.

Believe me, physically partitioning the dirty pages and async IO
streams comes at big costs. It won't scale well in many ways.

For one instance, splitting the request queues will give rise to
PG_writeback pages.  Those pages have been the biggest source of
latency issues in the various parts of the system.

It's not uncommon for me to see filesystems sleep on PG_writeback
pages during heavy writeback, within some lock or transaction, which in
turn stall many tasks that try to do IO or merely dirty some page in
memory. Random writes are especially susceptible to such stalls. The
stable page feature also vastly increase the chances of stalls by
locking the writeback pages. 

Page reclaim may also block on PG_writeback and/or PG_dirty pages. In
the case of direct reclaim, it means blocking random tasks that are
allocating memory in the system.

PG_writeback pages are much worse than PG_dirty pages in that they are
not movable. This makes a big difference for high-order page allocations.
To make room for a 2MB huge page, vmscan has the option to migrate
PG_dirty pages, but for PG_writeback it has no better choices than to
wait for IO completion.

The difficulty of THP allocation goes up *exponentially* with the
number of PG_writeback pages. Assume PG_writeback pages are randomly
distributed in the physical memory space. Then we have formula

        P(reclaimable for THP) = 1 - P(hit PG_writeback)^256

That's the possibly for a contiguous range of 256 pages to be free of
PG_writeback, so that it's immediately reclaimable for use by
transparent huge page. This ruby script shows us the concrete numbers.

irb> 1.upto(10) { |i| j=i/1000.0; printf "%.3f\t\t\t%.3f\n", j, (1-j)**512 }

        P(hit PG_writeback)     P(reclaimable for THP)
        0.001                   0.599
        0.002                   0.359
        0.003                   0.215
        0.004                   0.128
        0.005                   0.077
        0.006                   0.046
        0.007                   0.027
        0.008                   0.016
        0.009                   0.010
        0.010                   0.006

The numbers show that when the PG_writeback pages go up from 0.1% to
1% of system memory, the THP reclaim success ratio drops quickly from
60% to 0.6%. It indicates that in order to use THP without constantly
running into stalls, the reasonable PG_writeback ratio is <= 0.1%.
Going beyond that threshold, it quickly becomes intolerable.

That makes a limit of 256MB writeback pages for a mem=256GB system.
Looking at the real vmstat:nr_writeback numbers in dd write tests:

JBOD-12SSD-thresh=8G/ext4-1dd-1-3.3.0/vmstat-end:nr_writeback 217009
JBOD-12SSD-thresh=8G/ext4-10dd-1-3.3.0/vmstat-end:nr_writeback 198335
JBOD-12SSD-thresh=8G/xfs-1dd-1-3.3.0/vmstat-end:nr_writeback 306026
JBOD-12SSD-thresh=8G/xfs-10dd-1-3.3.0/vmstat-end:nr_writeback 315099
JBOD-12SSD-thresh=8G/btrfs-1dd-1-3.3.0/vmstat-end:nr_writeback 1216058
JBOD-12SSD-thresh=8G/btrfs-10dd-1-3.3.0/vmstat-end:nr_writeback 895335

Oops btrfs has 4GB writeback pages -- which asks for some bug fixing.
Even ext4's 800MB still looks way too high, but that's ~1s worth of
data per queue (or 130ms worth of data for the high performance Intel
SSD, which is perhaps in danger of queue underruns?). So this system
would require 512GB memory to comfortably run KVM instances with THP
support.

Judging from the above numbers, we can hardly afford to split up the
IO queues and proliferate writeback pages.

It's worth to note that running multiple flusher threads per bdi means
not only disk seeks for spin disks, smaller IO size for SSD, but also
lock contentions and cache bouncing for metadata heavy workloads and
fast storage.

To give some concrete examples on how much CPU overheads can be saved
by reducing multiple IO submitters, here are some summaries for the
IO-less dirty throttling gains. Tests show that it yields huge
benefits for reducing IO seeks as well as CPU overheads.

For example, the fs_mark benchmark on a 12-drive software RAID0 goes
from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock
contention". (by Dave Chinner)

- "CPU usage has dropped by ~55%", "it certainly appears that most of
  the CPU time saving comes from the removal of contention on the
  inode_wb_list_lock"
  (IMHO at least 10% comes from the reduction of cacheline bouncing,
  because the new code is able to call much less frequently into
  balance_dirty_pages() and hence access the _global_ page states)

- the user space "App overhead" is reduced by 20%, by avoiding the
  cacheline pollution by the complex writeback code path

- "for a ~5% throughput reduction", "the number of write IOs have
  dropped by ~25%", and the elapsed time reduced from 41:42.17 to
  40:53.23.

And for simple dd tests

- "throughput for a _single_ large dd (100GB) increase from ~650MB/s
  to 700MB/s"

- "On a simple test of 100 dd, it reduces the CPU %system time from
  30% to 3%, and improves IO throughput from 38MB/s to 42MB/s."

> > > No, no, it's not about standing in my way.  As Vivek said in the other
> > > reply, it's that the "gap" that you filled was created *because*
> > > writeback wasn't cgroup aware and now you're in turn filling that gap
> > > by making writeback work around that "gap".  I mean, my mind boggles.
> > > Doesn't yours?  I strongly believe everyone's should.
> > 
> > Heh. It's a hard problem indeed. I felt great pains in the IO-less
> > dirty throttling work. I did a lot reasoning about it, and have in
> > fact kept cgroup IO controller in mind since its early days. Now I'd
> > say it's hands down for it to adapt to the gap between the total IO
> > limit and what's carried out by the block IO controller.
> 
> You're not providing any valid counter arguments about the issues
> being raised about the messed up design.  How is anything "hands down"
> here?

Yeah sadly, it turns out to be not "hands down" when it comes to the
proportional async/sync splits, and it's even prohibiting when comes
to the hierarchical support..

> > > There's where I'm confused.  How is the said split supposed to work?
> > > They aren't independent.  I mean, who gets to decide what and where
> > > are those decisions enforced?
> > 
> > Yeah it's not independent. It's about
> > 
> > - keep block IO cgroup untouched (in its current algorithm, for
> >   throttling direct IO)
> > 
> > - let balance_dirty_pages() adapt to the throttling target
> >   
> >         buffered_write_limit = total_limit - direct_IOs
> 
> Think about proportional allocation.  You don't have a number until
> you know who have pending IOs and how much.

We have the IO rate. The above formula is actually working on "rates".
That's good enough for calculating the ratelimit for buffered writes.
We don't have to know every transient states of the pending IOs.
Because the direct IOs are handled by cfq based on cfqg weight and 
for async IOs, there are plenty of dirty pages for
buffering/tolerating small errors in the dirty rate control.

> > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > It's always there doing 1:1 proportional throttling. Then you try to
> > kick in to add *double* throttling in block/cfq layer. Now the low
> > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > from its balanced state, leading to large fluctuations and program
> > stalls.
> 
> Just do the same 1:1 inside each cgroup.

Sure. But the ratio mismatch I'm talking about is inter-cgroup.
For example there are only 2 dd tasks doing buffered writes in the
system. Now consider the mismatch that cfq is dispatching their IO
requests at 10:1 weights, while balance_dirty_pages() is throttling
the dd tasks at 1:1 equal split because it's not aware of the cgroup
weights.

What will happen in the end? The 1:1 ratio imposed by
balance_dirty_pages() will take effect and the dd tasks will progress
at the same pace. The cfq weights will be defeated because the async
queue for the second dd (and cgroup) constantly runs empty.

> >  This can be avoided by telling balance_dirty_pages(): "your
> > balance goal is no longer 1:1, but 10:1". With this information
> > balance_dirty_pages() will behave right. Then there is the question:
> > if balance_dirty_pages() will work just well provided the information,
> > why bother doing the throttling at low layer and "push back" the
> > pressure all the way up?
> 
> Because splitting a resource into two pieces arbitrarily with
> different amount of consumptions on each side and then applying the
> same proportion on both doesn't mean anything?

Sorry, I don't quite catch your words here.

> > The balance_dirty_pages() is already deeply involved in dirty throttling.
> > As you can see from this patchset, the same algorithms can be extended
> > trivially to work with cgroup IO limits.
> > 
> > buffered write IO controller in balance_dirty_pages()
> > https://lkml.org/lkml/2012/3/28/275
> 
> It is half broken thing with fundamental design flaws which can't be
> corrected without complete reimplementation.  I don't know what to
> say.

I'm fully aware of that, and so have been exploring new ways out :)

> > In the "back pressure" scheme, memcg is a must because only it has all
> > the infrastructure to track dirty pages upon which you can apply some
> > dirty_limits. Don't tell me you want to account dirty pages in blkcg...
> 
> For now, per-inode tracking seems good enough.

There are actually two directions of information passing.

1) pass the dirtier ownership down to bio. For this part, it's mostly
   enough to do the light weight per-inode tracking.

2) pass the backpressure up, from cfq (IO dispatch) to flusher (IO
submit) as well as to balance_dirty_pages() (to actually throttle the
dirty tasks). The flusher naturally works on inode granularities.
However balance_dirty_pages() is about limiting dirty pages. For this
part, it needs to know the total number of dirty pages and writeout
bandwidth for each cgroup in order to do proper dirty throttling. And
to maintain proper number of dirty pages to avoid the queue underrun
issue explained in the above 2-dd example.

> > What I can see is, it looks pretty simple and nature to let
> > balance_dirty_pages() fill the gap towards a total solution :-)
> > 
> > - add direct IO accounting in some convenient point of the IO path
> >   IO submission or completion point, either is fine.
> > 
> > - change several lines of the buffered write IO controller to
> >   integrate the direct IO rate into the formula to fit the "total
> >   IO" limit
> > 
> > - in future, add more accounting as well as feedback control to make
> >   balance_dirty_pages() work with IOPS and disk time
> 
> To me, you seem to be not addressing the issues I've been raising at
> all and just repeating the same points again and again.  If I'm
> misunderstanding something, please point out.

Hopefully the renewed patch can dismiss some of your questions. It's a
pity that I didn't thought about the hierarchical requirement at the
time. Otherwise the complexity of calculations still looks manageable.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-19 14:23             ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-19 14:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k,
	andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hi Tejun,

On Tue, Apr 17, 2012 at 03:38:54PM -0700, Tejun Heo wrote:
> Hello, Fengguang.
> 
> On Fri, Apr 06, 2012 at 02:59:34AM -0700, Fengguang Wu wrote:
> > Fortunately, the above gap can be easily filled judging from the
> > block/cfq IO controller code. By adding some direct IO accounting
> > and changing several lines of my patches to make use of the collected
> > stats, the semantics of the blkio.throttle.write_bps interfaces can be
> > changed from "limit for direct IO" to "limit for direct+buffered IOs".
> > Ditto for blkio.weight and blkio.write_iops, as long as some
> > iops/device time stats are made available to balance_dirty_pages().
> > 
> > It would be a fairly *easy* change. :-) It's merely adding some
> > accounting code and there is no need to change the block IO
> > controlling algorithm at all. I'll do the work of accounting (which
> > is basically independent of the IO controlling) and use the new stats
> > in balance_dirty_pages().
> 
> I don't really understand how this can work.  For hard limits, maybe,

Yeah, hard limits are the easiest.

> but for proportional IO, you have to know which cgroups have IOs
> before assigning the proportions, so blkcg assigning IO bandwidth
> without knowing async writes simply can't work.
> 
> For example, let's say cgroups A and B have 2:8 split.  If A has IOs
> on queue and B doesn't, blkcg will assign all IO bandwidth to A.  I
> can't wrap my head around how writeback is gonna make use of the
> resulting stats but let's say it decides it needs to put out some IOs
> out for both cgroups.  What happens then?  Do all the async writes go
> through the root cgroup controlled by and affecting the ratio between
> rootcg and cgroup A and B?  Or do they have to be accounted as part of
> cgroups A and B?  If so, what if the added bandwidth goes over the
> limit?  Let's say if we implement overcharge; then, I suppose we'll
> have to communicate that upwards too, right?

The trick is to do the throttling for buffered writes at page dirty
time, when balance_dirty_pages() knows exactly what cgroup the dirtier
task belongs to, the dirty rate and whether or not it's an aggressive
dirtier. The cgroup's direct IO rate can also be measured. The only
missing information is whether it's a non-aggressive direct writer
(only cfq may know about that). Now I'm simply assuming direct writers
are all aggressive.

So if A and B have 2:8 split and A only submits async IO and B only
submits direct IO, there will be no cfqg exist for A at all. cfq will
be serving B and root cgroup interleavely. In the patch I just posted,
blkcg_update_dirty_ratelimit() will transfer A's weight 2 to the root
cgroup for use by the flusher. In the end the flusher gets weight 2
and B gets weight 8. Here we need to distinguish the weight assigned
by user and the weight after the async/sync adjustment.

The other missing information is the real cost when the dirtied pages
eventually hit the disk after perhaps dozens of seconds.  For that
part I'm assuming simple dd at this time and balance_dirty_pages()
is now splitting out the flusher's overall writeout progress to the
dirtier tasks' dirty ratelimit based on bandwidth fairness.

> This is still easy.  What about hierarchical propio?  What happens
> then?  You can't do hierarchical proportional allocation without
> knowing how much IOs are pending for which group.  How is that
> information gonna be communicated between blkcg and writeback?  Are we
> gonna have two separate hierarchical proportional IO allocators?  How
> is that gonna work at all?  If we're gonna have single allocator in
> block layer, writeback would have to feed the amount of IOs it may
> generate into the allocator, get the resulting allocation and then
> issue IO and then block layer again will have to account these to the
> originating cgroups.  It's just crazy.

No I have not got the idea on how to do the hierarchical proportional
IO controller without physically splitting up the async IO streams.
It's pretty hard and I'd better break out before it drives me crazy.

So in the following discussion, let's assume cfq will move async
requests from the current root cgroup to individual IO issuer's cfqgs
and schedule service for the async streams there. And thus the need to
create "backpressure" for balance_dirty_pages() to eventually throttle
the individual dirtier tasks.

That said, I still don't think we've come up with any satisfactory
solutions. It's hard problem after all.

> > The only problem I can see now, is that balance_dirty_pages() works
> > per-bdi and blkcg works per-device. So the two ends may not match
> > nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where
> > sdb is shared by lv0 and lv1. However it should be rare situations and
> > be much more acceptable than the problems arise from the "push back"
> > approach which impacts everyone.
> 
> I don't know.  What problems?  AFAICS, the biggest issue is writeback
> of different inodes getting mixed resulting in poor performance, but
> if you think about it, that's about the frequency of switching cgroups
> and a problem which can and should be dealt with from block layer
> (e.g. use larger time slice if all the pending IOs are async).

Yeah increasing time slice would help that case. In general it's not
merely the frequency of switching cgroup if take hard disk' writeback
cache into account.  Think about some inodes with async IO: A1, A2,
A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different
cgroups. So when the root cgroup holds all async inodes, the cfq may
schedule IO interleavely like this

        A1,    A1,    A1,    A2,    A1,    A2,    ...
           D1,    D2,    D3,    D4,    D5,    D6, ...

Now it becomes

        A1,    A2,    A3,    A4,    A5,    A6,    ...
           D1,    D2,    D3,    D4,    D5,    D6, ...

The difference is that it's now switching the async inodes each time.
At cfq level, the seek costs look the same, however the disk's
writeback cache may help merge the data chunks from the same inode A1.
Well, it may cost some latency for spin disks. But how about SSD? It
can run deeper queue and benefit from large writes.

> Writeback's duty is generating stream of async writes which can be
> served efficiently for the *cgroup* and keeping the buffer filled as
> necessary and chaining the backpressure from there to the actual
> dirtier.  That's what writeback does without cgroup.  Nothing
> fundamental changes with cgroup.  It's just finer grained.

Believe me, physically partitioning the dirty pages and async IO
streams comes at big costs. It won't scale well in many ways.

For one instance, splitting the request queues will give rise to
PG_writeback pages.  Those pages have been the biggest source of
latency issues in the various parts of the system.

It's not uncommon for me to see filesystems sleep on PG_writeback
pages during heavy writeback, within some lock or transaction, which in
turn stall many tasks that try to do IO or merely dirty some page in
memory. Random writes are especially susceptible to such stalls. The
stable page feature also vastly increase the chances of stalls by
locking the writeback pages. 

Page reclaim may also block on PG_writeback and/or PG_dirty pages. In
the case of direct reclaim, it means blocking random tasks that are
allocating memory in the system.

PG_writeback pages are much worse than PG_dirty pages in that they are
not movable. This makes a big difference for high-order page allocations.
To make room for a 2MB huge page, vmscan has the option to migrate
PG_dirty pages, but for PG_writeback it has no better choices than to
wait for IO completion.

The difficulty of THP allocation goes up *exponentially* with the
number of PG_writeback pages. Assume PG_writeback pages are randomly
distributed in the physical memory space. Then we have formula

        P(reclaimable for THP) = 1 - P(hit PG_writeback)^256

That's the possibly for a contiguous range of 256 pages to be free of
PG_writeback, so that it's immediately reclaimable for use by
transparent huge page. This ruby script shows us the concrete numbers.

irb> 1.upto(10) { |i| j=i/1000.0; printf "%.3f\t\t\t%.3f\n", j, (1-j)**512 }

        P(hit PG_writeback)     P(reclaimable for THP)
        0.001                   0.599
        0.002                   0.359
        0.003                   0.215
        0.004                   0.128
        0.005                   0.077
        0.006                   0.046
        0.007                   0.027
        0.008                   0.016
        0.009                   0.010
        0.010                   0.006

The numbers show that when the PG_writeback pages go up from 0.1% to
1% of system memory, the THP reclaim success ratio drops quickly from
60% to 0.6%. It indicates that in order to use THP without constantly
running into stalls, the reasonable PG_writeback ratio is <= 0.1%.
Going beyond that threshold, it quickly becomes intolerable.

That makes a limit of 256MB writeback pages for a mem=256GB system.
Looking at the real vmstat:nr_writeback numbers in dd write tests:

JBOD-12SSD-thresh=8G/ext4-1dd-1-3.3.0/vmstat-end:nr_writeback 217009
JBOD-12SSD-thresh=8G/ext4-10dd-1-3.3.0/vmstat-end:nr_writeback 198335
JBOD-12SSD-thresh=8G/xfs-1dd-1-3.3.0/vmstat-end:nr_writeback 306026
JBOD-12SSD-thresh=8G/xfs-10dd-1-3.3.0/vmstat-end:nr_writeback 315099
JBOD-12SSD-thresh=8G/btrfs-1dd-1-3.3.0/vmstat-end:nr_writeback 1216058
JBOD-12SSD-thresh=8G/btrfs-10dd-1-3.3.0/vmstat-end:nr_writeback 895335

Oops btrfs has 4GB writeback pages -- which asks for some bug fixing.
Even ext4's 800MB still looks way too high, but that's ~1s worth of
data per queue (or 130ms worth of data for the high performance Intel
SSD, which is perhaps in danger of queue underruns?). So this system
would require 512GB memory to comfortably run KVM instances with THP
support.

Judging from the above numbers, we can hardly afford to split up the
IO queues and proliferate writeback pages.

It's worth to note that running multiple flusher threads per bdi means
not only disk seeks for spin disks, smaller IO size for SSD, but also
lock contentions and cache bouncing for metadata heavy workloads and
fast storage.

To give some concrete examples on how much CPU overheads can be saved
by reducing multiple IO submitters, here are some summaries for the
IO-less dirty throttling gains. Tests show that it yields huge
benefits for reducing IO seeks as well as CPU overheads.

For example, the fs_mark benchmark on a 12-drive software RAID0 goes
from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock
contention". (by Dave Chinner)

- "CPU usage has dropped by ~55%", "it certainly appears that most of
  the CPU time saving comes from the removal of contention on the
  inode_wb_list_lock"
  (IMHO at least 10% comes from the reduction of cacheline bouncing,
  because the new code is able to call much less frequently into
  balance_dirty_pages() and hence access the _global_ page states)

- the user space "App overhead" is reduced by 20%, by avoiding the
  cacheline pollution by the complex writeback code path

- "for a ~5% throughput reduction", "the number of write IOs have
  dropped by ~25%", and the elapsed time reduced from 41:42.17 to
  40:53.23.

And for simple dd tests

- "throughput for a _single_ large dd (100GB) increase from ~650MB/s
  to 700MB/s"

- "On a simple test of 100 dd, it reduces the CPU %system time from
  30% to 3%, and improves IO throughput from 38MB/s to 42MB/s."

> > > No, no, it's not about standing in my way.  As Vivek said in the other
> > > reply, it's that the "gap" that you filled was created *because*
> > > writeback wasn't cgroup aware and now you're in turn filling that gap
> > > by making writeback work around that "gap".  I mean, my mind boggles.
> > > Doesn't yours?  I strongly believe everyone's should.
> > 
> > Heh. It's a hard problem indeed. I felt great pains in the IO-less
> > dirty throttling work. I did a lot reasoning about it, and have in
> > fact kept cgroup IO controller in mind since its early days. Now I'd
> > say it's hands down for it to adapt to the gap between the total IO
> > limit and what's carried out by the block IO controller.
> 
> You're not providing any valid counter arguments about the issues
> being raised about the messed up design.  How is anything "hands down"
> here?

Yeah sadly, it turns out to be not "hands down" when it comes to the
proportional async/sync splits, and it's even prohibiting when comes
to the hierarchical support..

> > > There's where I'm confused.  How is the said split supposed to work?
> > > They aren't independent.  I mean, who gets to decide what and where
> > > are those decisions enforced?
> > 
> > Yeah it's not independent. It's about
> > 
> > - keep block IO cgroup untouched (in its current algorithm, for
> >   throttling direct IO)
> > 
> > - let balance_dirty_pages() adapt to the throttling target
> >   
> >         buffered_write_limit = total_limit - direct_IOs
> 
> Think about proportional allocation.  You don't have a number until
> you know who have pending IOs and how much.

We have the IO rate. The above formula is actually working on "rates".
That's good enough for calculating the ratelimit for buffered writes.
We don't have to know every transient states of the pending IOs.
Because the direct IOs are handled by cfq based on cfqg weight and 
for async IOs, there are plenty of dirty pages for
buffering/tolerating small errors in the dirty rate control.

> > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > It's always there doing 1:1 proportional throttling. Then you try to
> > kick in to add *double* throttling in block/cfq layer. Now the low
> > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > from its balanced state, leading to large fluctuations and program
> > stalls.
> 
> Just do the same 1:1 inside each cgroup.

Sure. But the ratio mismatch I'm talking about is inter-cgroup.
For example there are only 2 dd tasks doing buffered writes in the
system. Now consider the mismatch that cfq is dispatching their IO
requests at 10:1 weights, while balance_dirty_pages() is throttling
the dd tasks at 1:1 equal split because it's not aware of the cgroup
weights.

What will happen in the end? The 1:1 ratio imposed by
balance_dirty_pages() will take effect and the dd tasks will progress
at the same pace. The cfq weights will be defeated because the async
queue for the second dd (and cgroup) constantly runs empty.

> >  This can be avoided by telling balance_dirty_pages(): "your
> > balance goal is no longer 1:1, but 10:1". With this information
> > balance_dirty_pages() will behave right. Then there is the question:
> > if balance_dirty_pages() will work just well provided the information,
> > why bother doing the throttling at low layer and "push back" the
> > pressure all the way up?
> 
> Because splitting a resource into two pieces arbitrarily with
> different amount of consumptions on each side and then applying the
> same proportion on both doesn't mean anything?

Sorry, I don't quite catch your words here.

> > The balance_dirty_pages() is already deeply involved in dirty throttling.
> > As you can see from this patchset, the same algorithms can be extended
> > trivially to work with cgroup IO limits.
> > 
> > buffered write IO controller in balance_dirty_pages()
> > https://lkml.org/lkml/2012/3/28/275
> 
> It is half broken thing with fundamental design flaws which can't be
> corrected without complete reimplementation.  I don't know what to
> say.

I'm fully aware of that, and so have been exploring new ways out :)

> > In the "back pressure" scheme, memcg is a must because only it has all
> > the infrastructure to track dirty pages upon which you can apply some
> > dirty_limits. Don't tell me you want to account dirty pages in blkcg...
> 
> For now, per-inode tracking seems good enough.

There are actually two directions of information passing.

1) pass the dirtier ownership down to bio. For this part, it's mostly
   enough to do the light weight per-inode tracking.

2) pass the backpressure up, from cfq (IO dispatch) to flusher (IO
submit) as well as to balance_dirty_pages() (to actually throttle the
dirty tasks). The flusher naturally works on inode granularities.
However balance_dirty_pages() is about limiting dirty pages. For this
part, it needs to know the total number of dirty pages and writeout
bandwidth for each cgroup in order to do proper dirty throttling. And
to maintain proper number of dirty pages to avoid the queue underrun
issue explained in the above 2-dd example.

> > What I can see is, it looks pretty simple and nature to let
> > balance_dirty_pages() fill the gap towards a total solution :-)
> > 
> > - add direct IO accounting in some convenient point of the IO path
> >   IO submission or completion point, either is fine.
> > 
> > - change several lines of the buffered write IO controller to
> >   integrate the direct IO rate into the formula to fit the "total
> >   IO" limit
> > 
> > - in future, add more accounting as well as feedback control to make
> >   balance_dirty_pages() work with IOPS and disk time
> 
> To me, you seem to be not addressing the issues I've been raising at
> all and just repeating the same points again and again.  If I'm
> misunderstanding something, please point out.

Hopefully the renewed patch can dismiss some of your questions. It's a
pity that I didn't thought about the hierarchical requirement at the
time. Otherwise the complexity of calculations still looks manageable.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-19 14:23             ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-19 14:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hi Tejun,

On Tue, Apr 17, 2012 at 03:38:54PM -0700, Tejun Heo wrote:
> Hello, Fengguang.
> 
> On Fri, Apr 06, 2012 at 02:59:34AM -0700, Fengguang Wu wrote:
> > Fortunately, the above gap can be easily filled judging from the
> > block/cfq IO controller code. By adding some direct IO accounting
> > and changing several lines of my patches to make use of the collected
> > stats, the semantics of the blkio.throttle.write_bps interfaces can be
> > changed from "limit for direct IO" to "limit for direct+buffered IOs".
> > Ditto for blkio.weight and blkio.write_iops, as long as some
> > iops/device time stats are made available to balance_dirty_pages().
> > 
> > It would be a fairly *easy* change. :-) It's merely adding some
> > accounting code and there is no need to change the block IO
> > controlling algorithm at all. I'll do the work of accounting (which
> > is basically independent of the IO controlling) and use the new stats
> > in balance_dirty_pages().
> 
> I don't really understand how this can work.  For hard limits, maybe,

Yeah, hard limits are the easiest.

> but for proportional IO, you have to know which cgroups have IOs
> before assigning the proportions, so blkcg assigning IO bandwidth
> without knowing async writes simply can't work.
> 
> For example, let's say cgroups A and B have 2:8 split.  If A has IOs
> on queue and B doesn't, blkcg will assign all IO bandwidth to A.  I
> can't wrap my head around how writeback is gonna make use of the
> resulting stats but let's say it decides it needs to put out some IOs
> out for both cgroups.  What happens then?  Do all the async writes go
> through the root cgroup controlled by and affecting the ratio between
> rootcg and cgroup A and B?  Or do they have to be accounted as part of
> cgroups A and B?  If so, what if the added bandwidth goes over the
> limit?  Let's say if we implement overcharge; then, I suppose we'll
> have to communicate that upwards too, right?

The trick is to do the throttling for buffered writes at page dirty
time, when balance_dirty_pages() knows exactly what cgroup the dirtier
task belongs to, the dirty rate and whether or not it's an aggressive
dirtier. The cgroup's direct IO rate can also be measured. The only
missing information is whether it's a non-aggressive direct writer
(only cfq may know about that). Now I'm simply assuming direct writers
are all aggressive.

So if A and B have 2:8 split and A only submits async IO and B only
submits direct IO, there will be no cfqg exist for A at all. cfq will
be serving B and root cgroup interleavely. In the patch I just posted,
blkcg_update_dirty_ratelimit() will transfer A's weight 2 to the root
cgroup for use by the flusher. In the end the flusher gets weight 2
and B gets weight 8. Here we need to distinguish the weight assigned
by user and the weight after the async/sync adjustment.

The other missing information is the real cost when the dirtied pages
eventually hit the disk after perhaps dozens of seconds.  For that
part I'm assuming simple dd at this time and balance_dirty_pages()
is now splitting out the flusher's overall writeout progress to the
dirtier tasks' dirty ratelimit based on bandwidth fairness.

> This is still easy.  What about hierarchical propio?  What happens
> then?  You can't do hierarchical proportional allocation without
> knowing how much IOs are pending for which group.  How is that
> information gonna be communicated between blkcg and writeback?  Are we
> gonna have two separate hierarchical proportional IO allocators?  How
> is that gonna work at all?  If we're gonna have single allocator in
> block layer, writeback would have to feed the amount of IOs it may
> generate into the allocator, get the resulting allocation and then
> issue IO and then block layer again will have to account these to the
> originating cgroups.  It's just crazy.

No I have not got the idea on how to do the hierarchical proportional
IO controller without physically splitting up the async IO streams.
It's pretty hard and I'd better break out before it drives me crazy.

So in the following discussion, let's assume cfq will move async
requests from the current root cgroup to individual IO issuer's cfqgs
and schedule service for the async streams there. And thus the need to
create "backpressure" for balance_dirty_pages() to eventually throttle
the individual dirtier tasks.

That said, I still don't think we've come up with any satisfactory
solutions. It's hard problem after all.

> > The only problem I can see now, is that balance_dirty_pages() works
> > per-bdi and blkcg works per-device. So the two ends may not match
> > nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where
> > sdb is shared by lv0 and lv1. However it should be rare situations and
> > be much more acceptable than the problems arise from the "push back"
> > approach which impacts everyone.
> 
> I don't know.  What problems?  AFAICS, the biggest issue is writeback
> of different inodes getting mixed resulting in poor performance, but
> if you think about it, that's about the frequency of switching cgroups
> and a problem which can and should be dealt with from block layer
> (e.g. use larger time slice if all the pending IOs are async).

Yeah increasing time slice would help that case. In general it's not
merely the frequency of switching cgroup if take hard disk' writeback
cache into account.  Think about some inodes with async IO: A1, A2,
A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different
cgroups. So when the root cgroup holds all async inodes, the cfq may
schedule IO interleavely like this

        A1,    A1,    A1,    A2,    A1,    A2,    ...
           D1,    D2,    D3,    D4,    D5,    D6, ...

Now it becomes

        A1,    A2,    A3,    A4,    A5,    A6,    ...
           D1,    D2,    D3,    D4,    D5,    D6, ...

The difference is that it's now switching the async inodes each time.
At cfq level, the seek costs look the same, however the disk's
writeback cache may help merge the data chunks from the same inode A1.
Well, it may cost some latency for spin disks. But how about SSD? It
can run deeper queue and benefit from large writes.

> Writeback's duty is generating stream of async writes which can be
> served efficiently for the *cgroup* and keeping the buffer filled as
> necessary and chaining the backpressure from there to the actual
> dirtier.  That's what writeback does without cgroup.  Nothing
> fundamental changes with cgroup.  It's just finer grained.

Believe me, physically partitioning the dirty pages and async IO
streams comes at big costs. It won't scale well in many ways.

For one instance, splitting the request queues will give rise to
PG_writeback pages.  Those pages have been the biggest source of
latency issues in the various parts of the system.

It's not uncommon for me to see filesystems sleep on PG_writeback
pages during heavy writeback, within some lock or transaction, which in
turn stall many tasks that try to do IO or merely dirty some page in
memory. Random writes are especially susceptible to such stalls. The
stable page feature also vastly increase the chances of stalls by
locking the writeback pages. 

Page reclaim may also block on PG_writeback and/or PG_dirty pages. In
the case of direct reclaim, it means blocking random tasks that are
allocating memory in the system.

PG_writeback pages are much worse than PG_dirty pages in that they are
not movable. This makes a big difference for high-order page allocations.
To make room for a 2MB huge page, vmscan has the option to migrate
PG_dirty pages, but for PG_writeback it has no better choices than to
wait for IO completion.

The difficulty of THP allocation goes up *exponentially* with the
number of PG_writeback pages. Assume PG_writeback pages are randomly
distributed in the physical memory space. Then we have formula

        P(reclaimable for THP) = 1 - P(hit PG_writeback)^256

That's the possibly for a contiguous range of 256 pages to be free of
PG_writeback, so that it's immediately reclaimable for use by
transparent huge page. This ruby script shows us the concrete numbers.

irb> 1.upto(10) { |i| j=i/1000.0; printf "%.3f\t\t\t%.3f\n", j, (1-j)**512 }

        P(hit PG_writeback)     P(reclaimable for THP)
        0.001                   0.599
        0.002                   0.359
        0.003                   0.215
        0.004                   0.128
        0.005                   0.077
        0.006                   0.046
        0.007                   0.027
        0.008                   0.016
        0.009                   0.010
        0.010                   0.006

The numbers show that when the PG_writeback pages go up from 0.1% to
1% of system memory, the THP reclaim success ratio drops quickly from
60% to 0.6%. It indicates that in order to use THP without constantly
running into stalls, the reasonable PG_writeback ratio is <= 0.1%.
Going beyond that threshold, it quickly becomes intolerable.

That makes a limit of 256MB writeback pages for a mem=256GB system.
Looking at the real vmstat:nr_writeback numbers in dd write tests:

JBOD-12SSD-thresh=8G/ext4-1dd-1-3.3.0/vmstat-end:nr_writeback 217009
JBOD-12SSD-thresh=8G/ext4-10dd-1-3.3.0/vmstat-end:nr_writeback 198335
JBOD-12SSD-thresh=8G/xfs-1dd-1-3.3.0/vmstat-end:nr_writeback 306026
JBOD-12SSD-thresh=8G/xfs-10dd-1-3.3.0/vmstat-end:nr_writeback 315099
JBOD-12SSD-thresh=8G/btrfs-1dd-1-3.3.0/vmstat-end:nr_writeback 1216058
JBOD-12SSD-thresh=8G/btrfs-10dd-1-3.3.0/vmstat-end:nr_writeback 895335

Oops btrfs has 4GB writeback pages -- which asks for some bug fixing.
Even ext4's 800MB still looks way too high, but that's ~1s worth of
data per queue (or 130ms worth of data for the high performance Intel
SSD, which is perhaps in danger of queue underruns?). So this system
would require 512GB memory to comfortably run KVM instances with THP
support.

Judging from the above numbers, we can hardly afford to split up the
IO queues and proliferate writeback pages.

It's worth to note that running multiple flusher threads per bdi means
not only disk seeks for spin disks, smaller IO size for SSD, but also
lock contentions and cache bouncing for metadata heavy workloads and
fast storage.

To give some concrete examples on how much CPU overheads can be saved
by reducing multiple IO submitters, here are some summaries for the
IO-less dirty throttling gains. Tests show that it yields huge
benefits for reducing IO seeks as well as CPU overheads.

For example, the fs_mark benchmark on a 12-drive software RAID0 goes
from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock
contention". (by Dave Chinner)

- "CPU usage has dropped by ~55%", "it certainly appears that most of
  the CPU time saving comes from the removal of contention on the
  inode_wb_list_lock"
  (IMHO at least 10% comes from the reduction of cacheline bouncing,
  because the new code is able to call much less frequently into
  balance_dirty_pages() and hence access the _global_ page states)

- the user space "App overhead" is reduced by 20%, by avoiding the
  cacheline pollution by the complex writeback code path

- "for a ~5% throughput reduction", "the number of write IOs have
  dropped by ~25%", and the elapsed time reduced from 41:42.17 to
  40:53.23.

And for simple dd tests

- "throughput for a _single_ large dd (100GB) increase from ~650MB/s
  to 700MB/s"

- "On a simple test of 100 dd, it reduces the CPU %system time from
  30% to 3%, and improves IO throughput from 38MB/s to 42MB/s."

> > > No, no, it's not about standing in my way.  As Vivek said in the other
> > > reply, it's that the "gap" that you filled was created *because*
> > > writeback wasn't cgroup aware and now you're in turn filling that gap
> > > by making writeback work around that "gap".  I mean, my mind boggles.
> > > Doesn't yours?  I strongly believe everyone's should.
> > 
> > Heh. It's a hard problem indeed. I felt great pains in the IO-less
> > dirty throttling work. I did a lot reasoning about it, and have in
> > fact kept cgroup IO controller in mind since its early days. Now I'd
> > say it's hands down for it to adapt to the gap between the total IO
> > limit and what's carried out by the block IO controller.
> 
> You're not providing any valid counter arguments about the issues
> being raised about the messed up design.  How is anything "hands down"
> here?

Yeah sadly, it turns out to be not "hands down" when it comes to the
proportional async/sync splits, and it's even prohibiting when comes
to the hierarchical support..

> > > There's where I'm confused.  How is the said split supposed to work?
> > > They aren't independent.  I mean, who gets to decide what and where
> > > are those decisions enforced?
> > 
> > Yeah it's not independent. It's about
> > 
> > - keep block IO cgroup untouched (in its current algorithm, for
> >   throttling direct IO)
> > 
> > - let balance_dirty_pages() adapt to the throttling target
> >   
> >         buffered_write_limit = total_limit - direct_IOs
> 
> Think about proportional allocation.  You don't have a number until
> you know who have pending IOs and how much.

We have the IO rate. The above formula is actually working on "rates".
That's good enough for calculating the ratelimit for buffered writes.
We don't have to know every transient states of the pending IOs.
Because the direct IOs are handled by cfq based on cfqg weight and 
for async IOs, there are plenty of dirty pages for
buffering/tolerating small errors in the dirty rate control.

> > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > It's always there doing 1:1 proportional throttling. Then you try to
> > kick in to add *double* throttling in block/cfq layer. Now the low
> > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > from its balanced state, leading to large fluctuations and program
> > stalls.
> 
> Just do the same 1:1 inside each cgroup.

Sure. But the ratio mismatch I'm talking about is inter-cgroup.
For example there are only 2 dd tasks doing buffered writes in the
system. Now consider the mismatch that cfq is dispatching their IO
requests at 10:1 weights, while balance_dirty_pages() is throttling
the dd tasks at 1:1 equal split because it's not aware of the cgroup
weights.

What will happen in the end? The 1:1 ratio imposed by
balance_dirty_pages() will take effect and the dd tasks will progress
at the same pace. The cfq weights will be defeated because the async
queue for the second dd (and cgroup) constantly runs empty.

> >  This can be avoided by telling balance_dirty_pages(): "your
> > balance goal is no longer 1:1, but 10:1". With this information
> > balance_dirty_pages() will behave right. Then there is the question:
> > if balance_dirty_pages() will work just well provided the information,
> > why bother doing the throttling at low layer and "push back" the
> > pressure all the way up?
> 
> Because splitting a resource into two pieces arbitrarily with
> different amount of consumptions on each side and then applying the
> same proportion on both doesn't mean anything?

Sorry, I don't quite catch your words here.

> > The balance_dirty_pages() is already deeply involved in dirty throttling.
> > As you can see from this patchset, the same algorithms can be extended
> > trivially to work with cgroup IO limits.
> > 
> > buffered write IO controller in balance_dirty_pages()
> > https://lkml.org/lkml/2012/3/28/275
> 
> It is half broken thing with fundamental design flaws which can't be
> corrected without complete reimplementation.  I don't know what to
> say.

I'm fully aware of that, and so have been exploring new ways out :)

> > In the "back pressure" scheme, memcg is a must because only it has all
> > the infrastructure to track dirty pages upon which you can apply some
> > dirty_limits. Don't tell me you want to account dirty pages in blkcg...
> 
> For now, per-inode tracking seems good enough.

There are actually two directions of information passing.

1) pass the dirtier ownership down to bio. For this part, it's mostly
   enough to do the light weight per-inode tracking.

2) pass the backpressure up, from cfq (IO dispatch) to flusher (IO
submit) as well as to balance_dirty_pages() (to actually throttle the
dirty tasks). The flusher naturally works on inode granularities.
However balance_dirty_pages() is about limiting dirty pages. For this
part, it needs to know the total number of dirty pages and writeout
bandwidth for each cgroup in order to do proper dirty throttling. And
to maintain proper number of dirty pages to avoid the queue underrun
issue explained in the above 2-dd example.

> > What I can see is, it looks pretty simple and nature to let
> > balance_dirty_pages() fill the gap towards a total solution :-)
> > 
> > - add direct IO accounting in some convenient point of the IO path
> >   IO submission or completion point, either is fine.
> > 
> > - change several lines of the buffered write IO controller to
> >   integrate the direct IO rate into the formula to fit the "total
> >   IO" limit
> > 
> > - in future, add more accounting as well as feedback control to make
> >   balance_dirty_pages() work with IOPS and disk time
> 
> To me, you seem to be not addressing the issues I've been raising at
> all and just repeating the same points again and again.  If I'm
> misunderstanding something, please point out.

Hopefully the renewed patch can dismiss some of your questions. It's a
pity that I didn't thought about the hierarchical requirement at the
time. Otherwise the complexity of calculations still looks manageable.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-19 14:23             ` Fengguang Wu
                               ` (2 preceding siblings ...)
  (?)
@ 2012-04-19 18:31             ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-19 18:31 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Thu, Apr 19, 2012 at 10:23:43PM +0800, Fengguang Wu wrote:

Hi Fengguang,

[..]
> > I don't know.  What problems?  AFAICS, the biggest issue is writeback
> > of different inodes getting mixed resulting in poor performance, but
> > if you think about it, that's about the frequency of switching cgroups
> > and a problem which can and should be dealt with from block layer
> > (e.g. use larger time slice if all the pending IOs are async).
> 
> Yeah increasing time slice would help that case. In general it's not
> merely the frequency of switching cgroup if take hard disk' writeback
> cache into account.  Think about some inodes with async IO: A1, A2,
> A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different
> cgroups. So when the root cgroup holds all async inodes, the cfq may
> schedule IO interleavely like this
> 
>         A1,    A1,    A1,    A2,    A1,    A2,    ...
>            D1,    D2,    D3,    D4,    D5,    D6, ...
> 
> Now it becomes
> 
>         A1,    A2,    A3,    A4,    A5,    A6,    ...
>            D1,    D2,    D3,    D4,    D5,    D6, ...
> 
> The difference is that it's now switching the async inodes each time.
> At cfq level, the seek costs look the same, however the disk's
> writeback cache may help merge the data chunks from the same inode A1.
> Well, it may cost some latency for spin disks. But how about SSD? It
> can run deeper queue and benefit from large writes.

Not sure what's the point here. Many things seem to be mixed up.

If we start putting async queues in separate groups (in an attempt to
provide fairness/service differentiation), then how much IO we dispatch
from one async inode will directly depend on slice time of that
cgroup/queue. So if you want longer dispatch from same async inode
increasing slice time will help.

Also elevator merge logic anyway increses the size of async IO requests
and big requests are submitted to device.

If you are looking that in every dispatch cycle we continue to dispatch
request from same inode, yes that's not possible. Too huge a slice length
in presence of sync IO is also not good. So if you are looking for
high throughput and sacrificing fairness then you can switch to mode
where all async queues are put in single root group. (Note: you will have
to do reasonably fast switch between cgroups so that all the cgroups are
able to do some writeout in a time window).

Writeback logic also submits a certain amount of writes from one inode
and then switches to next inode in an attempt to provide fairness. Same
thing should be directly controllable by CFQ's notion of time slice. That
is continue to dispatch async IO from a cgroup/inode for extended durtaion
before switching. So what's the difference. One can achieve equivalent
behavior at any layer (writeback/CFQ).

> 
> > Writeback's duty is generating stream of async writes which can be
> > served efficiently for the *cgroup* and keeping the buffer filled as
> > necessary and chaining the backpressure from there to the actual
> > dirtier.  That's what writeback does without cgroup.  Nothing
> > fundamental changes with cgroup.  It's just finer grained.
> 
> Believe me, physically partitioning the dirty pages and async IO
> streams comes at big costs. It won't scale well in many ways.
> 
> For one instance, splitting the request queues will give rise to
> PG_writeback pages.  Those pages have been the biggest source of
> latency issues in the various parts of the system.

So PG_writeback pages are one which have been submitted for IO? So even
now we generate PG_writeback pages across multiple inodes as we submit
those pages for IO. By keeping the number of request descriptor per
group low, we can build back pressure early and hence per inode/group
we will not have too many PG_Writeback pages. IOW, number of PG_Writeback
pages will be controllable by number of request descriptros. So how
does situation becomes worse in case of CFQ putting them in separate
cgroups?

> It's worth to note that running multiple flusher threads per bdi means
> not only disk seeks for spin disks, smaller IO size for SSD, but also
> lock contentions and cache bouncing for metadata heavy workloads and
> fast storage.

But we could still have single flusher per bdi and just check the
write congestion state of each group and back off if it is congested.

So single thread will still be doing IO submission. Just that it will
submit IO from multiple inodes/cgroup which can cause additional seeks.
And that's the tradeoff of fairness. What I am not able to understand
is that how are you avoiding this tradeoff by implementing things in
writeback layer. To achieve more fairness among groups, even a flusher
thread will have to switch faster among cgroups/inodes.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-19 14:23             ` Fengguang Wu
@ 2012-04-19 18:31               ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-19 18:31 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Thu, Apr 19, 2012 at 10:23:43PM +0800, Fengguang Wu wrote:

Hi Fengguang,

[..]
> > I don't know.  What problems?  AFAICS, the biggest issue is writeback
> > of different inodes getting mixed resulting in poor performance, but
> > if you think about it, that's about the frequency of switching cgroups
> > and a problem which can and should be dealt with from block layer
> > (e.g. use larger time slice if all the pending IOs are async).
> 
> Yeah increasing time slice would help that case. In general it's not
> merely the frequency of switching cgroup if take hard disk' writeback
> cache into account.  Think about some inodes with async IO: A1, A2,
> A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different
> cgroups. So when the root cgroup holds all async inodes, the cfq may
> schedule IO interleavely like this
> 
>         A1,    A1,    A1,    A2,    A1,    A2,    ...
>            D1,    D2,    D3,    D4,    D5,    D6, ...
> 
> Now it becomes
> 
>         A1,    A2,    A3,    A4,    A5,    A6,    ...
>            D1,    D2,    D3,    D4,    D5,    D6, ...
> 
> The difference is that it's now switching the async inodes each time.
> At cfq level, the seek costs look the same, however the disk's
> writeback cache may help merge the data chunks from the same inode A1.
> Well, it may cost some latency for spin disks. But how about SSD? It
> can run deeper queue and benefit from large writes.

Not sure what's the point here. Many things seem to be mixed up.

If we start putting async queues in separate groups (in an attempt to
provide fairness/service differentiation), then how much IO we dispatch
from one async inode will directly depend on slice time of that
cgroup/queue. So if you want longer dispatch from same async inode
increasing slice time will help.

Also elevator merge logic anyway increses the size of async IO requests
and big requests are submitted to device.

If you are looking that in every dispatch cycle we continue to dispatch
request from same inode, yes that's not possible. Too huge a slice length
in presence of sync IO is also not good. So if you are looking for
high throughput and sacrificing fairness then you can switch to mode
where all async queues are put in single root group. (Note: you will have
to do reasonably fast switch between cgroups so that all the cgroups are
able to do some writeout in a time window).

Writeback logic also submits a certain amount of writes from one inode
and then switches to next inode in an attempt to provide fairness. Same
thing should be directly controllable by CFQ's notion of time slice. That
is continue to dispatch async IO from a cgroup/inode for extended durtaion
before switching. So what's the difference. One can achieve equivalent
behavior at any layer (writeback/CFQ).

> 
> > Writeback's duty is generating stream of async writes which can be
> > served efficiently for the *cgroup* and keeping the buffer filled as
> > necessary and chaining the backpressure from there to the actual
> > dirtier.  That's what writeback does without cgroup.  Nothing
> > fundamental changes with cgroup.  It's just finer grained.
> 
> Believe me, physically partitioning the dirty pages and async IO
> streams comes at big costs. It won't scale well in many ways.
> 
> For one instance, splitting the request queues will give rise to
> PG_writeback pages.  Those pages have been the biggest source of
> latency issues in the various parts of the system.

So PG_writeback pages are one which have been submitted for IO? So even
now we generate PG_writeback pages across multiple inodes as we submit
those pages for IO. By keeping the number of request descriptor per
group low, we can build back pressure early and hence per inode/group
we will not have too many PG_Writeback pages. IOW, number of PG_Writeback
pages will be controllable by number of request descriptros. So how
does situation becomes worse in case of CFQ putting them in separate
cgroups?

> It's worth to note that running multiple flusher threads per bdi means
> not only disk seeks for spin disks, smaller IO size for SSD, but also
> lock contentions and cache bouncing for metadata heavy workloads and
> fast storage.

But we could still have single flusher per bdi and just check the
write congestion state of each group and back off if it is congested.

So single thread will still be doing IO submission. Just that it will
submit IO from multiple inodes/cgroup which can cause additional seeks.
And that's the tradeoff of fairness. What I am not able to understand
is that how are you avoiding this tradeoff by implementing things in
writeback layer. To achieve more fairness among groups, even a flusher
thread will have to switch faster among cgroups/inodes.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-19 18:31               ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-19 18:31 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Thu, Apr 19, 2012 at 10:23:43PM +0800, Fengguang Wu wrote:

Hi Fengguang,

[..]
> > I don't know.  What problems?  AFAICS, the biggest issue is writeback
> > of different inodes getting mixed resulting in poor performance, but
> > if you think about it, that's about the frequency of switching cgroups
> > and a problem which can and should be dealt with from block layer
> > (e.g. use larger time slice if all the pending IOs are async).
> 
> Yeah increasing time slice would help that case. In general it's not
> merely the frequency of switching cgroup if take hard disk' writeback
> cache into account.  Think about some inodes with async IO: A1, A2,
> A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different
> cgroups. So when the root cgroup holds all async inodes, the cfq may
> schedule IO interleavely like this
> 
>         A1,    A1,    A1,    A2,    A1,    A2,    ...
>            D1,    D2,    D3,    D4,    D5,    D6, ...
> 
> Now it becomes
> 
>         A1,    A2,    A3,    A4,    A5,    A6,    ...
>            D1,    D2,    D3,    D4,    D5,    D6, ...
> 
> The difference is that it's now switching the async inodes each time.
> At cfq level, the seek costs look the same, however the disk's
> writeback cache may help merge the data chunks from the same inode A1.
> Well, it may cost some latency for spin disks. But how about SSD? It
> can run deeper queue and benefit from large writes.

Not sure what's the point here. Many things seem to be mixed up.

If we start putting async queues in separate groups (in an attempt to
provide fairness/service differentiation), then how much IO we dispatch
from one async inode will directly depend on slice time of that
cgroup/queue. So if you want longer dispatch from same async inode
increasing slice time will help.

Also elevator merge logic anyway increses the size of async IO requests
and big requests are submitted to device.

If you are looking that in every dispatch cycle we continue to dispatch
request from same inode, yes that's not possible. Too huge a slice length
in presence of sync IO is also not good. So if you are looking for
high throughput and sacrificing fairness then you can switch to mode
where all async queues are put in single root group. (Note: you will have
to do reasonably fast switch between cgroups so that all the cgroups are
able to do some writeout in a time window).

Writeback logic also submits a certain amount of writes from one inode
and then switches to next inode in an attempt to provide fairness. Same
thing should be directly controllable by CFQ's notion of time slice. That
is continue to dispatch async IO from a cgroup/inode for extended durtaion
before switching. So what's the difference. One can achieve equivalent
behavior at any layer (writeback/CFQ).

> 
> > Writeback's duty is generating stream of async writes which can be
> > served efficiently for the *cgroup* and keeping the buffer filled as
> > necessary and chaining the backpressure from there to the actual
> > dirtier.  That's what writeback does without cgroup.  Nothing
> > fundamental changes with cgroup.  It's just finer grained.
> 
> Believe me, physically partitioning the dirty pages and async IO
> streams comes at big costs. It won't scale well in many ways.
> 
> For one instance, splitting the request queues will give rise to
> PG_writeback pages.  Those pages have been the biggest source of
> latency issues in the various parts of the system.

So PG_writeback pages are one which have been submitted for IO? So even
now we generate PG_writeback pages across multiple inodes as we submit
those pages for IO. By keeping the number of request descriptor per
group low, we can build back pressure early and hence per inode/group
we will not have too many PG_Writeback pages. IOW, number of PG_Writeback
pages will be controllable by number of request descriptros. So how
does situation becomes worse in case of CFQ putting them in separate
cgroups?

> It's worth to note that running multiple flusher threads per bdi means
> not only disk seeks for spin disks, smaller IO size for SSD, but also
> lock contentions and cache bouncing for metadata heavy workloads and
> fast storage.

But we could still have single flusher per bdi and just check the
write congestion state of each group and back off if it is congested.

So single thread will still be doing IO submission. Just that it will
submit IO from multiple inodes/cgroup which can cause additional seeks.
And that's the tradeoff of fairness. What I am not able to understand
is that how are you avoiding this tradeoff by implementing things in
writeback layer. To achieve more fairness among groups, even a flusher
thread will have to switch faster among cgroups/inodes.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-19 14:23             ` Fengguang Wu
  (?)
@ 2012-04-19 20:26               ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-19 20:26 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA

On Thu 19-04-12 22:23:43, Wu Fengguang wrote:
> For one instance, splitting the request queues will give rise to
> PG_writeback pages.  Those pages have been the biggest source of
> latency issues in the various parts of the system.
  Well, if we allow more requests to be in flight in total then yes, number
of PG_Writeback pages can be higher as well.

> It's not uncommon for me to see filesystems sleep on PG_writeback
> pages during heavy writeback, within some lock or transaction, which in
> turn stall many tasks that try to do IO or merely dirty some page in
> memory. Random writes are especially susceptible to such stalls. The
> stable page feature also vastly increase the chances of stalls by
> locking the writeback pages. 
> 
> Page reclaim may also block on PG_writeback and/or PG_dirty pages. In
> the case of direct reclaim, it means blocking random tasks that are
> allocating memory in the system.
> 
> PG_writeback pages are much worse than PG_dirty pages in that they are
> not movable. This makes a big difference for high-order page allocations.
> To make room for a 2MB huge page, vmscan has the option to migrate
> PG_dirty pages, but for PG_writeback it has no better choices than to
> wait for IO completion.
> 
> The difficulty of THP allocation goes up *exponentially* with the
> number of PG_writeback pages. Assume PG_writeback pages are randomly
> distributed in the physical memory space. Then we have formula
> 
>         P(reclaimable for THP) = 1 - P(hit PG_writeback)^256
  Well, this implicitely assumes that PG_Writeback pages are scattered
across memory uniformly at random. I'm not sure to which extent this is
true... Also as a nitpick, this isn't really an exponential growth since
the exponent is fixed (256 - actually it should be 512, right?). It's just
a polynomial with a big exponent. But sure, growth in number of PG_Writeback
pages will cause relatively steep drop in the number of available huge
pages.

...
> It's worth to note that running multiple flusher threads per bdi means
> not only disk seeks for spin disks, smaller IO size for SSD, but also
> lock contentions and cache bouncing for metadata heavy workloads and
> fast storage.
  Well, this heavily depends on particular implementation (and chosen
data structures). But yes, we should have that in mind.

...
> > > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > > It's always there doing 1:1 proportional throttling. Then you try to
> > > kick in to add *double* throttling in block/cfq layer. Now the low
> > > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > > from its balanced state, leading to large fluctuations and program
> > > stalls.
> > 
> > Just do the same 1:1 inside each cgroup.
> 
> Sure. But the ratio mismatch I'm talking about is inter-cgroup.
> For example there are only 2 dd tasks doing buffered writes in the
> system. Now consider the mismatch that cfq is dispatching their IO
> requests at 10:1 weights, while balance_dirty_pages() is throttling
> the dd tasks at 1:1 equal split because it's not aware of the cgroup
> weights.
> 
> What will happen in the end? The 1:1 ratio imposed by
> balance_dirty_pages() will take effect and the dd tasks will progress
> at the same pace. The cfq weights will be defeated because the async
> queue for the second dd (and cgroup) constantly runs empty.
  Yup. This just shows that you have to have per-cgroup dirty limits. Once
you have those, things start working again.

								Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-19 20:26               ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-19 20:26 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Tejun Heo, Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Thu 19-04-12 22:23:43, Wu Fengguang wrote:
> For one instance, splitting the request queues will give rise to
> PG_writeback pages.  Those pages have been the biggest source of
> latency issues in the various parts of the system.
  Well, if we allow more requests to be in flight in total then yes, number
of PG_Writeback pages can be higher as well.

> It's not uncommon for me to see filesystems sleep on PG_writeback
> pages during heavy writeback, within some lock or transaction, which in
> turn stall many tasks that try to do IO or merely dirty some page in
> memory. Random writes are especially susceptible to such stalls. The
> stable page feature also vastly increase the chances of stalls by
> locking the writeback pages. 
> 
> Page reclaim may also block on PG_writeback and/or PG_dirty pages. In
> the case of direct reclaim, it means blocking random tasks that are
> allocating memory in the system.
> 
> PG_writeback pages are much worse than PG_dirty pages in that they are
> not movable. This makes a big difference for high-order page allocations.
> To make room for a 2MB huge page, vmscan has the option to migrate
> PG_dirty pages, but for PG_writeback it has no better choices than to
> wait for IO completion.
> 
> The difficulty of THP allocation goes up *exponentially* with the
> number of PG_writeback pages. Assume PG_writeback pages are randomly
> distributed in the physical memory space. Then we have formula
> 
>         P(reclaimable for THP) = 1 - P(hit PG_writeback)^256
  Well, this implicitely assumes that PG_Writeback pages are scattered
across memory uniformly at random. I'm not sure to which extent this is
true... Also as a nitpick, this isn't really an exponential growth since
the exponent is fixed (256 - actually it should be 512, right?). It's just
a polynomial with a big exponent. But sure, growth in number of PG_Writeback
pages will cause relatively steep drop in the number of available huge
pages.

...
> It's worth to note that running multiple flusher threads per bdi means
> not only disk seeks for spin disks, smaller IO size for SSD, but also
> lock contentions and cache bouncing for metadata heavy workloads and
> fast storage.
  Well, this heavily depends on particular implementation (and chosen
data structures). But yes, we should have that in mind.

...
> > > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > > It's always there doing 1:1 proportional throttling. Then you try to
> > > kick in to add *double* throttling in block/cfq layer. Now the low
> > > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > > from its balanced state, leading to large fluctuations and program
> > > stalls.
> > 
> > Just do the same 1:1 inside each cgroup.
> 
> Sure. But the ratio mismatch I'm talking about is inter-cgroup.
> For example there are only 2 dd tasks doing buffered writes in the
> system. Now consider the mismatch that cfq is dispatching their IO
> requests at 10:1 weights, while balance_dirty_pages() is throttling
> the dd tasks at 1:1 equal split because it's not aware of the cgroup
> weights.
> 
> What will happen in the end? The 1:1 ratio imposed by
> balance_dirty_pages() will take effect and the dd tasks will progress
> at the same pace. The cfq weights will be defeated because the async
> queue for the second dd (and cgroup) constantly runs empty.
  Yup. This just shows that you have to have per-cgroup dirty limits. Once
you have those, things start working again.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-19 20:26               ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-19 20:26 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Tejun Heo, Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Thu 19-04-12 22:23:43, Wu Fengguang wrote:
> For one instance, splitting the request queues will give rise to
> PG_writeback pages.  Those pages have been the biggest source of
> latency issues in the various parts of the system.
  Well, if we allow more requests to be in flight in total then yes, number
of PG_Writeback pages can be higher as well.

> It's not uncommon for me to see filesystems sleep on PG_writeback
> pages during heavy writeback, within some lock or transaction, which in
> turn stall many tasks that try to do IO or merely dirty some page in
> memory. Random writes are especially susceptible to such stalls. The
> stable page feature also vastly increase the chances of stalls by
> locking the writeback pages. 
> 
> Page reclaim may also block on PG_writeback and/or PG_dirty pages. In
> the case of direct reclaim, it means blocking random tasks that are
> allocating memory in the system.
> 
> PG_writeback pages are much worse than PG_dirty pages in that they are
> not movable. This makes a big difference for high-order page allocations.
> To make room for a 2MB huge page, vmscan has the option to migrate
> PG_dirty pages, but for PG_writeback it has no better choices than to
> wait for IO completion.
> 
> The difficulty of THP allocation goes up *exponentially* with the
> number of PG_writeback pages. Assume PG_writeback pages are randomly
> distributed in the physical memory space. Then we have formula
> 
>         P(reclaimable for THP) = 1 - P(hit PG_writeback)^256
  Well, this implicitely assumes that PG_Writeback pages are scattered
across memory uniformly at random. I'm not sure to which extent this is
true... Also as a nitpick, this isn't really an exponential growth since
the exponent is fixed (256 - actually it should be 512, right?). It's just
a polynomial with a big exponent. But sure, growth in number of PG_Writeback
pages will cause relatively steep drop in the number of available huge
pages.

...
> It's worth to note that running multiple flusher threads per bdi means
> not only disk seeks for spin disks, smaller IO size for SSD, but also
> lock contentions and cache bouncing for metadata heavy workloads and
> fast storage.
  Well, this heavily depends on particular implementation (and chosen
data structures). But yes, we should have that in mind.

...
> > > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > > It's always there doing 1:1 proportional throttling. Then you try to
> > > kick in to add *double* throttling in block/cfq layer. Now the low
> > > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > > from its balanced state, leading to large fluctuations and program
> > > stalls.
> > 
> > Just do the same 1:1 inside each cgroup.
> 
> Sure. But the ratio mismatch I'm talking about is inter-cgroup.
> For example there are only 2 dd tasks doing buffered writes in the
> system. Now consider the mismatch that cfq is dispatching their IO
> requests at 10:1 weights, while balance_dirty_pages() is throttling
> the dd tasks at 1:1 equal split because it's not aware of the cgroup
> weights.
> 
> What will happen in the end? The 1:1 ratio imposed by
> balance_dirty_pages() will take effect and the dd tasks will progress
> at the same pace. The cfq weights will be defeated because the async
> queue for the second dd (and cgroup) constantly runs empty.
  Yup. This just shows that you have to have per-cgroup dirty limits. Once
you have those, things start working again.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]               ` <20120419183118.GM10216-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-20 12:45                 ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-20 12:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Hi Vivek,

On Thu, Apr 19, 2012 at 02:31:18PM -0400, Vivek Goyal wrote:
> On Thu, Apr 19, 2012 at 10:23:43PM +0800, Fengguang Wu wrote:
> 
> Hi Fengguang,
> 
> [..]
> > > I don't know.  What problems?  AFAICS, the biggest issue is writeback
> > > of different inodes getting mixed resulting in poor performance, but
> > > if you think about it, that's about the frequency of switching cgroups
> > > and a problem which can and should be dealt with from block layer
> > > (e.g. use larger time slice if all the pending IOs are async).
> > 
> > Yeah increasing time slice would help that case. In general it's not
> > merely the frequency of switching cgroup if take hard disk' writeback
> > cache into account.  Think about some inodes with async IO: A1, A2,
> > A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different
> > cgroups. So when the root cgroup holds all async inodes, the cfq may
> > schedule IO interleavely like this
> > 
> >         A1,    A1,    A1,    A2,    A1,    A2,    ...
> >            D1,    D2,    D3,    D4,    D5,    D6, ...
> > 
> > Now it becomes
> > 
> >         A1,    A2,    A3,    A4,    A5,    A6,    ...
> >            D1,    D2,    D3,    D4,    D5,    D6, ...
> > 
> > The difference is that it's now switching the async inodes each time.
> > At cfq level, the seek costs look the same, however the disk's
> > writeback cache may help merge the data chunks from the same inode A1.
> > Well, it may cost some latency for spin disks. But how about SSD? It
> > can run deeper queue and benefit from large writes.
> 
> Not sure what's the point here. Many things seem to be mixed up.
> 
> If we start putting async queues in separate groups (in an attempt to
> provide fairness/service differentiation), then how much IO we dispatch
> from one async inode will directly depend on slice time of that
> cgroup/queue. So if you want longer dispatch from same async inode
> increasing slice time will help.

Right. The problem is async slice time can hardly be increased when
there are sync IO, as you said below.

> Also elevator merge logic anyway increses the size of async IO requests
> and big requests are submitted to device.
> 
> If you are looking that in every dispatch cycle we continue to dispatch
> request from same inode, yes that's not possible. Too huge a slice length
> in presence of sync IO is also not good. So if you are looking for
> high throughput and sacrificing fairness then you can switch to mode
> where all async queues are put in single root group. (Note: you will have
> to do reasonably fast switch between cgroups so that all the cgroups are
> able to do some writeout in a time window).

Agreed.

> Writeback logic also submits a certain amount of writes from one inode
> and then switches to next inode in an attempt to provide fairness. Same
> thing should be directly controllable by CFQ's notion of time slice. That
> is continue to dispatch async IO from a cgroup/inode for extended durtaion
> before switching. So what's the difference. One can achieve equivalent
> behavior at any layer (writeback/CFQ).

The difference is, the flusher's slice time is 500ms, while the cfq's
async slice time is 40ms. In the one async queue case, cfq will switch
back to serve the remaining data from the same inode; while in split
async queues case, cfq will switch to the other inodes. This makes the
flusher's larger slice time somehow "useless".

> > > Writeback's duty is generating stream of async writes which can be
> > > served efficiently for the *cgroup* and keeping the buffer filled as
> > > necessary and chaining the backpressure from there to the actual
> > > dirtier.  That's what writeback does without cgroup.  Nothing
> > > fundamental changes with cgroup.  It's just finer grained.
> > 
> > Believe me, physically partitioning the dirty pages and async IO
> > streams comes at big costs. It won't scale well in many ways.
> > 
> > For one instance, splitting the request queues will give rise to
> > PG_writeback pages.  Those pages have been the biggest source of
> > latency issues in the various parts of the system.
> 
> So PG_writeback pages are one which have been submitted for IO? So even

Yes.

> now we generate PG_writeback pages across multiple inodes as we submit
> those pages for IO. By keeping the number of request descriptor per
> group low, we can build back pressure early and hence per inode/group
> we will not have too many PG_Writeback pages. IOW, number of PG_Writeback
> pages will be controllable by number of request descriptros.

> So how does situation becomes worse in case of CFQ putting them in
> separate cgroups?

Good question.

Imagine there are 10 dds (each in one cgroup) dirtying pages and the
flusher thread is issuing IO for them in round robin fashion, issuing
500ms worth of data for each inode and then go on to next.

And imagine we keep a minimal global async queue size, which is just
enough for holding the 500ms data from one inode. If it can be reduced
to 40ms without leading to underrun or hurt in other ways, then great.
Even if the queue size is much smaller than the flusher's write chunk
size, the disk will still be serving inodes on 500ms granularity,
because the flusher won't feed cfq with other data during the time.

Now consider moving to 10 async queues, each in one cfq group. Now
each inode will need to have at least 40ms data queued, so that when a
new cfq async slice comes, it can get enough data to work with.

Adding it up, (40ms per queue * 10 queues) = 400ms. It means, 400ms is
what's more than enough in the global async queue scheme is now only
barely enough to avoid queue underrun. This makes one fundamental need
to increase the total queued requests and hence PG_writeback pages.

To avoid seeks we might do tricks to let cfq return to the same group
serving the same async queue and repeat it for 500ms/40ms times.
However the cfq vdisktime/weight system in general don't work that way.
Once cgroup A get served its vdisktime will be increased and naturally
some other cgroup's async queue get selected. And it's hardly feasible
to increase async slice time to 500ms.

Overall the split async queues in cfq will be defeating the flusher's
attempt to amortize IO, because the cfq groups are now walking through
the inodes in much more "fine grained" granularity: 40ms vs 500ms.

> > It's worth to note that running multiple flusher threads per bdi means
> > not only disk seeks for spin disks, smaller IO size for SSD, but also
> > lock contentions and cache bouncing for metadata heavy workloads and
> > fast storage.
> 
> But we could still have single flusher per bdi and just check the
> write congestion state of each group and back off if it is congested.
> 
> So single thread will still be doing IO submission. Just that it will
> submit IO from multiple inodes/cgroup which can cause additional seeks.

Yes we still have the good option to run one single flusher. Except
that its writeback chunk size should be reduced to match the 40ms
async slice time and queue size mentioned above.

So yes, running one single flusher will help reduce contentions,
however cannot help avoid smaller IO size.

> And that's the tradeoff of fairness. What I am not able to understand
> is that how are you avoiding this tradeoff by implementing things in
> writeback layer. To achieve more fairness among groups, even a flusher
> thread will have to switch faster among cgroups/inodes.
 
Fairness is only a problem for the cfq groups. cfq by nature works on
sub-100ms granularities and switches between groups at that frequency.
If it gives each cgroup 500ms and there are 10 cgroups, latency will
become uncontrollable.

If still keep the global async queue, it can run small 40ms slices
without defeating the flusher's 500ms granularity. After each slice
it can freely switch to other cgroups with sync IOs, so is free from
latency issues. After return, it will continue to serve the same
inode. It will basically be working on behalf of one cgroup for 500ms
data, working for another cgroup for 500ms data and so on. That
behavior does not impact fairness, because it's still using small
slices and its weight is computed system wide thus exhibits some kind
of smooth/amortize effects over long period of time. It can naturally 
serve the same inode after return.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-19 18:31               ` Vivek Goyal
@ 2012-04-20 12:45                 ` Fengguang Wu
  -1 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-20 12:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hi Vivek,

On Thu, Apr 19, 2012 at 02:31:18PM -0400, Vivek Goyal wrote:
> On Thu, Apr 19, 2012 at 10:23:43PM +0800, Fengguang Wu wrote:
> 
> Hi Fengguang,
> 
> [..]
> > > I don't know.  What problems?  AFAICS, the biggest issue is writeback
> > > of different inodes getting mixed resulting in poor performance, but
> > > if you think about it, that's about the frequency of switching cgroups
> > > and a problem which can and should be dealt with from block layer
> > > (e.g. use larger time slice if all the pending IOs are async).
> > 
> > Yeah increasing time slice would help that case. In general it's not
> > merely the frequency of switching cgroup if take hard disk' writeback
> > cache into account.  Think about some inodes with async IO: A1, A2,
> > A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different
> > cgroups. So when the root cgroup holds all async inodes, the cfq may
> > schedule IO interleavely like this
> > 
> >         A1,    A1,    A1,    A2,    A1,    A2,    ...
> >            D1,    D2,    D3,    D4,    D5,    D6, ...
> > 
> > Now it becomes
> > 
> >         A1,    A2,    A3,    A4,    A5,    A6,    ...
> >            D1,    D2,    D3,    D4,    D5,    D6, ...
> > 
> > The difference is that it's now switching the async inodes each time.
> > At cfq level, the seek costs look the same, however the disk's
> > writeback cache may help merge the data chunks from the same inode A1.
> > Well, it may cost some latency for spin disks. But how about SSD? It
> > can run deeper queue and benefit from large writes.
> 
> Not sure what's the point here. Many things seem to be mixed up.
> 
> If we start putting async queues in separate groups (in an attempt to
> provide fairness/service differentiation), then how much IO we dispatch
> from one async inode will directly depend on slice time of that
> cgroup/queue. So if you want longer dispatch from same async inode
> increasing slice time will help.

Right. The problem is async slice time can hardly be increased when
there are sync IO, as you said below.

> Also elevator merge logic anyway increses the size of async IO requests
> and big requests are submitted to device.
> 
> If you are looking that in every dispatch cycle we continue to dispatch
> request from same inode, yes that's not possible. Too huge a slice length
> in presence of sync IO is also not good. So if you are looking for
> high throughput and sacrificing fairness then you can switch to mode
> where all async queues are put in single root group. (Note: you will have
> to do reasonably fast switch between cgroups so that all the cgroups are
> able to do some writeout in a time window).

Agreed.

> Writeback logic also submits a certain amount of writes from one inode
> and then switches to next inode in an attempt to provide fairness. Same
> thing should be directly controllable by CFQ's notion of time slice. That
> is continue to dispatch async IO from a cgroup/inode for extended durtaion
> before switching. So what's the difference. One can achieve equivalent
> behavior at any layer (writeback/CFQ).

The difference is, the flusher's slice time is 500ms, while the cfq's
async slice time is 40ms. In the one async queue case, cfq will switch
back to serve the remaining data from the same inode; while in split
async queues case, cfq will switch to the other inodes. This makes the
flusher's larger slice time somehow "useless".

> > > Writeback's duty is generating stream of async writes which can be
> > > served efficiently for the *cgroup* and keeping the buffer filled as
> > > necessary and chaining the backpressure from there to the actual
> > > dirtier.  That's what writeback does without cgroup.  Nothing
> > > fundamental changes with cgroup.  It's just finer grained.
> > 
> > Believe me, physically partitioning the dirty pages and async IO
> > streams comes at big costs. It won't scale well in many ways.
> > 
> > For one instance, splitting the request queues will give rise to
> > PG_writeback pages.  Those pages have been the biggest source of
> > latency issues in the various parts of the system.
> 
> So PG_writeback pages are one which have been submitted for IO? So even

Yes.

> now we generate PG_writeback pages across multiple inodes as we submit
> those pages for IO. By keeping the number of request descriptor per
> group low, we can build back pressure early and hence per inode/group
> we will not have too many PG_Writeback pages. IOW, number of PG_Writeback
> pages will be controllable by number of request descriptros.

> So how does situation becomes worse in case of CFQ putting them in
> separate cgroups?

Good question.

Imagine there are 10 dds (each in one cgroup) dirtying pages and the
flusher thread is issuing IO for them in round robin fashion, issuing
500ms worth of data for each inode and then go on to next.

And imagine we keep a minimal global async queue size, which is just
enough for holding the 500ms data from one inode. If it can be reduced
to 40ms without leading to underrun or hurt in other ways, then great.
Even if the queue size is much smaller than the flusher's write chunk
size, the disk will still be serving inodes on 500ms granularity,
because the flusher won't feed cfq with other data during the time.

Now consider moving to 10 async queues, each in one cfq group. Now
each inode will need to have at least 40ms data queued, so that when a
new cfq async slice comes, it can get enough data to work with.

Adding it up, (40ms per queue * 10 queues) = 400ms. It means, 400ms is
what's more than enough in the global async queue scheme is now only
barely enough to avoid queue underrun. This makes one fundamental need
to increase the total queued requests and hence PG_writeback pages.

To avoid seeks we might do tricks to let cfq return to the same group
serving the same async queue and repeat it for 500ms/40ms times.
However the cfq vdisktime/weight system in general don't work that way.
Once cgroup A get served its vdisktime will be increased and naturally
some other cgroup's async queue get selected. And it's hardly feasible
to increase async slice time to 500ms.

Overall the split async queues in cfq will be defeating the flusher's
attempt to amortize IO, because the cfq groups are now walking through
the inodes in much more "fine grained" granularity: 40ms vs 500ms.

> > It's worth to note that running multiple flusher threads per bdi means
> > not only disk seeks for spin disks, smaller IO size for SSD, but also
> > lock contentions and cache bouncing for metadata heavy workloads and
> > fast storage.
> 
> But we could still have single flusher per bdi and just check the
> write congestion state of each group and back off if it is congested.
> 
> So single thread will still be doing IO submission. Just that it will
> submit IO from multiple inodes/cgroup which can cause additional seeks.

Yes we still have the good option to run one single flusher. Except
that its writeback chunk size should be reduced to match the 40ms
async slice time and queue size mentioned above.

So yes, running one single flusher will help reduce contentions,
however cannot help avoid smaller IO size.

> And that's the tradeoff of fairness. What I am not able to understand
> is that how are you avoiding this tradeoff by implementing things in
> writeback layer. To achieve more fairness among groups, even a flusher
> thread will have to switch faster among cgroups/inodes.
 
Fairness is only a problem for the cfq groups. cfq by nature works on
sub-100ms granularities and switches between groups at that frequency.
If it gives each cgroup 500ms and there are 10 cgroups, latency will
become uncontrollable.

If still keep the global async queue, it can run small 40ms slices
without defeating the flusher's 500ms granularity. After each slice
it can freely switch to other cgroups with sync IOs, so is free from
latency issues. After return, it will continue to serve the same
inode. It will basically be working on behalf of one cgroup for 500ms
data, working for another cgroup for 500ms data and so on. That
behavior does not impact fairness, because it's still using small
slices and its weight is computed system wide thus exhibits some kind
of smooth/amortize effects over long period of time. It can naturally 
serve the same inode after return.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-20 12:45                 ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-20 12:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hi Vivek,

On Thu, Apr 19, 2012 at 02:31:18PM -0400, Vivek Goyal wrote:
> On Thu, Apr 19, 2012 at 10:23:43PM +0800, Fengguang Wu wrote:
> 
> Hi Fengguang,
> 
> [..]
> > > I don't know.  What problems?  AFAICS, the biggest issue is writeback
> > > of different inodes getting mixed resulting in poor performance, but
> > > if you think about it, that's about the frequency of switching cgroups
> > > and a problem which can and should be dealt with from block layer
> > > (e.g. use larger time slice if all the pending IOs are async).
> > 
> > Yeah increasing time slice would help that case. In general it's not
> > merely the frequency of switching cgroup if take hard disk' writeback
> > cache into account.  Think about some inodes with async IO: A1, A2,
> > A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different
> > cgroups. So when the root cgroup holds all async inodes, the cfq may
> > schedule IO interleavely like this
> > 
> >         A1,    A1,    A1,    A2,    A1,    A2,    ...
> >            D1,    D2,    D3,    D4,    D5,    D6, ...
> > 
> > Now it becomes
> > 
> >         A1,    A2,    A3,    A4,    A5,    A6,    ...
> >            D1,    D2,    D3,    D4,    D5,    D6, ...
> > 
> > The difference is that it's now switching the async inodes each time.
> > At cfq level, the seek costs look the same, however the disk's
> > writeback cache may help merge the data chunks from the same inode A1.
> > Well, it may cost some latency for spin disks. But how about SSD? It
> > can run deeper queue and benefit from large writes.
> 
> Not sure what's the point here. Many things seem to be mixed up.
> 
> If we start putting async queues in separate groups (in an attempt to
> provide fairness/service differentiation), then how much IO we dispatch
> from one async inode will directly depend on slice time of that
> cgroup/queue. So if you want longer dispatch from same async inode
> increasing slice time will help.

Right. The problem is async slice time can hardly be increased when
there are sync IO, as you said below.

> Also elevator merge logic anyway increses the size of async IO requests
> and big requests are submitted to device.
> 
> If you are looking that in every dispatch cycle we continue to dispatch
> request from same inode, yes that's not possible. Too huge a slice length
> in presence of sync IO is also not good. So if you are looking for
> high throughput and sacrificing fairness then you can switch to mode
> where all async queues are put in single root group. (Note: you will have
> to do reasonably fast switch between cgroups so that all the cgroups are
> able to do some writeout in a time window).

Agreed.

> Writeback logic also submits a certain amount of writes from one inode
> and then switches to next inode in an attempt to provide fairness. Same
> thing should be directly controllable by CFQ's notion of time slice. That
> is continue to dispatch async IO from a cgroup/inode for extended durtaion
> before switching. So what's the difference. One can achieve equivalent
> behavior at any layer (writeback/CFQ).

The difference is, the flusher's slice time is 500ms, while the cfq's
async slice time is 40ms. In the one async queue case, cfq will switch
back to serve the remaining data from the same inode; while in split
async queues case, cfq will switch to the other inodes. This makes the
flusher's larger slice time somehow "useless".

> > > Writeback's duty is generating stream of async writes which can be
> > > served efficiently for the *cgroup* and keeping the buffer filled as
> > > necessary and chaining the backpressure from there to the actual
> > > dirtier.  That's what writeback does without cgroup.  Nothing
> > > fundamental changes with cgroup.  It's just finer grained.
> > 
> > Believe me, physically partitioning the dirty pages and async IO
> > streams comes at big costs. It won't scale well in many ways.
> > 
> > For one instance, splitting the request queues will give rise to
> > PG_writeback pages.  Those pages have been the biggest source of
> > latency issues in the various parts of the system.
> 
> So PG_writeback pages are one which have been submitted for IO? So even

Yes.

> now we generate PG_writeback pages across multiple inodes as we submit
> those pages for IO. By keeping the number of request descriptor per
> group low, we can build back pressure early and hence per inode/group
> we will not have too many PG_Writeback pages. IOW, number of PG_Writeback
> pages will be controllable by number of request descriptros.

> So how does situation becomes worse in case of CFQ putting them in
> separate cgroups?

Good question.

Imagine there are 10 dds (each in one cgroup) dirtying pages and the
flusher thread is issuing IO for them in round robin fashion, issuing
500ms worth of data for each inode and then go on to next.

And imagine we keep a minimal global async queue size, which is just
enough for holding the 500ms data from one inode. If it can be reduced
to 40ms without leading to underrun or hurt in other ways, then great.
Even if the queue size is much smaller than the flusher's write chunk
size, the disk will still be serving inodes on 500ms granularity,
because the flusher won't feed cfq with other data during the time.

Now consider moving to 10 async queues, each in one cfq group. Now
each inode will need to have at least 40ms data queued, so that when a
new cfq async slice comes, it can get enough data to work with.

Adding it up, (40ms per queue * 10 queues) = 400ms. It means, 400ms is
what's more than enough in the global async queue scheme is now only
barely enough to avoid queue underrun. This makes one fundamental need
to increase the total queued requests and hence PG_writeback pages.

To avoid seeks we might do tricks to let cfq return to the same group
serving the same async queue and repeat it for 500ms/40ms times.
However the cfq vdisktime/weight system in general don't work that way.
Once cgroup A get served its vdisktime will be increased and naturally
some other cgroup's async queue get selected. And it's hardly feasible
to increase async slice time to 500ms.

Overall the split async queues in cfq will be defeating the flusher's
attempt to amortize IO, because the cfq groups are now walking through
the inodes in much more "fine grained" granularity: 40ms vs 500ms.

> > It's worth to note that running multiple flusher threads per bdi means
> > not only disk seeks for spin disks, smaller IO size for SSD, but also
> > lock contentions and cache bouncing for metadata heavy workloads and
> > fast storage.
> 
> But we could still have single flusher per bdi and just check the
> write congestion state of each group and back off if it is congested.
> 
> So single thread will still be doing IO submission. Just that it will
> submit IO from multiple inodes/cgroup which can cause additional seeks.

Yes we still have the good option to run one single flusher. Except
that its writeback chunk size should be reduced to match the 40ms
async slice time and queue size mentioned above.

So yes, running one single flusher will help reduce contentions,
however cannot help avoid smaller IO size.

> And that's the tradeoff of fairness. What I am not able to understand
> is that how are you avoiding this tradeoff by implementing things in
> writeback layer. To achieve more fairness among groups, even a flusher
> thread will have to switch faster among cgroups/inodes.
 
Fairness is only a problem for the cfq groups. cfq by nature works on
sub-100ms granularities and switches between groups at that frequency.
If it gives each cgroup 500ms and there are 10 cgroups, latency will
become uncontrollable.

If still keep the global async queue, it can run small 40ms slices
without defeating the flusher's 500ms granularity. After each slice
it can freely switch to other cgroups with sync IOs, so is free from
latency issues. After return, it will continue to serve the same
inode. It will basically be working on behalf of one cgroup for 500ms
data, working for another cgroup for 500ms data and so on. That
behavior does not impact fairness, because it's still using small
slices and its weight is computed system wide thus exhibits some kind
of smooth/amortize effects over long period of time. It can naturally 
serve the same inode after return.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]               ` <20120419202635.GA4795-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
@ 2012-04-20 13:34                 ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-20 13:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Mel Gorman

On Thu, Apr 19, 2012 at 10:26:35PM +0200, Jan Kara wrote:
> On Thu 19-04-12 22:23:43, Wu Fengguang wrote:
> > For one instance, splitting the request queues will give rise to
> > PG_writeback pages.  Those pages have been the biggest source of
> > latency issues in the various parts of the system.
>   Well, if we allow more requests to be in flight in total then yes, number
> of PG_Writeback pages can be higher as well.

Exactly.  

> > It's not uncommon for me to see filesystems sleep on PG_writeback
> > pages during heavy writeback, within some lock or transaction, which in
> > turn stall many tasks that try to do IO or merely dirty some page in
> > memory. Random writes are especially susceptible to such stalls. The
> > stable page feature also vastly increase the chances of stalls by
> > locking the writeback pages. 
> > 
> > Page reclaim may also block on PG_writeback and/or PG_dirty pages. In
> > the case of direct reclaim, it means blocking random tasks that are
> > allocating memory in the system.
> > 
> > PG_writeback pages are much worse than PG_dirty pages in that they are
> > not movable. This makes a big difference for high-order page allocations.
> > To make room for a 2MB huge page, vmscan has the option to migrate
> > PG_dirty pages, but for PG_writeback it has no better choices than to
> > wait for IO completion.
> > 
> > The difficulty of THP allocation goes up *exponentially* with the
> > number of PG_writeback pages. Assume PG_writeback pages are randomly
> > distributed in the physical memory space. Then we have formula
> > 
> >         P(reclaimable for THP) = 1 - P(hit PG_writeback)^256
>   Well, this implicitely assumes that PG_Writeback pages are scattered
> across memory uniformly at random. I'm not sure to which extent this is
> true...

Yeah, when describing the problem I was also thinking about the
possibilities of optimization (it would be a very good general
improvements). Or maybe Mel already has some solutions :)

> Also as a nitpick, this isn't really an exponential growth since
> the exponent is fixed (256 - actually it should be 512, right?). It's just

Right, 512 4k pages to form one x86_64 2MB huge pages.

> a polynomial with a big exponent. But sure, growth in number of PG_Writeback
> pages will cause relatively steep drop in the number of available huge
> pages.

It's exponential indeed, because "1 - p(x)" here means "p(!x)".
It's exponential for a 10x increase in x resulting in 100x drop of y.

> ...
> > It's worth to note that running multiple flusher threads per bdi means
> > not only disk seeks for spin disks, smaller IO size for SSD, but also
> > lock contentions and cache bouncing for metadata heavy workloads and
> > fast storage.
>   Well, this heavily depends on particular implementation (and chosen
> data structures). But yes, we should have that in mind.

The lock contentions and cache bouncing actually mainly happen in fs
code due to concurrent IO submissions. Also when replying Vivek's
email I realized that the disk seeks and/or smaller IO size are more
fundamentally tied to the split async queues in cfq which makes it
switch inodes on every async slice time (typically 40ms).

> ...
> > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > > > It's always there doing 1:1 proportional throttling. Then you try to
> > > > kick in to add *double* throttling in block/cfq layer. Now the low
> > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > > > from its balanced state, leading to large fluctuations and program
> > > > stalls.
> > > 
> > > Just do the same 1:1 inside each cgroup.
> > 
> > Sure. But the ratio mismatch I'm talking about is inter-cgroup.
> > For example there are only 2 dd tasks doing buffered writes in the
> > system. Now consider the mismatch that cfq is dispatching their IO
> > requests at 10:1 weights, while balance_dirty_pages() is throttling
> > the dd tasks at 1:1 equal split because it's not aware of the cgroup
> > weights.
> > 
> > What will happen in the end? The 1:1 ratio imposed by
> > balance_dirty_pages() will take effect and the dd tasks will progress
> > at the same pace. The cfq weights will be defeated because the async
> > queue for the second dd (and cgroup) constantly runs empty.
>   Yup. This just shows that you have to have per-cgroup dirty limits. Once
> you have those, things start working again.

Right. I think Tejun was more of less aware of this.

I was rather upset by this per-memcg dirty_limit idea indeed. I never
expect it to work well when used extensively. My plan was to set the
default memcg dirty_limit high enough, so that it's not hit in normal.
Then Tejun came and proposed to (mis-)use dirty_limit as the way to
convert the dirty pages' backpressure into real dirty throttling rate.
No, that's just crazy idea!

Come on, let's not over-use memcg's dirty_limit. It's there as the
*last resort* to keep dirty pages under control so as to maintain
interactive performance inside the cgroup. However if used extensively
in the system (like dozens of memcgs all hit their dirty limits), the
limit itself may stall random dirtiers and create interactive
performance issues!

In the recent days I've come up with the idea of memcg.dirty_setpoint
for the blkcg backpressure stuff. We can use that instead.

memcg.dirty_setpoint will scale proportionally with blkcg.writeout_rate.
Imagine bdi_setpoint. It's all the same concepts. Why we need this?
Because if blkcg A and B does 10:1 weights and are both doing buffered
writes, their dirty pages should better be maintained around 10:1
ratio to avoid underrun and hopefully achieve better IO size.
memcg.dirty_limit cannot guarantee that goal.

But be warned! Partitioning the dirty pages always means more
fluctuations of dirty rates (and even stalls) that's perceivable by
the user. Which means another limiting factor for the backpressure
based IO controller to scale well.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-19 20:26               ` Jan Kara
@ 2012-04-20 13:34                 ` Fengguang Wu
  -1 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-20 13:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf, Mel Gorman

On Thu, Apr 19, 2012 at 10:26:35PM +0200, Jan Kara wrote:
> On Thu 19-04-12 22:23:43, Wu Fengguang wrote:
> > For one instance, splitting the request queues will give rise to
> > PG_writeback pages.  Those pages have been the biggest source of
> > latency issues in the various parts of the system.
>   Well, if we allow more requests to be in flight in total then yes, number
> of PG_Writeback pages can be higher as well.

Exactly.  

> > It's not uncommon for me to see filesystems sleep on PG_writeback
> > pages during heavy writeback, within some lock or transaction, which in
> > turn stall many tasks that try to do IO or merely dirty some page in
> > memory. Random writes are especially susceptible to such stalls. The
> > stable page feature also vastly increase the chances of stalls by
> > locking the writeback pages. 
> > 
> > Page reclaim may also block on PG_writeback and/or PG_dirty pages. In
> > the case of direct reclaim, it means blocking random tasks that are
> > allocating memory in the system.
> > 
> > PG_writeback pages are much worse than PG_dirty pages in that they are
> > not movable. This makes a big difference for high-order page allocations.
> > To make room for a 2MB huge page, vmscan has the option to migrate
> > PG_dirty pages, but for PG_writeback it has no better choices than to
> > wait for IO completion.
> > 
> > The difficulty of THP allocation goes up *exponentially* with the
> > number of PG_writeback pages. Assume PG_writeback pages are randomly
> > distributed in the physical memory space. Then we have formula
> > 
> >         P(reclaimable for THP) = 1 - P(hit PG_writeback)^256
>   Well, this implicitely assumes that PG_Writeback pages are scattered
> across memory uniformly at random. I'm not sure to which extent this is
> true...

Yeah, when describing the problem I was also thinking about the
possibilities of optimization (it would be a very good general
improvements). Or maybe Mel already has some solutions :)

> Also as a nitpick, this isn't really an exponential growth since
> the exponent is fixed (256 - actually it should be 512, right?). It's just

Right, 512 4k pages to form one x86_64 2MB huge pages.

> a polynomial with a big exponent. But sure, growth in number of PG_Writeback
> pages will cause relatively steep drop in the number of available huge
> pages.

It's exponential indeed, because "1 - p(x)" here means "p(!x)".
It's exponential for a 10x increase in x resulting in 100x drop of y.

> ...
> > It's worth to note that running multiple flusher threads per bdi means
> > not only disk seeks for spin disks, smaller IO size for SSD, but also
> > lock contentions and cache bouncing for metadata heavy workloads and
> > fast storage.
>   Well, this heavily depends on particular implementation (and chosen
> data structures). But yes, we should have that in mind.

The lock contentions and cache bouncing actually mainly happen in fs
code due to concurrent IO submissions. Also when replying Vivek's
email I realized that the disk seeks and/or smaller IO size are more
fundamentally tied to the split async queues in cfq which makes it
switch inodes on every async slice time (typically 40ms).

> ...
> > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > > > It's always there doing 1:1 proportional throttling. Then you try to
> > > > kick in to add *double* throttling in block/cfq layer. Now the low
> > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > > > from its balanced state, leading to large fluctuations and program
> > > > stalls.
> > > 
> > > Just do the same 1:1 inside each cgroup.
> > 
> > Sure. But the ratio mismatch I'm talking about is inter-cgroup.
> > For example there are only 2 dd tasks doing buffered writes in the
> > system. Now consider the mismatch that cfq is dispatching their IO
> > requests at 10:1 weights, while balance_dirty_pages() is throttling
> > the dd tasks at 1:1 equal split because it's not aware of the cgroup
> > weights.
> > 
> > What will happen in the end? The 1:1 ratio imposed by
> > balance_dirty_pages() will take effect and the dd tasks will progress
> > at the same pace. The cfq weights will be defeated because the async
> > queue for the second dd (and cgroup) constantly runs empty.
>   Yup. This just shows that you have to have per-cgroup dirty limits. Once
> you have those, things start working again.

Right. I think Tejun was more of less aware of this.

I was rather upset by this per-memcg dirty_limit idea indeed. I never
expect it to work well when used extensively. My plan was to set the
default memcg dirty_limit high enough, so that it's not hit in normal.
Then Tejun came and proposed to (mis-)use dirty_limit as the way to
convert the dirty pages' backpressure into real dirty throttling rate.
No, that's just crazy idea!

Come on, let's not over-use memcg's dirty_limit. It's there as the
*last resort* to keep dirty pages under control so as to maintain
interactive performance inside the cgroup. However if used extensively
in the system (like dozens of memcgs all hit their dirty limits), the
limit itself may stall random dirtiers and create interactive
performance issues!

In the recent days I've come up with the idea of memcg.dirty_setpoint
for the blkcg backpressure stuff. We can use that instead.

memcg.dirty_setpoint will scale proportionally with blkcg.writeout_rate.
Imagine bdi_setpoint. It's all the same concepts. Why we need this?
Because if blkcg A and B does 10:1 weights and are both doing buffered
writes, their dirty pages should better be maintained around 10:1
ratio to avoid underrun and hopefully achieve better IO size.
memcg.dirty_limit cannot guarantee that goal.

But be warned! Partitioning the dirty pages always means more
fluctuations of dirty rates (and even stalls) that's perceivable by
the user. Which means another limiting factor for the backpressure
based IO controller to scale well.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-20 13:34                 ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-20 13:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf, Mel Gorman

On Thu, Apr 19, 2012 at 10:26:35PM +0200, Jan Kara wrote:
> On Thu 19-04-12 22:23:43, Wu Fengguang wrote:
> > For one instance, splitting the request queues will give rise to
> > PG_writeback pages.  Those pages have been the biggest source of
> > latency issues in the various parts of the system.
>   Well, if we allow more requests to be in flight in total then yes, number
> of PG_Writeback pages can be higher as well.

Exactly.  

> > It's not uncommon for me to see filesystems sleep on PG_writeback
> > pages during heavy writeback, within some lock or transaction, which in
> > turn stall many tasks that try to do IO or merely dirty some page in
> > memory. Random writes are especially susceptible to such stalls. The
> > stable page feature also vastly increase the chances of stalls by
> > locking the writeback pages. 
> > 
> > Page reclaim may also block on PG_writeback and/or PG_dirty pages. In
> > the case of direct reclaim, it means blocking random tasks that are
> > allocating memory in the system.
> > 
> > PG_writeback pages are much worse than PG_dirty pages in that they are
> > not movable. This makes a big difference for high-order page allocations.
> > To make room for a 2MB huge page, vmscan has the option to migrate
> > PG_dirty pages, but for PG_writeback it has no better choices than to
> > wait for IO completion.
> > 
> > The difficulty of THP allocation goes up *exponentially* with the
> > number of PG_writeback pages. Assume PG_writeback pages are randomly
> > distributed in the physical memory space. Then we have formula
> > 
> >         P(reclaimable for THP) = 1 - P(hit PG_writeback)^256
>   Well, this implicitely assumes that PG_Writeback pages are scattered
> across memory uniformly at random. I'm not sure to which extent this is
> true...

Yeah, when describing the problem I was also thinking about the
possibilities of optimization (it would be a very good general
improvements). Or maybe Mel already has some solutions :)

> Also as a nitpick, this isn't really an exponential growth since
> the exponent is fixed (256 - actually it should be 512, right?). It's just

Right, 512 4k pages to form one x86_64 2MB huge pages.

> a polynomial with a big exponent. But sure, growth in number of PG_Writeback
> pages will cause relatively steep drop in the number of available huge
> pages.

It's exponential indeed, because "1 - p(x)" here means "p(!x)".
It's exponential for a 10x increase in x resulting in 100x drop of y.

> ...
> > It's worth to note that running multiple flusher threads per bdi means
> > not only disk seeks for spin disks, smaller IO size for SSD, but also
> > lock contentions and cache bouncing for metadata heavy workloads and
> > fast storage.
>   Well, this heavily depends on particular implementation (and chosen
> data structures). But yes, we should have that in mind.

The lock contentions and cache bouncing actually mainly happen in fs
code due to concurrent IO submissions. Also when replying Vivek's
email I realized that the disk seeks and/or smaller IO size are more
fundamentally tied to the split async queues in cfq which makes it
switch inodes on every async slice time (typically 40ms).

> ...
> > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > > > It's always there doing 1:1 proportional throttling. Then you try to
> > > > kick in to add *double* throttling in block/cfq layer. Now the low
> > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > > > from its balanced state, leading to large fluctuations and program
> > > > stalls.
> > > 
> > > Just do the same 1:1 inside each cgroup.
> > 
> > Sure. But the ratio mismatch I'm talking about is inter-cgroup.
> > For example there are only 2 dd tasks doing buffered writes in the
> > system. Now consider the mismatch that cfq is dispatching their IO
> > requests at 10:1 weights, while balance_dirty_pages() is throttling
> > the dd tasks at 1:1 equal split because it's not aware of the cgroup
> > weights.
> > 
> > What will happen in the end? The 1:1 ratio imposed by
> > balance_dirty_pages() will take effect and the dd tasks will progress
> > at the same pace. The cfq weights will be defeated because the async
> > queue for the second dd (and cgroup) constantly runs empty.
>   Yup. This just shows that you have to have per-cgroup dirty limits. Once
> you have those, things start working again.

Right. I think Tejun was more of less aware of this.

I was rather upset by this per-memcg dirty_limit idea indeed. I never
expect it to work well when used extensively. My plan was to set the
default memcg dirty_limit high enough, so that it's not hit in normal.
Then Tejun came and proposed to (mis-)use dirty_limit as the way to
convert the dirty pages' backpressure into real dirty throttling rate.
No, that's just crazy idea!

Come on, let's not over-use memcg's dirty_limit. It's there as the
*last resort* to keep dirty pages under control so as to maintain
interactive performance inside the cgroup. However if used extensively
in the system (like dozens of memcgs all hit their dirty limits), the
limit itself may stall random dirtiers and create interactive
performance issues!

In the recent days I've come up with the idea of memcg.dirty_setpoint
for the blkcg backpressure stuff. We can use that instead.

memcg.dirty_setpoint will scale proportionally with blkcg.writeout_rate.
Imagine bdi_setpoint. It's all the same concepts. Why we need this?
Because if blkcg A and B does 10:1 weights and are both doing buffered
writes, their dirty pages should better be maintained around 10:1
ratio to avoid underrun and hopefully achieve better IO size.
memcg.dirty_limit cannot guarantee that goal.

But be warned! Partitioning the dirty pages always means more
fluctuations of dirty rates (and even stalls) that's perceivable by
the user. Which means another limiting factor for the backpressure
based IO controller to scale well.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-20 13:34                 ` Fengguang Wu
  (?)
@ 2012-04-20 19:08                 ` Tejun Heo
  -1 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-20 19:08 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA,
	Mel Gorman

Hello, Fengguang.

On Fri, Apr 20, 2012 at 09:34:41PM +0800, Fengguang Wu wrote:
> >   Yup. This just shows that you have to have per-cgroup dirty limits. Once
> > you have those, things start working again.
> 
> Right. I think Tejun was more of less aware of this.

I'm fairly sure I'm on the "less" side of it.

> I was rather upset by this per-memcg dirty_limit idea indeed. I never
> expect it to work well when used extensively. My plan was to set the
> default memcg dirty_limit high enough, so that it's not hit in normal.
> Then Tejun came and proposed to (mis-)use dirty_limit as the way to
> convert the dirty pages' backpressure into real dirty throttling rate.
> No, that's just crazy idea!

I'll tell you what's crazy.

We're not gonna cut three more kernel releases and then change jobs.
Some of the stuff we put in the kernel ends up staying there for over
a decade.  While ignoring fundamental designs and violating layers may
look like rendering a quick solution.  They tend to come back and bite
our collective asses.  Ask Vivek.  The iosched / blkcg API was messed
up to the extent that bugs were so difficult to track down and it was
nearly impossible to add new features, let alone new blkcg policy or
elevator and people did suffer for that for long time.  I ended up
cleaning up the mess.  It took me longer than three months and even
then we have to carry on with a lot of ugly stuff for compatibility.

Unfortunately, your proposed solution is far worse than blkcg was or
ever could be.  It's not even contained in a single subsystem and it's
not even clear what it achieves.  Neither weight or hard limit can be
properly enforced without another layer of controlling at the block
layer (some use cases do expect strict enforcement) and we're baking
assumptions about use cases, interfaces and underlying hardware across
multiple subsystems (some ssds work fine with per-iops switching).
For your suggested solution, the moment it's best fit is now and it'll
be a long painful way down until someone snaps and reimplements the
whole thing.

The kernel is larger than balance_dirty_pages() or writeback.  Each
subsystem should do what it's supposed to do.  Let's solve problems
where they belong and pay overheads where they're due.  Let's not
contort the whole stack for the short term goal of shoving writeback
support into the existing, still-developing, blkcg cfq proportional IO
implementation.  Because that's pure insanity.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-20 13:34                 ` Fengguang Wu
@ 2012-04-20 19:08                   ` Tejun Heo
  -1 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-20 19:08 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf, Mel Gorman

Hello, Fengguang.

On Fri, Apr 20, 2012 at 09:34:41PM +0800, Fengguang Wu wrote:
> >   Yup. This just shows that you have to have per-cgroup dirty limits. Once
> > you have those, things start working again.
> 
> Right. I think Tejun was more of less aware of this.

I'm fairly sure I'm on the "less" side of it.

> I was rather upset by this per-memcg dirty_limit idea indeed. I never
> expect it to work well when used extensively. My plan was to set the
> default memcg dirty_limit high enough, so that it's not hit in normal.
> Then Tejun came and proposed to (mis-)use dirty_limit as the way to
> convert the dirty pages' backpressure into real dirty throttling rate.
> No, that's just crazy idea!

I'll tell you what's crazy.

We're not gonna cut three more kernel releases and then change jobs.
Some of the stuff we put in the kernel ends up staying there for over
a decade.  While ignoring fundamental designs and violating layers may
look like rendering a quick solution.  They tend to come back and bite
our collective asses.  Ask Vivek.  The iosched / blkcg API was messed
up to the extent that bugs were so difficult to track down and it was
nearly impossible to add new features, let alone new blkcg policy or
elevator and people did suffer for that for long time.  I ended up
cleaning up the mess.  It took me longer than three months and even
then we have to carry on with a lot of ugly stuff for compatibility.

Unfortunately, your proposed solution is far worse than blkcg was or
ever could be.  It's not even contained in a single subsystem and it's
not even clear what it achieves.  Neither weight or hard limit can be
properly enforced without another layer of controlling at the block
layer (some use cases do expect strict enforcement) and we're baking
assumptions about use cases, interfaces and underlying hardware across
multiple subsystems (some ssds work fine with per-iops switching).
For your suggested solution, the moment it's best fit is now and it'll
be a long painful way down until someone snaps and reimplements the
whole thing.

The kernel is larger than balance_dirty_pages() or writeback.  Each
subsystem should do what it's supposed to do.  Let's solve problems
where they belong and pay overheads where they're due.  Let's not
contort the whole stack for the short term goal of shoving writeback
support into the existing, still-developing, blkcg cfq proportional IO
implementation.  Because that's pure insanity.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-20 19:08                   ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-20 19:08 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf, Mel Gorman

Hello, Fengguang.

On Fri, Apr 20, 2012 at 09:34:41PM +0800, Fengguang Wu wrote:
> >   Yup. This just shows that you have to have per-cgroup dirty limits. Once
> > you have those, things start working again.
> 
> Right. I think Tejun was more of less aware of this.

I'm fairly sure I'm on the "less" side of it.

> I was rather upset by this per-memcg dirty_limit idea indeed. I never
> expect it to work well when used extensively. My plan was to set the
> default memcg dirty_limit high enough, so that it's not hit in normal.
> Then Tejun came and proposed to (mis-)use dirty_limit as the way to
> convert the dirty pages' backpressure into real dirty throttling rate.
> No, that's just crazy idea!

I'll tell you what's crazy.

We're not gonna cut three more kernel releases and then change jobs.
Some of the stuff we put in the kernel ends up staying there for over
a decade.  While ignoring fundamental designs and violating layers may
look like rendering a quick solution.  They tend to come back and bite
our collective asses.  Ask Vivek.  The iosched / blkcg API was messed
up to the extent that bugs were so difficult to track down and it was
nearly impossible to add new features, let alone new blkcg policy or
elevator and people did suffer for that for long time.  I ended up
cleaning up the mess.  It took me longer than three months and even
then we have to carry on with a lot of ugly stuff for compatibility.

Unfortunately, your proposed solution is far worse than blkcg was or
ever could be.  It's not even contained in a single subsystem and it's
not even clear what it achieves.  Neither weight or hard limit can be
properly enforced without another layer of controlling at the block
layer (some use cases do expect strict enforcement) and we're baking
assumptions about use cases, interfaces and underlying hardware across
multiple subsystems (some ssds work fine with per-iops switching).
For your suggested solution, the moment it's best fit is now and it'll
be a long painful way down until someone snaps and reimplements the
whole thing.

The kernel is larger than balance_dirty_pages() or writeback.  Each
subsystem should do what it's supposed to do.  Let's solve problems
where they belong and pay overheads where they're due.  Let's not
contort the whole stack for the short term goal of shoving writeback
support into the existing, still-developing, blkcg cfq proportional IO
implementation.  Because that's pure insanity.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-20 12:45                 ` Fengguang Wu
  (?)
  (?)
@ 2012-04-20 19:29                 ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-20 19:29 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Fri, Apr 20, 2012 at 08:45:18PM +0800, Fengguang Wu wrote:

[..]
> If still keep the global async queue, it can run small 40ms slices
> without defeating the flusher's 500ms granularity. After each slice
> it can freely switch to other cgroups with sync IOs, so is free from
> latency issues. After return, it will continue to serve the same
> inode. It will basically be working on behalf of one cgroup for 500ms
> data, working for another cgroup for 500ms data and so on. That
> behavior does not impact fairness, because it's still using small
> slices and its weight is computed system wide thus exhibits some kind
> of smooth/amortize effects over long period of time. It can naturally 
> serve the same inode after return.

Ok, So tejun did say that we will have a switch where we will allow
retaining the old behavior of keeping all async writes in root group
and not in individual group. So throughput sensitive users can make
use of that and there is no need to push proportional IO logic to
writeback layer for buffered writes?

I am personally is not too excited about the case of putting async IO
in separate groups due to the reason that async IO of one group will
start impacting latencies of sync IO of another group and in practice
it might not be desirable. But there are others who have use cases for
separate async IO queue. So as long as switch is there to change the
behavior, I am not too worried.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-20 12:45                 ` Fengguang Wu
@ 2012-04-20 19:29                   ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-20 19:29 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Fri, Apr 20, 2012 at 08:45:18PM +0800, Fengguang Wu wrote:

[..]
> If still keep the global async queue, it can run small 40ms slices
> without defeating the flusher's 500ms granularity. After each slice
> it can freely switch to other cgroups with sync IOs, so is free from
> latency issues. After return, it will continue to serve the same
> inode. It will basically be working on behalf of one cgroup for 500ms
> data, working for another cgroup for 500ms data and so on. That
> behavior does not impact fairness, because it's still using small
> slices and its weight is computed system wide thus exhibits some kind
> of smooth/amortize effects over long period of time. It can naturally 
> serve the same inode after return.

Ok, So tejun did say that we will have a switch where we will allow
retaining the old behavior of keeping all async writes in root group
and not in individual group. So throughput sensitive users can make
use of that and there is no need to push proportional IO logic to
writeback layer for buffered writes?

I am personally is not too excited about the case of putting async IO
in separate groups due to the reason that async IO of one group will
start impacting latencies of sync IO of another group and in practice
it might not be desirable. But there are others who have use cases for
separate async IO queue. So as long as switch is there to change the
behavior, I am not too worried.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-20 19:29                   ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-20 19:29 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Fri, Apr 20, 2012 at 08:45:18PM +0800, Fengguang Wu wrote:

[..]
> If still keep the global async queue, it can run small 40ms slices
> without defeating the flusher's 500ms granularity. After each slice
> it can freely switch to other cgroups with sync IOs, so is free from
> latency issues. After return, it will continue to serve the same
> inode. It will basically be working on behalf of one cgroup for 500ms
> data, working for another cgroup for 500ms data and so on. That
> behavior does not impact fairness, because it's still using small
> slices and its weight is computed system wide thus exhibits some kind
> of smooth/amortize effects over long period of time. It can naturally 
> serve the same inode after return.

Ok, So tejun did say that we will have a switch where we will allow
retaining the old behavior of keeping all async writes in root group
and not in individual group. So throughput sensitive users can make
use of that and there is no need to push proportional IO logic to
writeback layer for buffered writes?

I am personally is not too excited about the case of putting async IO
in separate groups due to the reason that async IO of one group will
start impacting latencies of sync IO of another group and in practice
it might not be desirable. But there are others who have use cases for
separate async IO queue. So as long as switch is there to change the
behavior, I am not too worried.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                   ` <20120420192930.GR22419-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-20 21:33                     ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-20 21:33 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Fri, Apr 20, 2012 at 03:29:30PM -0400, Vivek Goyal wrote:
> I am personally is not too excited about the case of putting async IO
> in separate groups due to the reason that async IO of one group will
> start impacting latencies of sync IO of another group and in practice
> it might not be desirable. But there are others who have use cases for
> separate async IO queue. So as long as switch is there to change the
> behavior, I am not too worried.

Why not just fix cfq so that it prefers groups w/ sync IOs?

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-20 19:29                   ` Vivek Goyal
@ 2012-04-20 21:33                     ` Tejun Heo
  -1 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-20 21:33 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Fri, Apr 20, 2012 at 03:29:30PM -0400, Vivek Goyal wrote:
> I am personally is not too excited about the case of putting async IO
> in separate groups due to the reason that async IO of one group will
> start impacting latencies of sync IO of another group and in practice
> it might not be desirable. But there are others who have use cases for
> separate async IO queue. So as long as switch is there to change the
> behavior, I am not too worried.

Why not just fix cfq so that it prefers groups w/ sync IOs?

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-20 21:33                     ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-20 21:33 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Fri, Apr 20, 2012 at 03:29:30PM -0400, Vivek Goyal wrote:
> I am personally is not too excited about the case of putting async IO
> in separate groups due to the reason that async IO of one group will
> start impacting latencies of sync IO of another group and in practice
> it might not be desirable. But there are others who have use cases for
> separate async IO queue. So as long as switch is there to change the
> behavior, I am not too worried.

Why not just fix cfq so that it prefers groups w/ sync IOs?

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                     ` <20120420213301.GA29134-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-04-22 14:26                       ` Fengguang Wu
  2012-04-23 12:30                       ` Vivek Goyal
  1 sibling, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-22 14:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal

On Fri, Apr 20, 2012 at 02:33:01PM -0700, Tejun Heo wrote:
> On Fri, Apr 20, 2012 at 03:29:30PM -0400, Vivek Goyal wrote:
> > I am personally is not too excited about the case of putting async IO
> > in separate groups due to the reason that async IO of one group will
> > start impacting latencies of sync IO of another group and in practice
> > it might not be desirable. But there are others who have use cases for
> > separate async IO queue. So as long as switch is there to change the
> > behavior, I am not too worried.
> 
> Why not just fix cfq so that it prefers groups w/ sync IOs?

There may be a sync+async group in front, but when switch into it, it
decides to give its async queue a run. That's not necessarily a bad
decision, but we do lose some control here.

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-20 21:33                     ` Tejun Heo
@ 2012-04-22 14:26                       ` Fengguang Wu
  -1 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-22 14:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Fri, Apr 20, 2012 at 02:33:01PM -0700, Tejun Heo wrote:
> On Fri, Apr 20, 2012 at 03:29:30PM -0400, Vivek Goyal wrote:
> > I am personally is not too excited about the case of putting async IO
> > in separate groups due to the reason that async IO of one group will
> > start impacting latencies of sync IO of another group and in practice
> > it might not be desirable. But there are others who have use cases for
> > separate async IO queue. So as long as switch is there to change the
> > behavior, I am not too worried.
> 
> Why not just fix cfq so that it prefers groups w/ sync IOs?

There may be a sync+async group in front, but when switch into it, it
decides to give its async queue a run. That's not necessarily a bad
decision, but we do lose some control here.

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-22 14:26                       ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-22 14:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Fri, Apr 20, 2012 at 02:33:01PM -0700, Tejun Heo wrote:
> On Fri, Apr 20, 2012 at 03:29:30PM -0400, Vivek Goyal wrote:
> > I am personally is not too excited about the case of putting async IO
> > in separate groups due to the reason that async IO of one group will
> > start impacting latencies of sync IO of another group and in practice
> > it might not be desirable. But there are others who have use cases for
> > separate async IO queue. So as long as switch is there to change the
> > behavior, I am not too worried.
> 
> Why not just fix cfq so that it prefers groups w/ sync IOs?

There may be a sync+async group in front, but when switch into it, it
decides to give its async queue a run. That's not necessarily a bad
decision, but we do lose some control here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                   ` <20120420190844.GH32324-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-04-22 14:46                     ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-22 14:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA,
	Mel Gorman

Hi Tejun,

On Fri, Apr 20, 2012 at 12:08:44PM -0700, Tejun Heo wrote:
> Hello, Fengguang.
> 
> On Fri, Apr 20, 2012 at 09:34:41PM +0800, Fengguang Wu wrote:
> > >   Yup. This just shows that you have to have per-cgroup dirty limits. Once
> > > you have those, things start working again.
> > 
> > Right. I think Tejun was more of less aware of this.
> 
> I'm fairly sure I'm on the "less" side of it.

OK. Sorry I should have explained why memcg dirty limit is not the
right tool for back pressure based throttling.

To limit memcg dirty pages, two thresholds will be introduced:


0                 call for flush                    dirty limit
------------------------*--------------------------------*----------------------->
                                                                memcg dirty pages

1) when dirty pages increase to "call for flush" point, the memcg will
   explicitly ask the flusher thread to focus more on this memcg's inodes

2) when "dirty limit" is reached, the dirtier tasks will be throttled
   the hard way

When there are few memcgs, or when the safety margin between the two
thresholds are large enough, the dirty limit won't be hit and all goes
virtually as smooth as when there are only global dirty limits.

Otherwise the memcg dirty limit will be occasionally hit, but still
should drop soon when the flusher thread round-robin to this memcg. 

Basically the more memcgs with dirty limits, the more hard time for
the flusher to serve them fairly and knock down their dirty pages in
time. Because the flusher works inode by inode, each one may take up
to 0.5 second, and there may be many memcgs asking for the flusher's
attention. Also the more memcgs, the global dirty pages pool are
partitioned into smaller pieces, which means smaller safety margin for
each memcg. Adding these two effects up, there may be constantly some
memcgs hitting their dirty limits when there are dozens of memcgs.

Hitting the dirty limits means all dirtiers tasks, including the light
dirtiers who do occasional writes, become painfully slow. It's a bad
state that should be avoided by any means.

Now consider the back pressure case. When the user configured two
blkcgs with 10:1 weights, the flusher will have great difficulties
writeout pages for the latter blkcg. The corresponding memcg's dirty
pages rush straightly to its dirty limit, _stay_ there and can never
drop to normal. This means the latter blkcg's tasks will constantly
see second-long time stalls.

The solution would be to create an adaptive threshold blkcg.bdi.dirty_setpoint
that's proportional to its buffered writeout bandwidth and teach
balance_dirty_pages() to balance dirty pages around that target.

It avoids the worst case of hitting dirty_limit. However it may still
present big challenges to balance_dirty_pages(). For example, when
there are 10 blkcgs and 12 JBOD disks, it may create up to 10*12=120
dirty balance targets. Wow I cannot imagine how it's going to fulfill
so many different targets.

> > I was rather upset by this per-memcg dirty_limit idea indeed. I never
> > expect it to work well when used extensively. My plan was to set the
> > default memcg dirty_limit high enough, so that it's not hit in normal.
> > Then Tejun came and proposed to (mis-)use dirty_limit as the way to
> > convert the dirty pages' backpressure into real dirty throttling rate.
> > No, that's just crazy idea!
> 
> I'll tell you what's crazy.
> 
> We're not gonna cut three more kernel releases and then change jobs.
> Some of the stuff we put in the kernel ends up staying there for over
> a decade.  While ignoring fundamental designs and violating layers may
> look like rendering a quick solution.  They tend to come back and bite
> our collective asses.  Ask Vivek.  The iosched / blkcg API was messed
> up to the extent that bugs were so difficult to track down and it was
> nearly impossible to add new features, let alone new blkcg policy or
> elevator and people did suffer for that for long time.  I ended up
> cleaning up the mess.  It took me longer than three months and even
> then we have to carry on with a lot of ugly stuff for compatibility.

"block/cfq-iosched.c" 3930L

Yeah it's a big pile of tricky code. In despite of that, the code
structure still looks pretty neat, kudos to all of you!

> Unfortunately, your proposed solution is far worse than blkcg was or
> ever could be.  It's not even contained in a single subsystem and it's
> not even clear what it achieves.

Yeah it's cross subsystems, mainly due to there are two natural
throttling points: balance_dirty_pages() and cfq. It requires both
sides to work properly.

In my proposal, balance_dirty_pages() takes care to update the
weights for async/direct IO on every 200ms and store it in blkcg.
cfq then grabs the weights to update the cfq group's vdisktime.

Such cross subsystem coordinations still look natural to me because
"weight" is a fundamental and general parameter. It's really a blkcg
thing (determined by the blkio.weight user interface) rather than
specifically tied to cfq. When another kernel entity (eg. NFS or noop)
decides to add support for proportional weight IO control in future,
it can make use of the weights calculated by balance_dirty_pages(), too.

That scheme does involve non-trivial complexities in the calculations,
however IMHO sucks much less than let cfq take control and convey the
information all the way up to balance_dirty_pages() via "backpressure".

When balance_dirty_pages() takes part in the job, it merely costs some
per-cpu accounting and calculations on every 200ms -- both scales
pretty well.  Virtually nothing changed (how buffered IO is performed)
before/after applying IO controllers. From the users' perspective:

        - No more latency
        - No performance drop
        - No bumpy progress and stalls
        - No need to attach memcg to blkcg
        - Feel free to create 1000+ IO controllers, to heart's content
          w/o worrying about costs (if any, it would be some existing
          scalability issues)

On the other hand, the back pressure scheme makes Linux more clumsy by
vectorizing everything from bottom to up, giving rise to a number of
problems:

- in cfq, by splitting up the global async queue, cfq suddenly sees a
  number of cfq groups full of async requests lining up competing for
  the disk time. This could obscure things up and add difficulties to
  maintain low latency for sync requests.

- in cfq, it will now be switching inodes based on the 40ms async
  slice time, which defeats the flusher thread's 500ms inode slice
  time. The below numbers show the performance cost of lowering the
  flusher's slices to ~40ms:

  3.4.0-rc2             3.4.0-rc2-4M+  
-----------  ------------------------  
     114.02        -4.2%       109.23  snb/thresh=8G/xfs-1dd-1-3.4.0-rc2
     102.25       -11.7%        90.24  snb/thresh=8G/xfs-10dd-1-3.4.0-rc2
     104.17       -17.5%        85.91  snb/thresh=8G/xfs-20dd-1-3.4.0-rc2
     104.94       -18.7%        85.28  snb/thresh=8G/xfs-30dd-1-3.4.0-rc2
     104.76       -21.9%        81.82  snb/thresh=8G/xfs-100dd-1-3.4.0-rc2

  We can do the optimization of increasing cfq async time slice when
  there are no sync IO. However in general cases it could still hurt.

- in cfq, the lots more async queues will be holding much more async
  requests in order to prevent queue underrun. This proportionally
  scales up the number of writeback pages, which in turn exponentially
  scales up the difficulty to reclaim high order pages:

          P(reclaimable for THP) = P(non-PG_writeback)^512

  That means we cannot comfortably use THP in a system with more than
  0.1% writeback pages. Perhaps we need to work out some general
  optimizations to make writeback pages more concentrated in the
  physical memory space.

  Besides, when there are N seconds worth of writeback pages, it may
  take N/2 seconds on average for wait_on_page_writeback() to finish.
  So the total time cost of running into a random writeback page and
  waiting on it is also O(n^2):

        E(PG_writeback waits) = P(hit PG_writeback) * E(wait on it)

  That means we can hardly keep more than 1-second worth of writeback
  pages w/o worrying about long waits on PG_writeback in various parts
  of the kernel.

- in the flusher, we'll need to vectorize the dirty inode lists,
  that's fine. However we either need to create one flusher per blkcg,
  which has the problem of intensify various fs lock contentions, or
  let one single flusher to walk through the blkcgs, which risks more
  cfq queue underruns. We may decrease the flusher's time slice or
  increase the queue size to mitigate this, however neither looks
  the exciting way.

- balance_dirty_pages() will need to keep each blkcg's dirty pages at
  reasonable level, otherwise there may be starvations to defeat the
  low level IO controllers and to hurt IO size. Thus comes the very
  undesirable need to attach memcg to blkcg to track dirty pages.
  
  It's also not fun to work with dozens of dirty pages targets because
  dirty pages tend to fluctuate a lot. In comparison, it's far more
  easier for balance_dirty_pages() to dirty ratelimit 1000+ dd tasks
  in the global context.

In summary, the back pressure scheme looks obvious at first sight,
however there are some fundamental problems in the way. Cgroups are
expected to be *light weight* facilities. Unfortunately this scheme
will likely present too much burden and side effects to the system.
It might become uncomfortable for the user to run 10+ blkcgs...

> Neither weight or hard limit can be
> properly enforced without another layer of controlling at the block
> layer (some use cases do expect strict enforcement) and we're baking
> assumptions about use cases, interfaces and underlying hardware across
> multiple subsystems (some ssds work fine with per-iops switching).

cfq still has the freedom to do per-iops switching, based on the same
weight values computed by balance_dirty_pages(). cfq will need to feed
back some "IO cost" stats based on either disk time or iops, upon
which balance_dirty_pages() scales the throttling bandwidth for the
dirtier tasks by the "IO cost". balance_dirty_pages() can also do IOPS
hard limits based on the scaled throttling bandwidth.

> For your suggested solution, the moment it's best fit is now and it'll
> be a long painful way down until someone snaps and reimplements the
> whole thing.
>
> The kernel is larger than balance_dirty_pages() or writeback.  Each
> subsystem should do what it's supposed to do.  Let's solve problems
> where they belong and pay overheads where they're due.  Let's not
> contort the whole stack for the short term goal of shoving writeback
> support into the existing, still-developing, blkcg cfq proportional IO
> implementation.  Because that's pure insanity.

To be frank I would be very pleased to avoid going into the pains of
doing all the hairy computations to graft balance_dirty_pages() onto
cfq, if ever the back pressure idea is not so upsetting. And if there
are proper ways to address its problems, it would be a great relief
for me to stop pondering on the details of disk time/IOPS feedback and
the hierarchical support (yeah I think it's somehow possible now), and
the foreseeable _numerous_ experiments to get the ideas into shape...

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                   ` <20120420190844.GH32324-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2012-04-22 14:46                     ` Fengguang Wu
@ 2012-04-22 14:46                     ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-22 14:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf, Mel Gorman

Hi Tejun,

On Fri, Apr 20, 2012 at 12:08:44PM -0700, Tejun Heo wrote:
> Hello, Fengguang.
> 
> On Fri, Apr 20, 2012 at 09:34:41PM +0800, Fengguang Wu wrote:
> > >   Yup. This just shows that you have to have per-cgroup dirty limits. Once
> > > you have those, things start working again.
> > 
> > Right. I think Tejun was more of less aware of this.
> 
> I'm fairly sure I'm on the "less" side of it.

OK. Sorry I should have explained why memcg dirty limit is not the
right tool for back pressure based throttling.

To limit memcg dirty pages, two thresholds will be introduced:


0                 call for flush                    dirty limit
------------------------*--------------------------------*----------------------->
                                                                memcg dirty pages

1) when dirty pages increase to "call for flush" point, the memcg will
   explicitly ask the flusher thread to focus more on this memcg's inodes

2) when "dirty limit" is reached, the dirtier tasks will be throttled
   the hard way

When there are few memcgs, or when the safety margin between the two
thresholds are large enough, the dirty limit won't be hit and all goes
virtually as smooth as when there are only global dirty limits.

Otherwise the memcg dirty limit will be occasionally hit, but still
should drop soon when the flusher thread round-robin to this memcg. 

Basically the more memcgs with dirty limits, the more hard time for
the flusher to serve them fairly and knock down their dirty pages in
time. Because the flusher works inode by inode, each one may take up
to 0.5 second, and there may be many memcgs asking for the flusher's
attention. Also the more memcgs, the global dirty pages pool are
partitioned into smaller pieces, which means smaller safety margin for
each memcg. Adding these two effects up, there may be constantly some
memcgs hitting their dirty limits when there are dozens of memcgs.

Hitting the dirty limits means all dirtiers tasks, including the light
dirtiers who do occasional writes, become painfully slow. It's a bad
state that should be avoided by any means.

Now consider the back pressure case. When the user configured two
blkcgs with 10:1 weights, the flusher will have great difficulties
writeout pages for the latter blkcg. The corresponding memcg's dirty
pages rush straightly to its dirty limit, _stay_ there and can never
drop to normal. This means the latter blkcg's tasks will constantly
see second-long time stalls.

The solution would be to create an adaptive threshold blkcg.bdi.dirty_setpoint
that's proportional to its buffered writeout bandwidth and teach
balance_dirty_pages() to balance dirty pages around that target.

It avoids the worst case of hitting dirty_limit. However it may still
present big challenges to balance_dirty_pages(). For example, when
there are 10 blkcgs and 12 JBOD disks, it may create up to 10*12=120
dirty balance targets. Wow I cannot imagine how it's going to fulfill
so many different targets.

> > I was rather upset by this per-memcg dirty_limit idea indeed. I never
> > expect it to work well when used extensively. My plan was to set the
> > default memcg dirty_limit high enough, so that it's not hit in normal.
> > Then Tejun came and proposed to (mis-)use dirty_limit as the way to
> > convert the dirty pages' backpressure into real dirty throttling rate.
> > No, that's just crazy idea!
> 
> I'll tell you what's crazy.
> 
> We're not gonna cut three more kernel releases and then change jobs.
> Some of the stuff we put in the kernel ends up staying there for over
> a decade.  While ignoring fundamental designs and violating layers may
> look like rendering a quick solution.  They tend to come back and bite
> our collective asses.  Ask Vivek.  The iosched / blkcg API was messed
> up to the extent that bugs were so difficult to track down and it was
> nearly impossible to add new features, let alone new blkcg policy or
> elevator and people did suffer for that for long time.  I ended up
> cleaning up the mess.  It took me longer than three months and even
> then we have to carry on with a lot of ugly stuff for compatibility.

"block/cfq-iosched.c" 3930L

Yeah it's a big pile of tricky code. In despite of that, the code
structure still looks pretty neat, kudos to all of you!

> Unfortunately, your proposed solution is far worse than blkcg was or
> ever could be.  It's not even contained in a single subsystem and it's
> not even clear what it achieves.

Yeah it's cross subsystems, mainly due to there are two natural
throttling points: balance_dirty_pages() and cfq. It requires both
sides to work properly.

In my proposal, balance_dirty_pages() takes care to update the
weights for async/direct IO on every 200ms and store it in blkcg.
cfq then grabs the weights to update the cfq group's vdisktime.

Such cross subsystem coordinations still look natural to me because
"weight" is a fundamental and general parameter. It's really a blkcg
thing (determined by the blkio.weight user interface) rather than
specifically tied to cfq. When another kernel entity (eg. NFS or noop)
decides to add support for proportional weight IO control in future,
it can make use of the weights calculated by balance_dirty_pages(), too.

That scheme does involve non-trivial complexities in the calculations,
however IMHO sucks much less than let cfq take control and convey the
information all the way up to balance_dirty_pages() via "backpressure".

When balance_dirty_pages() takes part in the job, it merely costs some
per-cpu accounting and calculations on every 200ms -- both scales
pretty well.  Virtually nothing changed (how buffered IO is performed)
before/after applying IO controllers. From the users' perspective:

        - No more latency
        - No performance drop
        - No bumpy progress and stalls
        - No need to attach memcg to blkcg
        - Feel free to create 1000+ IO controllers, to heart's content
          w/o worrying about costs (if any, it would be some existing
          scalability issues)

On the other hand, the back pressure scheme makes Linux more clumsy by
vectorizing everything from bottom to up, giving rise to a number of
problems:

- in cfq, by splitting up the global async queue, cfq suddenly sees a
  number of cfq groups full of async requests lining up competing for
  the disk time. This could obscure things up and add difficulties to
  maintain low latency for sync requests.

- in cfq, it will now be switching inodes based on the 40ms async
  slice time, which defeats the flusher thread's 500ms inode slice
  time. The below numbers show the performance cost of lowering the
  flusher's slices to ~40ms:

  3.4.0-rc2             3.4.0-rc2-4M+  
-----------  ------------------------  
     114.02        -4.2%       109.23  snb/thresh=8G/xfs-1dd-1-3.4.0-rc2
     102.25       -11.7%        90.24  snb/thresh=8G/xfs-10dd-1-3.4.0-rc2
     104.17       -17.5%        85.91  snb/thresh=8G/xfs-20dd-1-3.4.0-rc2
     104.94       -18.7%        85.28  snb/thresh=8G/xfs-30dd-1-3.4.0-rc2
     104.76       -21.9%        81.82  snb/thresh=8G/xfs-100dd-1-3.4.0-rc2

  We can do the optimization of increasing cfq async time slice when
  there are no sync IO. However in general cases it could still hurt.

- in cfq, the lots more async queues will be holding much more async
  requests in order to prevent queue underrun. This proportionally
  scales up the number of writeback pages, which in turn exponentially
  scales up the difficulty to reclaim high order pages:

          P(reclaimable for THP) = P(non-PG_writeback)^512

  That means we cannot comfortably use THP in a system with more than
  0.1% writeback pages. Perhaps we need to work out some general
  optimizations to make writeback pages more concentrated in the
  physical memory space.

  Besides, when there are N seconds worth of writeback pages, it may
  take N/2 seconds on average for wait_on_page_writeback() to finish.
  So the total time cost of running into a random writeback page and
  waiting on it is also O(n^2):

        E(PG_writeback waits) = P(hit PG_writeback) * E(wait on it)

  That means we can hardly keep more than 1-second worth of writeback
  pages w/o worrying about long waits on PG_writeback in various parts
  of the kernel.

- in the flusher, we'll need to vectorize the dirty inode lists,
  that's fine. However we either need to create one flusher per blkcg,
  which has the problem of intensify various fs lock contentions, or
  let one single flusher to walk through the blkcgs, which risks more
  cfq queue underruns. We may decrease the flusher's time slice or
  increase the queue size to mitigate this, however neither looks
  the exciting way.

- balance_dirty_pages() will need to keep each blkcg's dirty pages at
  reasonable level, otherwise there may be starvations to defeat the
  low level IO controllers and to hurt IO size. Thus comes the very
  undesirable need to attach memcg to blkcg to track dirty pages.
  
  It's also not fun to work with dozens of dirty pages targets because
  dirty pages tend to fluctuate a lot. In comparison, it's far more
  easier for balance_dirty_pages() to dirty ratelimit 1000+ dd tasks
  in the global context.

In summary, the back pressure scheme looks obvious at first sight,
however there are some fundamental problems in the way. Cgroups are
expected to be *light weight* facilities. Unfortunately this scheme
will likely present too much burden and side effects to the system.
It might become uncomfortable for the user to run 10+ blkcgs...

> Neither weight or hard limit can be
> properly enforced without another layer of controlling at the block
> layer (some use cases do expect strict enforcement) and we're baking
> assumptions about use cases, interfaces and underlying hardware across
> multiple subsystems (some ssds work fine with per-iops switching).

cfq still has the freedom to do per-iops switching, based on the same
weight values computed by balance_dirty_pages(). cfq will need to feed
back some "IO cost" stats based on either disk time or iops, upon
which balance_dirty_pages() scales the throttling bandwidth for the
dirtier tasks by the "IO cost". balance_dirty_pages() can also do IOPS
hard limits based on the scaled throttling bandwidth.

> For your suggested solution, the moment it's best fit is now and it'll
> be a long painful way down until someone snaps and reimplements the
> whole thing.
>
> The kernel is larger than balance_dirty_pages() or writeback.  Each
> subsystem should do what it's supposed to do.  Let's solve problems
> where they belong and pay overheads where they're due.  Let's not
> contort the whole stack for the short term goal of shoving writeback
> support into the existing, still-developing, blkcg cfq proportional IO
> implementation.  Because that's pure insanity.

To be frank I would be very pleased to avoid going into the pains of
doing all the hairy computations to graft balance_dirty_pages() onto
cfq, if ever the back pressure idea is not so upsetting. And if there
are proper ways to address its problems, it would be a great relief
for me to stop pondering on the details of disk time/IOPS feedback and
the hierarchical support (yeah I think it's somehow possible now), and
the foreseeable _numerous_ experiments to get the ideas into shape...

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-22 14:46                     ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-22 14:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k,
	andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mel Gorman

Hi Tejun,

On Fri, Apr 20, 2012 at 12:08:44PM -0700, Tejun Heo wrote:
> Hello, Fengguang.
> 
> On Fri, Apr 20, 2012 at 09:34:41PM +0800, Fengguang Wu wrote:
> > >   Yup. This just shows that you have to have per-cgroup dirty limits. Once
> > > you have those, things start working again.
> > 
> > Right. I think Tejun was more of less aware of this.
> 
> I'm fairly sure I'm on the "less" side of it.

OK. Sorry I should have explained why memcg dirty limit is not the
right tool for back pressure based throttling.

To limit memcg dirty pages, two thresholds will be introduced:


0                 call for flush                    dirty limit
------------------------*--------------------------------*----------------------->
                                                                memcg dirty pages

1) when dirty pages increase to "call for flush" point, the memcg will
   explicitly ask the flusher thread to focus more on this memcg's inodes

2) when "dirty limit" is reached, the dirtier tasks will be throttled
   the hard way

When there are few memcgs, or when the safety margin between the two
thresholds are large enough, the dirty limit won't be hit and all goes
virtually as smooth as when there are only global dirty limits.

Otherwise the memcg dirty limit will be occasionally hit, but still
should drop soon when the flusher thread round-robin to this memcg. 

Basically the more memcgs with dirty limits, the more hard time for
the flusher to serve them fairly and knock down their dirty pages in
time. Because the flusher works inode by inode, each one may take up
to 0.5 second, and there may be many memcgs asking for the flusher's
attention. Also the more memcgs, the global dirty pages pool are
partitioned into smaller pieces, which means smaller safety margin for
each memcg. Adding these two effects up, there may be constantly some
memcgs hitting their dirty limits when there are dozens of memcgs.

Hitting the dirty limits means all dirtiers tasks, including the light
dirtiers who do occasional writes, become painfully slow. It's a bad
state that should be avoided by any means.

Now consider the back pressure case. When the user configured two
blkcgs with 10:1 weights, the flusher will have great difficulties
writeout pages for the latter blkcg. The corresponding memcg's dirty
pages rush straightly to its dirty limit, _stay_ there and can never
drop to normal. This means the latter blkcg's tasks will constantly
see second-long time stalls.

The solution would be to create an adaptive threshold blkcg.bdi.dirty_setpoint
that's proportional to its buffered writeout bandwidth and teach
balance_dirty_pages() to balance dirty pages around that target.

It avoids the worst case of hitting dirty_limit. However it may still
present big challenges to balance_dirty_pages(). For example, when
there are 10 blkcgs and 12 JBOD disks, it may create up to 10*12=120
dirty balance targets. Wow I cannot imagine how it's going to fulfill
so many different targets.

> > I was rather upset by this per-memcg dirty_limit idea indeed. I never
> > expect it to work well when used extensively. My plan was to set the
> > default memcg dirty_limit high enough, so that it's not hit in normal.
> > Then Tejun came and proposed to (mis-)use dirty_limit as the way to
> > convert the dirty pages' backpressure into real dirty throttling rate.
> > No, that's just crazy idea!
> 
> I'll tell you what's crazy.
> 
> We're not gonna cut three more kernel releases and then change jobs.
> Some of the stuff we put in the kernel ends up staying there for over
> a decade.  While ignoring fundamental designs and violating layers may
> look like rendering a quick solution.  They tend to come back and bite
> our collective asses.  Ask Vivek.  The iosched / blkcg API was messed
> up to the extent that bugs were so difficult to track down and it was
> nearly impossible to add new features, let alone new blkcg policy or
> elevator and people did suffer for that for long time.  I ended up
> cleaning up the mess.  It took me longer than three months and even
> then we have to carry on with a lot of ugly stuff for compatibility.

"block/cfq-iosched.c" 3930L

Yeah it's a big pile of tricky code. In despite of that, the code
structure still looks pretty neat, kudos to all of you!

> Unfortunately, your proposed solution is far worse than blkcg was or
> ever could be.  It's not even contained in a single subsystem and it's
> not even clear what it achieves.

Yeah it's cross subsystems, mainly due to there are two natural
throttling points: balance_dirty_pages() and cfq. It requires both
sides to work properly.

In my proposal, balance_dirty_pages() takes care to update the
weights for async/direct IO on every 200ms and store it in blkcg.
cfq then grabs the weights to update the cfq group's vdisktime.

Such cross subsystem coordinations still look natural to me because
"weight" is a fundamental and general parameter. It's really a blkcg
thing (determined by the blkio.weight user interface) rather than
specifically tied to cfq. When another kernel entity (eg. NFS or noop)
decides to add support for proportional weight IO control in future,
it can make use of the weights calculated by balance_dirty_pages(), too.

That scheme does involve non-trivial complexities in the calculations,
however IMHO sucks much less than let cfq take control and convey the
information all the way up to balance_dirty_pages() via "backpressure".

When balance_dirty_pages() takes part in the job, it merely costs some
per-cpu accounting and calculations on every 200ms -- both scales
pretty well.  Virtually nothing changed (how buffered IO is performed)
before/after applying IO controllers. From the users' perspective:

        - No more latency
        - No performance drop
        - No bumpy progress and stalls
        - No need to attach memcg to blkcg
        - Feel free to create 1000+ IO controllers, to heart's content
          w/o worrying about costs (if any, it would be some existing
          scalability issues)

On the other hand, the back pressure scheme makes Linux more clumsy by
vectorizing everything from bottom to up, giving rise to a number of
problems:

- in cfq, by splitting up the global async queue, cfq suddenly sees a
  number of cfq groups full of async requests lining up competing for
  the disk time. This could obscure things up and add difficulties to
  maintain low latency for sync requests.

- in cfq, it will now be switching inodes based on the 40ms async
  slice time, which defeats the flusher thread's 500ms inode slice
  time. The below numbers show the performance cost of lowering the
  flusher's slices to ~40ms:

  3.4.0-rc2             3.4.0-rc2-4M+  
-----------  ------------------------  
     114.02        -4.2%       109.23  snb/thresh=8G/xfs-1dd-1-3.4.0-rc2
     102.25       -11.7%        90.24  snb/thresh=8G/xfs-10dd-1-3.4.0-rc2
     104.17       -17.5%        85.91  snb/thresh=8G/xfs-20dd-1-3.4.0-rc2
     104.94       -18.7%        85.28  snb/thresh=8G/xfs-30dd-1-3.4.0-rc2
     104.76       -21.9%        81.82  snb/thresh=8G/xfs-100dd-1-3.4.0-rc2

  We can do the optimization of increasing cfq async time slice when
  there are no sync IO. However in general cases it could still hurt.

- in cfq, the lots more async queues will be holding much more async
  requests in order to prevent queue underrun. This proportionally
  scales up the number of writeback pages, which in turn exponentially
  scales up the difficulty to reclaim high order pages:

          P(reclaimable for THP) = P(non-PG_writeback)^512

  That means we cannot comfortably use THP in a system with more than
  0.1% writeback pages. Perhaps we need to work out some general
  optimizations to make writeback pages more concentrated in the
  physical memory space.

  Besides, when there are N seconds worth of writeback pages, it may
  take N/2 seconds on average for wait_on_page_writeback() to finish.
  So the total time cost of running into a random writeback page and
  waiting on it is also O(n^2):

        E(PG_writeback waits) = P(hit PG_writeback) * E(wait on it)

  That means we can hardly keep more than 1-second worth of writeback
  pages w/o worrying about long waits on PG_writeback in various parts
  of the kernel.

- in the flusher, we'll need to vectorize the dirty inode lists,
  that's fine. However we either need to create one flusher per blkcg,
  which has the problem of intensify various fs lock contentions, or
  let one single flusher to walk through the blkcgs, which risks more
  cfq queue underruns. We may decrease the flusher's time slice or
  increase the queue size to mitigate this, however neither looks
  the exciting way.

- balance_dirty_pages() will need to keep each blkcg's dirty pages at
  reasonable level, otherwise there may be starvations to defeat the
  low level IO controllers and to hurt IO size. Thus comes the very
  undesirable need to attach memcg to blkcg to track dirty pages.
  
  It's also not fun to work with dozens of dirty pages targets because
  dirty pages tend to fluctuate a lot. In comparison, it's far more
  easier for balance_dirty_pages() to dirty ratelimit 1000+ dd tasks
  in the global context.

In summary, the back pressure scheme looks obvious at first sight,
however there are some fundamental problems in the way. Cgroups are
expected to be *light weight* facilities. Unfortunately this scheme
will likely present too much burden and side effects to the system.
It might become uncomfortable for the user to run 10+ blkcgs...

> Neither weight or hard limit can be
> properly enforced without another layer of controlling at the block
> layer (some use cases do expect strict enforcement) and we're baking
> assumptions about use cases, interfaces and underlying hardware across
> multiple subsystems (some ssds work fine with per-iops switching).

cfq still has the freedom to do per-iops switching, based on the same
weight values computed by balance_dirty_pages(). cfq will need to feed
back some "IO cost" stats based on either disk time or iops, upon
which balance_dirty_pages() scales the throttling bandwidth for the
dirtier tasks by the "IO cost". balance_dirty_pages() can also do IOPS
hard limits based on the scaled throttling bandwidth.

> For your suggested solution, the moment it's best fit is now and it'll
> be a long painful way down until someone snaps and reimplements the
> whole thing.
>
> The kernel is larger than balance_dirty_pages() or writeback.  Each
> subsystem should do what it's supposed to do.  Let's solve problems
> where they belong and pay overheads where they're due.  Let's not
> contort the whole stack for the short term goal of shoving writeback
> support into the existing, still-developing, blkcg cfq proportional IO
> implementation.  Because that's pure insanity.

To be frank I would be very pleased to avoid going into the pains of
doing all the hairy computations to graft balance_dirty_pages() onto
cfq, if ever the back pressure idea is not so upsetting. And if there
are proper ways to address its problems, it would be a great relief
for me to stop pondering on the details of disk time/IOPS feedback and
the hierarchical support (yeah I think it's somehow possible now), and
the foreseeable _numerous_ experiments to get the ideas into shape...

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-22 14:46                     ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-22 14:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf, Mel Gorman

Hi Tejun,

On Fri, Apr 20, 2012 at 12:08:44PM -0700, Tejun Heo wrote:
> Hello, Fengguang.
> 
> On Fri, Apr 20, 2012 at 09:34:41PM +0800, Fengguang Wu wrote:
> > >   Yup. This just shows that you have to have per-cgroup dirty limits. Once
> > > you have those, things start working again.
> > 
> > Right. I think Tejun was more of less aware of this.
> 
> I'm fairly sure I'm on the "less" side of it.

OK. Sorry I should have explained why memcg dirty limit is not the
right tool for back pressure based throttling.

To limit memcg dirty pages, two thresholds will be introduced:


0                 call for flush                    dirty limit
------------------------*--------------------------------*----------------------->
                                                                memcg dirty pages

1) when dirty pages increase to "call for flush" point, the memcg will
   explicitly ask the flusher thread to focus more on this memcg's inodes

2) when "dirty limit" is reached, the dirtier tasks will be throttled
   the hard way

When there are few memcgs, or when the safety margin between the two
thresholds are large enough, the dirty limit won't be hit and all goes
virtually as smooth as when there are only global dirty limits.

Otherwise the memcg dirty limit will be occasionally hit, but still
should drop soon when the flusher thread round-robin to this memcg. 

Basically the more memcgs with dirty limits, the more hard time for
the flusher to serve them fairly and knock down their dirty pages in
time. Because the flusher works inode by inode, each one may take up
to 0.5 second, and there may be many memcgs asking for the flusher's
attention. Also the more memcgs, the global dirty pages pool are
partitioned into smaller pieces, which means smaller safety margin for
each memcg. Adding these two effects up, there may be constantly some
memcgs hitting their dirty limits when there are dozens of memcgs.

Hitting the dirty limits means all dirtiers tasks, including the light
dirtiers who do occasional writes, become painfully slow. It's a bad
state that should be avoided by any means.

Now consider the back pressure case. When the user configured two
blkcgs with 10:1 weights, the flusher will have great difficulties
writeout pages for the latter blkcg. The corresponding memcg's dirty
pages rush straightly to its dirty limit, _stay_ there and can never
drop to normal. This means the latter blkcg's tasks will constantly
see second-long time stalls.

The solution would be to create an adaptive threshold blkcg.bdi.dirty_setpoint
that's proportional to its buffered writeout bandwidth and teach
balance_dirty_pages() to balance dirty pages around that target.

It avoids the worst case of hitting dirty_limit. However it may still
present big challenges to balance_dirty_pages(). For example, when
there are 10 blkcgs and 12 JBOD disks, it may create up to 10*12=120
dirty balance targets. Wow I cannot imagine how it's going to fulfill
so many different targets.

> > I was rather upset by this per-memcg dirty_limit idea indeed. I never
> > expect it to work well when used extensively. My plan was to set the
> > default memcg dirty_limit high enough, so that it's not hit in normal.
> > Then Tejun came and proposed to (mis-)use dirty_limit as the way to
> > convert the dirty pages' backpressure into real dirty throttling rate.
> > No, that's just crazy idea!
> 
> I'll tell you what's crazy.
> 
> We're not gonna cut three more kernel releases and then change jobs.
> Some of the stuff we put in the kernel ends up staying there for over
> a decade.  While ignoring fundamental designs and violating layers may
> look like rendering a quick solution.  They tend to come back and bite
> our collective asses.  Ask Vivek.  The iosched / blkcg API was messed
> up to the extent that bugs were so difficult to track down and it was
> nearly impossible to add new features, let alone new blkcg policy or
> elevator and people did suffer for that for long time.  I ended up
> cleaning up the mess.  It took me longer than three months and even
> then we have to carry on with a lot of ugly stuff for compatibility.

"block/cfq-iosched.c" 3930L

Yeah it's a big pile of tricky code. In despite of that, the code
structure still looks pretty neat, kudos to all of you!

> Unfortunately, your proposed solution is far worse than blkcg was or
> ever could be.  It's not even contained in a single subsystem and it's
> not even clear what it achieves.

Yeah it's cross subsystems, mainly due to there are two natural
throttling points: balance_dirty_pages() and cfq. It requires both
sides to work properly.

In my proposal, balance_dirty_pages() takes care to update the
weights for async/direct IO on every 200ms and store it in blkcg.
cfq then grabs the weights to update the cfq group's vdisktime.

Such cross subsystem coordinations still look natural to me because
"weight" is a fundamental and general parameter. It's really a blkcg
thing (determined by the blkio.weight user interface) rather than
specifically tied to cfq. When another kernel entity (eg. NFS or noop)
decides to add support for proportional weight IO control in future,
it can make use of the weights calculated by balance_dirty_pages(), too.

That scheme does involve non-trivial complexities in the calculations,
however IMHO sucks much less than let cfq take control and convey the
information all the way up to balance_dirty_pages() via "backpressure".

When balance_dirty_pages() takes part in the job, it merely costs some
per-cpu accounting and calculations on every 200ms -- both scales
pretty well.  Virtually nothing changed (how buffered IO is performed)
before/after applying IO controllers. From the users' perspective:

        - No more latency
        - No performance drop
        - No bumpy progress and stalls
        - No need to attach memcg to blkcg
        - Feel free to create 1000+ IO controllers, to heart's content
          w/o worrying about costs (if any, it would be some existing
          scalability issues)

On the other hand, the back pressure scheme makes Linux more clumsy by
vectorizing everything from bottom to up, giving rise to a number of
problems:

- in cfq, by splitting up the global async queue, cfq suddenly sees a
  number of cfq groups full of async requests lining up competing for
  the disk time. This could obscure things up and add difficulties to
  maintain low latency for sync requests.

- in cfq, it will now be switching inodes based on the 40ms async
  slice time, which defeats the flusher thread's 500ms inode slice
  time. The below numbers show the performance cost of lowering the
  flusher's slices to ~40ms:

  3.4.0-rc2             3.4.0-rc2-4M+  
-----------  ------------------------  
     114.02        -4.2%       109.23  snb/thresh=8G/xfs-1dd-1-3.4.0-rc2
     102.25       -11.7%        90.24  snb/thresh=8G/xfs-10dd-1-3.4.0-rc2
     104.17       -17.5%        85.91  snb/thresh=8G/xfs-20dd-1-3.4.0-rc2
     104.94       -18.7%        85.28  snb/thresh=8G/xfs-30dd-1-3.4.0-rc2
     104.76       -21.9%        81.82  snb/thresh=8G/xfs-100dd-1-3.4.0-rc2

  We can do the optimization of increasing cfq async time slice when
  there are no sync IO. However in general cases it could still hurt.

- in cfq, the lots more async queues will be holding much more async
  requests in order to prevent queue underrun. This proportionally
  scales up the number of writeback pages, which in turn exponentially
  scales up the difficulty to reclaim high order pages:

          P(reclaimable for THP) = P(non-PG_writeback)^512

  That means we cannot comfortably use THP in a system with more than
  0.1% writeback pages. Perhaps we need to work out some general
  optimizations to make writeback pages more concentrated in the
  physical memory space.

  Besides, when there are N seconds worth of writeback pages, it may
  take N/2 seconds on average for wait_on_page_writeback() to finish.
  So the total time cost of running into a random writeback page and
  waiting on it is also O(n^2):

        E(PG_writeback waits) = P(hit PG_writeback) * E(wait on it)

  That means we can hardly keep more than 1-second worth of writeback
  pages w/o worrying about long waits on PG_writeback in various parts
  of the kernel.

- in the flusher, we'll need to vectorize the dirty inode lists,
  that's fine. However we either need to create one flusher per blkcg,
  which has the problem of intensify various fs lock contentions, or
  let one single flusher to walk through the blkcgs, which risks more
  cfq queue underruns. We may decrease the flusher's time slice or
  increase the queue size to mitigate this, however neither looks
  the exciting way.

- balance_dirty_pages() will need to keep each blkcg's dirty pages at
  reasonable level, otherwise there may be starvations to defeat the
  low level IO controllers and to hurt IO size. Thus comes the very
  undesirable need to attach memcg to blkcg to track dirty pages.
  
  It's also not fun to work with dozens of dirty pages targets because
  dirty pages tend to fluctuate a lot. In comparison, it's far more
  easier for balance_dirty_pages() to dirty ratelimit 1000+ dd tasks
  in the global context.

In summary, the back pressure scheme looks obvious at first sight,
however there are some fundamental problems in the way. Cgroups are
expected to be *light weight* facilities. Unfortunately this scheme
will likely present too much burden and side effects to the system.
It might become uncomfortable for the user to run 10+ blkcgs...

> Neither weight or hard limit can be
> properly enforced without another layer of controlling at the block
> layer (some use cases do expect strict enforcement) and we're baking
> assumptions about use cases, interfaces and underlying hardware across
> multiple subsystems (some ssds work fine with per-iops switching).

cfq still has the freedom to do per-iops switching, based on the same
weight values computed by balance_dirty_pages(). cfq will need to feed
back some "IO cost" stats based on either disk time or iops, upon
which balance_dirty_pages() scales the throttling bandwidth for the
dirtier tasks by the "IO cost". balance_dirty_pages() can also do IOPS
hard limits based on the scaled throttling bandwidth.

> For your suggested solution, the moment it's best fit is now and it'll
> be a long painful way down until someone snaps and reimplements the
> whole thing.
>
> The kernel is larger than balance_dirty_pages() or writeback.  Each
> subsystem should do what it's supposed to do.  Let's solve problems
> where they belong and pay overheads where they're due.  Let's not
> contort the whole stack for the short term goal of shoving writeback
> support into the existing, still-developing, blkcg cfq proportional IO
> implementation.  Because that's pure insanity.

To be frank I would be very pleased to avoid going into the pains of
doing all the hairy computations to graft balance_dirty_pages() onto
cfq, if ever the back pressure idea is not so upsetting. And if there
are proper ways to address its problems, it would be a great relief
for me to stop pondering on the details of disk time/IOPS feedback and
the hierarchical support (yeah I think it's somehow possible now), and
the foreseeable _numerous_ experiments to get the ideas into shape...

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-20 13:34                 ` Fengguang Wu
  (?)
@ 2012-04-23  9:14                   ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-23  9:14 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Mel Gorman

On Fri 20-04-12 21:34:41, Wu Fengguang wrote:
> On Thu, Apr 19, 2012 at 10:26:35PM +0200, Jan Kara wrote:
> > > It's not uncommon for me to see filesystems sleep on PG_writeback
> > > pages during heavy writeback, within some lock or transaction, which in
> > > turn stall many tasks that try to do IO or merely dirty some page in
> > > memory. Random writes are especially susceptible to such stalls. The
> > > stable page feature also vastly increase the chances of stalls by
> > > locking the writeback pages. 
> > > 
> > > Page reclaim may also block on PG_writeback and/or PG_dirty pages. In
> > > the case of direct reclaim, it means blocking random tasks that are
> > > allocating memory in the system.
> > > 
> > > PG_writeback pages are much worse than PG_dirty pages in that they are
> > > not movable. This makes a big difference for high-order page allocations.
> > > To make room for a 2MB huge page, vmscan has the option to migrate
> > > PG_dirty pages, but for PG_writeback it has no better choices than to
> > > wait for IO completion.
> > > 
> > > The difficulty of THP allocation goes up *exponentially* with the
> > > number of PG_writeback pages. Assume PG_writeback pages are randomly
> > > distributed in the physical memory space. Then we have formula
> > > 
> > >         P(reclaimable for THP) = 1 - P(hit PG_writeback)^256
> >   Well, this implicitely assumes that PG_Writeback pages are scattered
> > across memory uniformly at random. I'm not sure to which extent this is
> > true...
> 
> Yeah, when describing the problem I was also thinking about the
> possibilities of optimization (it would be a very good general
> improvements). Or maybe Mel already has some solutions :)
> 
> > Also as a nitpick, this isn't really an exponential growth since
> > the exponent is fixed (256 - actually it should be 512, right?). It's just
> 
> Right, 512 4k pages to form one x86_64 2MB huge pages.
> 
> > a polynomial with a big exponent. But sure, growth in number of PG_Writeback
> > pages will cause relatively steep drop in the number of available huge
> > pages.
> 
> It's exponential indeed, because "1 - p(x)" here means "p(!x)".
> It's exponential for a 10x increase in x resulting in 100x drop of y.
  If 'x' is the probability page has PG_Writeback set, then the probability
a huge page has a single PG_Writeback page is (as you almost correctly wrote):
(1-x)^512. This is a polynominal by the definition: It can be
expressed as $\sum_{i=0}^n a_i*x^i$ for $a_i\in R$ and $n$ finite.

The expression decreases fast as x approaches to 1, that's for sure, but
that does not make it exponential. Sorry, my mathematical part could not
resist this terminology correction.

> > ...
> > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > > > > It's always there doing 1:1 proportional throttling. Then you try to
> > > > > kick in to add *double* throttling in block/cfq layer. Now the low
> > > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > > > > from its balanced state, leading to large fluctuations and program
> > > > > stalls.
> > > > 
> > > > Just do the same 1:1 inside each cgroup.
> > > 
> > > Sure. But the ratio mismatch I'm talking about is inter-cgroup.
> > > For example there are only 2 dd tasks doing buffered writes in the
> > > system. Now consider the mismatch that cfq is dispatching their IO
> > > requests at 10:1 weights, while balance_dirty_pages() is throttling
> > > the dd tasks at 1:1 equal split because it's not aware of the cgroup
> > > weights.
> > > 
> > > What will happen in the end? The 1:1 ratio imposed by
> > > balance_dirty_pages() will take effect and the dd tasks will progress
> > > at the same pace. The cfq weights will be defeated because the async
> > > queue for the second dd (and cgroup) constantly runs empty.
> >   Yup. This just shows that you have to have per-cgroup dirty limits. Once
> > you have those, things start working again.
> 
> Right. I think Tejun was more of less aware of this.
> 
> I was rather upset by this per-memcg dirty_limit idea indeed. I never
> expect it to work well when used extensively. My plan was to set the
> default memcg dirty_limit high enough, so that it's not hit in normal.
> Then Tejun came and proposed to (mis-)use dirty_limit as the way to
> convert the dirty pages' backpressure into real dirty throttling rate.
> No, that's just crazy idea!
> 
> Come on, let's not over-use memcg's dirty_limit. It's there as the
> *last resort* to keep dirty pages under control so as to maintain
> interactive performance inside the cgroup. However if used extensively
> in the system (like dozens of memcgs all hit their dirty limits), the
> limit itself may stall random dirtiers and create interactive
> performance issues!
> 
> In the recent days I've come up with the idea of memcg.dirty_setpoint
> for the blkcg backpressure stuff. We can use that instead.
> 
> memcg.dirty_setpoint will scale proportionally with blkcg.writeout_rate.
> Imagine bdi_setpoint. It's all the same concepts. Why we need this?
> Because if blkcg A and B does 10:1 weights and are both doing buffered
> writes, their dirty pages should better be maintained around 10:1
> ratio to avoid underrun and hopefully achieve better IO size.
> memcg.dirty_limit cannot guarantee that goal.
  I agree that to avoid stalls of throttled processes we shouldn't be
hitting memcg.dirty_limit on a regular basis. When I wrote we need "per
cgroup dirty limits" I actually imagined something like you write above -
do complete throttling computations within each memcg - estimate throughput
available for it, compute appropriate dirty rates for it's processes and
from its dirty limit estimate appropriate setpoint to balance around.

> But be warned! Partitioning the dirty pages always means more
> fluctuations of dirty rates (and even stalls) that's perceivable by
> the user. Which means another limiting factor for the backpressure
> based IO controller to scale well.
  Sure, the smaller the memcg gets, the more noticeable these fluctuations
would be. I would not expect memcg with 200 MB of memory to behave better
(and also not much worse) than if I have a machine with that much memory...

								Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-23  9:14                   ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-23  9:14 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, Tejun Heo, vgoyal, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf, Mel Gorman

On Fri 20-04-12 21:34:41, Wu Fengguang wrote:
> On Thu, Apr 19, 2012 at 10:26:35PM +0200, Jan Kara wrote:
> > > It's not uncommon for me to see filesystems sleep on PG_writeback
> > > pages during heavy writeback, within some lock or transaction, which in
> > > turn stall many tasks that try to do IO or merely dirty some page in
> > > memory. Random writes are especially susceptible to such stalls. The
> > > stable page feature also vastly increase the chances of stalls by
> > > locking the writeback pages. 
> > > 
> > > Page reclaim may also block on PG_writeback and/or PG_dirty pages. In
> > > the case of direct reclaim, it means blocking random tasks that are
> > > allocating memory in the system.
> > > 
> > > PG_writeback pages are much worse than PG_dirty pages in that they are
> > > not movable. This makes a big difference for high-order page allocations.
> > > To make room for a 2MB huge page, vmscan has the option to migrate
> > > PG_dirty pages, but for PG_writeback it has no better choices than to
> > > wait for IO completion.
> > > 
> > > The difficulty of THP allocation goes up *exponentially* with the
> > > number of PG_writeback pages. Assume PG_writeback pages are randomly
> > > distributed in the physical memory space. Then we have formula
> > > 
> > >         P(reclaimable for THP) = 1 - P(hit PG_writeback)^256
> >   Well, this implicitely assumes that PG_Writeback pages are scattered
> > across memory uniformly at random. I'm not sure to which extent this is
> > true...
> 
> Yeah, when describing the problem I was also thinking about the
> possibilities of optimization (it would be a very good general
> improvements). Or maybe Mel already has some solutions :)
> 
> > Also as a nitpick, this isn't really an exponential growth since
> > the exponent is fixed (256 - actually it should be 512, right?). It's just
> 
> Right, 512 4k pages to form one x86_64 2MB huge pages.
> 
> > a polynomial with a big exponent. But sure, growth in number of PG_Writeback
> > pages will cause relatively steep drop in the number of available huge
> > pages.
> 
> It's exponential indeed, because "1 - p(x)" here means "p(!x)".
> It's exponential for a 10x increase in x resulting in 100x drop of y.
  If 'x' is the probability page has PG_Writeback set, then the probability
a huge page has a single PG_Writeback page is (as you almost correctly wrote):
(1-x)^512. This is a polynominal by the definition: It can be
expressed as $\sum_{i=0}^n a_i*x^i$ for $a_i\in R$ and $n$ finite.

The expression decreases fast as x approaches to 1, that's for sure, but
that does not make it exponential. Sorry, my mathematical part could not
resist this terminology correction.

> > ...
> > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > > > > It's always there doing 1:1 proportional throttling. Then you try to
> > > > > kick in to add *double* throttling in block/cfq layer. Now the low
> > > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > > > > from its balanced state, leading to large fluctuations and program
> > > > > stalls.
> > > > 
> > > > Just do the same 1:1 inside each cgroup.
> > > 
> > > Sure. But the ratio mismatch I'm talking about is inter-cgroup.
> > > For example there are only 2 dd tasks doing buffered writes in the
> > > system. Now consider the mismatch that cfq is dispatching their IO
> > > requests at 10:1 weights, while balance_dirty_pages() is throttling
> > > the dd tasks at 1:1 equal split because it's not aware of the cgroup
> > > weights.
> > > 
> > > What will happen in the end? The 1:1 ratio imposed by
> > > balance_dirty_pages() will take effect and the dd tasks will progress
> > > at the same pace. The cfq weights will be defeated because the async
> > > queue for the second dd (and cgroup) constantly runs empty.
> >   Yup. This just shows that you have to have per-cgroup dirty limits. Once
> > you have those, things start working again.
> 
> Right. I think Tejun was more of less aware of this.
> 
> I was rather upset by this per-memcg dirty_limit idea indeed. I never
> expect it to work well when used extensively. My plan was to set the
> default memcg dirty_limit high enough, so that it's not hit in normal.
> Then Tejun came and proposed to (mis-)use dirty_limit as the way to
> convert the dirty pages' backpressure into real dirty throttling rate.
> No, that's just crazy idea!
> 
> Come on, let's not over-use memcg's dirty_limit. It's there as the
> *last resort* to keep dirty pages under control so as to maintain
> interactive performance inside the cgroup. However if used extensively
> in the system (like dozens of memcgs all hit their dirty limits), the
> limit itself may stall random dirtiers and create interactive
> performance issues!
> 
> In the recent days I've come up with the idea of memcg.dirty_setpoint
> for the blkcg backpressure stuff. We can use that instead.
> 
> memcg.dirty_setpoint will scale proportionally with blkcg.writeout_rate.
> Imagine bdi_setpoint. It's all the same concepts. Why we need this?
> Because if blkcg A and B does 10:1 weights and are both doing buffered
> writes, their dirty pages should better be maintained around 10:1
> ratio to avoid underrun and hopefully achieve better IO size.
> memcg.dirty_limit cannot guarantee that goal.
  I agree that to avoid stalls of throttled processes we shouldn't be
hitting memcg.dirty_limit on a regular basis. When I wrote we need "per
cgroup dirty limits" I actually imagined something like you write above -
do complete throttling computations within each memcg - estimate throughput
available for it, compute appropriate dirty rates for it's processes and
from its dirty limit estimate appropriate setpoint to balance around.

> But be warned! Partitioning the dirty pages always means more
> fluctuations of dirty rates (and even stalls) that's perceivable by
> the user. Which means another limiting factor for the backpressure
> based IO controller to scale well.
  Sure, the smaller the memcg gets, the more noticeable these fluctuations
would be. I would not expect memcg with 200 MB of memory to behave better
(and also not much worse) than if I have a machine with that much memory...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-23  9:14                   ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-23  9:14 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, Tejun Heo, vgoyal, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf, Mel Gorman

On Fri 20-04-12 21:34:41, Wu Fengguang wrote:
> On Thu, Apr 19, 2012 at 10:26:35PM +0200, Jan Kara wrote:
> > > It's not uncommon for me to see filesystems sleep on PG_writeback
> > > pages during heavy writeback, within some lock or transaction, which in
> > > turn stall many tasks that try to do IO or merely dirty some page in
> > > memory. Random writes are especially susceptible to such stalls. The
> > > stable page feature also vastly increase the chances of stalls by
> > > locking the writeback pages. 
> > > 
> > > Page reclaim may also block on PG_writeback and/or PG_dirty pages. In
> > > the case of direct reclaim, it means blocking random tasks that are
> > > allocating memory in the system.
> > > 
> > > PG_writeback pages are much worse than PG_dirty pages in that they are
> > > not movable. This makes a big difference for high-order page allocations.
> > > To make room for a 2MB huge page, vmscan has the option to migrate
> > > PG_dirty pages, but for PG_writeback it has no better choices than to
> > > wait for IO completion.
> > > 
> > > The difficulty of THP allocation goes up *exponentially* with the
> > > number of PG_writeback pages. Assume PG_writeback pages are randomly
> > > distributed in the physical memory space. Then we have formula
> > > 
> > >         P(reclaimable for THP) = 1 - P(hit PG_writeback)^256
> >   Well, this implicitely assumes that PG_Writeback pages are scattered
> > across memory uniformly at random. I'm not sure to which extent this is
> > true...
> 
> Yeah, when describing the problem I was also thinking about the
> possibilities of optimization (it would be a very good general
> improvements). Or maybe Mel already has some solutions :)
> 
> > Also as a nitpick, this isn't really an exponential growth since
> > the exponent is fixed (256 - actually it should be 512, right?). It's just
> 
> Right, 512 4k pages to form one x86_64 2MB huge pages.
> 
> > a polynomial with a big exponent. But sure, growth in number of PG_Writeback
> > pages will cause relatively steep drop in the number of available huge
> > pages.
> 
> It's exponential indeed, because "1 - p(x)" here means "p(!x)".
> It's exponential for a 10x increase in x resulting in 100x drop of y.
  If 'x' is the probability page has PG_Writeback set, then the probability
a huge page has a single PG_Writeback page is (as you almost correctly wrote):
(1-x)^512. This is a polynominal by the definition: It can be
expressed as $\sum_{i=0}^n a_i*x^i$ for $a_i\in R$ and $n$ finite.

The expression decreases fast as x approaches to 1, that's for sure, but
that does not make it exponential. Sorry, my mathematical part could not
resist this terminology correction.

> > ...
> > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > > > > It's always there doing 1:1 proportional throttling. Then you try to
> > > > > kick in to add *double* throttling in block/cfq layer. Now the low
> > > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > > > > from its balanced state, leading to large fluctuations and program
> > > > > stalls.
> > > > 
> > > > Just do the same 1:1 inside each cgroup.
> > > 
> > > Sure. But the ratio mismatch I'm talking about is inter-cgroup.
> > > For example there are only 2 dd tasks doing buffered writes in the
> > > system. Now consider the mismatch that cfq is dispatching their IO
> > > requests at 10:1 weights, while balance_dirty_pages() is throttling
> > > the dd tasks at 1:1 equal split because it's not aware of the cgroup
> > > weights.
> > > 
> > > What will happen in the end? The 1:1 ratio imposed by
> > > balance_dirty_pages() will take effect and the dd tasks will progress
> > > at the same pace. The cfq weights will be defeated because the async
> > > queue for the second dd (and cgroup) constantly runs empty.
> >   Yup. This just shows that you have to have per-cgroup dirty limits. Once
> > you have those, things start working again.
> 
> Right. I think Tejun was more of less aware of this.
> 
> I was rather upset by this per-memcg dirty_limit idea indeed. I never
> expect it to work well when used extensively. My plan was to set the
> default memcg dirty_limit high enough, so that it's not hit in normal.
> Then Tejun came and proposed to (mis-)use dirty_limit as the way to
> convert the dirty pages' backpressure into real dirty throttling rate.
> No, that's just crazy idea!
> 
> Come on, let's not over-use memcg's dirty_limit. It's there as the
> *last resort* to keep dirty pages under control so as to maintain
> interactive performance inside the cgroup. However if used extensively
> in the system (like dozens of memcgs all hit their dirty limits), the
> limit itself may stall random dirtiers and create interactive
> performance issues!
> 
> In the recent days I've come up with the idea of memcg.dirty_setpoint
> for the blkcg backpressure stuff. We can use that instead.
> 
> memcg.dirty_setpoint will scale proportionally with blkcg.writeout_rate.
> Imagine bdi_setpoint. It's all the same concepts. Why we need this?
> Because if blkcg A and B does 10:1 weights and are both doing buffered
> writes, their dirty pages should better be maintained around 10:1
> ratio to avoid underrun and hopefully achieve better IO size.
> memcg.dirty_limit cannot guarantee that goal.
  I agree that to avoid stalls of throttled processes we shouldn't be
hitting memcg.dirty_limit on a regular basis. When I wrote we need "per
cgroup dirty limits" I actually imagined something like you write above -
do complete throttling computations within each memcg - estimate throughput
available for it, compute appropriate dirty rates for it's processes and
from its dirty limit estimate appropriate setpoint to balance around.

> But be warned! Partitioning the dirty pages always means more
> fluctuations of dirty rates (and even stalls) that's perceivable by
> the user. Which means another limiting factor for the backpressure
> based IO controller to scale well.
  Sure, the smaller the memcg gets, the more noticeable these fluctuations
would be. I would not expect memcg with 200 MB of memory to behave better
(and also not much worse) than if I have a machine with that much memory...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                   ` <20120423091432.GC6512-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
@ 2012-04-23 10:24                     ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-23 10:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Mel Gorman

On Mon, Apr 23, 2012 at 11:14:32AM +0200, Jan Kara wrote:
> On Fri 20-04-12 21:34:41, Wu Fengguang wrote:
> > On Thu, Apr 19, 2012 at 10:26:35PM +0200, Jan Kara wrote:
> > > > It's not uncommon for me to see filesystems sleep on PG_writeback
> > > > pages during heavy writeback, within some lock or transaction, which in
> > > > turn stall many tasks that try to do IO or merely dirty some page in
> > > > memory. Random writes are especially susceptible to such stalls. The
> > > > stable page feature also vastly increase the chances of stalls by
> > > > locking the writeback pages. 
> > > > 
> > > > Page reclaim may also block on PG_writeback and/or PG_dirty pages. In
> > > > the case of direct reclaim, it means blocking random tasks that are
> > > > allocating memory in the system.
> > > > 
> > > > PG_writeback pages are much worse than PG_dirty pages in that they are
> > > > not movable. This makes a big difference for high-order page allocations.
> > > > To make room for a 2MB huge page, vmscan has the option to migrate
> > > > PG_dirty pages, but for PG_writeback it has no better choices than to
> > > > wait for IO completion.
> > > > 
> > > > The difficulty of THP allocation goes up *exponentially* with the
> > > > number of PG_writeback pages. Assume PG_writeback pages are randomly
> > > > distributed in the physical memory space. Then we have formula
> > > > 
> > > >         P(reclaimable for THP) = 1 - P(hit PG_writeback)^256
> > >   Well, this implicitely assumes that PG_Writeback pages are scattered
> > > across memory uniformly at random. I'm not sure to which extent this is
> > > true...
> > 
> > Yeah, when describing the problem I was also thinking about the
> > possibilities of optimization (it would be a very good general
> > improvements). Or maybe Mel already has some solutions :)
> > 
> > > Also as a nitpick, this isn't really an exponential growth since
> > > the exponent is fixed (256 - actually it should be 512, right?). It's just
> > 
> > Right, 512 4k pages to form one x86_64 2MB huge pages.
> > 
> > > a polynomial with a big exponent. But sure, growth in number of PG_Writeback
> > > pages will cause relatively steep drop in the number of available huge
> > > pages.
> > 
> > It's exponential indeed, because "1 - p(x)" here means "p(!x)".
> > It's exponential for a 10x increase in x resulting in 100x drop of y.
>   If 'x' is the probability page has PG_Writeback set, then the probability
> a huge page has a single PG_Writeback page is (as you almost correctly wrote):
> (1-x)^512. This is a polynominal by the definition: It can be
> expressed as $\sum_{i=0}^n a_i*x^i$ for $a_i\in R$ and $n$ finite.
> 
> The expression decreases fast as x approaches to 1, that's for sure, but
> that does not make it exponential. Sorry, my mathematical part could not
> resist this terminology correction.

ok, ok :-)

I actually got the equation wrong above, the one used in the script is
correct. The correct one is "it takes all 512 component pages to be
free of PG_writeback for the huge page to be free of PG_writeback and
immediately reclaimable for THP".

P(reclaimable for THP) = P(non-PG_writeback)^512

> > > ...
> > > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > > > > > It's always there doing 1:1 proportional throttling. Then you try to
> > > > > > kick in to add *double* throttling in block/cfq layer. Now the low
> > > > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > > > > > from its balanced state, leading to large fluctuations and program
> > > > > > stalls.
> > > > > 
> > > > > Just do the same 1:1 inside each cgroup.
> > > > 
> > > > Sure. But the ratio mismatch I'm talking about is inter-cgroup.
> > > > For example there are only 2 dd tasks doing buffered writes in the
> > > > system. Now consider the mismatch that cfq is dispatching their IO
> > > > requests at 10:1 weights, while balance_dirty_pages() is throttling
> > > > the dd tasks at 1:1 equal split because it's not aware of the cgroup
> > > > weights.
> > > > 
> > > > What will happen in the end? The 1:1 ratio imposed by
> > > > balance_dirty_pages() will take effect and the dd tasks will progress
> > > > at the same pace. The cfq weights will be defeated because the async
> > > > queue for the second dd (and cgroup) constantly runs empty.
> > >   Yup. This just shows that you have to have per-cgroup dirty limits. Once
> > > you have those, things start working again.
> > 
> > Right. I think Tejun was more of less aware of this.
> > 
> > I was rather upset by this per-memcg dirty_limit idea indeed. I never
> > expect it to work well when used extensively. My plan was to set the
> > default memcg dirty_limit high enough, so that it's not hit in normal.
> > Then Tejun came and proposed to (mis-)use dirty_limit as the way to
> > convert the dirty pages' backpressure into real dirty throttling rate.
> > No, that's just crazy idea!
> > 
> > Come on, let's not over-use memcg's dirty_limit. It's there as the
> > *last resort* to keep dirty pages under control so as to maintain
> > interactive performance inside the cgroup. However if used extensively
> > in the system (like dozens of memcgs all hit their dirty limits), the
> > limit itself may stall random dirtiers and create interactive
> > performance issues!
> > 
> > In the recent days I've come up with the idea of memcg.dirty_setpoint
> > for the blkcg backpressure stuff. We can use that instead.
> > 
> > memcg.dirty_setpoint will scale proportionally with blkcg.writeout_rate.
> > Imagine bdi_setpoint. It's all the same concepts. Why we need this?
> > Because if blkcg A and B does 10:1 weights and are both doing buffered
> > writes, their dirty pages should better be maintained around 10:1
> > ratio to avoid underrun and hopefully achieve better IO size.
> > memcg.dirty_limit cannot guarantee that goal.
>   I agree that to avoid stalls of throttled processes we shouldn't be
> hitting memcg.dirty_limit on a regular basis. When I wrote we need "per
> cgroup dirty limits" I actually imagined something like you write above -
> do complete throttling computations within each memcg - estimate throughput
> available for it, compute appropriate dirty rates for it's processes and
> from its dirty limit estimate appropriate setpoint to balance around.
> 

Yes. balance_dirty_pages() will need both dirty pages and dirty page
writeout rate for the cgroup to do proper dirty throttling for it.

> > But be warned! Partitioning the dirty pages always means more
> > fluctuations of dirty rates (and even stalls) that's perceivable by
> > the user. Which means another limiting factor for the backpressure
> > based IO controller to scale well.
>   Sure, the smaller the memcg gets, the more noticeable these fluctuations
> would be. I would not expect memcg with 200 MB of memory to behave better
> (and also not much worse) than if I have a machine with that much memory...

It would be much worse if it's one single flusher thread round robin
over the cgroups...

For a small machine with 200MB memory, its IO completion events can
arrive continuously over time. However if its a 2000MB box divided
into 10 cgroups and the flusher is writing out dirty pages, spending
0.5s on each cgroup and then go on to the next, then for any single
cgroup, its IO completion events go quiet for every 9.5s and goes up
on the other 0.5s. It becomes really hard to control the number of
dirty pages.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-23  9:14                   ` Jan Kara
@ 2012-04-23 10:24                     ` Fengguang Wu
  -1 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-23 10:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf, Mel Gorman

On Mon, Apr 23, 2012 at 11:14:32AM +0200, Jan Kara wrote:
> On Fri 20-04-12 21:34:41, Wu Fengguang wrote:
> > On Thu, Apr 19, 2012 at 10:26:35PM +0200, Jan Kara wrote:
> > > > It's not uncommon for me to see filesystems sleep on PG_writeback
> > > > pages during heavy writeback, within some lock or transaction, which in
> > > > turn stall many tasks that try to do IO or merely dirty some page in
> > > > memory. Random writes are especially susceptible to such stalls. The
> > > > stable page feature also vastly increase the chances of stalls by
> > > > locking the writeback pages. 
> > > > 
> > > > Page reclaim may also block on PG_writeback and/or PG_dirty pages. In
> > > > the case of direct reclaim, it means blocking random tasks that are
> > > > allocating memory in the system.
> > > > 
> > > > PG_writeback pages are much worse than PG_dirty pages in that they are
> > > > not movable. This makes a big difference for high-order page allocations.
> > > > To make room for a 2MB huge page, vmscan has the option to migrate
> > > > PG_dirty pages, but for PG_writeback it has no better choices than to
> > > > wait for IO completion.
> > > > 
> > > > The difficulty of THP allocation goes up *exponentially* with the
> > > > number of PG_writeback pages. Assume PG_writeback pages are randomly
> > > > distributed in the physical memory space. Then we have formula
> > > > 
> > > >         P(reclaimable for THP) = 1 - P(hit PG_writeback)^256
> > >   Well, this implicitely assumes that PG_Writeback pages are scattered
> > > across memory uniformly at random. I'm not sure to which extent this is
> > > true...
> > 
> > Yeah, when describing the problem I was also thinking about the
> > possibilities of optimization (it would be a very good general
> > improvements). Or maybe Mel already has some solutions :)
> > 
> > > Also as a nitpick, this isn't really an exponential growth since
> > > the exponent is fixed (256 - actually it should be 512, right?). It's just
> > 
> > Right, 512 4k pages to form one x86_64 2MB huge pages.
> > 
> > > a polynomial with a big exponent. But sure, growth in number of PG_Writeback
> > > pages will cause relatively steep drop in the number of available huge
> > > pages.
> > 
> > It's exponential indeed, because "1 - p(x)" here means "p(!x)".
> > It's exponential for a 10x increase in x resulting in 100x drop of y.
>   If 'x' is the probability page has PG_Writeback set, then the probability
> a huge page has a single PG_Writeback page is (as you almost correctly wrote):
> (1-x)^512. This is a polynominal by the definition: It can be
> expressed as $\sum_{i=0}^n a_i*x^i$ for $a_i\in R$ and $n$ finite.
> 
> The expression decreases fast as x approaches to 1, that's for sure, but
> that does not make it exponential. Sorry, my mathematical part could not
> resist this terminology correction.

ok, ok :-)

I actually got the equation wrong above, the one used in the script is
correct. The correct one is "it takes all 512 component pages to be
free of PG_writeback for the huge page to be free of PG_writeback and
immediately reclaimable for THP".

P(reclaimable for THP) = P(non-PG_writeback)^512

> > > ...
> > > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > > > > > It's always there doing 1:1 proportional throttling. Then you try to
> > > > > > kick in to add *double* throttling in block/cfq layer. Now the low
> > > > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > > > > > from its balanced state, leading to large fluctuations and program
> > > > > > stalls.
> > > > > 
> > > > > Just do the same 1:1 inside each cgroup.
> > > > 
> > > > Sure. But the ratio mismatch I'm talking about is inter-cgroup.
> > > > For example there are only 2 dd tasks doing buffered writes in the
> > > > system. Now consider the mismatch that cfq is dispatching their IO
> > > > requests at 10:1 weights, while balance_dirty_pages() is throttling
> > > > the dd tasks at 1:1 equal split because it's not aware of the cgroup
> > > > weights.
> > > > 
> > > > What will happen in the end? The 1:1 ratio imposed by
> > > > balance_dirty_pages() will take effect and the dd tasks will progress
> > > > at the same pace. The cfq weights will be defeated because the async
> > > > queue for the second dd (and cgroup) constantly runs empty.
> > >   Yup. This just shows that you have to have per-cgroup dirty limits. Once
> > > you have those, things start working again.
> > 
> > Right. I think Tejun was more of less aware of this.
> > 
> > I was rather upset by this per-memcg dirty_limit idea indeed. I never
> > expect it to work well when used extensively. My plan was to set the
> > default memcg dirty_limit high enough, so that it's not hit in normal.
> > Then Tejun came and proposed to (mis-)use dirty_limit as the way to
> > convert the dirty pages' backpressure into real dirty throttling rate.
> > No, that's just crazy idea!
> > 
> > Come on, let's not over-use memcg's dirty_limit. It's there as the
> > *last resort* to keep dirty pages under control so as to maintain
> > interactive performance inside the cgroup. However if used extensively
> > in the system (like dozens of memcgs all hit their dirty limits), the
> > limit itself may stall random dirtiers and create interactive
> > performance issues!
> > 
> > In the recent days I've come up with the idea of memcg.dirty_setpoint
> > for the blkcg backpressure stuff. We can use that instead.
> > 
> > memcg.dirty_setpoint will scale proportionally with blkcg.writeout_rate.
> > Imagine bdi_setpoint. It's all the same concepts. Why we need this?
> > Because if blkcg A and B does 10:1 weights and are both doing buffered
> > writes, their dirty pages should better be maintained around 10:1
> > ratio to avoid underrun and hopefully achieve better IO size.
> > memcg.dirty_limit cannot guarantee that goal.
>   I agree that to avoid stalls of throttled processes we shouldn't be
> hitting memcg.dirty_limit on a regular basis. When I wrote we need "per
> cgroup dirty limits" I actually imagined something like you write above -
> do complete throttling computations within each memcg - estimate throughput
> available for it, compute appropriate dirty rates for it's processes and
> from its dirty limit estimate appropriate setpoint to balance around.
> 

Yes. balance_dirty_pages() will need both dirty pages and dirty page
writeout rate for the cgroup to do proper dirty throttling for it.

> > But be warned! Partitioning the dirty pages always means more
> > fluctuations of dirty rates (and even stalls) that's perceivable by
> > the user. Which means another limiting factor for the backpressure
> > based IO controller to scale well.
>   Sure, the smaller the memcg gets, the more noticeable these fluctuations
> would be. I would not expect memcg with 200 MB of memory to behave better
> (and also not much worse) than if I have a machine with that much memory...

It would be much worse if it's one single flusher thread round robin
over the cgroups...

For a small machine with 200MB memory, its IO completion events can
arrive continuously over time. However if its a 2000MB box divided
into 10 cgroups and the flusher is writing out dirty pages, spending
0.5s on each cgroup and then go on to the next, then for any single
cgroup, its IO completion events go quiet for every 9.5s and goes up
on the other 0.5s. It becomes really hard to control the number of
dirty pages.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-23 10:24                     ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-23 10:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf, Mel Gorman

On Mon, Apr 23, 2012 at 11:14:32AM +0200, Jan Kara wrote:
> On Fri 20-04-12 21:34:41, Wu Fengguang wrote:
> > On Thu, Apr 19, 2012 at 10:26:35PM +0200, Jan Kara wrote:
> > > > It's not uncommon for me to see filesystems sleep on PG_writeback
> > > > pages during heavy writeback, within some lock or transaction, which in
> > > > turn stall many tasks that try to do IO or merely dirty some page in
> > > > memory. Random writes are especially susceptible to such stalls. The
> > > > stable page feature also vastly increase the chances of stalls by
> > > > locking the writeback pages. 
> > > > 
> > > > Page reclaim may also block on PG_writeback and/or PG_dirty pages. In
> > > > the case of direct reclaim, it means blocking random tasks that are
> > > > allocating memory in the system.
> > > > 
> > > > PG_writeback pages are much worse than PG_dirty pages in that they are
> > > > not movable. This makes a big difference for high-order page allocations.
> > > > To make room for a 2MB huge page, vmscan has the option to migrate
> > > > PG_dirty pages, but for PG_writeback it has no better choices than to
> > > > wait for IO completion.
> > > > 
> > > > The difficulty of THP allocation goes up *exponentially* with the
> > > > number of PG_writeback pages. Assume PG_writeback pages are randomly
> > > > distributed in the physical memory space. Then we have formula
> > > > 
> > > >         P(reclaimable for THP) = 1 - P(hit PG_writeback)^256
> > >   Well, this implicitely assumes that PG_Writeback pages are scattered
> > > across memory uniformly at random. I'm not sure to which extent this is
> > > true...
> > 
> > Yeah, when describing the problem I was also thinking about the
> > possibilities of optimization (it would be a very good general
> > improvements). Or maybe Mel already has some solutions :)
> > 
> > > Also as a nitpick, this isn't really an exponential growth since
> > > the exponent is fixed (256 - actually it should be 512, right?). It's just
> > 
> > Right, 512 4k pages to form one x86_64 2MB huge pages.
> > 
> > > a polynomial with a big exponent. But sure, growth in number of PG_Writeback
> > > pages will cause relatively steep drop in the number of available huge
> > > pages.
> > 
> > It's exponential indeed, because "1 - p(x)" here means "p(!x)".
> > It's exponential for a 10x increase in x resulting in 100x drop of y.
>   If 'x' is the probability page has PG_Writeback set, then the probability
> a huge page has a single PG_Writeback page is (as you almost correctly wrote):
> (1-x)^512. This is a polynominal by the definition: It can be
> expressed as $\sum_{i=0}^n a_i*x^i$ for $a_i\in R$ and $n$ finite.
> 
> The expression decreases fast as x approaches to 1, that's for sure, but
> that does not make it exponential. Sorry, my mathematical part could not
> resist this terminology correction.

ok, ok :-)

I actually got the equation wrong above, the one used in the script is
correct. The correct one is "it takes all 512 component pages to be
free of PG_writeback for the huge page to be free of PG_writeback and
immediately reclaimable for THP".

P(reclaimable for THP) = P(non-PG_writeback)^512

> > > ...
> > > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > > > > > It's always there doing 1:1 proportional throttling. Then you try to
> > > > > > kick in to add *double* throttling in block/cfq layer. Now the low
> > > > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > > > > > from its balanced state, leading to large fluctuations and program
> > > > > > stalls.
> > > > > 
> > > > > Just do the same 1:1 inside each cgroup.
> > > > 
> > > > Sure. But the ratio mismatch I'm talking about is inter-cgroup.
> > > > For example there are only 2 dd tasks doing buffered writes in the
> > > > system. Now consider the mismatch that cfq is dispatching their IO
> > > > requests at 10:1 weights, while balance_dirty_pages() is throttling
> > > > the dd tasks at 1:1 equal split because it's not aware of the cgroup
> > > > weights.
> > > > 
> > > > What will happen in the end? The 1:1 ratio imposed by
> > > > balance_dirty_pages() will take effect and the dd tasks will progress
> > > > at the same pace. The cfq weights will be defeated because the async
> > > > queue for the second dd (and cgroup) constantly runs empty.
> > >   Yup. This just shows that you have to have per-cgroup dirty limits. Once
> > > you have those, things start working again.
> > 
> > Right. I think Tejun was more of less aware of this.
> > 
> > I was rather upset by this per-memcg dirty_limit idea indeed. I never
> > expect it to work well when used extensively. My plan was to set the
> > default memcg dirty_limit high enough, so that it's not hit in normal.
> > Then Tejun came and proposed to (mis-)use dirty_limit as the way to
> > convert the dirty pages' backpressure into real dirty throttling rate.
> > No, that's just crazy idea!
> > 
> > Come on, let's not over-use memcg's dirty_limit. It's there as the
> > *last resort* to keep dirty pages under control so as to maintain
> > interactive performance inside the cgroup. However if used extensively
> > in the system (like dozens of memcgs all hit their dirty limits), the
> > limit itself may stall random dirtiers and create interactive
> > performance issues!
> > 
> > In the recent days I've come up with the idea of memcg.dirty_setpoint
> > for the blkcg backpressure stuff. We can use that instead.
> > 
> > memcg.dirty_setpoint will scale proportionally with blkcg.writeout_rate.
> > Imagine bdi_setpoint. It's all the same concepts. Why we need this?
> > Because if blkcg A and B does 10:1 weights and are both doing buffered
> > writes, their dirty pages should better be maintained around 10:1
> > ratio to avoid underrun and hopefully achieve better IO size.
> > memcg.dirty_limit cannot guarantee that goal.
>   I agree that to avoid stalls of throttled processes we shouldn't be
> hitting memcg.dirty_limit on a regular basis. When I wrote we need "per
> cgroup dirty limits" I actually imagined something like you write above -
> do complete throttling computations within each memcg - estimate throughput
> available for it, compute appropriate dirty rates for it's processes and
> from its dirty limit estimate appropriate setpoint to balance around.
> 

Yes. balance_dirty_pages() will need both dirty pages and dirty page
writeout rate for the cgroup to do proper dirty throttling for it.

> > But be warned! Partitioning the dirty pages always means more
> > fluctuations of dirty rates (and even stalls) that's perceivable by
> > the user. Which means another limiting factor for the backpressure
> > based IO controller to scale well.
>   Sure, the smaller the memcg gets, the more noticeable these fluctuations
> would be. I would not expect memcg with 200 MB of memory to behave better
> (and also not much worse) than if I have a machine with that much memory...

It would be much worse if it's one single flusher thread round robin
over the cgroups...

For a small machine with 200MB memory, its IO completion events can
arrive continuously over time. However if its a 2000MB box divided
into 10 cgroups and the flusher is writing out dirty pages, spending
0.5s on each cgroup and then go on to the next, then for any single
cgroup, its IO completion events go quiet for every 9.5s and goes up
on the other 0.5s. It becomes really hard to control the number of
dirty pages.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                     ` <20120420213301.GA29134-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2012-04-22 14:26                       ` Fengguang Wu
@ 2012-04-23 12:30                       ` Vivek Goyal
  1 sibling, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-23 12:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Fri, Apr 20, 2012 at 02:33:01PM -0700, Tejun Heo wrote:
> On Fri, Apr 20, 2012 at 03:29:30PM -0400, Vivek Goyal wrote:
> > I am personally is not too excited about the case of putting async IO
> > in separate groups due to the reason that async IO of one group will
> > start impacting latencies of sync IO of another group and in practice
> > it might not be desirable. But there are others who have use cases for
> > separate async IO queue. So as long as switch is there to change the
> > behavior, I am not too worried.
> 
> Why not just fix cfq so that it prefers groups w/ sync IOs?

Yes that could possibly be done but now that's change of requirements. Now
we are saying that I want one buffered write to go faster than other
buffered write only if there is no sync IO present in any of the groups.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-20 21:33                     ` Tejun Heo
@ 2012-04-23 12:30                       ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-23 12:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Fri, Apr 20, 2012 at 02:33:01PM -0700, Tejun Heo wrote:
> On Fri, Apr 20, 2012 at 03:29:30PM -0400, Vivek Goyal wrote:
> > I am personally is not too excited about the case of putting async IO
> > in separate groups due to the reason that async IO of one group will
> > start impacting latencies of sync IO of another group and in practice
> > it might not be desirable. But there are others who have use cases for
> > separate async IO queue. So as long as switch is there to change the
> > behavior, I am not too worried.
> 
> Why not just fix cfq so that it prefers groups w/ sync IOs?

Yes that could possibly be done but now that's change of requirements. Now
we are saying that I want one buffered write to go faster than other
buffered write only if there is no sync IO present in any of the groups.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-23 12:30                       ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-23 12:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Fri, Apr 20, 2012 at 02:33:01PM -0700, Tejun Heo wrote:
> On Fri, Apr 20, 2012 at 03:29:30PM -0400, Vivek Goyal wrote:
> > I am personally is not too excited about the case of putting async IO
> > in separate groups due to the reason that async IO of one group will
> > start impacting latencies of sync IO of another group and in practice
> > it might not be desirable. But there are others who have use cases for
> > separate async IO queue. So as long as switch is there to change the
> > behavior, I am not too worried.
> 
> Why not just fix cfq so that it prefers groups w/ sync IOs?

Yes that could possibly be done but now that's change of requirements. Now
we are saying that I want one buffered write to go faster than other
buffered write only if there is no sync IO present in any of the groups.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-23 10:24                     ` Fengguang Wu
  (?)
  (?)
@ 2012-04-23 12:42                     ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-23 12:42 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Mel Gorman

On Mon 23-04-12 18:24:20, Wu Fengguang wrote:
> On Mon, Apr 23, 2012 at 11:14:32AM +0200, Jan Kara wrote:
> > On Fri 20-04-12 21:34:41, Wu Fengguang wrote:
> > > > ...
> > > > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > > > > > > It's always there doing 1:1 proportional throttling. Then you try to
> > > > > > > kick in to add *double* throttling in block/cfq layer. Now the low
> > > > > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > > > > > > from its balanced state, leading to large fluctuations and program
> > > > > > > stalls.
> > > > > > 
> > > > > > Just do the same 1:1 inside each cgroup.
> > > > > 
> > > > > Sure. But the ratio mismatch I'm talking about is inter-cgroup.
> > > > > For example there are only 2 dd tasks doing buffered writes in the
> > > > > system. Now consider the mismatch that cfq is dispatching their IO
> > > > > requests at 10:1 weights, while balance_dirty_pages() is throttling
> > > > > the dd tasks at 1:1 equal split because it's not aware of the cgroup
> > > > > weights.
> > > > > 
> > > > > What will happen in the end? The 1:1 ratio imposed by
> > > > > balance_dirty_pages() will take effect and the dd tasks will progress
> > > > > at the same pace. The cfq weights will be defeated because the async
> > > > > queue for the second dd (and cgroup) constantly runs empty.
> > > >   Yup. This just shows that you have to have per-cgroup dirty limits. Once
> > > > you have those, things start working again.
> > > 
> > > Right. I think Tejun was more of less aware of this.
> > > 
> > > I was rather upset by this per-memcg dirty_limit idea indeed. I never
> > > expect it to work well when used extensively. My plan was to set the
> > > default memcg dirty_limit high enough, so that it's not hit in normal.
> > > Then Tejun came and proposed to (mis-)use dirty_limit as the way to
> > > convert the dirty pages' backpressure into real dirty throttling rate.
> > > No, that's just crazy idea!
> > > 
> > > Come on, let's not over-use memcg's dirty_limit. It's there as the
> > > *last resort* to keep dirty pages under control so as to maintain
> > > interactive performance inside the cgroup. However if used extensively
> > > in the system (like dozens of memcgs all hit their dirty limits), the
> > > limit itself may stall random dirtiers and create interactive
> > > performance issues!
> > > 
> > > In the recent days I've come up with the idea of memcg.dirty_setpoint
> > > for the blkcg backpressure stuff. We can use that instead.
> > > 
> > > memcg.dirty_setpoint will scale proportionally with blkcg.writeout_rate.
> > > Imagine bdi_setpoint. It's all the same concepts. Why we need this?
> > > Because if blkcg A and B does 10:1 weights and are both doing buffered
> > > writes, their dirty pages should better be maintained around 10:1
> > > ratio to avoid underrun and hopefully achieve better IO size.
> > > memcg.dirty_limit cannot guarantee that goal.
> >   I agree that to avoid stalls of throttled processes we shouldn't be
> > hitting memcg.dirty_limit on a regular basis. When I wrote we need "per
> > cgroup dirty limits" I actually imagined something like you write above -
> > do complete throttling computations within each memcg - estimate throughput
> > available for it, compute appropriate dirty rates for it's processes and
> > from its dirty limit estimate appropriate setpoint to balance around.
> > 
> 
> Yes. balance_dirty_pages() will need both dirty pages and dirty page
> writeout rate for the cgroup to do proper dirty throttling for it.
> 
> > > But be warned! Partitioning the dirty pages always means more
> > > fluctuations of dirty rates (and even stalls) that's perceivable by
> > > the user. Which means another limiting factor for the backpressure
> > > based IO controller to scale well.
> >   Sure, the smaller the memcg gets, the more noticeable these fluctuations
> > would be. I would not expect memcg with 200 MB of memory to behave better
> > (and also not much worse) than if I have a machine with that much memory...
> 
> It would be much worse if it's one single flusher thread round robin
> over the cgroups...
> 
> For a small machine with 200MB memory, its IO completion events can
> arrive continuously over time. However if its a 2000MB box divided
> into 10 cgroups and the flusher is writing out dirty pages, spending
> 0.5s on each cgroup and then go on to the next, then for any single
> cgroup, its IO completion events go quiet for every 9.5s and goes up
> on the other 0.5s. It becomes really hard to control the number of
> dirty pages.
  Umm, but flusher does not spend 0.5s on each cgroup. It submits 0.5s
worth of IO for each cgroup. Since the throughput computed for each cgroup
will be scaled down accordingly (and thus write_chunk will be scaled down
as well), it should end up submitting 0.5s worth of IO for the whole system
after it traverses all the cgroups, shouldn't it? Effectively we will work
with smaller write_chunk which will lead to lower total throughput - that's
the price of partitioning and higher fairness requirements (previously the
requirement was to switch to a new inode every 0.5s, now the requirement is
to switch to a new inode in each cgroup every 0.5s). In the end, we may end
up increasing the write_chunk by some factor like \sqrt(number of memcgs)
to get some middle ground between the guaranteed small latency and
reasonable total throughput but before I'd go for such hacks, I'd wait to
see real numbers - e.g. paying 10% of total throughput for partitioning the
machine into 10 IO intensive cgroups (as in your tests with dd's) would be
a reasonable cost in my opinion.

Also the granularity of IO completions should depend more on the
granularity of IO scheduler (CFQ) rather than the granularity of flusher
thread as such so I wouldn't think that would be a problem.

								Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-23 10:24                     ` Fengguang Wu
@ 2012-04-23 12:42                       ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-23 12:42 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, Tejun Heo, vgoyal, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf, Mel Gorman

On Mon 23-04-12 18:24:20, Wu Fengguang wrote:
> On Mon, Apr 23, 2012 at 11:14:32AM +0200, Jan Kara wrote:
> > On Fri 20-04-12 21:34:41, Wu Fengguang wrote:
> > > > ...
> > > > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > > > > > > It's always there doing 1:1 proportional throttling. Then you try to
> > > > > > > kick in to add *double* throttling in block/cfq layer. Now the low
> > > > > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > > > > > > from its balanced state, leading to large fluctuations and program
> > > > > > > stalls.
> > > > > > 
> > > > > > Just do the same 1:1 inside each cgroup.
> > > > > 
> > > > > Sure. But the ratio mismatch I'm talking about is inter-cgroup.
> > > > > For example there are only 2 dd tasks doing buffered writes in the
> > > > > system. Now consider the mismatch that cfq is dispatching their IO
> > > > > requests at 10:1 weights, while balance_dirty_pages() is throttling
> > > > > the dd tasks at 1:1 equal split because it's not aware of the cgroup
> > > > > weights.
> > > > > 
> > > > > What will happen in the end? The 1:1 ratio imposed by
> > > > > balance_dirty_pages() will take effect and the dd tasks will progress
> > > > > at the same pace. The cfq weights will be defeated because the async
> > > > > queue for the second dd (and cgroup) constantly runs empty.
> > > >   Yup. This just shows that you have to have per-cgroup dirty limits. Once
> > > > you have those, things start working again.
> > > 
> > > Right. I think Tejun was more of less aware of this.
> > > 
> > > I was rather upset by this per-memcg dirty_limit idea indeed. I never
> > > expect it to work well when used extensively. My plan was to set the
> > > default memcg dirty_limit high enough, so that it's not hit in normal.
> > > Then Tejun came and proposed to (mis-)use dirty_limit as the way to
> > > convert the dirty pages' backpressure into real dirty throttling rate.
> > > No, that's just crazy idea!
> > > 
> > > Come on, let's not over-use memcg's dirty_limit. It's there as the
> > > *last resort* to keep dirty pages under control so as to maintain
> > > interactive performance inside the cgroup. However if used extensively
> > > in the system (like dozens of memcgs all hit their dirty limits), the
> > > limit itself may stall random dirtiers and create interactive
> > > performance issues!
> > > 
> > > In the recent days I've come up with the idea of memcg.dirty_setpoint
> > > for the blkcg backpressure stuff. We can use that instead.
> > > 
> > > memcg.dirty_setpoint will scale proportionally with blkcg.writeout_rate.
> > > Imagine bdi_setpoint. It's all the same concepts. Why we need this?
> > > Because if blkcg A and B does 10:1 weights and are both doing buffered
> > > writes, their dirty pages should better be maintained around 10:1
> > > ratio to avoid underrun and hopefully achieve better IO size.
> > > memcg.dirty_limit cannot guarantee that goal.
> >   I agree that to avoid stalls of throttled processes we shouldn't be
> > hitting memcg.dirty_limit on a regular basis. When I wrote we need "per
> > cgroup dirty limits" I actually imagined something like you write above -
> > do complete throttling computations within each memcg - estimate throughput
> > available for it, compute appropriate dirty rates for it's processes and
> > from its dirty limit estimate appropriate setpoint to balance around.
> > 
> 
> Yes. balance_dirty_pages() will need both dirty pages and dirty page
> writeout rate for the cgroup to do proper dirty throttling for it.
> 
> > > But be warned! Partitioning the dirty pages always means more
> > > fluctuations of dirty rates (and even stalls) that's perceivable by
> > > the user. Which means another limiting factor for the backpressure
> > > based IO controller to scale well.
> >   Sure, the smaller the memcg gets, the more noticeable these fluctuations
> > would be. I would not expect memcg with 200 MB of memory to behave better
> > (and also not much worse) than if I have a machine with that much memory...
> 
> It would be much worse if it's one single flusher thread round robin
> over the cgroups...
> 
> For a small machine with 200MB memory, its IO completion events can
> arrive continuously over time. However if its a 2000MB box divided
> into 10 cgroups and the flusher is writing out dirty pages, spending
> 0.5s on each cgroup and then go on to the next, then for any single
> cgroup, its IO completion events go quiet for every 9.5s and goes up
> on the other 0.5s. It becomes really hard to control the number of
> dirty pages.
  Umm, but flusher does not spend 0.5s on each cgroup. It submits 0.5s
worth of IO for each cgroup. Since the throughput computed for each cgroup
will be scaled down accordingly (and thus write_chunk will be scaled down
as well), it should end up submitting 0.5s worth of IO for the whole system
after it traverses all the cgroups, shouldn't it? Effectively we will work
with smaller write_chunk which will lead to lower total throughput - that's
the price of partitioning and higher fairness requirements (previously the
requirement was to switch to a new inode every 0.5s, now the requirement is
to switch to a new inode in each cgroup every 0.5s). In the end, we may end
up increasing the write_chunk by some factor like \sqrt(number of memcgs)
to get some middle ground between the guaranteed small latency and
reasonable total throughput but before I'd go for such hacks, I'd wait to
see real numbers - e.g. paying 10% of total throughput for partitioning the
machine into 10 IO intensive cgroups (as in your tests with dd's) would be
a reasonable cost in my opinion.

Also the granularity of IO completions should depend more on the
granularity of IO scheduler (CFQ) rather than the granularity of flusher
thread as such so I wouldn't think that would be a problem.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-23 12:42                       ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-23 12:42 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, Tejun Heo, vgoyal, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf, Mel Gorman

On Mon 23-04-12 18:24:20, Wu Fengguang wrote:
> On Mon, Apr 23, 2012 at 11:14:32AM +0200, Jan Kara wrote:
> > On Fri 20-04-12 21:34:41, Wu Fengguang wrote:
> > > > ...
> > > > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > > > > > > It's always there doing 1:1 proportional throttling. Then you try to
> > > > > > > kick in to add *double* throttling in block/cfq layer. Now the low
> > > > > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > > > > > > from its balanced state, leading to large fluctuations and program
> > > > > > > stalls.
> > > > > > 
> > > > > > Just do the same 1:1 inside each cgroup.
> > > > > 
> > > > > Sure. But the ratio mismatch I'm talking about is inter-cgroup.
> > > > > For example there are only 2 dd tasks doing buffered writes in the
> > > > > system. Now consider the mismatch that cfq is dispatching their IO
> > > > > requests at 10:1 weights, while balance_dirty_pages() is throttling
> > > > > the dd tasks at 1:1 equal split because it's not aware of the cgroup
> > > > > weights.
> > > > > 
> > > > > What will happen in the end? The 1:1 ratio imposed by
> > > > > balance_dirty_pages() will take effect and the dd tasks will progress
> > > > > at the same pace. The cfq weights will be defeated because the async
> > > > > queue for the second dd (and cgroup) constantly runs empty.
> > > >   Yup. This just shows that you have to have per-cgroup dirty limits. Once
> > > > you have those, things start working again.
> > > 
> > > Right. I think Tejun was more of less aware of this.
> > > 
> > > I was rather upset by this per-memcg dirty_limit idea indeed. I never
> > > expect it to work well when used extensively. My plan was to set the
> > > default memcg dirty_limit high enough, so that it's not hit in normal.
> > > Then Tejun came and proposed to (mis-)use dirty_limit as the way to
> > > convert the dirty pages' backpressure into real dirty throttling rate.
> > > No, that's just crazy idea!
> > > 
> > > Come on, let's not over-use memcg's dirty_limit. It's there as the
> > > *last resort* to keep dirty pages under control so as to maintain
> > > interactive performance inside the cgroup. However if used extensively
> > > in the system (like dozens of memcgs all hit their dirty limits), the
> > > limit itself may stall random dirtiers and create interactive
> > > performance issues!
> > > 
> > > In the recent days I've come up with the idea of memcg.dirty_setpoint
> > > for the blkcg backpressure stuff. We can use that instead.
> > > 
> > > memcg.dirty_setpoint will scale proportionally with blkcg.writeout_rate.
> > > Imagine bdi_setpoint. It's all the same concepts. Why we need this?
> > > Because if blkcg A and B does 10:1 weights and are both doing buffered
> > > writes, their dirty pages should better be maintained around 10:1
> > > ratio to avoid underrun and hopefully achieve better IO size.
> > > memcg.dirty_limit cannot guarantee that goal.
> >   I agree that to avoid stalls of throttled processes we shouldn't be
> > hitting memcg.dirty_limit on a regular basis. When I wrote we need "per
> > cgroup dirty limits" I actually imagined something like you write above -
> > do complete throttling computations within each memcg - estimate throughput
> > available for it, compute appropriate dirty rates for it's processes and
> > from its dirty limit estimate appropriate setpoint to balance around.
> > 
> 
> Yes. balance_dirty_pages() will need both dirty pages and dirty page
> writeout rate for the cgroup to do proper dirty throttling for it.
> 
> > > But be warned! Partitioning the dirty pages always means more
> > > fluctuations of dirty rates (and even stalls) that's perceivable by
> > > the user. Which means another limiting factor for the backpressure
> > > based IO controller to scale well.
> >   Sure, the smaller the memcg gets, the more noticeable these fluctuations
> > would be. I would not expect memcg with 200 MB of memory to behave better
> > (and also not much worse) than if I have a machine with that much memory...
> 
> It would be much worse if it's one single flusher thread round robin
> over the cgroups...
> 
> For a small machine with 200MB memory, its IO completion events can
> arrive continuously over time. However if its a 2000MB box divided
> into 10 cgroups and the flusher is writing out dirty pages, spending
> 0.5s on each cgroup and then go on to the next, then for any single
> cgroup, its IO completion events go quiet for every 9.5s and goes up
> on the other 0.5s. It becomes really hard to control the number of
> dirty pages.
  Umm, but flusher does not spend 0.5s on each cgroup. It submits 0.5s
worth of IO for each cgroup. Since the throughput computed for each cgroup
will be scaled down accordingly (and thus write_chunk will be scaled down
as well), it should end up submitting 0.5s worth of IO for the whole system
after it traverses all the cgroups, shouldn't it? Effectively we will work
with smaller write_chunk which will lead to lower total throughput - that's
the price of partitioning and higher fairness requirements (previously the
requirement was to switch to a new inode every 0.5s, now the requirement is
to switch to a new inode in each cgroup every 0.5s). In the end, we may end
up increasing the write_chunk by some factor like \sqrt(number of memcgs)
to get some middle ground between the guaranteed small latency and
reasonable total throughput but before I'd go for such hacks, I'd wait to
see real numbers - e.g. paying 10% of total throughput for partitioning the
machine into 10 IO intensive cgroups (as in your tests with dd's) would be
a reasonable cost in my opinion.

Also the granularity of IO completions should depend more on the
granularity of IO scheduler (CFQ) rather than the granularity of flusher
thread as such so I wouldn't think that would be a problem.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                       ` <20120423124240.GE6512-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
@ 2012-04-23 14:31                         ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-23 14:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Mel Gorman

On Mon, Apr 23, 2012 at 02:42:40PM +0200, Jan Kara wrote:
> On Mon 23-04-12 18:24:20, Wu Fengguang wrote:
> > On Mon, Apr 23, 2012 at 11:14:32AM +0200, Jan Kara wrote:
> > > On Fri 20-04-12 21:34:41, Wu Fengguang wrote:

> > > > But be warned! Partitioning the dirty pages always means more
> > > > fluctuations of dirty rates (and even stalls) that's perceivable by
> > > > the user. Which means another limiting factor for the backpressure
> > > > based IO controller to scale well.
> > >   Sure, the smaller the memcg gets, the more noticeable these fluctuations
> > > would be. I would not expect memcg with 200 MB of memory to behave better
> > > (and also not much worse) than if I have a machine with that much memory...
> > 
> > It would be much worse if it's one single flusher thread round robin
> > over the cgroups...
> > 
> > For a small machine with 200MB memory, its IO completion events can
> > arrive continuously over time. However if its a 2000MB box divided
> > into 10 cgroups and the flusher is writing out dirty pages, spending
> > 0.5s on each cgroup and then go on to the next, then for any single
> > cgroup, its IO completion events go quiet for every 9.5s and goes up
> > on the other 0.5s. It becomes really hard to control the number of
> > dirty pages.
>   Umm, but flusher does not spend 0.5s on each cgroup. It submits 0.5s
> worth of IO for each cgroup.

Right.

> Since the throughput computed for each cgroup
> will be scaled down accordingly (and thus write_chunk will be scaled down
> as well), it should end up submitting 0.5s worth of IO for the whole system
> after it traverses all the cgroups, shouldn't it? Effectively we will work
> with smaller write_chunk which will lead to lower total throughput - that's
> the price of partitioning and higher fairness requirements (previously the

Sure you can do that. However I think we were talking about memcg
dirty limits, in which case we still have good chances to keep the
0.5s per inode granularity by making the dirty limits high so that it
won't be hit normally. Only when there comes lots of memory cgroups
that the flusher cannot easily safeguard fairness among them, we may
consider decreasing the writeback chunk size.

> requirement was to switch to a new inode every 0.5s, now the requirement is
> to switch to a new inode in each cgroup every 0.5s). In the end, we may end
> up increasing the write_chunk by some factor like \sqrt(number of memcgs)
> to get some middle ground between the guaranteed small latency and
> reasonable total throughput but before I'd go for such hacks, I'd wait to
> see real numbers - e.g. paying 10% of total throughput for partitioning the
> machine into 10 IO intensive cgroups (as in your tests with dd's) would be
> a reasonable cost in my opinion.

For IO cgroups, I'd always prefer to avoid partitioning the dirty pages
and async IO queue so as to avoid such embarrassing tradeoffs in the
first place :-)

> Also the granularity of IO completions should depend more on the
> granularity of IO scheduler (CFQ) rather than the granularity of flusher
> thread as such so I wouldn't think that would be a problem.

By avoiding the partitions, we'll cancel the fairness problem. So
the coarse granularity of flusher won't be a problem for IO cgroups at
all.  balance_dirty_pages() will do proper throttling when dirty pages
are created, based directly on the blkcg weights and ongoing IO.
After that all async IOs go as a single stream from the flusher to the
storage. There are no need for page tracking. No split inode lists and
hence granularity or shared inodes issues for the flusher. Above all
there will be no degradation of performance at all, whether it be
throughput, latency or responsiveness. 

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-23 12:42                       ` Jan Kara
@ 2012-04-23 14:31                         ` Fengguang Wu
  -1 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-23 14:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf, Mel Gorman

On Mon, Apr 23, 2012 at 02:42:40PM +0200, Jan Kara wrote:
> On Mon 23-04-12 18:24:20, Wu Fengguang wrote:
> > On Mon, Apr 23, 2012 at 11:14:32AM +0200, Jan Kara wrote:
> > > On Fri 20-04-12 21:34:41, Wu Fengguang wrote:

> > > > But be warned! Partitioning the dirty pages always means more
> > > > fluctuations of dirty rates (and even stalls) that's perceivable by
> > > > the user. Which means another limiting factor for the backpressure
> > > > based IO controller to scale well.
> > >   Sure, the smaller the memcg gets, the more noticeable these fluctuations
> > > would be. I would not expect memcg with 200 MB of memory to behave better
> > > (and also not much worse) than if I have a machine with that much memory...
> > 
> > It would be much worse if it's one single flusher thread round robin
> > over the cgroups...
> > 
> > For a small machine with 200MB memory, its IO completion events can
> > arrive continuously over time. However if its a 2000MB box divided
> > into 10 cgroups and the flusher is writing out dirty pages, spending
> > 0.5s on each cgroup and then go on to the next, then for any single
> > cgroup, its IO completion events go quiet for every 9.5s and goes up
> > on the other 0.5s. It becomes really hard to control the number of
> > dirty pages.
>   Umm, but flusher does not spend 0.5s on each cgroup. It submits 0.5s
> worth of IO for each cgroup.

Right.

> Since the throughput computed for each cgroup
> will be scaled down accordingly (and thus write_chunk will be scaled down
> as well), it should end up submitting 0.5s worth of IO for the whole system
> after it traverses all the cgroups, shouldn't it? Effectively we will work
> with smaller write_chunk which will lead to lower total throughput - that's
> the price of partitioning and higher fairness requirements (previously the

Sure you can do that. However I think we were talking about memcg
dirty limits, in which case we still have good chances to keep the
0.5s per inode granularity by making the dirty limits high so that it
won't be hit normally. Only when there comes lots of memory cgroups
that the flusher cannot easily safeguard fairness among them, we may
consider decreasing the writeback chunk size.

> requirement was to switch to a new inode every 0.5s, now the requirement is
> to switch to a new inode in each cgroup every 0.5s). In the end, we may end
> up increasing the write_chunk by some factor like \sqrt(number of memcgs)
> to get some middle ground between the guaranteed small latency and
> reasonable total throughput but before I'd go for such hacks, I'd wait to
> see real numbers - e.g. paying 10% of total throughput for partitioning the
> machine into 10 IO intensive cgroups (as in your tests with dd's) would be
> a reasonable cost in my opinion.

For IO cgroups, I'd always prefer to avoid partitioning the dirty pages
and async IO queue so as to avoid such embarrassing tradeoffs in the
first place :-)

> Also the granularity of IO completions should depend more on the
> granularity of IO scheduler (CFQ) rather than the granularity of flusher
> thread as such so I wouldn't think that would be a problem.

By avoiding the partitions, we'll cancel the fairness problem. So
the coarse granularity of flusher won't be a problem for IO cgroups at
all.  balance_dirty_pages() will do proper throttling when dirty pages
are created, based directly on the blkcg weights and ongoing IO.
After that all async IOs go as a single stream from the flusher to the
storage. There are no need for page tracking. No split inode lists and
hence granularity or shared inodes issues for the flusher. Above all
there will be no degradation of performance at all, whether it be
throughput, latency or responsiveness. 

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-23 14:31                         ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-23 14:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf, Mel Gorman

On Mon, Apr 23, 2012 at 02:42:40PM +0200, Jan Kara wrote:
> On Mon 23-04-12 18:24:20, Wu Fengguang wrote:
> > On Mon, Apr 23, 2012 at 11:14:32AM +0200, Jan Kara wrote:
> > > On Fri 20-04-12 21:34:41, Wu Fengguang wrote:

> > > > But be warned! Partitioning the dirty pages always means more
> > > > fluctuations of dirty rates (and even stalls) that's perceivable by
> > > > the user. Which means another limiting factor for the backpressure
> > > > based IO controller to scale well.
> > >   Sure, the smaller the memcg gets, the more noticeable these fluctuations
> > > would be. I would not expect memcg with 200 MB of memory to behave better
> > > (and also not much worse) than if I have a machine with that much memory...
> > 
> > It would be much worse if it's one single flusher thread round robin
> > over the cgroups...
> > 
> > For a small machine with 200MB memory, its IO completion events can
> > arrive continuously over time. However if its a 2000MB box divided
> > into 10 cgroups and the flusher is writing out dirty pages, spending
> > 0.5s on each cgroup and then go on to the next, then for any single
> > cgroup, its IO completion events go quiet for every 9.5s and goes up
> > on the other 0.5s. It becomes really hard to control the number of
> > dirty pages.
>   Umm, but flusher does not spend 0.5s on each cgroup. It submits 0.5s
> worth of IO for each cgroup.

Right.

> Since the throughput computed for each cgroup
> will be scaled down accordingly (and thus write_chunk will be scaled down
> as well), it should end up submitting 0.5s worth of IO for the whole system
> after it traverses all the cgroups, shouldn't it? Effectively we will work
> with smaller write_chunk which will lead to lower total throughput - that's
> the price of partitioning and higher fairness requirements (previously the

Sure you can do that. However I think we were talking about memcg
dirty limits, in which case we still have good chances to keep the
0.5s per inode granularity by making the dirty limits high so that it
won't be hit normally. Only when there comes lots of memory cgroups
that the flusher cannot easily safeguard fairness among them, we may
consider decreasing the writeback chunk size.

> requirement was to switch to a new inode every 0.5s, now the requirement is
> to switch to a new inode in each cgroup every 0.5s). In the end, we may end
> up increasing the write_chunk by some factor like \sqrt(number of memcgs)
> to get some middle ground between the guaranteed small latency and
> reasonable total throughput but before I'd go for such hacks, I'd wait to
> see real numbers - e.g. paying 10% of total throughput for partitioning the
> machine into 10 IO intensive cgroups (as in your tests with dd's) would be
> a reasonable cost in my opinion.

For IO cgroups, I'd always prefer to avoid partitioning the dirty pages
and async IO queue so as to avoid such embarrassing tradeoffs in the
first place :-)

> Also the granularity of IO completions should depend more on the
> granularity of IO scheduler (CFQ) rather than the granularity of flusher
> thread as such so I wouldn't think that would be a problem.

By avoiding the partitions, we'll cancel the fairness problem. So
the coarse granularity of flusher won't be a problem for IO cgroups at
all.  balance_dirty_pages() will do proper throttling when dirty pages
are created, based directly on the blkcg weights and ongoing IO.
After that all async IOs go as a single stream from the flusher to the
storage. There are no need for page tracking. No split inode lists and
hence granularity or shared inodes issues for the flusher. Above all
there will be no degradation of performance at all, whether it be
throughput, latency or responsiveness. 

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                       ` <20120423123011.GA8103-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-23 16:04                         ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-23 16:04 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

Hello, Vivek.

On Mon, Apr 23, 2012 at 08:30:11AM -0400, Vivek Goyal wrote:
> On Fri, Apr 20, 2012 at 02:33:01PM -0700, Tejun Heo wrote:
> > On Fri, Apr 20, 2012 at 03:29:30PM -0400, Vivek Goyal wrote:
> > > I am personally is not too excited about the case of putting async IO
> > > in separate groups due to the reason that async IO of one group will
> > > start impacting latencies of sync IO of another group and in practice
> > > it might not be desirable. But there are others who have use cases for
> > > separate async IO queue. So as long as switch is there to change the
> > > behavior, I am not too worried.
> > 
> > Why not just fix cfq so that it prefers groups w/ sync IOs?
> 
> Yes that could possibly be done but now that's change of requirements. Now
> we are saying that I want one buffered write to go faster than other
> buffered write only if there is no sync IO present in any of the groups.

It's a scheduling decision and the resource split may or may not be
about latency (the faster part).  We're currently just shoving all
asyncs into the root group and preferring sync IOs in general.  The
other end would be keeping them completely siloed and not caring about
[a]sync across different cgroups.  My point is that managing async IOs
per cgroup doesn't mean we can't prioritize sync IOs in general.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-23 12:30                       ` Vivek Goyal
@ 2012-04-23 16:04                         ` Tejun Heo
  -1 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-23 16:04 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hello, Vivek.

On Mon, Apr 23, 2012 at 08:30:11AM -0400, Vivek Goyal wrote:
> On Fri, Apr 20, 2012 at 02:33:01PM -0700, Tejun Heo wrote:
> > On Fri, Apr 20, 2012 at 03:29:30PM -0400, Vivek Goyal wrote:
> > > I am personally is not too excited about the case of putting async IO
> > > in separate groups due to the reason that async IO of one group will
> > > start impacting latencies of sync IO of another group and in practice
> > > it might not be desirable. But there are others who have use cases for
> > > separate async IO queue. So as long as switch is there to change the
> > > behavior, I am not too worried.
> > 
> > Why not just fix cfq so that it prefers groups w/ sync IOs?
> 
> Yes that could possibly be done but now that's change of requirements. Now
> we are saying that I want one buffered write to go faster than other
> buffered write only if there is no sync IO present in any of the groups.

It's a scheduling decision and the resource split may or may not be
about latency (the faster part).  We're currently just shoving all
asyncs into the root group and preferring sync IOs in general.  The
other end would be keeping them completely siloed and not caring about
[a]sync across different cgroups.  My point is that managing async IOs
per cgroup doesn't mean we can't prioritize sync IOs in general.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-23 16:04                         ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-23 16:04 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hello, Vivek.

On Mon, Apr 23, 2012 at 08:30:11AM -0400, Vivek Goyal wrote:
> On Fri, Apr 20, 2012 at 02:33:01PM -0700, Tejun Heo wrote:
> > On Fri, Apr 20, 2012 at 03:29:30PM -0400, Vivek Goyal wrote:
> > > I am personally is not too excited about the case of putting async IO
> > > in separate groups due to the reason that async IO of one group will
> > > start impacting latencies of sync IO of another group and in practice
> > > it might not be desirable. But there are others who have use cases for
> > > separate async IO queue. So as long as switch is there to change the
> > > behavior, I am not too worried.
> > 
> > Why not just fix cfq so that it prefers groups w/ sync IOs?
> 
> Yes that could possibly be done but now that's change of requirements. Now
> we are saying that I want one buffered write to go faster than other
> buffered write only if there is no sync IO present in any of the groups.

It's a scheduling decision and the resource split may or may not be
about latency (the faster part).  We're currently just shoving all
asyncs into the root group and preferring sync IOs in general.  The
other end would be keeping them completely siloed and not caring about
[a]sync across different cgroups.  My point is that managing async IOs
per cgroup doesn't mean we can't prioritize sync IOs in general.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-22 14:46                     ` Fengguang Wu
  (?)
@ 2012-04-23 16:56                       ` Tejun Heo
  -1 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-23 16:56 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA,
	Mel Gorman

Hello, Fengguang.

On Sun, Apr 22, 2012 at 10:46:49PM +0800, Fengguang Wu wrote:
> OK. Sorry I should have explained why memcg dirty limit is not the
> right tool for back pressure based throttling.

I have two questions.  Why do we need memcg for this?  Writeback
currently works without memcg, right?  Why does that change with blkcg
aware bdi?

> Basically the more memcgs with dirty limits, the more hard time for
> the flusher to serve them fairly and knock down their dirty pages in
> time. Because the flusher works inode by inode, each one may take up
> to 0.5 second, and there may be many memcgs asking for the flusher's
> attention. Also the more memcgs, the global dirty pages pool are
> partitioned into smaller pieces, which means smaller safety margin for
> each memcg. Adding these two effects up, there may be constantly some
> memcgs hitting their dirty limits when there are dozens of memcgs.

And how is this different from a machine with smaller memory?  If so,
why?

> Such cross subsystem coordinations still look natural to me because
> "weight" is a fundamental and general parameter. It's really a blkcg
> thing (determined by the blkio.weight user interface) rather than
> specifically tied to cfq. When another kernel entity (eg. NFS or noop)
> decides to add support for proportional weight IO control in future,
> it can make use of the weights calculated by balance_dirty_pages(), too.

It is not fundamental and natural at all and is already made cfq
specific in the devel branch.  You seem to think "weight" is somehow a
global concept which everyone can agree on but it is not.  Weight of
what?  Is it disktime, bandwidth, iops or something else?  cfq deals
primarily with disktime because that makes sense for spinning drives
with single head.  For SSDs with smart enough FTLs, the unit should
probably be iops.  For storage technology bottlenecked on bus speed,
bw would make sense.

IIUC, writeback is primarily dealing with abstracted bandwidth which
is applied per-inode, which is fine at that layer as details like
block allocations isn't and shouldn't be visible there and files (or
inodes) are the level of abstraction.

However, this doesn't necessarily translate easily into the actual
underlying IO resource.  For devices with spindle, seek time dominates
and the same amount of IO may consume vastly different amount of IO
and the disk time becomes the primary resource, not the iops or
bandwidth.  Naturally, people want to allocate and limit the primary
resource, so cfq distributes disk time across different cgroups as
configured.

Your suggested solution is applying the same a number - the weight -
to one portion of a mostly arbitrarily split resource using a
different unit.  I don't even understand what that achieves.

The requirement is to be able to split IO resource according to
cgroups in configurable way and enforce the limits established by the
configuration, which we're currently failing to do for async IOs.
Your proposed solution applies some arbitrary ratio according to some
arbitrary interpretation of cfq IO time weight way up in the stack
which, when propagated to the lower layer, would cause significant
amount of delay and fluctuation which behaves completely independent
from how (using what unit, in what granularity and in what time scale)
actual IO resource is handled, split and accounted, which would result
in something which probably has some semblance of interpreting
blkcg.weight as vague best-effort priority at its luckiest moments.

So, I don't think your suggested solution is a solution at all.  I'm
in fact not even sure what it achieves at the cost of the gross
layering violation and fundamental design braindamage.

>         - No more latency
>         - No performance drop
>         - No bumpy progress and stalls
>         - No need to attach memcg to blkcg
>         - Feel free to create 1000+ IO controllers, to heart's content
>           w/o worrying about costs (if any, it would be some existing
>           scalability issues)

I'm not sure why memcg suddenly becomes necessary with blkcg and I
don't think having per-blkcg writeback and reasonable async
optimization from iosched would be considerably worse.  It sure will
add some overhead (e.g. from split buffering) but there will be proper
working isolation which is what this fuss is all about.  Also, I just
don't see how creating 1000+ (relatively active, I presume) blkcgs on
a single spindle would be sane and how is the end result gonna be
significantly better for your suggested solution, so let's please put
aside the silly non-use case.

In terms of overhead, I suspect the biggest would be the increased
buffering coming from split channels but that seems like the cost of
business to me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-23 16:56                       ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-23 16:56 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf, Mel Gorman

Hello, Fengguang.

On Sun, Apr 22, 2012 at 10:46:49PM +0800, Fengguang Wu wrote:
> OK. Sorry I should have explained why memcg dirty limit is not the
> right tool for back pressure based throttling.

I have two questions.  Why do we need memcg for this?  Writeback
currently works without memcg, right?  Why does that change with blkcg
aware bdi?

> Basically the more memcgs with dirty limits, the more hard time for
> the flusher to serve them fairly and knock down their dirty pages in
> time. Because the flusher works inode by inode, each one may take up
> to 0.5 second, and there may be many memcgs asking for the flusher's
> attention. Also the more memcgs, the global dirty pages pool are
> partitioned into smaller pieces, which means smaller safety margin for
> each memcg. Adding these two effects up, there may be constantly some
> memcgs hitting their dirty limits when there are dozens of memcgs.

And how is this different from a machine with smaller memory?  If so,
why?

> Such cross subsystem coordinations still look natural to me because
> "weight" is a fundamental and general parameter. It's really a blkcg
> thing (determined by the blkio.weight user interface) rather than
> specifically tied to cfq. When another kernel entity (eg. NFS or noop)
> decides to add support for proportional weight IO control in future,
> it can make use of the weights calculated by balance_dirty_pages(), too.

It is not fundamental and natural at all and is already made cfq
specific in the devel branch.  You seem to think "weight" is somehow a
global concept which everyone can agree on but it is not.  Weight of
what?  Is it disktime, bandwidth, iops or something else?  cfq deals
primarily with disktime because that makes sense for spinning drives
with single head.  For SSDs with smart enough FTLs, the unit should
probably be iops.  For storage technology bottlenecked on bus speed,
bw would make sense.

IIUC, writeback is primarily dealing with abstracted bandwidth which
is applied per-inode, which is fine at that layer as details like
block allocations isn't and shouldn't be visible there and files (or
inodes) are the level of abstraction.

However, this doesn't necessarily translate easily into the actual
underlying IO resource.  For devices with spindle, seek time dominates
and the same amount of IO may consume vastly different amount of IO
and the disk time becomes the primary resource, not the iops or
bandwidth.  Naturally, people want to allocate and limit the primary
resource, so cfq distributes disk time across different cgroups as
configured.

Your suggested solution is applying the same a number - the weight -
to one portion of a mostly arbitrarily split resource using a
different unit.  I don't even understand what that achieves.

The requirement is to be able to split IO resource according to
cgroups in configurable way and enforce the limits established by the
configuration, which we're currently failing to do for async IOs.
Your proposed solution applies some arbitrary ratio according to some
arbitrary interpretation of cfq IO time weight way up in the stack
which, when propagated to the lower layer, would cause significant
amount of delay and fluctuation which behaves completely independent
from how (using what unit, in what granularity and in what time scale)
actual IO resource is handled, split and accounted, which would result
in something which probably has some semblance of interpreting
blkcg.weight as vague best-effort priority at its luckiest moments.

So, I don't think your suggested solution is a solution at all.  I'm
in fact not even sure what it achieves at the cost of the gross
layering violation and fundamental design braindamage.

>         - No more latency
>         - No performance drop
>         - No bumpy progress and stalls
>         - No need to attach memcg to blkcg
>         - Feel free to create 1000+ IO controllers, to heart's content
>           w/o worrying about costs (if any, it would be some existing
>           scalability issues)

I'm not sure why memcg suddenly becomes necessary with blkcg and I
don't think having per-blkcg writeback and reasonable async
optimization from iosched would be considerably worse.  It sure will
add some overhead (e.g. from split buffering) but there will be proper
working isolation which is what this fuss is all about.  Also, I just
don't see how creating 1000+ (relatively active, I presume) blkcgs on
a single spindle would be sane and how is the end result gonna be
significantly better for your suggested solution, so let's please put
aside the silly non-use case.

In terms of overhead, I suspect the biggest would be the increased
buffering coming from split channels but that seems like the cost of
business to me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-23 16:56                       ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-23 16:56 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf, Mel Gorman

Hello, Fengguang.

On Sun, Apr 22, 2012 at 10:46:49PM +0800, Fengguang Wu wrote:
> OK. Sorry I should have explained why memcg dirty limit is not the
> right tool for back pressure based throttling.

I have two questions.  Why do we need memcg for this?  Writeback
currently works without memcg, right?  Why does that change with blkcg
aware bdi?

> Basically the more memcgs with dirty limits, the more hard time for
> the flusher to serve them fairly and knock down their dirty pages in
> time. Because the flusher works inode by inode, each one may take up
> to 0.5 second, and there may be many memcgs asking for the flusher's
> attention. Also the more memcgs, the global dirty pages pool are
> partitioned into smaller pieces, which means smaller safety margin for
> each memcg. Adding these two effects up, there may be constantly some
> memcgs hitting their dirty limits when there are dozens of memcgs.

And how is this different from a machine with smaller memory?  If so,
why?

> Such cross subsystem coordinations still look natural to me because
> "weight" is a fundamental and general parameter. It's really a blkcg
> thing (determined by the blkio.weight user interface) rather than
> specifically tied to cfq. When another kernel entity (eg. NFS or noop)
> decides to add support for proportional weight IO control in future,
> it can make use of the weights calculated by balance_dirty_pages(), too.

It is not fundamental and natural at all and is already made cfq
specific in the devel branch.  You seem to think "weight" is somehow a
global concept which everyone can agree on but it is not.  Weight of
what?  Is it disktime, bandwidth, iops or something else?  cfq deals
primarily with disktime because that makes sense for spinning drives
with single head.  For SSDs with smart enough FTLs, the unit should
probably be iops.  For storage technology bottlenecked on bus speed,
bw would make sense.

IIUC, writeback is primarily dealing with abstracted bandwidth which
is applied per-inode, which is fine at that layer as details like
block allocations isn't and shouldn't be visible there and files (or
inodes) are the level of abstraction.

However, this doesn't necessarily translate easily into the actual
underlying IO resource.  For devices with spindle, seek time dominates
and the same amount of IO may consume vastly different amount of IO
and the disk time becomes the primary resource, not the iops or
bandwidth.  Naturally, people want to allocate and limit the primary
resource, so cfq distributes disk time across different cgroups as
configured.

Your suggested solution is applying the same a number - the weight -
to one portion of a mostly arbitrarily split resource using a
different unit.  I don't even understand what that achieves.

The requirement is to be able to split IO resource according to
cgroups in configurable way and enforce the limits established by the
configuration, which we're currently failing to do for async IOs.
Your proposed solution applies some arbitrary ratio according to some
arbitrary interpretation of cfq IO time weight way up in the stack
which, when propagated to the lower layer, would cause significant
amount of delay and fluctuation which behaves completely independent
from how (using what unit, in what granularity and in what time scale)
actual IO resource is handled, split and accounted, which would result
in something which probably has some semblance of interpreting
blkcg.weight as vague best-effort priority at its luckiest moments.

So, I don't think your suggested solution is a solution at all.  I'm
in fact not even sure what it achieves at the cost of the gross
layering violation and fundamental design braindamage.

>         - No more latency
>         - No performance drop
>         - No bumpy progress and stalls
>         - No need to attach memcg to blkcg
>         - Feel free to create 1000+ IO controllers, to heart's content
>           w/o worrying about costs (if any, it would be some existing
>           scalability issues)

I'm not sure why memcg suddenly becomes necessary with blkcg and I
don't think having per-blkcg writeback and reasonable async
optimization from iosched would be considerably worse.  It sure will
add some overhead (e.g. from split buffering) but there will be proper
working isolation which is what this fuss is all about.  Also, I just
don't see how creating 1000+ (relatively active, I presume) blkcgs on
a single spindle would be sane and how is the end result gonna be
significantly better for your suggested solution, so let's please put
aside the silly non-use case.

In terms of overhead, I suspect the biggest would be the increased
buffering coming from split channels but that seems like the cost of
business to me.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                       ` <20120423165626.GB5406-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-04-24  7:58                         ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-24  7:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA,
	Mel Gorman

Hi Tejun,

On Mon, Apr 23, 2012 at 09:56:26AM -0700, Tejun Heo wrote:
> Hello, Fengguang.
>
> On Sun, Apr 22, 2012 at 10:46:49PM +0800, Fengguang Wu wrote:
> > OK. Sorry I should have explained why memcg dirty limit is not the
> > right tool for back pressure based throttling.
>
> I have two questions.  Why do we need memcg for this?  Writeback
> currently works without memcg, right?  Why does that change with blkcg
> aware bdi?

Yeah currently writeback does not depend on memcg. As for blkcg, it's
necessary to keep a number of dirty pages for each blkcg, so that the
cfq groups' async IO queue does not go empty and lose its turn to do
IO. memcg provides the proper infrastructure to account dirty pages.

In a previous email, we have an example of two 10:1 weight cgroups,
each running one dd. They will make two IO pipes, each holding a number
of dirty pages. Since cfq honors dd-1 much more IO bandwidth, dd-1's
dirty pages are consumed quickly. However balance_dirty_pages(),
without knowing about cfq's bandwidth divisions, is throttling the
two dd tasks equally. So dd-1 will be producing dirty pages much
slower than cfq is consuming them. The flusher thus won't send enough
dirty pages down to fill the corresponding async IO queue for dd-1.
cfq cannot really give dd-1 more bandwidth share due to lack of data
feed. The end result will be: the two cgroups get 1:1 bandwidth share
honored by balance_dirty_pages() even though cfq honors 10:1 weights
to them.

1:1 balance_dirty_pages() bandwidth split

  [          dd-1              |           dd-2             ]
  |                             \                           |
  |                              \**************************|
  |                               \*************************|
  |                                \************************|
  |                                 \***********************|
  |                                  \**********************|
  |                                   \*********************|
  |                                    \********************|
  |                                     \*******************|
  |                                      \******************|
  |                                       \*****************|
  |                                        \****************|
  |                                         \***************|
  |                                          \**************|
  |                                           \*************|
  |                                            \************|
  |                                             \***********|
  |                                              \**********|
  |                                               \*********|
  |                                                \********|
  |                                                 \*******|
  |                                                  \******|
  |************************   (constantly underrun)   \*****|

10:1 cfq bandwidth split                      [*] dirty pages

Ideally is,

  [                      dd-1                         | dd-2]
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |***************************************************|*****|
  |***************************************************|*****|
  |***************************************************|*****|
  |***************************************************|*****|
  |***************************************************|*****|
  |***************************************************|*****|
  |***************************************************|*****|
  |***************************************************|*****|
  |***************************************************|*****|
  |***************************************************|*****|

Or better, one single pipe :)

  [                      dd-1                         | dd-2]
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |*********************************************************|
  |*********************************************************|
  |*********************************************************|
  |*********************************************************|
  |*********************************************************|
  |*********************************************************|
  |*********************************************************|
  |*********************************************************|
  |*********************************************************|
  |*********************************************************|

> > Basically the more memcgs with dirty limits, the more hard time for
> > the flusher to serve them fairly and knock down their dirty pages in
> > time. Because the flusher works inode by inode, each one may take up
> > to 0.5 second, and there may be many memcgs asking for the flusher's
> > attention. Also the more memcgs, the global dirty pages pool are
> > partitioned into smaller pieces, which means smaller safety margin for
> > each memcg. Adding these two effects up, there may be constantly some
> > memcgs hitting their dirty limits when there are dozens of memcgs.
>
> And how is this different from a machine with smaller memory?  If so,
> why?

In a small memory box, dd and flusher produce/consume dirty pages
continuously, so that over time the number of dirty pages can remain
roughly stable.

  ^ dirty pages
  |
  +dirty limit
  |                    | dd continously dirtying pages
  |dirty setpoint      v
  +*******************************************************************
  |                    |
  |                    v flusher continously clean pages
  |
  +-------------------------------------------------------------------->
                                                                    time

However if it's a large memory machine whose dirty pages get
partitioned to 100 cgroups, the flusher will be serving them
in round robin fashion. For a particular cgroup, the flusher
only comes and consumes its dirty pages once on every (100*flusher_slice)
seconds. The interval would be 50s for the current 0.5s flusher slice,
or 5s if lowering flusher slice to 50ms.

I'm not sure whether it's practical to decrease the flusher slice for
ext4, which for the sake of write performance and avoid fragmentation,
increases the write chunk size to 128MB internally.

For a number of reasons, the flusher's behavior cannot be exactly
controlled. The intervals the flusher come to each cgroup go up and
down, fairness can only be coarsely assured.

The dirty pages for each cgroup will be going up and down irregularly
across very large dynamic ranges. Now you should be able to imagine
the challenges to avoid hitting the dirty limit, to balance the dirty
pages around the per-cgroup-per-bdi dirty setpoints and to avoid
underruns. When there are 10 cgroups and 12 bdi's, the dirty setpoints
could explode up to 10*12.

  ^ dirty pages
  |
  |                                       dd continously dirtying pages
  + dirty limit        dd stalled
  |                      ******             |
  |                    *      *             |           *
  |                  *        *             |         *  *
  |                *          *             |       *    *
  | dirty        *            *             |     *      *
  + setpoint   *              *             |   *        *
  |          *                *             v *          *
  |        *                   *            *            *
  |      *                     *          *              *
  |    *                       *        *                *      *
  |  *                         *      *                   *   *
  |*                           *    *                     * *
  |                            *  *                       *
  |                             *
  |                           ^^ the flusher comes around to this cgroup
  |
  +-------------------------------------------------------------------->
                                                                    time

> > Such cross subsystem coordinations still look natural to me because
> > "weight" is a fundamental and general parameter. It's really a blkcg
> > thing (determined by the blkio.weight user interface) rather than
> > specifically tied to cfq. When another kernel entity (eg. NFS or noop)
> > decides to add support for proportional weight IO control in future,
> > it can make use of the weights calculated by balance_dirty_pages(), too.
>
> It is not fundamental and natural at all and is already made cfq
> specific in the devel branch.  You seem to think "weight" is somehow a
> global concept which everyone can agree on but it is not.  Weight of
> what?  Is it disktime, bandwidth, iops or something else?  cfq deals
> primarily with disktime because that makes sense for spinning drives
> with single head.  For SSDs with smart enough FTLs, the unit should
> probably be iops.  For storage technology bottlenecked on bus speed,
> bw would make sense.

"Weight" is sure a global concept that reflects the "importance"
deemed by the user for that cgroup. cfq (or NFS, whatever on the horizon)
then interprets the importance number as disk time, IOPS, bandwidth,
whatever semantic that best fits the backing storage and workload.

blkio.weight will be the "number" shared and interpreted by all IO
controller entities, whether it be cfq, NFS or balance_dirty_pages().
And I can assure you that balance_dirty_pages() will be interpreting
it the _same_ way the underlying cfq/NFS interprets it, via the feedback
scheme described below.

> IIUC, writeback is primarily dealing with abstracted bandwidth which
> is applied per-inode, which is fine at that layer as details like
> block allocations isn't and shouldn't be visible there and files (or
> inodes) are the level of abstraction.
>
> However, this doesn't necessarily translate easily into the actual
> underlying IO resource.  For devices with spindle, seek time dominates
> and the same amount of IO may consume vastly different amount of IO
> and the disk time becomes the primary resource, not the iops or
> bandwidth.  Naturally, people want to allocate and limit the primary
> resource, so cfq distributes disk time across different cgroups as
> configured.

Right. balance_dirty_pages() is always doing dirty throttling wrt.
bandwidth, even in your back pressure scheme, isn't it? In this regard,
there are nothing fundamentally different between our proposals. They
will both employ some way to convert the cfq's disk time or IOPS
notion to balance_dirty_pages()'s bandwidth notion. See below for my
way of conversion.

> Your suggested solution is applying the same a number - the weight -
> to one portion of a mostly arbitrarily split resource using a
> different unit.  I don't even understand what that achieves.

You seem to miss my stated plan: next step, balance_dirty_pages() will
get some feedback information from cfq to adjust its bandwidth targets
accordingly. That information will be

        io_cost = charge/sectors

The charge value is exactly the value computed in cfq_group_served(),
which is the slice time or IOs dispatched depending the mode cfq is
operating in. By dividing ratelimit by the normalized io_cost,
balance_dirty_pages() will automatically get the same weight
interpretation as cfq. For example, on spin disks, it will be able to
allocate lower bandwidth to seeky cgroups due to the larger io_cost
reported by cfq.

> The requirement is to be able to split IO resource according to
> cgroups in configurable way and enforce the limits established by the
> configuration, which we're currently failing to do for async IOs.
> Your proposed solution applies some arbitrary ratio according to some
> arbitrary interpretation of cfq IO time weight way up in the stack
> which, when propagated to the lower layer, would cause significant
> amount of delay and fluctuation which behaves completely independent
> from how (using what unit, in what granularity and in what time scale)
> actual IO resource is handled, split and accounted, which would result
> in something which probably has some semblance of interpreting
> blkcg.weight as vague best-effort priority at its luckiest moments.

Interestingly, our proposals are once again on the same plane
regarding the delays and fluctuations. Due to the long delays between
dirty and writeout time, the access pattern for the newly generated
dirty pages and the access pattern for the under-writeback pages may
have changed. So even if cfq is throttling the stream proportional to
its IO cost, the user on the other side of the pipe (with long delay)
may still see the strange behavior of lower throughput for sequential
writes and higher throughput for random writes. Let's accept the fact:
it's a natural problem/property of the buffered writes. What we can do
is to aim for _long term_ rate matching.

> So, I don't think your suggested solution is a solution at all.  I'm
> in fact not even sure what it achieves at the cost of the gross
> layering violation and fundamental design braindamage.

It doesn't make anything perform better (nor worse). In face of the
challenging problem, both proposals suck. My solution just sucks less
as in the below listing.

> >         - No more latency
> >         - No performance drop
> >         - No bumpy progress and stalls
> >         - No need to attach memcg to blkcg
> >         - Feel free to create 1000+ IO controllers, to heart's content
> >           w/o worrying about costs (if any, it would be some existing
> >           scalability issues)
>
> I'm not sure why memcg suddenly becomes necessary with blkcg and I
> don't think having per-blkcg writeback and reasonable async
> optimization from iosched would be considerably worse.  It sure will
> add some overhead (e.g. from split buffering) but there will be proper
> working isolation which is what this fuss is all about.  Also, I just
> don't see how creating 1000+ (relatively active, I presume) blkcgs on
> a single spindle would be sane and how is the end result gonna be
> significantly better for your suggested solution, so let's please put
> aside the silly non-use case.

There are big disk arrays with lots of spindles inside, or arrays of
fast SSDs. People may want to create lots of cgroups on them.  IO
controllers should be made cheap and scalable to meet the demands from
our variety user base, now and future.

> In terms of overhead, I suspect the biggest would be the increased
> buffering coming from split channels but that seems like the cost of
> business to me.

I know that the back pressure idea actually come a long way (several
years?) and it's kind of become a common agreement that there will be
inevitable costs incur to the isolation. So I can understand why you
keep ignoring all the overheads, costs and scalability issues because
there seems no other way out. However here comes the solution that can
magically avoid the partition and all the resulted problems, and still
be able to provide the isolation.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-23 16:56                       ` Tejun Heo
@ 2012-04-24  7:58                         ` Fengguang Wu
  -1 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-24  7:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf, Mel Gorman

Hi Tejun,

On Mon, Apr 23, 2012 at 09:56:26AM -0700, Tejun Heo wrote:
> Hello, Fengguang.
>
> On Sun, Apr 22, 2012 at 10:46:49PM +0800, Fengguang Wu wrote:
> > OK. Sorry I should have explained why memcg dirty limit is not the
> > right tool for back pressure based throttling.
>
> I have two questions.  Why do we need memcg for this?  Writeback
> currently works without memcg, right?  Why does that change with blkcg
> aware bdi?

Yeah currently writeback does not depend on memcg. As for blkcg, it's
necessary to keep a number of dirty pages for each blkcg, so that the
cfq groups' async IO queue does not go empty and lose its turn to do
IO. memcg provides the proper infrastructure to account dirty pages.

In a previous email, we have an example of two 10:1 weight cgroups,
each running one dd. They will make two IO pipes, each holding a number
of dirty pages. Since cfq honors dd-1 much more IO bandwidth, dd-1's
dirty pages are consumed quickly. However balance_dirty_pages(),
without knowing about cfq's bandwidth divisions, is throttling the
two dd tasks equally. So dd-1 will be producing dirty pages much
slower than cfq is consuming them. The flusher thus won't send enough
dirty pages down to fill the corresponding async IO queue for dd-1.
cfq cannot really give dd-1 more bandwidth share due to lack of data
feed. The end result will be: the two cgroups get 1:1 bandwidth share
honored by balance_dirty_pages() even though cfq honors 10:1 weights
to them.

1:1 balance_dirty_pages() bandwidth split

  [          dd-1              |           dd-2             ]
  |                             \                           |
  |                              \**************************|
  |                               \*************************|
  |                                \************************|
  |                                 \***********************|
  |                                  \**********************|
  |                                   \*********************|
  |                                    \********************|
  |                                     \*******************|
  |                                      \******************|
  |                                       \*****************|
  |                                        \****************|
  |                                         \***************|
  |                                          \**************|
  |                                           \*************|
  |                                            \************|
  |                                             \***********|
  |                                              \**********|
  |                                               \*********|
  |                                                \********|
  |                                                 \*******|
  |                                                  \******|
  |************************   (constantly underrun)   \*****|

10:1 cfq bandwidth split                      [*] dirty pages

Ideally is,

  [                      dd-1                         | dd-2]
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |***************************************************|*****|
  |***************************************************|*****|
  |***************************************************|*****|
  |***************************************************|*****|
  |***************************************************|*****|
  |***************************************************|*****|
  |***************************************************|*****|
  |***************************************************|*****|
  |***************************************************|*****|
  |***************************************************|*****|

Or better, one single pipe :)

  [                      dd-1                         | dd-2]
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |*********************************************************|
  |*********************************************************|
  |*********************************************************|
  |*********************************************************|
  |*********************************************************|
  |*********************************************************|
  |*********************************************************|
  |*********************************************************|
  |*********************************************************|
  |*********************************************************|

> > Basically the more memcgs with dirty limits, the more hard time for
> > the flusher to serve them fairly and knock down their dirty pages in
> > time. Because the flusher works inode by inode, each one may take up
> > to 0.5 second, and there may be many memcgs asking for the flusher's
> > attention. Also the more memcgs, the global dirty pages pool are
> > partitioned into smaller pieces, which means smaller safety margin for
> > each memcg. Adding these two effects up, there may be constantly some
> > memcgs hitting their dirty limits when there are dozens of memcgs.
>
> And how is this different from a machine with smaller memory?  If so,
> why?

In a small memory box, dd and flusher produce/consume dirty pages
continuously, so that over time the number of dirty pages can remain
roughly stable.

  ^ dirty pages
  |
  +dirty limit
  |                    | dd continously dirtying pages
  |dirty setpoint      v
  +*******************************************************************
  |                    |
  |                    v flusher continously clean pages
  |
  +-------------------------------------------------------------------->
                                                                    time

However if it's a large memory machine whose dirty pages get
partitioned to 100 cgroups, the flusher will be serving them
in round robin fashion. For a particular cgroup, the flusher
only comes and consumes its dirty pages once on every (100*flusher_slice)
seconds. The interval would be 50s for the current 0.5s flusher slice,
or 5s if lowering flusher slice to 50ms.

I'm not sure whether it's practical to decrease the flusher slice for
ext4, which for the sake of write performance and avoid fragmentation,
increases the write chunk size to 128MB internally.

For a number of reasons, the flusher's behavior cannot be exactly
controlled. The intervals the flusher come to each cgroup go up and
down, fairness can only be coarsely assured.

The dirty pages for each cgroup will be going up and down irregularly
across very large dynamic ranges. Now you should be able to imagine
the challenges to avoid hitting the dirty limit, to balance the dirty
pages around the per-cgroup-per-bdi dirty setpoints and to avoid
underruns. When there are 10 cgroups and 12 bdi's, the dirty setpoints
could explode up to 10*12.

  ^ dirty pages
  |
  |                                       dd continously dirtying pages
  + dirty limit        dd stalled
  |                      ******             |
  |                    *      *             |           *
  |                  *        *             |         *  *
  |                *          *             |       *    *
  | dirty        *            *             |     *      *
  + setpoint   *              *             |   *        *
  |          *                *             v *          *
  |        *                   *            *            *
  |      *                     *          *              *
  |    *                       *        *                *      *
  |  *                         *      *                   *   *
  |*                           *    *                     * *
  |                            *  *                       *
  |                             *
  |                           ^^ the flusher comes around to this cgroup
  |
  +-------------------------------------------------------------------->
                                                                    time

> > Such cross subsystem coordinations still look natural to me because
> > "weight" is a fundamental and general parameter. It's really a blkcg
> > thing (determined by the blkio.weight user interface) rather than
> > specifically tied to cfq. When another kernel entity (eg. NFS or noop)
> > decides to add support for proportional weight IO control in future,
> > it can make use of the weights calculated by balance_dirty_pages(), too.
>
> It is not fundamental and natural at all and is already made cfq
> specific in the devel branch.  You seem to think "weight" is somehow a
> global concept which everyone can agree on but it is not.  Weight of
> what?  Is it disktime, bandwidth, iops or something else?  cfq deals
> primarily with disktime because that makes sense for spinning drives
> with single head.  For SSDs with smart enough FTLs, the unit should
> probably be iops.  For storage technology bottlenecked on bus speed,
> bw would make sense.

"Weight" is sure a global concept that reflects the "importance"
deemed by the user for that cgroup. cfq (or NFS, whatever on the horizon)
then interprets the importance number as disk time, IOPS, bandwidth,
whatever semantic that best fits the backing storage and workload.

blkio.weight will be the "number" shared and interpreted by all IO
controller entities, whether it be cfq, NFS or balance_dirty_pages().
And I can assure you that balance_dirty_pages() will be interpreting
it the _same_ way the underlying cfq/NFS interprets it, via the feedback
scheme described below.

> IIUC, writeback is primarily dealing with abstracted bandwidth which
> is applied per-inode, which is fine at that layer as details like
> block allocations isn't and shouldn't be visible there and files (or
> inodes) are the level of abstraction.
>
> However, this doesn't necessarily translate easily into the actual
> underlying IO resource.  For devices with spindle, seek time dominates
> and the same amount of IO may consume vastly different amount of IO
> and the disk time becomes the primary resource, not the iops or
> bandwidth.  Naturally, people want to allocate and limit the primary
> resource, so cfq distributes disk time across different cgroups as
> configured.

Right. balance_dirty_pages() is always doing dirty throttling wrt.
bandwidth, even in your back pressure scheme, isn't it? In this regard,
there are nothing fundamentally different between our proposals. They
will both employ some way to convert the cfq's disk time or IOPS
notion to balance_dirty_pages()'s bandwidth notion. See below for my
way of conversion.

> Your suggested solution is applying the same a number - the weight -
> to one portion of a mostly arbitrarily split resource using a
> different unit.  I don't even understand what that achieves.

You seem to miss my stated plan: next step, balance_dirty_pages() will
get some feedback information from cfq to adjust its bandwidth targets
accordingly. That information will be

        io_cost = charge/sectors

The charge value is exactly the value computed in cfq_group_served(),
which is the slice time or IOs dispatched depending the mode cfq is
operating in. By dividing ratelimit by the normalized io_cost,
balance_dirty_pages() will automatically get the same weight
interpretation as cfq. For example, on spin disks, it will be able to
allocate lower bandwidth to seeky cgroups due to the larger io_cost
reported by cfq.

> The requirement is to be able to split IO resource according to
> cgroups in configurable way and enforce the limits established by the
> configuration, which we're currently failing to do for async IOs.
> Your proposed solution applies some arbitrary ratio according to some
> arbitrary interpretation of cfq IO time weight way up in the stack
> which, when propagated to the lower layer, would cause significant
> amount of delay and fluctuation which behaves completely independent
> from how (using what unit, in what granularity and in what time scale)
> actual IO resource is handled, split and accounted, which would result
> in something which probably has some semblance of interpreting
> blkcg.weight as vague best-effort priority at its luckiest moments.

Interestingly, our proposals are once again on the same plane
regarding the delays and fluctuations. Due to the long delays between
dirty and writeout time, the access pattern for the newly generated
dirty pages and the access pattern for the under-writeback pages may
have changed. So even if cfq is throttling the stream proportional to
its IO cost, the user on the other side of the pipe (with long delay)
may still see the strange behavior of lower throughput for sequential
writes and higher throughput for random writes. Let's accept the fact:
it's a natural problem/property of the buffered writes. What we can do
is to aim for _long term_ rate matching.

> So, I don't think your suggested solution is a solution at all.  I'm
> in fact not even sure what it achieves at the cost of the gross
> layering violation and fundamental design braindamage.

It doesn't make anything perform better (nor worse). In face of the
challenging problem, both proposals suck. My solution just sucks less
as in the below listing.

> >         - No more latency
> >         - No performance drop
> >         - No bumpy progress and stalls
> >         - No need to attach memcg to blkcg
> >         - Feel free to create 1000+ IO controllers, to heart's content
> >           w/o worrying about costs (if any, it would be some existing
> >           scalability issues)
>
> I'm not sure why memcg suddenly becomes necessary with blkcg and I
> don't think having per-blkcg writeback and reasonable async
> optimization from iosched would be considerably worse.  It sure will
> add some overhead (e.g. from split buffering) but there will be proper
> working isolation which is what this fuss is all about.  Also, I just
> don't see how creating 1000+ (relatively active, I presume) blkcgs on
> a single spindle would be sane and how is the end result gonna be
> significantly better for your suggested solution, so let's please put
> aside the silly non-use case.

There are big disk arrays with lots of spindles inside, or arrays of
fast SSDs. People may want to create lots of cgroups on them.  IO
controllers should be made cheap and scalable to meet the demands from
our variety user base, now and future.

> In terms of overhead, I suspect the biggest would be the increased
> buffering coming from split channels but that seems like the cost of
> business to me.

I know that the back pressure idea actually come a long way (several
years?) and it's kind of become a common agreement that there will be
inevitable costs incur to the isolation. So I can understand why you
keep ignoring all the overheads, costs and scalability issues because
there seems no other way out. However here comes the solution that can
magically avoid the partition and all the resulted problems, and still
be able to provide the isolation.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-24  7:58                         ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-24  7:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf, Mel Gorman

Hi Tejun,

On Mon, Apr 23, 2012 at 09:56:26AM -0700, Tejun Heo wrote:
> Hello, Fengguang.
>
> On Sun, Apr 22, 2012 at 10:46:49PM +0800, Fengguang Wu wrote:
> > OK. Sorry I should have explained why memcg dirty limit is not the
> > right tool for back pressure based throttling.
>
> I have two questions.  Why do we need memcg for this?  Writeback
> currently works without memcg, right?  Why does that change with blkcg
> aware bdi?

Yeah currently writeback does not depend on memcg. As for blkcg, it's
necessary to keep a number of dirty pages for each blkcg, so that the
cfq groups' async IO queue does not go empty and lose its turn to do
IO. memcg provides the proper infrastructure to account dirty pages.

In a previous email, we have an example of two 10:1 weight cgroups,
each running one dd. They will make two IO pipes, each holding a number
of dirty pages. Since cfq honors dd-1 much more IO bandwidth, dd-1's
dirty pages are consumed quickly. However balance_dirty_pages(),
without knowing about cfq's bandwidth divisions, is throttling the
two dd tasks equally. So dd-1 will be producing dirty pages much
slower than cfq is consuming them. The flusher thus won't send enough
dirty pages down to fill the corresponding async IO queue for dd-1.
cfq cannot really give dd-1 more bandwidth share due to lack of data
feed. The end result will be: the two cgroups get 1:1 bandwidth share
honored by balance_dirty_pages() even though cfq honors 10:1 weights
to them.

1:1 balance_dirty_pages() bandwidth split

  [          dd-1              |           dd-2             ]
  |                             \                           |
  |                              \**************************|
  |                               \*************************|
  |                                \************************|
  |                                 \***********************|
  |                                  \**********************|
  |                                   \*********************|
  |                                    \********************|
  |                                     \*******************|
  |                                      \******************|
  |                                       \*****************|
  |                                        \****************|
  |                                         \***************|
  |                                          \**************|
  |                                           \*************|
  |                                            \************|
  |                                             \***********|
  |                                              \**********|
  |                                               \*********|
  |                                                \********|
  |                                                 \*******|
  |                                                  \******|
  |************************   (constantly underrun)   \*****|

10:1 cfq bandwidth split                      [*] dirty pages

Ideally is,

  [                      dd-1                         | dd-2]
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |                                                   |     |
  |***************************************************|*****|
  |***************************************************|*****|
  |***************************************************|*****|
  |***************************************************|*****|
  |***************************************************|*****|
  |***************************************************|*****|
  |***************************************************|*****|
  |***************************************************|*****|
  |***************************************************|*****|
  |***************************************************|*****|

Or better, one single pipe :)

  [                      dd-1                         | dd-2]
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |                                                         |
  |*********************************************************|
  |*********************************************************|
  |*********************************************************|
  |*********************************************************|
  |*********************************************************|
  |*********************************************************|
  |*********************************************************|
  |*********************************************************|
  |*********************************************************|
  |*********************************************************|

> > Basically the more memcgs with dirty limits, the more hard time for
> > the flusher to serve them fairly and knock down their dirty pages in
> > time. Because the flusher works inode by inode, each one may take up
> > to 0.5 second, and there may be many memcgs asking for the flusher's
> > attention. Also the more memcgs, the global dirty pages pool are
> > partitioned into smaller pieces, which means smaller safety margin for
> > each memcg. Adding these two effects up, there may be constantly some
> > memcgs hitting their dirty limits when there are dozens of memcgs.
>
> And how is this different from a machine with smaller memory?  If so,
> why?

In a small memory box, dd and flusher produce/consume dirty pages
continuously, so that over time the number of dirty pages can remain
roughly stable.

  ^ dirty pages
  |
  +dirty limit
  |                    | dd continously dirtying pages
  |dirty setpoint      v
  +*******************************************************************
  |                    |
  |                    v flusher continously clean pages
  |
  +-------------------------------------------------------------------->
                                                                    time

However if it's a large memory machine whose dirty pages get
partitioned to 100 cgroups, the flusher will be serving them
in round robin fashion. For a particular cgroup, the flusher
only comes and consumes its dirty pages once on every (100*flusher_slice)
seconds. The interval would be 50s for the current 0.5s flusher slice,
or 5s if lowering flusher slice to 50ms.

I'm not sure whether it's practical to decrease the flusher slice for
ext4, which for the sake of write performance and avoid fragmentation,
increases the write chunk size to 128MB internally.

For a number of reasons, the flusher's behavior cannot be exactly
controlled. The intervals the flusher come to each cgroup go up and
down, fairness can only be coarsely assured.

The dirty pages for each cgroup will be going up and down irregularly
across very large dynamic ranges. Now you should be able to imagine
the challenges to avoid hitting the dirty limit, to balance the dirty
pages around the per-cgroup-per-bdi dirty setpoints and to avoid
underruns. When there are 10 cgroups and 12 bdi's, the dirty setpoints
could explode up to 10*12.

  ^ dirty pages
  |
  |                                       dd continously dirtying pages
  + dirty limit        dd stalled
  |                      ******             |
  |                    *      *             |           *
  |                  *        *             |         *  *
  |                *          *             |       *    *
  | dirty        *            *             |     *      *
  + setpoint   *              *             |   *        *
  |          *                *             v *          *
  |        *                   *            *            *
  |      *                     *          *              *
  |    *                       *        *                *      *
  |  *                         *      *                   *   *
  |*                           *    *                     * *
  |                            *  *                       *
  |                             *
  |                           ^^ the flusher comes around to this cgroup
  |
  +-------------------------------------------------------------------->
                                                                    time

> > Such cross subsystem coordinations still look natural to me because
> > "weight" is a fundamental and general parameter. It's really a blkcg
> > thing (determined by the blkio.weight user interface) rather than
> > specifically tied to cfq. When another kernel entity (eg. NFS or noop)
> > decides to add support for proportional weight IO control in future,
> > it can make use of the weights calculated by balance_dirty_pages(), too.
>
> It is not fundamental and natural at all and is already made cfq
> specific in the devel branch.  You seem to think "weight" is somehow a
> global concept which everyone can agree on but it is not.  Weight of
> what?  Is it disktime, bandwidth, iops or something else?  cfq deals
> primarily with disktime because that makes sense for spinning drives
> with single head.  For SSDs with smart enough FTLs, the unit should
> probably be iops.  For storage technology bottlenecked on bus speed,
> bw would make sense.

"Weight" is sure a global concept that reflects the "importance"
deemed by the user for that cgroup. cfq (or NFS, whatever on the horizon)
then interprets the importance number as disk time, IOPS, bandwidth,
whatever semantic that best fits the backing storage and workload.

blkio.weight will be the "number" shared and interpreted by all IO
controller entities, whether it be cfq, NFS or balance_dirty_pages().
And I can assure you that balance_dirty_pages() will be interpreting
it the _same_ way the underlying cfq/NFS interprets it, via the feedback
scheme described below.

> IIUC, writeback is primarily dealing with abstracted bandwidth which
> is applied per-inode, which is fine at that layer as details like
> block allocations isn't and shouldn't be visible there and files (or
> inodes) are the level of abstraction.
>
> However, this doesn't necessarily translate easily into the actual
> underlying IO resource.  For devices with spindle, seek time dominates
> and the same amount of IO may consume vastly different amount of IO
> and the disk time becomes the primary resource, not the iops or
> bandwidth.  Naturally, people want to allocate and limit the primary
> resource, so cfq distributes disk time across different cgroups as
> configured.

Right. balance_dirty_pages() is always doing dirty throttling wrt.
bandwidth, even in your back pressure scheme, isn't it? In this regard,
there are nothing fundamentally different between our proposals. They
will both employ some way to convert the cfq's disk time or IOPS
notion to balance_dirty_pages()'s bandwidth notion. See below for my
way of conversion.

> Your suggested solution is applying the same a number - the weight -
> to one portion of a mostly arbitrarily split resource using a
> different unit.  I don't even understand what that achieves.

You seem to miss my stated plan: next step, balance_dirty_pages() will
get some feedback information from cfq to adjust its bandwidth targets
accordingly. That information will be

        io_cost = charge/sectors

The charge value is exactly the value computed in cfq_group_served(),
which is the slice time or IOs dispatched depending the mode cfq is
operating in. By dividing ratelimit by the normalized io_cost,
balance_dirty_pages() will automatically get the same weight
interpretation as cfq. For example, on spin disks, it will be able to
allocate lower bandwidth to seeky cgroups due to the larger io_cost
reported by cfq.

> The requirement is to be able to split IO resource according to
> cgroups in configurable way and enforce the limits established by the
> configuration, which we're currently failing to do for async IOs.
> Your proposed solution applies some arbitrary ratio according to some
> arbitrary interpretation of cfq IO time weight way up in the stack
> which, when propagated to the lower layer, would cause significant
> amount of delay and fluctuation which behaves completely independent
> from how (using what unit, in what granularity and in what time scale)
> actual IO resource is handled, split and accounted, which would result
> in something which probably has some semblance of interpreting
> blkcg.weight as vague best-effort priority at its luckiest moments.

Interestingly, our proposals are once again on the same plane
regarding the delays and fluctuations. Due to the long delays between
dirty and writeout time, the access pattern for the newly generated
dirty pages and the access pattern for the under-writeback pages may
have changed. So even if cfq is throttling the stream proportional to
its IO cost, the user on the other side of the pipe (with long delay)
may still see the strange behavior of lower throughput for sequential
writes and higher throughput for random writes. Let's accept the fact:
it's a natural problem/property of the buffered writes. What we can do
is to aim for _long term_ rate matching.

> So, I don't think your suggested solution is a solution at all.  I'm
> in fact not even sure what it achieves at the cost of the gross
> layering violation and fundamental design braindamage.

It doesn't make anything perform better (nor worse). In face of the
challenging problem, both proposals suck. My solution just sucks less
as in the below listing.

> >         - No more latency
> >         - No performance drop
> >         - No bumpy progress and stalls
> >         - No need to attach memcg to blkcg
> >         - Feel free to create 1000+ IO controllers, to heart's content
> >           w/o worrying about costs (if any, it would be some existing
> >           scalability issues)
>
> I'm not sure why memcg suddenly becomes necessary with blkcg and I
> don't think having per-blkcg writeback and reasonable async
> optimization from iosched would be considerably worse.  It sure will
> add some overhead (e.g. from split buffering) but there will be proper
> working isolation which is what this fuss is all about.  Also, I just
> don't see how creating 1000+ (relatively active, I presume) blkcgs on
> a single spindle would be sane and how is the end result gonna be
> significantly better for your suggested solution, so let's please put
> aside the silly non-use case.

There are big disk arrays with lots of spindles inside, or arrays of
fast SSDs. People may want to create lots of cgroups on them.  IO
controllers should be made cheap and scalable to meet the demands from
our variety user base, now and future.

> In terms of overhead, I suspect the biggest would be the increased
> buffering coming from split channels but that seems like the cost of
> business to me.

I know that the back pressure idea actually come a long way (several
years?) and it's kind of become a common agreement that there will be
inevitable costs incur to the isolation. So I can understand why you
keep ignoring all the overheads, costs and scalability issues because
there seems no other way out. However here comes the solution that can
magically avoid the partition and all the resulted problems, and still
be able to provide the isolation.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                                 ` <20120416145744.GA15437-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-24 11:33                                   ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-24 11:33 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Hi Vivek,

On Mon, Apr 16, 2012 at 10:57:45AM -0400, Vivek Goyal wrote:
> On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote:
> 
> [..]
> > Yeah the backpressure idea would work nicely with all possible
> > intermediate stacking between the bdi and leaf devices. In my attempt
> > to do combined IO bandwidth control for
> > 
> > - buffered writes, in balance_dirty_pages()
> > - direct IO, in the cfq IO scheduler
> > 
> > I have to look into the cfq code in the past days to get an idea how
> > the two throttling layers can cooperate (and suffer from the pains
> > arise from the violations of layers). It's also rather tricky to get
> > two previously independent throttling mechanisms to work seamlessly
> > with each other for providing the desired _unified_ user interface. It
> > took a lot of reasoning and experiments to work the basic scheme out...
> > 
> > But here is the first result. The attached graph shows progress of 4
> > tasks:
> > - cgroup A: 1 direct dd + 1 buffered dd
> > - cgroup B: 1 direct dd + 1 buffered dd
> > 
> > The 4 tasks are mostly progressing at the same pace. The top 2
> > smoother lines are for the buffered dirtiers. The bottom 2 lines are
> > for the direct writers. As you may notice, the two direct writers are
> > somehow stalled for 1-2 times, which increases the gaps between the
> > lines. Otherwise, the algorithm is working as expected to distribute
> > the bandwidth to each task.
> > 
> > The current code's target is to satisfy the more realistic user demand
> > of distributing bandwidth equally to each cgroup, and inside each
> > cgroup, distribute bandwidth equally to buffered/direct writes. On top
> > of which, weights can be specified to change the default distribution.
> > 
> > The implementation involves adding "weight for direct IO" to the cfq
> > groups and "weight for buffered writes" to the root cgroup. Note that
> > current cfq proportional IO conroller does not offer explicit control
> > over the direct:buffered ratio.
> > 
> > When there are both direct/buffered writers in the cgroup,
> > balance_dirty_pages() will kick in and adjust the weights for cfq to
> > execute. Note that cfq will continue to send all flusher IOs to the
> > root cgroup.  balance_dirty_pages() will compute the overall async
> > weight for it so that in the above test case, the computed weights
> > will be
> 
> I think having separate weigths for sync IO groups and async IO is not
> very appealing. There should be one notion of group weight and bandwidth
> distrubuted among groups according to their weight.

There have to be some scheme, either explicitly or implicitly. Maybe
you are baring in mind some "equal split among queues" policy? For
example, if the cgroup has 9 active sync queues and 1 async queue,
split the weight equally to the 10 queues?  So the sync IOs get 90%
share, and the async writes get 10% share.

For dirty throttling w/o cgroup awareness, balance_dirty_pages()
splits the writeout bandwidth equally among all dirtier tasks. Since
cfq works with queues, it seems most natural for it to do equal split
among all queues (inside the cgroup).

I'm not sure when there are N dd tasks doing direct IO, cfq will
continuously run N sync queues for them (without many dynamic queue
deletion and recreations). If that is the case, it should be trivial
to support the queue based fair split in the global async queue
scheme. Otherwise I'll have some trouble detecting the N value when
trying to do the N:1 sync:async weight split.

> Now one can argue that with-in a group, there might be one knob in CFQ
> which allows to change the share or sync/async IO.

Yeah. I suspect typical users don't care about the split policy or
fairness inside the cgroup, otherwise there may be complains on any
existing policies: "I want split this way" "I want that way"... ;-)

Anyway I'm not sure about the possible use cases..

> Also Tejun and Jan have expressed the desire that once we have figured
> out a way to communicate the submitter's context for async IO, we would
> like to account that IO in associated cgroup instead of root cgroup (as
> we do today).

Understand. Accounting should always be attributed to the corresponding
cgroup. I'll also need this to send feedback information to the async
IO submitter's cgroups.

> > - 1000 async weight for the root cgroup (2 buffered dds)
> > - 500 dio weight for cgroup A (1 direct dd)
> > - 500 dio weight for cgroup B (1 direct dd)
> > 
> > The second graph shows result for another test case:
> > - cgroup A, weight 300: 1 buffered cp
> > - cgroup B, weight 600: 1 buffered dd + 1 direct dd
> > - cgroup C, weight 300: 1 direct dd
> > which is also working as expected.
> > 
> > Once the cfq properly grants total async IO share to the flusher,
> > balance_dirty_pages() will then do its original job of distributing
> > the buffered write bandwidth among the buffered dd tasks.
> > 
> > It will have to assume that the devices under the same bdi are
> > "symmetry". It also needs further stats feedback on IOPS or disk time
> > in order to do IOPS/time based IO distribution. Anyway it would be
> > interesting to see how far this scheme can go. I'll cleanup the code
> > and post it soon.
> 
> Your proposal relies on few things.
> 
> - Bandwidth needs to be divided eually among sync and async IO.

Yeah balance_dirty_pages() always works on the basis of bandwidth. The
plan is that once we get the feedback information on each stream's
bandwidth:disk_time (or IOPS) ratio, the bandwidth target can be
adjusted to achieve disk time or IOPS based fair share among the
buffered dirtiers.

For the sync:async split, it's operating on the cfqg->weight. So it's
automatically disk time based.

Look at this graph, the 4 dd tasks are granted the same weight (2 of
them are buffered writes). I guess the 2 buffered dd tasks managed to
progress much faster than the 2 direct dd tasks just because the async
IOs are much more efficient than the bs=64k direct IOs.

https://github.com/fengguang/io-controller-tests/raw/master/log/bay/xfs/mixed-write-2.2012-04-19-10-42/balance_dirty_pages-task-bw.png

> - Flusher thread async IO will always to go to root cgroup.

Right. This is actually my main target: to avoid splitting up the
async streams throughout the IO path, for the good of performance.

> - I am not sure how this scheme is going to work when we introduce
>   hierarchical blkio cgroups.

I think it's still viable. balance_dirty_pages() works by estimating
the N (number of dd tasks) value and splitting the writeout bandwidth
equally among the tasks:

        task_ratelimit = write_bandwidth / N

It becomes a proportional weight IO controller if change the formula
to
        task_ratelimit = weight * write_bandwidth / N_w

Here lies the beauty of the bdi_update_dirty_ratelimit() algorithm:
it can automatically adapt N to the proper "weighted" N_w to keep
things in balance, given whatever weights applied to each task.

If further use 

        blkcg_ratelimit = weight * write_bandwidth / N_w
        task_ratelimit  = weight * blkcg_ratelimit / M_w

It's turned into a cgroup IO controller.

This change further makes it a hierarchical IO controller:

        blkcg_ratelimit = weight * parent_blkcg_ratelimit / M_w

We'll also need to hierarchically de-compose the async weights from
inner cgroup levels to outer levels, and finally add them to the root
cgroup that holds the async queue. This looks feasible, too.

> - cgroup weights for sync IO seems to be being controlled by user and
>   somehow root cgroup weight seems to be controlled by this async IO
>   logic silently.

In the current state I do assume no IO tasks in the root cgroup except
for the flusher. However in general the root cgroup can be treated the
same as other cgroups: its weight can also be split up into dio_weight
and async weight.

The general idea is
- cfqg->weight is given by user
- cfqg->dio_weight is used for sync slices on vdisktime calculation.
- total_async_weight collects all async IO weights from each cgroup,
  including the root cgroup. They are the "credits" for the flusher
  for doing the async IOs in delegate of all the cgroups.

> Overall sounds very odd design to me. I am not sure what are we achieving
> by this. In current scheme one should be able to just adjust the weight
> of root cgroup using cgroup interface and achieve same results which you
> are showing. So where is the need of dynamically changing it inside
> kernel.

The "dynamically changing weights" are for the in-cgroup equal split
between sync/async IOs. It does feel like an arbitrary added policy..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-16 14:57                                 ` Vivek Goyal
@ 2012-04-24 11:33                                   ` Fengguang Wu
  -1 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-24 11:33 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hi Vivek,

On Mon, Apr 16, 2012 at 10:57:45AM -0400, Vivek Goyal wrote:
> On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote:
> 
> [..]
> > Yeah the backpressure idea would work nicely with all possible
> > intermediate stacking between the bdi and leaf devices. In my attempt
> > to do combined IO bandwidth control for
> > 
> > - buffered writes, in balance_dirty_pages()
> > - direct IO, in the cfq IO scheduler
> > 
> > I have to look into the cfq code in the past days to get an idea how
> > the two throttling layers can cooperate (and suffer from the pains
> > arise from the violations of layers). It's also rather tricky to get
> > two previously independent throttling mechanisms to work seamlessly
> > with each other for providing the desired _unified_ user interface. It
> > took a lot of reasoning and experiments to work the basic scheme out...
> > 
> > But here is the first result. The attached graph shows progress of 4
> > tasks:
> > - cgroup A: 1 direct dd + 1 buffered dd
> > - cgroup B: 1 direct dd + 1 buffered dd
> > 
> > The 4 tasks are mostly progressing at the same pace. The top 2
> > smoother lines are for the buffered dirtiers. The bottom 2 lines are
> > for the direct writers. As you may notice, the two direct writers are
> > somehow stalled for 1-2 times, which increases the gaps between the
> > lines. Otherwise, the algorithm is working as expected to distribute
> > the bandwidth to each task.
> > 
> > The current code's target is to satisfy the more realistic user demand
> > of distributing bandwidth equally to each cgroup, and inside each
> > cgroup, distribute bandwidth equally to buffered/direct writes. On top
> > of which, weights can be specified to change the default distribution.
> > 
> > The implementation involves adding "weight for direct IO" to the cfq
> > groups and "weight for buffered writes" to the root cgroup. Note that
> > current cfq proportional IO conroller does not offer explicit control
> > over the direct:buffered ratio.
> > 
> > When there are both direct/buffered writers in the cgroup,
> > balance_dirty_pages() will kick in and adjust the weights for cfq to
> > execute. Note that cfq will continue to send all flusher IOs to the
> > root cgroup.  balance_dirty_pages() will compute the overall async
> > weight for it so that in the above test case, the computed weights
> > will be
> 
> I think having separate weigths for sync IO groups and async IO is not
> very appealing. There should be one notion of group weight and bandwidth
> distrubuted among groups according to their weight.

There have to be some scheme, either explicitly or implicitly. Maybe
you are baring in mind some "equal split among queues" policy? For
example, if the cgroup has 9 active sync queues and 1 async queue,
split the weight equally to the 10 queues?  So the sync IOs get 90%
share, and the async writes get 10% share.

For dirty throttling w/o cgroup awareness, balance_dirty_pages()
splits the writeout bandwidth equally among all dirtier tasks. Since
cfq works with queues, it seems most natural for it to do equal split
among all queues (inside the cgroup).

I'm not sure when there are N dd tasks doing direct IO, cfq will
continuously run N sync queues for them (without many dynamic queue
deletion and recreations). If that is the case, it should be trivial
to support the queue based fair split in the global async queue
scheme. Otherwise I'll have some trouble detecting the N value when
trying to do the N:1 sync:async weight split.

> Now one can argue that with-in a group, there might be one knob in CFQ
> which allows to change the share or sync/async IO.

Yeah. I suspect typical users don't care about the split policy or
fairness inside the cgroup, otherwise there may be complains on any
existing policies: "I want split this way" "I want that way"... ;-)

Anyway I'm not sure about the possible use cases..

> Also Tejun and Jan have expressed the desire that once we have figured
> out a way to communicate the submitter's context for async IO, we would
> like to account that IO in associated cgroup instead of root cgroup (as
> we do today).

Understand. Accounting should always be attributed to the corresponding
cgroup. I'll also need this to send feedback information to the async
IO submitter's cgroups.

> > - 1000 async weight for the root cgroup (2 buffered dds)
> > - 500 dio weight for cgroup A (1 direct dd)
> > - 500 dio weight for cgroup B (1 direct dd)
> > 
> > The second graph shows result for another test case:
> > - cgroup A, weight 300: 1 buffered cp
> > - cgroup B, weight 600: 1 buffered dd + 1 direct dd
> > - cgroup C, weight 300: 1 direct dd
> > which is also working as expected.
> > 
> > Once the cfq properly grants total async IO share to the flusher,
> > balance_dirty_pages() will then do its original job of distributing
> > the buffered write bandwidth among the buffered dd tasks.
> > 
> > It will have to assume that the devices under the same bdi are
> > "symmetry". It also needs further stats feedback on IOPS or disk time
> > in order to do IOPS/time based IO distribution. Anyway it would be
> > interesting to see how far this scheme can go. I'll cleanup the code
> > and post it soon.
> 
> Your proposal relies on few things.
> 
> - Bandwidth needs to be divided eually among sync and async IO.

Yeah balance_dirty_pages() always works on the basis of bandwidth. The
plan is that once we get the feedback information on each stream's
bandwidth:disk_time (or IOPS) ratio, the bandwidth target can be
adjusted to achieve disk time or IOPS based fair share among the
buffered dirtiers.

For the sync:async split, it's operating on the cfqg->weight. So it's
automatically disk time based.

Look at this graph, the 4 dd tasks are granted the same weight (2 of
them are buffered writes). I guess the 2 buffered dd tasks managed to
progress much faster than the 2 direct dd tasks just because the async
IOs are much more efficient than the bs=64k direct IOs.

https://github.com/fengguang/io-controller-tests/raw/master/log/bay/xfs/mixed-write-2.2012-04-19-10-42/balance_dirty_pages-task-bw.png

> - Flusher thread async IO will always to go to root cgroup.

Right. This is actually my main target: to avoid splitting up the
async streams throughout the IO path, for the good of performance.

> - I am not sure how this scheme is going to work when we introduce
>   hierarchical blkio cgroups.

I think it's still viable. balance_dirty_pages() works by estimating
the N (number of dd tasks) value and splitting the writeout bandwidth
equally among the tasks:

        task_ratelimit = write_bandwidth / N

It becomes a proportional weight IO controller if change the formula
to
        task_ratelimit = weight * write_bandwidth / N_w

Here lies the beauty of the bdi_update_dirty_ratelimit() algorithm:
it can automatically adapt N to the proper "weighted" N_w to keep
things in balance, given whatever weights applied to each task.

If further use 

        blkcg_ratelimit = weight * write_bandwidth / N_w
        task_ratelimit  = weight * blkcg_ratelimit / M_w

It's turned into a cgroup IO controller.

This change further makes it a hierarchical IO controller:

        blkcg_ratelimit = weight * parent_blkcg_ratelimit / M_w

We'll also need to hierarchically de-compose the async weights from
inner cgroup levels to outer levels, and finally add them to the root
cgroup that holds the async queue. This looks feasible, too.

> - cgroup weights for sync IO seems to be being controlled by user and
>   somehow root cgroup weight seems to be controlled by this async IO
>   logic silently.

In the current state I do assume no IO tasks in the root cgroup except
for the flusher. However in general the root cgroup can be treated the
same as other cgroups: its weight can also be split up into dio_weight
and async weight.

The general idea is
- cfqg->weight is given by user
- cfqg->dio_weight is used for sync slices on vdisktime calculation.
- total_async_weight collects all async IO weights from each cgroup,
  including the root cgroup. They are the "credits" for the flusher
  for doing the async IOs in delegate of all the cgroups.

> Overall sounds very odd design to me. I am not sure what are we achieving
> by this. In current scheme one should be able to just adjust the weight
> of root cgroup using cgroup interface and achieve same results which you
> are showing. So where is the need of dynamically changing it inside
> kernel.

The "dynamically changing weights" are for the in-cgroup equal split
between sync/async IOs. It does feel like an arbitrary added policy..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-24 11:33                                   ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-24 11:33 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, Jan Kara, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

Hi Vivek,

On Mon, Apr 16, 2012 at 10:57:45AM -0400, Vivek Goyal wrote:
> On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote:
> 
> [..]
> > Yeah the backpressure idea would work nicely with all possible
> > intermediate stacking between the bdi and leaf devices. In my attempt
> > to do combined IO bandwidth control for
> > 
> > - buffered writes, in balance_dirty_pages()
> > - direct IO, in the cfq IO scheduler
> > 
> > I have to look into the cfq code in the past days to get an idea how
> > the two throttling layers can cooperate (and suffer from the pains
> > arise from the violations of layers). It's also rather tricky to get
> > two previously independent throttling mechanisms to work seamlessly
> > with each other for providing the desired _unified_ user interface. It
> > took a lot of reasoning and experiments to work the basic scheme out...
> > 
> > But here is the first result. The attached graph shows progress of 4
> > tasks:
> > - cgroup A: 1 direct dd + 1 buffered dd
> > - cgroup B: 1 direct dd + 1 buffered dd
> > 
> > The 4 tasks are mostly progressing at the same pace. The top 2
> > smoother lines are for the buffered dirtiers. The bottom 2 lines are
> > for the direct writers. As you may notice, the two direct writers are
> > somehow stalled for 1-2 times, which increases the gaps between the
> > lines. Otherwise, the algorithm is working as expected to distribute
> > the bandwidth to each task.
> > 
> > The current code's target is to satisfy the more realistic user demand
> > of distributing bandwidth equally to each cgroup, and inside each
> > cgroup, distribute bandwidth equally to buffered/direct writes. On top
> > of which, weights can be specified to change the default distribution.
> > 
> > The implementation involves adding "weight for direct IO" to the cfq
> > groups and "weight for buffered writes" to the root cgroup. Note that
> > current cfq proportional IO conroller does not offer explicit control
> > over the direct:buffered ratio.
> > 
> > When there are both direct/buffered writers in the cgroup,
> > balance_dirty_pages() will kick in and adjust the weights for cfq to
> > execute. Note that cfq will continue to send all flusher IOs to the
> > root cgroup.  balance_dirty_pages() will compute the overall async
> > weight for it so that in the above test case, the computed weights
> > will be
> 
> I think having separate weigths for sync IO groups and async IO is not
> very appealing. There should be one notion of group weight and bandwidth
> distrubuted among groups according to their weight.

There have to be some scheme, either explicitly or implicitly. Maybe
you are baring in mind some "equal split among queues" policy? For
example, if the cgroup has 9 active sync queues and 1 async queue,
split the weight equally to the 10 queues?  So the sync IOs get 90%
share, and the async writes get 10% share.

For dirty throttling w/o cgroup awareness, balance_dirty_pages()
splits the writeout bandwidth equally among all dirtier tasks. Since
cfq works with queues, it seems most natural for it to do equal split
among all queues (inside the cgroup).

I'm not sure when there are N dd tasks doing direct IO, cfq will
continuously run N sync queues for them (without many dynamic queue
deletion and recreations). If that is the case, it should be trivial
to support the queue based fair split in the global async queue
scheme. Otherwise I'll have some trouble detecting the N value when
trying to do the N:1 sync:async weight split.

> Now one can argue that with-in a group, there might be one knob in CFQ
> which allows to change the share or sync/async IO.

Yeah. I suspect typical users don't care about the split policy or
fairness inside the cgroup, otherwise there may be complains on any
existing policies: "I want split this way" "I want that way"... ;-)

Anyway I'm not sure about the possible use cases..

> Also Tejun and Jan have expressed the desire that once we have figured
> out a way to communicate the submitter's context for async IO, we would
> like to account that IO in associated cgroup instead of root cgroup (as
> we do today).

Understand. Accounting should always be attributed to the corresponding
cgroup. I'll also need this to send feedback information to the async
IO submitter's cgroups.

> > - 1000 async weight for the root cgroup (2 buffered dds)
> > - 500 dio weight for cgroup A (1 direct dd)
> > - 500 dio weight for cgroup B (1 direct dd)
> > 
> > The second graph shows result for another test case:
> > - cgroup A, weight 300: 1 buffered cp
> > - cgroup B, weight 600: 1 buffered dd + 1 direct dd
> > - cgroup C, weight 300: 1 direct dd
> > which is also working as expected.
> > 
> > Once the cfq properly grants total async IO share to the flusher,
> > balance_dirty_pages() will then do its original job of distributing
> > the buffered write bandwidth among the buffered dd tasks.
> > 
> > It will have to assume that the devices under the same bdi are
> > "symmetry". It also needs further stats feedback on IOPS or disk time
> > in order to do IOPS/time based IO distribution. Anyway it would be
> > interesting to see how far this scheme can go. I'll cleanup the code
> > and post it soon.
> 
> Your proposal relies on few things.
> 
> - Bandwidth needs to be divided eually among sync and async IO.

Yeah balance_dirty_pages() always works on the basis of bandwidth. The
plan is that once we get the feedback information on each stream's
bandwidth:disk_time (or IOPS) ratio, the bandwidth target can be
adjusted to achieve disk time or IOPS based fair share among the
buffered dirtiers.

For the sync:async split, it's operating on the cfqg->weight. So it's
automatically disk time based.

Look at this graph, the 4 dd tasks are granted the same weight (2 of
them are buffered writes). I guess the 2 buffered dd tasks managed to
progress much faster than the 2 direct dd tasks just because the async
IOs are much more efficient than the bs=64k direct IOs.

https://github.com/fengguang/io-controller-tests/raw/master/log/bay/xfs/mixed-write-2.2012-04-19-10-42/balance_dirty_pages-task-bw.png

> - Flusher thread async IO will always to go to root cgroup.

Right. This is actually my main target: to avoid splitting up the
async streams throughout the IO path, for the good of performance.

> - I am not sure how this scheme is going to work when we introduce
>   hierarchical blkio cgroups.

I think it's still viable. balance_dirty_pages() works by estimating
the N (number of dd tasks) value and splitting the writeout bandwidth
equally among the tasks:

        task_ratelimit = write_bandwidth / N

It becomes a proportional weight IO controller if change the formula
to
        task_ratelimit = weight * write_bandwidth / N_w

Here lies the beauty of the bdi_update_dirty_ratelimit() algorithm:
it can automatically adapt N to the proper "weighted" N_w to keep
things in balance, given whatever weights applied to each task.

If further use 

        blkcg_ratelimit = weight * write_bandwidth / N_w
        task_ratelimit  = weight * blkcg_ratelimit / M_w

It's turned into a cgroup IO controller.

This change further makes it a hierarchical IO controller:

        blkcg_ratelimit = weight * parent_blkcg_ratelimit / M_w

We'll also need to hierarchically de-compose the async weights from
inner cgroup levels to outer levels, and finally add them to the root
cgroup that holds the async queue. This looks feasible, too.

> - cgroup weights for sync IO seems to be being controlled by user and
>   somehow root cgroup weight seems to be controlled by this async IO
>   logic silently.

In the current state I do assume no IO tasks in the root cgroup except
for the flusher. However in general the root cgroup can be treated the
same as other cgroups: its weight can also be split up into dio_weight
and async weight.

The general idea is
- cfqg->weight is given by user
- cfqg->dio_weight is used for sync slices on vdisktime calculation.
- total_async_weight collects all async IO weights from each cgroup,
  including the root cgroup. They are the "credits" for the flusher
  for doing the async IOs in delegate of all the cgroups.

> Overall sounds very odd design to me. I am not sure what are we achieving
> by this. In current scheme one should be able to just adjust the weight
> of root cgroup using cgroup interface and achieve same results which you
> are showing. So where is the need of dynamically changing it inside
> kernel.

The "dynamically changing weights" are for the in-cgroup equal split
between sync/async IOs. It does feel like an arbitrary added policy..

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-24 11:33                                   ` Fengguang Wu
  (?)
@ 2012-04-24 14:56                                   ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-24 14:56 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal

On Tue 24-04-12 19:33:40, Wu Fengguang wrote:
> On Mon, Apr 16, 2012 at 10:57:45AM -0400, Vivek Goyal wrote:
> > On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote:
> > 
> > [..]
> > > Yeah the backpressure idea would work nicely with all possible
> > > intermediate stacking between the bdi and leaf devices. In my attempt
> > > to do combined IO bandwidth control for
> > > 
> > > - buffered writes, in balance_dirty_pages()
> > > - direct IO, in the cfq IO scheduler
> > > 
> > > I have to look into the cfq code in the past days to get an idea how
> > > the two throttling layers can cooperate (and suffer from the pains
> > > arise from the violations of layers). It's also rather tricky to get
> > > two previously independent throttling mechanisms to work seamlessly
> > > with each other for providing the desired _unified_ user interface. It
> > > took a lot of reasoning and experiments to work the basic scheme out...
> > > 
> > > But here is the first result. The attached graph shows progress of 4
> > > tasks:
> > > - cgroup A: 1 direct dd + 1 buffered dd
> > > - cgroup B: 1 direct dd + 1 buffered dd
> > > 
> > > The 4 tasks are mostly progressing at the same pace. The top 2
> > > smoother lines are for the buffered dirtiers. The bottom 2 lines are
> > > for the direct writers. As you may notice, the two direct writers are
> > > somehow stalled for 1-2 times, which increases the gaps between the
> > > lines. Otherwise, the algorithm is working as expected to distribute
> > > the bandwidth to each task.
> > > 
> > > The current code's target is to satisfy the more realistic user demand
> > > of distributing bandwidth equally to each cgroup, and inside each
> > > cgroup, distribute bandwidth equally to buffered/direct writes. On top
> > > of which, weights can be specified to change the default distribution.
> > > 
> > > The implementation involves adding "weight for direct IO" to the cfq
> > > groups and "weight for buffered writes" to the root cgroup. Note that
> > > current cfq proportional IO conroller does not offer explicit control
> > > over the direct:buffered ratio.
> > > 
> > > When there are both direct/buffered writers in the cgroup,
> > > balance_dirty_pages() will kick in and adjust the weights for cfq to
> > > execute. Note that cfq will continue to send all flusher IOs to the
> > > root cgroup.  balance_dirty_pages() will compute the overall async
> > > weight for it so that in the above test case, the computed weights
> > > will be
> > 
> > I think having separate weigths for sync IO groups and async IO is not
> > very appealing. There should be one notion of group weight and bandwidth
> > distrubuted among groups according to their weight.
> 
> There have to be some scheme, either explicitly or implicitly. Maybe
> you are baring in mind some "equal split among queues" policy? For
> example, if the cgroup has 9 active sync queues and 1 async queue,
> split the weight equally to the 10 queues?  So the sync IOs get 90%
> share, and the async writes get 10% share.
  Maybe I misunderstand but there doesn't have to be (and in fact isn't)
any split among sync / async IO in CFQ. At each moment, we choose a queue
with the highest score and dispatch a couple of requests from it. Then we
go and choose again. The score of the queue depends on several factors
(like age of requests, whether the queue is sync or async, IO priority,
etc.).

Practically, over a longer period system will stabilize on some ratio
but that's dependent on the load so your system should not impose some
artificial direct/buffered split but rather somehow deal with the reality
how IO scheduler decides to dispatch requests...

> For dirty throttling w/o cgroup awareness, balance_dirty_pages()
> splits the writeout bandwidth equally among all dirtier tasks. Since
> cfq works with queues, it seems most natural for it to do equal split
> among all queues (inside the cgroup).
  Well, but we also have IO priorities which change which queue should get
preference.

> I'm not sure when there are N dd tasks doing direct IO, cfq will
> continuously run N sync queues for them (without many dynamic queue
> deletion and recreations). If that is the case, it should be trivial
> to support the queue based fair split in the global async queue
> scheme. Otherwise I'll have some trouble detecting the N value when
> trying to do the N:1 sync:async weight split.
  And also sync queues for several processes can get merged when CFQ
observes these processes cooperate together on one area of disk and get
split again when processes stop cooperating. I don't think you really want
to second-guess what CFQ does inside...

> Look at this graph, the 4 dd tasks are granted the same weight (2 of
> them are buffered writes). I guess the 2 buffered dd tasks managed to
> progress much faster than the 2 direct dd tasks just because the async
> IOs are much more efficient than the bs=64k direct IOs.
  Likely because 64k is too low to get good bandwidth with direct IO. If
it was 4M, I believe you would get similar throughput for buffered and
direct IO. So essentially you are right, small IO benefits from caching
effects since they allow you to submit larger requests to the device which
is more efficient.

								Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-24 11:33                                   ` Fengguang Wu
  (?)
@ 2012-04-24 14:56                                     ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-24 14:56 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Vivek Goyal, Tejun Heo, Jan Kara, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

On Tue 24-04-12 19:33:40, Wu Fengguang wrote:
> On Mon, Apr 16, 2012 at 10:57:45AM -0400, Vivek Goyal wrote:
> > On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote:
> > 
> > [..]
> > > Yeah the backpressure idea would work nicely with all possible
> > > intermediate stacking between the bdi and leaf devices. In my attempt
> > > to do combined IO bandwidth control for
> > > 
> > > - buffered writes, in balance_dirty_pages()
> > > - direct IO, in the cfq IO scheduler
> > > 
> > > I have to look into the cfq code in the past days to get an idea how
> > > the two throttling layers can cooperate (and suffer from the pains
> > > arise from the violations of layers). It's also rather tricky to get
> > > two previously independent throttling mechanisms to work seamlessly
> > > with each other for providing the desired _unified_ user interface. It
> > > took a lot of reasoning and experiments to work the basic scheme out...
> > > 
> > > But here is the first result. The attached graph shows progress of 4
> > > tasks:
> > > - cgroup A: 1 direct dd + 1 buffered dd
> > > - cgroup B: 1 direct dd + 1 buffered dd
> > > 
> > > The 4 tasks are mostly progressing at the same pace. The top 2
> > > smoother lines are for the buffered dirtiers. The bottom 2 lines are
> > > for the direct writers. As you may notice, the two direct writers are
> > > somehow stalled for 1-2 times, which increases the gaps between the
> > > lines. Otherwise, the algorithm is working as expected to distribute
> > > the bandwidth to each task.
> > > 
> > > The current code's target is to satisfy the more realistic user demand
> > > of distributing bandwidth equally to each cgroup, and inside each
> > > cgroup, distribute bandwidth equally to buffered/direct writes. On top
> > > of which, weights can be specified to change the default distribution.
> > > 
> > > The implementation involves adding "weight for direct IO" to the cfq
> > > groups and "weight for buffered writes" to the root cgroup. Note that
> > > current cfq proportional IO conroller does not offer explicit control
> > > over the direct:buffered ratio.
> > > 
> > > When there are both direct/buffered writers in the cgroup,
> > > balance_dirty_pages() will kick in and adjust the weights for cfq to
> > > execute. Note that cfq will continue to send all flusher IOs to the
> > > root cgroup.  balance_dirty_pages() will compute the overall async
> > > weight for it so that in the above test case, the computed weights
> > > will be
> > 
> > I think having separate weigths for sync IO groups and async IO is not
> > very appealing. There should be one notion of group weight and bandwidth
> > distrubuted among groups according to their weight.
> 
> There have to be some scheme, either explicitly or implicitly. Maybe
> you are baring in mind some "equal split among queues" policy? For
> example, if the cgroup has 9 active sync queues and 1 async queue,
> split the weight equally to the 10 queues?  So the sync IOs get 90%
> share, and the async writes get 10% share.
  Maybe I misunderstand but there doesn't have to be (and in fact isn't)
any split among sync / async IO in CFQ. At each moment, we choose a queue
with the highest score and dispatch a couple of requests from it. Then we
go and choose again. The score of the queue depends on several factors
(like age of requests, whether the queue is sync or async, IO priority,
etc.).

Practically, over a longer period system will stabilize on some ratio
but that's dependent on the load so your system should not impose some
artificial direct/buffered split but rather somehow deal with the reality
how IO scheduler decides to dispatch requests...

> For dirty throttling w/o cgroup awareness, balance_dirty_pages()
> splits the writeout bandwidth equally among all dirtier tasks. Since
> cfq works with queues, it seems most natural for it to do equal split
> among all queues (inside the cgroup).
  Well, but we also have IO priorities which change which queue should get
preference.

> I'm not sure when there are N dd tasks doing direct IO, cfq will
> continuously run N sync queues for them (without many dynamic queue
> deletion and recreations). If that is the case, it should be trivial
> to support the queue based fair split in the global async queue
> scheme. Otherwise I'll have some trouble detecting the N value when
> trying to do the N:1 sync:async weight split.
  And also sync queues for several processes can get merged when CFQ
observes these processes cooperate together on one area of disk and get
split again when processes stop cooperating. I don't think you really want
to second-guess what CFQ does inside...

> Look at this graph, the 4 dd tasks are granted the same weight (2 of
> them are buffered writes). I guess the 2 buffered dd tasks managed to
> progress much faster than the 2 direct dd tasks just because the async
> IOs are much more efficient than the bs=64k direct IOs.
  Likely because 64k is too low to get good bandwidth with direct IO. If
it was 4M, I believe you would get similar throughput for buffered and
direct IO. So essentially you are right, small IO benefits from caching
effects since they allow you to submit larger requests to the device which
is more efficient.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-24 14:56                                     ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-24 14:56 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Vivek Goyal, Tejun Heo, Jan Kara, Jens Axboe,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k,
	andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Tue 24-04-12 19:33:40, Wu Fengguang wrote:
> On Mon, Apr 16, 2012 at 10:57:45AM -0400, Vivek Goyal wrote:
> > On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote:
> > 
> > [..]
> > > Yeah the backpressure idea would work nicely with all possible
> > > intermediate stacking between the bdi and leaf devices. In my attempt
> > > to do combined IO bandwidth control for
> > > 
> > > - buffered writes, in balance_dirty_pages()
> > > - direct IO, in the cfq IO scheduler
> > > 
> > > I have to look into the cfq code in the past days to get an idea how
> > > the two throttling layers can cooperate (and suffer from the pains
> > > arise from the violations of layers). It's also rather tricky to get
> > > two previously independent throttling mechanisms to work seamlessly
> > > with each other for providing the desired _unified_ user interface. It
> > > took a lot of reasoning and experiments to work the basic scheme out...
> > > 
> > > But here is the first result. The attached graph shows progress of 4
> > > tasks:
> > > - cgroup A: 1 direct dd + 1 buffered dd
> > > - cgroup B: 1 direct dd + 1 buffered dd
> > > 
> > > The 4 tasks are mostly progressing at the same pace. The top 2
> > > smoother lines are for the buffered dirtiers. The bottom 2 lines are
> > > for the direct writers. As you may notice, the two direct writers are
> > > somehow stalled for 1-2 times, which increases the gaps between the
> > > lines. Otherwise, the algorithm is working as expected to distribute
> > > the bandwidth to each task.
> > > 
> > > The current code's target is to satisfy the more realistic user demand
> > > of distributing bandwidth equally to each cgroup, and inside each
> > > cgroup, distribute bandwidth equally to buffered/direct writes. On top
> > > of which, weights can be specified to change the default distribution.
> > > 
> > > The implementation involves adding "weight for direct IO" to the cfq
> > > groups and "weight for buffered writes" to the root cgroup. Note that
> > > current cfq proportional IO conroller does not offer explicit control
> > > over the direct:buffered ratio.
> > > 
> > > When there are both direct/buffered writers in the cgroup,
> > > balance_dirty_pages() will kick in and adjust the weights for cfq to
> > > execute. Note that cfq will continue to send all flusher IOs to the
> > > root cgroup.  balance_dirty_pages() will compute the overall async
> > > weight for it so that in the above test case, the computed weights
> > > will be
> > 
> > I think having separate weigths for sync IO groups and async IO is not
> > very appealing. There should be one notion of group weight and bandwidth
> > distrubuted among groups according to their weight.
> 
> There have to be some scheme, either explicitly or implicitly. Maybe
> you are baring in mind some "equal split among queues" policy? For
> example, if the cgroup has 9 active sync queues and 1 async queue,
> split the weight equally to the 10 queues?  So the sync IOs get 90%
> share, and the async writes get 10% share.
  Maybe I misunderstand but there doesn't have to be (and in fact isn't)
any split among sync / async IO in CFQ. At each moment, we choose a queue
with the highest score and dispatch a couple of requests from it. Then we
go and choose again. The score of the queue depends on several factors
(like age of requests, whether the queue is sync or async, IO priority,
etc.).

Practically, over a longer period system will stabilize on some ratio
but that's dependent on the load so your system should not impose some
artificial direct/buffered split but rather somehow deal with the reality
how IO scheduler decides to dispatch requests...

> For dirty throttling w/o cgroup awareness, balance_dirty_pages()
> splits the writeout bandwidth equally among all dirtier tasks. Since
> cfq works with queues, it seems most natural for it to do equal split
> among all queues (inside the cgroup).
  Well, but we also have IO priorities which change which queue should get
preference.

> I'm not sure when there are N dd tasks doing direct IO, cfq will
> continuously run N sync queues for them (without many dynamic queue
> deletion and recreations). If that is the case, it should be trivial
> to support the queue based fair split in the global async queue
> scheme. Otherwise I'll have some trouble detecting the N value when
> trying to do the N:1 sync:async weight split.
  And also sync queues for several processes can get merged when CFQ
observes these processes cooperate together on one area of disk and get
split again when processes stop cooperating. I don't think you really want
to second-guess what CFQ does inside...

> Look at this graph, the 4 dd tasks are granted the same weight (2 of
> them are buffered writes). I guess the 2 buffered dd tasks managed to
> progress much faster than the 2 direct dd tasks just because the async
> IOs are much more efficient than the bs=64k direct IOs.
  Likely because 64k is too low to get good bandwidth with direct IO. If
it was 4M, I believe you would get similar throughput for buffered and
direct IO. So essentially you are right, small IO benefits from caching
effects since they allow you to submit larger requests to the device which
is more efficient.

								Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-24 14:56                                     ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-24 14:56 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Vivek Goyal, Tejun Heo, Jan Kara, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

On Tue 24-04-12 19:33:40, Wu Fengguang wrote:
> On Mon, Apr 16, 2012 at 10:57:45AM -0400, Vivek Goyal wrote:
> > On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote:
> > 
> > [..]
> > > Yeah the backpressure idea would work nicely with all possible
> > > intermediate stacking between the bdi and leaf devices. In my attempt
> > > to do combined IO bandwidth control for
> > > 
> > > - buffered writes, in balance_dirty_pages()
> > > - direct IO, in the cfq IO scheduler
> > > 
> > > I have to look into the cfq code in the past days to get an idea how
> > > the two throttling layers can cooperate (and suffer from the pains
> > > arise from the violations of layers). It's also rather tricky to get
> > > two previously independent throttling mechanisms to work seamlessly
> > > with each other for providing the desired _unified_ user interface. It
> > > took a lot of reasoning and experiments to work the basic scheme out...
> > > 
> > > But here is the first result. The attached graph shows progress of 4
> > > tasks:
> > > - cgroup A: 1 direct dd + 1 buffered dd
> > > - cgroup B: 1 direct dd + 1 buffered dd
> > > 
> > > The 4 tasks are mostly progressing at the same pace. The top 2
> > > smoother lines are for the buffered dirtiers. The bottom 2 lines are
> > > for the direct writers. As you may notice, the two direct writers are
> > > somehow stalled for 1-2 times, which increases the gaps between the
> > > lines. Otherwise, the algorithm is working as expected to distribute
> > > the bandwidth to each task.
> > > 
> > > The current code's target is to satisfy the more realistic user demand
> > > of distributing bandwidth equally to each cgroup, and inside each
> > > cgroup, distribute bandwidth equally to buffered/direct writes. On top
> > > of which, weights can be specified to change the default distribution.
> > > 
> > > The implementation involves adding "weight for direct IO" to the cfq
> > > groups and "weight for buffered writes" to the root cgroup. Note that
> > > current cfq proportional IO conroller does not offer explicit control
> > > over the direct:buffered ratio.
> > > 
> > > When there are both direct/buffered writers in the cgroup,
> > > balance_dirty_pages() will kick in and adjust the weights for cfq to
> > > execute. Note that cfq will continue to send all flusher IOs to the
> > > root cgroup.  balance_dirty_pages() will compute the overall async
> > > weight for it so that in the above test case, the computed weights
> > > will be
> > 
> > I think having separate weigths for sync IO groups and async IO is not
> > very appealing. There should be one notion of group weight and bandwidth
> > distrubuted among groups according to their weight.
> 
> There have to be some scheme, either explicitly or implicitly. Maybe
> you are baring in mind some "equal split among queues" policy? For
> example, if the cgroup has 9 active sync queues and 1 async queue,
> split the weight equally to the 10 queues?  So the sync IOs get 90%
> share, and the async writes get 10% share.
  Maybe I misunderstand but there doesn't have to be (and in fact isn't)
any split among sync / async IO in CFQ. At each moment, we choose a queue
with the highest score and dispatch a couple of requests from it. Then we
go and choose again. The score of the queue depends on several factors
(like age of requests, whether the queue is sync or async, IO priority,
etc.).

Practically, over a longer period system will stabilize on some ratio
but that's dependent on the load so your system should not impose some
artificial direct/buffered split but rather somehow deal with the reality
how IO scheduler decides to dispatch requests...

> For dirty throttling w/o cgroup awareness, balance_dirty_pages()
> splits the writeout bandwidth equally among all dirtier tasks. Since
> cfq works with queues, it seems most natural for it to do equal split
> among all queues (inside the cgroup).
  Well, but we also have IO priorities which change which queue should get
preference.

> I'm not sure when there are N dd tasks doing direct IO, cfq will
> continuously run N sync queues for them (without many dynamic queue
> deletion and recreations). If that is the case, it should be trivial
> to support the queue based fair split in the global async queue
> scheme. Otherwise I'll have some trouble detecting the N value when
> trying to do the N:1 sync:async weight split.
  And also sync queues for several processes can get merged when CFQ
observes these processes cooperate together on one area of disk and get
split again when processes stop cooperating. I don't think you really want
to second-guess what CFQ does inside...

> Look at this graph, the 4 dd tasks are granted the same weight (2 of
> them are buffered writes). I guess the 2 buffered dd tasks managed to
> progress much faster than the 2 direct dd tasks just because the async
> IOs are much more efficient than the bs=64k direct IOs.
  Likely because 64k is too low to get good bandwidth with direct IO. If
it was 4M, I believe you would get similar throughput for buffered and
direct IO. So essentially you are right, small IO benefits from caching
effects since they allow you to submit larger requests to the device which
is more efficient.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                                     ` <20120424145655.GA1474-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
@ 2012-04-24 15:58                                       ` Vivek Goyal
  2012-04-25  3:16                                         ` Fengguang Wu
  1 sibling, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-24 15:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Tue, Apr 24, 2012 at 04:56:55PM +0200, Jan Kara wrote:

[..]
> > > I think having separate weigths for sync IO groups and async IO is not
> > > very appealing. There should be one notion of group weight and bandwidth
> > > distrubuted among groups according to their weight.
> > 
> > There have to be some scheme, either explicitly or implicitly. Maybe
> > you are baring in mind some "equal split among queues" policy? For
> > example, if the cgroup has 9 active sync queues and 1 async queue,
> > split the weight equally to the 10 queues?  So the sync IOs get 90%
> > share, and the async writes get 10% share.
>   Maybe I misunderstand but there doesn't have to be (and in fact isn't)
> any split among sync / async IO in CFQ. At each moment, we choose a queue
> with the highest score and dispatch a couple of requests from it. Then we
> go and choose again. The score of the queue depends on several factors
> (like age of requests, whether the queue is sync or async, IO priority,
> etc.).
> 
> Practically, over a longer period system will stabilize on some ratio
> but that's dependent on the load so your system should not impose some
> artificial direct/buffered split but rather somehow deal with the reality
> how IO scheduler decides to dispatch requests...

Yes. CFQ does not have the notion of giving a fixed share to async
requests. In fact right now it is so biased in favor of sync reqeusts,
that in some cases it can starve async writes or introduce long delays
resulting in "task hung for 120 second" warnings.

So if there are issues w.r.t how disk is shared between sync/async IO
with in a cgroup, that should be handled at IO scheduler level. Writeback
code trying to dictate that ratio, sounds odd.

> 
> > For dirty throttling w/o cgroup awareness, balance_dirty_pages()
> > splits the writeout bandwidth equally among all dirtier tasks. Since
> > cfq works with queues, it seems most natural for it to do equal split
> > among all queues (inside the cgroup).
>   Well, but we also have IO priorities which change which queue should get
> preference.
> 
> > I'm not sure when there are N dd tasks doing direct IO, cfq will
> > continuously run N sync queues for them (without many dynamic queue
> > deletion and recreations). If that is the case, it should be trivial
> > to support the queue based fair split in the global async queue
> > scheme. Otherwise I'll have some trouble detecting the N value when
> > trying to do the N:1 sync:async weight split.
>   And also sync queues for several processes can get merged when CFQ
> observes these processes cooperate together on one area of disk and get
> split again when processes stop cooperating. I don't think you really want
> to second-guess what CFQ does inside...

Agreed. Trying to predict what CFQ will do and then trying to influence
sync/async ration based on root cgroup weight does not seem to be the
right way. Especially that will also mean either assuming that everything
in root group is sync or we shall have to split sync/async weight notion.

sync/async ratio is a IO scheduler thing and is not fixed. So writeback
layer making assumptions and changing weigths sounds very awkward to me.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-24 14:56                                     ` Jan Kara
@ 2012-04-24 15:58                                       ` Vivek Goyal
  -1 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-24 15:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Fengguang Wu, Tejun Heo, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Tue, Apr 24, 2012 at 04:56:55PM +0200, Jan Kara wrote:

[..]
> > > I think having separate weigths for sync IO groups and async IO is not
> > > very appealing. There should be one notion of group weight and bandwidth
> > > distrubuted among groups according to their weight.
> > 
> > There have to be some scheme, either explicitly or implicitly. Maybe
> > you are baring in mind some "equal split among queues" policy? For
> > example, if the cgroup has 9 active sync queues and 1 async queue,
> > split the weight equally to the 10 queues?  So the sync IOs get 90%
> > share, and the async writes get 10% share.
>   Maybe I misunderstand but there doesn't have to be (and in fact isn't)
> any split among sync / async IO in CFQ. At each moment, we choose a queue
> with the highest score and dispatch a couple of requests from it. Then we
> go and choose again. The score of the queue depends on several factors
> (like age of requests, whether the queue is sync or async, IO priority,
> etc.).
> 
> Practically, over a longer period system will stabilize on some ratio
> but that's dependent on the load so your system should not impose some
> artificial direct/buffered split but rather somehow deal with the reality
> how IO scheduler decides to dispatch requests...

Yes. CFQ does not have the notion of giving a fixed share to async
requests. In fact right now it is so biased in favor of sync reqeusts,
that in some cases it can starve async writes or introduce long delays
resulting in "task hung for 120 second" warnings.

So if there are issues w.r.t how disk is shared between sync/async IO
with in a cgroup, that should be handled at IO scheduler level. Writeback
code trying to dictate that ratio, sounds odd.

> 
> > For dirty throttling w/o cgroup awareness, balance_dirty_pages()
> > splits the writeout bandwidth equally among all dirtier tasks. Since
> > cfq works with queues, it seems most natural for it to do equal split
> > among all queues (inside the cgroup).
>   Well, but we also have IO priorities which change which queue should get
> preference.
> 
> > I'm not sure when there are N dd tasks doing direct IO, cfq will
> > continuously run N sync queues for them (without many dynamic queue
> > deletion and recreations). If that is the case, it should be trivial
> > to support the queue based fair split in the global async queue
> > scheme. Otherwise I'll have some trouble detecting the N value when
> > trying to do the N:1 sync:async weight split.
>   And also sync queues for several processes can get merged when CFQ
> observes these processes cooperate together on one area of disk and get
> split again when processes stop cooperating. I don't think you really want
> to second-guess what CFQ does inside...

Agreed. Trying to predict what CFQ will do and then trying to influence
sync/async ration based on root cgroup weight does not seem to be the
right way. Especially that will also mean either assuming that everything
in root group is sync or we shall have to split sync/async weight notion.

sync/async ratio is a IO scheduler thing and is not fixed. So writeback
layer making assumptions and changing weigths sounds very awkward to me.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-24 15:58                                       ` Vivek Goyal
  0 siblings, 0 replies; 262+ messages in thread
From: Vivek Goyal @ 2012-04-24 15:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Fengguang Wu, Tejun Heo, Jens Axboe, linux-mm, sjayaraman,
	andrea, jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu,
	lizefan, containers, cgroups, ctalbott, rni, lsf

On Tue, Apr 24, 2012 at 04:56:55PM +0200, Jan Kara wrote:

[..]
> > > I think having separate weigths for sync IO groups and async IO is not
> > > very appealing. There should be one notion of group weight and bandwidth
> > > distrubuted among groups according to their weight.
> > 
> > There have to be some scheme, either explicitly or implicitly. Maybe
> > you are baring in mind some "equal split among queues" policy? For
> > example, if the cgroup has 9 active sync queues and 1 async queue,
> > split the weight equally to the 10 queues?  So the sync IOs get 90%
> > share, and the async writes get 10% share.
>   Maybe I misunderstand but there doesn't have to be (and in fact isn't)
> any split among sync / async IO in CFQ. At each moment, we choose a queue
> with the highest score and dispatch a couple of requests from it. Then we
> go and choose again. The score of the queue depends on several factors
> (like age of requests, whether the queue is sync or async, IO priority,
> etc.).
> 
> Practically, over a longer period system will stabilize on some ratio
> but that's dependent on the load so your system should not impose some
> artificial direct/buffered split but rather somehow deal with the reality
> how IO scheduler decides to dispatch requests...

Yes. CFQ does not have the notion of giving a fixed share to async
requests. In fact right now it is so biased in favor of sync reqeusts,
that in some cases it can starve async writes or introduce long delays
resulting in "task hung for 120 second" warnings.

So if there are issues w.r.t how disk is shared between sync/async IO
with in a cgroup, that should be handled at IO scheduler level. Writeback
code trying to dictate that ratio, sounds odd.

> 
> > For dirty throttling w/o cgroup awareness, balance_dirty_pages()
> > splits the writeout bandwidth equally among all dirtier tasks. Since
> > cfq works with queues, it seems most natural for it to do equal split
> > among all queues (inside the cgroup).
>   Well, but we also have IO priorities which change which queue should get
> preference.
> 
> > I'm not sure when there are N dd tasks doing direct IO, cfq will
> > continuously run N sync queues for them (without many dynamic queue
> > deletion and recreations). If that is the case, it should be trivial
> > to support the queue based fair split in the global async queue
> > scheme. Otherwise I'll have some trouble detecting the N value when
> > trying to do the N:1 sync:async weight split.
>   And also sync queues for several processes can get merged when CFQ
> observes these processes cooperate together on one area of disk and get
> split again when processes stop cooperating. I don't think you really want
> to second-guess what CFQ does inside...

Agreed. Trying to predict what CFQ will do and then trying to influence
sync/async ration based on root cgroup weight does not seem to be the
right way. Especially that will also mean either assuming that everything
in root group is sync or we shall have to split sync/async weight notion.

sync/async ratio is a IO scheduler thing and is not fixed. So writeback
layer making assumptions and changing weigths sounds very awkward to me.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                                       ` <20120424155843.GG26708-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-25  2:42                                         ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-25  2:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Tue, Apr 24, 2012 at 11:58:43AM -0400, Vivek Goyal wrote:
> On Tue, Apr 24, 2012 at 04:56:55PM +0200, Jan Kara wrote:
> 
> [..]
> > > > I think having separate weigths for sync IO groups and async IO is not
> > > > very appealing. There should be one notion of group weight and bandwidth
> > > > distrubuted among groups according to their weight.
> > > 
> > > There have to be some scheme, either explicitly or implicitly. Maybe
> > > you are baring in mind some "equal split among queues" policy? For
> > > example, if the cgroup has 9 active sync queues and 1 async queue,
> > > split the weight equally to the 10 queues?  So the sync IOs get 90%
> > > share, and the async writes get 10% share.
> >   Maybe I misunderstand but there doesn't have to be (and in fact isn't)
> > any split among sync / async IO in CFQ. At each moment, we choose a queue
> > with the highest score and dispatch a couple of requests from it. Then we
> > go and choose again. The score of the queue depends on several factors
> > (like age of requests, whether the queue is sync or async, IO priority,
> > etc.).
> > 
> > Practically, over a longer period system will stabilize on some ratio
> > but that's dependent on the load so your system should not impose some
> > artificial direct/buffered split but rather somehow deal with the reality
> > how IO scheduler decides to dispatch requests...
> 
> Yes. CFQ does not have the notion of giving a fixed share to async
> requests. In fact right now it is so biased in favor of sync reqeusts,
> that in some cases it can starve async writes or introduce long delays
> resulting in "task hung for 120 second" warnings.
> 
> So if there are issues w.r.t how disk is shared between sync/async IO
> with in a cgroup, that should be handled at IO scheduler level. Writeback
> code trying to dictate that ratio, sounds odd.

Indeed it sounds odd.. However it does look that there need some
sync/async ratio to avoid livelock issues, say 80:20 or whatever.
What's you original plan to deal with this in the IO scheduler?

> > > For dirty throttling w/o cgroup awareness, balance_dirty_pages()
> > > splits the writeout bandwidth equally among all dirtier tasks. Since
> > > cfq works with queues, it seems most natural for it to do equal split
> > > among all queues (inside the cgroup).
> >   Well, but we also have IO priorities which change which queue should get
> > preference.
> > 
> > > I'm not sure when there are N dd tasks doing direct IO, cfq will
> > > continuously run N sync queues for them (without many dynamic queue
> > > deletion and recreations). If that is the case, it should be trivial
> > > to support the queue based fair split in the global async queue
> > > scheme. Otherwise I'll have some trouble detecting the N value when
> > > trying to do the N:1 sync:async weight split.
> >   And also sync queues for several processes can get merged when CFQ
> > observes these processes cooperate together on one area of disk and get
> > split again when processes stop cooperating. I don't think you really want
> > to second-guess what CFQ does inside...
> 
> Agreed. Trying to predict what CFQ will do and then trying to influence
> sync/async ration based on root cgroup weight does not seem to be the
> right way. Especially that will also mean either assuming that everything
> in root group is sync or we shall have to split sync/async weight notion.

It seems there is some misunderstanding to the sync/async split.
No, root cgroup tasks won't be any special wrt the weight split.
Although in the current patch I does make assumption that no IO
is happening in the root cgroup.

To make it look easier, we may as well move the flusher thread to a
standalone cgroup. Then if the root cgroup has both aggressive
sync/async IOs, the split will be carried out the same way as other
cgroups:

        rootcg->dio_weight = rootcg->weight / 2
        flushercg->async_weight += rootcg->weight / 2

> sync/async ratio is a IO scheduler thing and is not fixed. So writeback
> layer making assumptions and changing weigths sounds very awkward to me.

OK the ratio is not fixed, so I'm not going to do the guess work.
However there is still the question: how are we going to fix the
sync-starve-async IO problem without some guaranteed ratio?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
       [not found]                                       ` <20120424155843.GG26708-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2012-04-25  2:42                                         ` Fengguang Wu
@ 2012-04-25  2:42                                         ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-25  2:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Tejun Heo, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Tue, Apr 24, 2012 at 11:58:43AM -0400, Vivek Goyal wrote:
> On Tue, Apr 24, 2012 at 04:56:55PM +0200, Jan Kara wrote:
> 
> [..]
> > > > I think having separate weigths for sync IO groups and async IO is not
> > > > very appealing. There should be one notion of group weight and bandwidth
> > > > distrubuted among groups according to their weight.
> > > 
> > > There have to be some scheme, either explicitly or implicitly. Maybe
> > > you are baring in mind some "equal split among queues" policy? For
> > > example, if the cgroup has 9 active sync queues and 1 async queue,
> > > split the weight equally to the 10 queues?  So the sync IOs get 90%
> > > share, and the async writes get 10% share.
> >   Maybe I misunderstand but there doesn't have to be (and in fact isn't)
> > any split among sync / async IO in CFQ. At each moment, we choose a queue
> > with the highest score and dispatch a couple of requests from it. Then we
> > go and choose again. The score of the queue depends on several factors
> > (like age of requests, whether the queue is sync or async, IO priority,
> > etc.).
> > 
> > Practically, over a longer period system will stabilize on some ratio
> > but that's dependent on the load so your system should not impose some
> > artificial direct/buffered split but rather somehow deal with the reality
> > how IO scheduler decides to dispatch requests...
> 
> Yes. CFQ does not have the notion of giving a fixed share to async
> requests. In fact right now it is so biased in favor of sync reqeusts,
> that in some cases it can starve async writes or introduce long delays
> resulting in "task hung for 120 second" warnings.
> 
> So if there are issues w.r.t how disk is shared between sync/async IO
> with in a cgroup, that should be handled at IO scheduler level. Writeback
> code trying to dictate that ratio, sounds odd.

Indeed it sounds odd.. However it does look that there need some
sync/async ratio to avoid livelock issues, say 80:20 or whatever.
What's you original plan to deal with this in the IO scheduler?

> > > For dirty throttling w/o cgroup awareness, balance_dirty_pages()
> > > splits the writeout bandwidth equally among all dirtier tasks. Since
> > > cfq works with queues, it seems most natural for it to do equal split
> > > among all queues (inside the cgroup).
> >   Well, but we also have IO priorities which change which queue should get
> > preference.
> > 
> > > I'm not sure when there are N dd tasks doing direct IO, cfq will
> > > continuously run N sync queues for them (without many dynamic queue
> > > deletion and recreations). If that is the case, it should be trivial
> > > to support the queue based fair split in the global async queue
> > > scheme. Otherwise I'll have some trouble detecting the N value when
> > > trying to do the N:1 sync:async weight split.
> >   And also sync queues for several processes can get merged when CFQ
> > observes these processes cooperate together on one area of disk and get
> > split again when processes stop cooperating. I don't think you really want
> > to second-guess what CFQ does inside...
> 
> Agreed. Trying to predict what CFQ will do and then trying to influence
> sync/async ration based on root cgroup weight does not seem to be the
> right way. Especially that will also mean either assuming that everything
> in root group is sync or we shall have to split sync/async weight notion.

It seems there is some misunderstanding to the sync/async split.
No, root cgroup tasks won't be any special wrt the weight split.
Although in the current patch I does make assumption that no IO
is happening in the root cgroup.

To make it look easier, we may as well move the flusher thread to a
standalone cgroup. Then if the root cgroup has both aggressive
sync/async IOs, the split will be carried out the same way as other
cgroups:

        rootcg->dio_weight = rootcg->weight / 2
        flushercg->async_weight += rootcg->weight / 2

> sync/async ratio is a IO scheduler thing and is not fixed. So writeback
> layer making assumptions and changing weigths sounds very awkward to me.

OK the ratio is not fixed, so I'm not going to do the guess work.
However there is still the question: how are we going to fix the
sync-starve-async IO problem without some guaranteed ratio?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-25  2:42                                         ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-25  2:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Tejun Heo, Jens Axboe, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	sjayaraman-IBi9RG/b67k, andrea-oIIqvOZpAevzfdHfmsDf5w,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Tue, Apr 24, 2012 at 11:58:43AM -0400, Vivek Goyal wrote:
> On Tue, Apr 24, 2012 at 04:56:55PM +0200, Jan Kara wrote:
> 
> [..]
> > > > I think having separate weigths for sync IO groups and async IO is not
> > > > very appealing. There should be one notion of group weight and bandwidth
> > > > distrubuted among groups according to their weight.
> > > 
> > > There have to be some scheme, either explicitly or implicitly. Maybe
> > > you are baring in mind some "equal split among queues" policy? For
> > > example, if the cgroup has 9 active sync queues and 1 async queue,
> > > split the weight equally to the 10 queues?  So the sync IOs get 90%
> > > share, and the async writes get 10% share.
> >   Maybe I misunderstand but there doesn't have to be (and in fact isn't)
> > any split among sync / async IO in CFQ. At each moment, we choose a queue
> > with the highest score and dispatch a couple of requests from it. Then we
> > go and choose again. The score of the queue depends on several factors
> > (like age of requests, whether the queue is sync or async, IO priority,
> > etc.).
> > 
> > Practically, over a longer period system will stabilize on some ratio
> > but that's dependent on the load so your system should not impose some
> > artificial direct/buffered split but rather somehow deal with the reality
> > how IO scheduler decides to dispatch requests...
> 
> Yes. CFQ does not have the notion of giving a fixed share to async
> requests. In fact right now it is so biased in favor of sync reqeusts,
> that in some cases it can starve async writes or introduce long delays
> resulting in "task hung for 120 second" warnings.
> 
> So if there are issues w.r.t how disk is shared between sync/async IO
> with in a cgroup, that should be handled at IO scheduler level. Writeback
> code trying to dictate that ratio, sounds odd.

Indeed it sounds odd.. However it does look that there need some
sync/async ratio to avoid livelock issues, say 80:20 or whatever.
What's you original plan to deal with this in the IO scheduler?

> > > For dirty throttling w/o cgroup awareness, balance_dirty_pages()
> > > splits the writeout bandwidth equally among all dirtier tasks. Since
> > > cfq works with queues, it seems most natural for it to do equal split
> > > among all queues (inside the cgroup).
> >   Well, but we also have IO priorities which change which queue should get
> > preference.
> > 
> > > I'm not sure when there are N dd tasks doing direct IO, cfq will
> > > continuously run N sync queues for them (without many dynamic queue
> > > deletion and recreations). If that is the case, it should be trivial
> > > to support the queue based fair split in the global async queue
> > > scheme. Otherwise I'll have some trouble detecting the N value when
> > > trying to do the N:1 sync:async weight split.
> >   And also sync queues for several processes can get merged when CFQ
> > observes these processes cooperate together on one area of disk and get
> > split again when processes stop cooperating. I don't think you really want
> > to second-guess what CFQ does inside...
> 
> Agreed. Trying to predict what CFQ will do and then trying to influence
> sync/async ration based on root cgroup weight does not seem to be the
> right way. Especially that will also mean either assuming that everything
> in root group is sync or we shall have to split sync/async weight notion.

It seems there is some misunderstanding to the sync/async split.
No, root cgroup tasks won't be any special wrt the weight split.
Although in the current patch I does make assumption that no IO
is happening in the root cgroup.

To make it look easier, we may as well move the flusher thread to a
standalone cgroup. Then if the root cgroup has both aggressive
sync/async IOs, the split will be carried out the same way as other
cgroups:

        rootcg->dio_weight = rootcg->weight / 2
        flushercg->async_weight += rootcg->weight / 2

> sync/async ratio is a IO scheduler thing and is not fixed. So writeback
> layer making assumptions and changing weigths sounds very awkward to me.

OK the ratio is not fixed, so I'm not going to do the guess work.
However there is still the question: how are we going to fix the
sync-starve-async IO problem without some guaranteed ratio?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-25  2:42                                         ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-25  2:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Tejun Heo, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

On Tue, Apr 24, 2012 at 11:58:43AM -0400, Vivek Goyal wrote:
> On Tue, Apr 24, 2012 at 04:56:55PM +0200, Jan Kara wrote:
> 
> [..]
> > > > I think having separate weigths for sync IO groups and async IO is not
> > > > very appealing. There should be one notion of group weight and bandwidth
> > > > distrubuted among groups according to their weight.
> > > 
> > > There have to be some scheme, either explicitly or implicitly. Maybe
> > > you are baring in mind some "equal split among queues" policy? For
> > > example, if the cgroup has 9 active sync queues and 1 async queue,
> > > split the weight equally to the 10 queues?  So the sync IOs get 90%
> > > share, and the async writes get 10% share.
> >   Maybe I misunderstand but there doesn't have to be (and in fact isn't)
> > any split among sync / async IO in CFQ. At each moment, we choose a queue
> > with the highest score and dispatch a couple of requests from it. Then we
> > go and choose again. The score of the queue depends on several factors
> > (like age of requests, whether the queue is sync or async, IO priority,
> > etc.).
> > 
> > Practically, over a longer period system will stabilize on some ratio
> > but that's dependent on the load so your system should not impose some
> > artificial direct/buffered split but rather somehow deal with the reality
> > how IO scheduler decides to dispatch requests...
> 
> Yes. CFQ does not have the notion of giving a fixed share to async
> requests. In fact right now it is so biased in favor of sync reqeusts,
> that in some cases it can starve async writes or introduce long delays
> resulting in "task hung for 120 second" warnings.
> 
> So if there are issues w.r.t how disk is shared between sync/async IO
> with in a cgroup, that should be handled at IO scheduler level. Writeback
> code trying to dictate that ratio, sounds odd.

Indeed it sounds odd.. However it does look that there need some
sync/async ratio to avoid livelock issues, say 80:20 or whatever.
What's you original plan to deal with this in the IO scheduler?

> > > For dirty throttling w/o cgroup awareness, balance_dirty_pages()
> > > splits the writeout bandwidth equally among all dirtier tasks. Since
> > > cfq works with queues, it seems most natural for it to do equal split
> > > among all queues (inside the cgroup).
> >   Well, but we also have IO priorities which change which queue should get
> > preference.
> > 
> > > I'm not sure when there are N dd tasks doing direct IO, cfq will
> > > continuously run N sync queues for them (without many dynamic queue
> > > deletion and recreations). If that is the case, it should be trivial
> > > to support the queue based fair split in the global async queue
> > > scheme. Otherwise I'll have some trouble detecting the N value when
> > > trying to do the N:1 sync:async weight split.
> >   And also sync queues for several processes can get merged when CFQ
> > observes these processes cooperate together on one area of disk and get
> > split again when processes stop cooperating. I don't think you really want
> > to second-guess what CFQ does inside...
> 
> Agreed. Trying to predict what CFQ will do and then trying to influence
> sync/async ration based on root cgroup weight does not seem to be the
> right way. Especially that will also mean either assuming that everything
> in root group is sync or we shall have to split sync/async weight notion.

It seems there is some misunderstanding to the sync/async split.
No, root cgroup tasks won't be any special wrt the weight split.
Although in the current patch I does make assumption that no IO
is happening in the root cgroup.

To make it look easier, we may as well move the flusher thread to a
standalone cgroup. Then if the root cgroup has both aggressive
sync/async IOs, the split will be carried out the same way as other
cgroups:

        rootcg->dio_weight = rootcg->weight / 2
        flushercg->async_weight += rootcg->weight / 2

> sync/async ratio is a IO scheduler thing and is not fixed. So writeback
> layer making assumptions and changing weigths sounds very awkward to me.

OK the ratio is not fixed, so I'm not going to do the guess work.
However there is still the question: how are we going to fix the
sync-starve-async IO problem without some guaranteed ratio?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-24 14:56                                     ` Jan Kara
@ 2012-04-25  3:16                                         ` Fengguang Wu
  -1 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-25  3:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal

[-- Attachment #1: Type: text/plain, Size: 5979 bytes --]

On Tue, Apr 24, 2012 at 04:56:55PM +0200, Jan Kara wrote:
> On Tue 24-04-12 19:33:40, Wu Fengguang wrote:
> > On Mon, Apr 16, 2012 at 10:57:45AM -0400, Vivek Goyal wrote:
> > > On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote:
> > > 
> > > [..]
> > > > Yeah the backpressure idea would work nicely with all possible
> > > > intermediate stacking between the bdi and leaf devices. In my attempt
> > > > to do combined IO bandwidth control for
> > > > 
> > > > - buffered writes, in balance_dirty_pages()
> > > > - direct IO, in the cfq IO scheduler
> > > > 
> > > > I have to look into the cfq code in the past days to get an idea how
> > > > the two throttling layers can cooperate (and suffer from the pains
> > > > arise from the violations of layers). It's also rather tricky to get
> > > > two previously independent throttling mechanisms to work seamlessly
> > > > with each other for providing the desired _unified_ user interface. It
> > > > took a lot of reasoning and experiments to work the basic scheme out...
> > > > 
> > > > But here is the first result. The attached graph shows progress of 4
> > > > tasks:
> > > > - cgroup A: 1 direct dd + 1 buffered dd
> > > > - cgroup B: 1 direct dd + 1 buffered dd
> > > > 
> > > > The 4 tasks are mostly progressing at the same pace. The top 2
> > > > smoother lines are for the buffered dirtiers. The bottom 2 lines are
> > > > for the direct writers. As you may notice, the two direct writers are
> > > > somehow stalled for 1-2 times, which increases the gaps between the
> > > > lines. Otherwise, the algorithm is working as expected to distribute
> > > > the bandwidth to each task.
> > > > 
> > > > The current code's target is to satisfy the more realistic user demand
> > > > of distributing bandwidth equally to each cgroup, and inside each
> > > > cgroup, distribute bandwidth equally to buffered/direct writes. On top
> > > > of which, weights can be specified to change the default distribution.
> > > > 
> > > > The implementation involves adding "weight for direct IO" to the cfq
> > > > groups and "weight for buffered writes" to the root cgroup. Note that
> > > > current cfq proportional IO conroller does not offer explicit control
> > > > over the direct:buffered ratio.
> > > > 
> > > > When there are both direct/buffered writers in the cgroup,
> > > > balance_dirty_pages() will kick in and adjust the weights for cfq to
> > > > execute. Note that cfq will continue to send all flusher IOs to the
> > > > root cgroup.  balance_dirty_pages() will compute the overall async
> > > > weight for it so that in the above test case, the computed weights
> > > > will be
> > > 
> > > I think having separate weigths for sync IO groups and async IO is not
> > > very appealing. There should be one notion of group weight and bandwidth
> > > distrubuted among groups according to their weight.
> > 
> > There have to be some scheme, either explicitly or implicitly. Maybe
> > you are baring in mind some "equal split among queues" policy? For
> > example, if the cgroup has 9 active sync queues and 1 async queue,
> > split the weight equally to the 10 queues?  So the sync IOs get 90%
> > share, and the async writes get 10% share.
>   Maybe I misunderstand but there doesn't have to be (and in fact isn't)
> any split among sync / async IO in CFQ. At each moment, we choose a queue
> with the highest score and dispatch a couple of requests from it. Then we
> go and choose again. The score of the queue depends on several factors
> (like age of requests, whether the queue is sync or async, IO priority,
> etc.).
> 
> Practically, over a longer period system will stabilize on some ratio
> but that's dependent on the load so your system should not impose some
> artificial direct/buffered split but rather somehow deal with the reality
> how IO scheduler decides to dispatch requests...

>   Well, but we also have IO priorities which change which queue should get
> preference.

>   And also sync queues for several processes can get merged when CFQ
> observes these processes cooperate together on one area of disk and get
> split again when processes stop cooperating. I don't think you really want
> to second-guess what CFQ does inside...
 
Good points, thank you!

So the cfq behavior is pretty undetermined. I more or less realize
this from the experiments. For example, when starting 2+ "dd oflag=direct"
tasks in one single cgroup, they _sometimes_ progress at different rates.
See the attached graphs for two such examples on XFS. ext4 is fine.

The 2-dd test case is:

mkdir /cgroup/dd
echo $$ > /cgroup/dd/tasks

dd if=/dev/zero of=/fs/zero1 bs=1M oflag=direct &
dd if=/dev/zero of=/fs/zero2 bs=1M oflag=direct &

The 6-dd test case is similar.

> > Look at this graph, the 4 dd tasks are granted the same weight (2 of
> > them are buffered writes). I guess the 2 buffered dd tasks managed to
> > progress much faster than the 2 direct dd tasks just because the async
> > IOs are much more efficient than the bs=64k direct IOs.
>   Likely because 64k is too low to get good bandwidth with direct IO. If
> it was 4M, I believe you would get similar throughput for buffered and
> direct IO. So essentially you are right, small IO benefits from caching
> effects since they allow you to submit larger requests to the device which
> is more efficient.

I didn't direct compare the effects, however here is an example of
doing 1M, 64k, 4k direct writes in parallel. It _seems_ bs=1M only has
marginal benefits of 64k, assuming cfq is behaving well.

https://github.com/fengguang/io-controller-tests/raw/master/log/snb/ext4/direct-write-1M-64k-4k.2012-04-19-10-50/balance_dirty_pages-task-bw.png

The test case is:

# cgroup 1
echo 500 > /cgroup/cp/blkio.weight

dd if=/dev/zero of=/fs/zero-1M bs=1M oflag=direct &

# cgroup 2
echo 1000 > /cgroup/dd/blkio.weight

dd if=/dev/zero of=/fs/zero-64k bs=64k oflag=direct &
dd if=/dev/zero of=/fs/zero-4k  bs=4k  oflag=direct &

Thanks,
Fengguang

[-- Attachment #2: balance_dirty_pages-task-bw.png --]
[-- Type: image/png, Size: 55134 bytes --]

[-- Attachment #3: balance_dirty_pages-task-bw.png --]
[-- Type: image/png, Size: 61243 bytes --]

[-- Attachment #4: Type: text/plain, Size: 205 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-25  3:16                                         ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-25  3:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, Tejun Heo, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

[-- Attachment #1: Type: text/plain, Size: 5979 bytes --]

On Tue, Apr 24, 2012 at 04:56:55PM +0200, Jan Kara wrote:
> On Tue 24-04-12 19:33:40, Wu Fengguang wrote:
> > On Mon, Apr 16, 2012 at 10:57:45AM -0400, Vivek Goyal wrote:
> > > On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote:
> > > 
> > > [..]
> > > > Yeah the backpressure idea would work nicely with all possible
> > > > intermediate stacking between the bdi and leaf devices. In my attempt
> > > > to do combined IO bandwidth control for
> > > > 
> > > > - buffered writes, in balance_dirty_pages()
> > > > - direct IO, in the cfq IO scheduler
> > > > 
> > > > I have to look into the cfq code in the past days to get an idea how
> > > > the two throttling layers can cooperate (and suffer from the pains
> > > > arise from the violations of layers). It's also rather tricky to get
> > > > two previously independent throttling mechanisms to work seamlessly
> > > > with each other for providing the desired _unified_ user interface. It
> > > > took a lot of reasoning and experiments to work the basic scheme out...
> > > > 
> > > > But here is the first result. The attached graph shows progress of 4
> > > > tasks:
> > > > - cgroup A: 1 direct dd + 1 buffered dd
> > > > - cgroup B: 1 direct dd + 1 buffered dd
> > > > 
> > > > The 4 tasks are mostly progressing at the same pace. The top 2
> > > > smoother lines are for the buffered dirtiers. The bottom 2 lines are
> > > > for the direct writers. As you may notice, the two direct writers are
> > > > somehow stalled for 1-2 times, which increases the gaps between the
> > > > lines. Otherwise, the algorithm is working as expected to distribute
> > > > the bandwidth to each task.
> > > > 
> > > > The current code's target is to satisfy the more realistic user demand
> > > > of distributing bandwidth equally to each cgroup, and inside each
> > > > cgroup, distribute bandwidth equally to buffered/direct writes. On top
> > > > of which, weights can be specified to change the default distribution.
> > > > 
> > > > The implementation involves adding "weight for direct IO" to the cfq
> > > > groups and "weight for buffered writes" to the root cgroup. Note that
> > > > current cfq proportional IO conroller does not offer explicit control
> > > > over the direct:buffered ratio.
> > > > 
> > > > When there are both direct/buffered writers in the cgroup,
> > > > balance_dirty_pages() will kick in and adjust the weights for cfq to
> > > > execute. Note that cfq will continue to send all flusher IOs to the
> > > > root cgroup.  balance_dirty_pages() will compute the overall async
> > > > weight for it so that in the above test case, the computed weights
> > > > will be
> > > 
> > > I think having separate weigths for sync IO groups and async IO is not
> > > very appealing. There should be one notion of group weight and bandwidth
> > > distrubuted among groups according to their weight.
> > 
> > There have to be some scheme, either explicitly or implicitly. Maybe
> > you are baring in mind some "equal split among queues" policy? For
> > example, if the cgroup has 9 active sync queues and 1 async queue,
> > split the weight equally to the 10 queues?  So the sync IOs get 90%
> > share, and the async writes get 10% share.
>   Maybe I misunderstand but there doesn't have to be (and in fact isn't)
> any split among sync / async IO in CFQ. At each moment, we choose a queue
> with the highest score and dispatch a couple of requests from it. Then we
> go and choose again. The score of the queue depends on several factors
> (like age of requests, whether the queue is sync or async, IO priority,
> etc.).
> 
> Practically, over a longer period system will stabilize on some ratio
> but that's dependent on the load so your system should not impose some
> artificial direct/buffered split but rather somehow deal with the reality
> how IO scheduler decides to dispatch requests...

>   Well, but we also have IO priorities which change which queue should get
> preference.

>   And also sync queues for several processes can get merged when CFQ
> observes these processes cooperate together on one area of disk and get
> split again when processes stop cooperating. I don't think you really want
> to second-guess what CFQ does inside...
 
Good points, thank you!

So the cfq behavior is pretty undetermined. I more or less realize
this from the experiments. For example, when starting 2+ "dd oflag=direct"
tasks in one single cgroup, they _sometimes_ progress at different rates.
See the attached graphs for two such examples on XFS. ext4 is fine.

The 2-dd test case is:

mkdir /cgroup/dd
echo $$ > /cgroup/dd/tasks

dd if=/dev/zero of=/fs/zero1 bs=1M oflag=direct &
dd if=/dev/zero of=/fs/zero2 bs=1M oflag=direct &

The 6-dd test case is similar.

> > Look at this graph, the 4 dd tasks are granted the same weight (2 of
> > them are buffered writes). I guess the 2 buffered dd tasks managed to
> > progress much faster than the 2 direct dd tasks just because the async
> > IOs are much more efficient than the bs=64k direct IOs.
>   Likely because 64k is too low to get good bandwidth with direct IO. If
> it was 4M, I believe you would get similar throughput for buffered and
> direct IO. So essentially you are right, small IO benefits from caching
> effects since they allow you to submit larger requests to the device which
> is more efficient.

I didn't direct compare the effects, however here is an example of
doing 1M, 64k, 4k direct writes in parallel. It _seems_ bs=1M only has
marginal benefits of 64k, assuming cfq is behaving well.

https://github.com/fengguang/io-controller-tests/raw/master/log/snb/ext4/direct-write-1M-64k-4k.2012-04-19-10-50/balance_dirty_pages-task-bw.png

The test case is:

# cgroup 1
echo 500 > /cgroup/cp/blkio.weight

dd if=/dev/zero of=/fs/zero-1M bs=1M oflag=direct &

# cgroup 2
echo 1000 > /cgroup/dd/blkio.weight

dd if=/dev/zero of=/fs/zero-64k bs=64k oflag=direct &
dd if=/dev/zero of=/fs/zero-4k  bs=4k  oflag=direct &

Thanks,
Fengguang

[-- Attachment #2: balance_dirty_pages-task-bw.png --]
[-- Type: image/png, Size: 55134 bytes --]

[-- Attachment #3: balance_dirty_pages-task-bw.png --]
[-- Type: image/png, Size: 61243 bytes --]

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]         ` <20120404191918.GK12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-25  8:47           ` Suresh Jayaraman
  0 siblings, 0 replies; 262+ messages in thread
From: Suresh Jayaraman @ 2012-04-25  8:47 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	andrea-oIIqvOZpAevzfdHfmsDf5w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Steve French,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On 04/05/2012 12:49 AM, Vivek Goyal wrote:
> On Wed, Apr 04, 2012 at 11:56:05AM -0700, Tejun Heo wrote:
>> On Wed, Apr 04, 2012 at 10:36:04AM -0500, Steve French wrote:
>>>> How do you take care of thorottling IO to NFS case in this model? Current
>>>> throttling logic is tied to block device and in case of NFS, there is no
>>>> block device.
>>>
>>> Similarly smb2 gets congestion info (number of "credits") returned from
>>> the server on every response - but not sure why congestion
>>> control is tied to the block device when this would create
>>> problems for network file systems
>>
>> I hope the previous replies answered this.  It's about writeback
>> getting pressure from bdi and isn't restricted to block devices.
> 
> So the controlling knobs for network filesystems will be very different
> as current throttling knobs are per device (and not per bdi). So
> presumably there will be some throttling logic in network layer (network
> tc), and that should communicate the back pressure.

Tried to figure out potential use-case scenarios for controlling Network
I/O resource from netfs POV (which ideally should guide the interfaces).

- Is finer grained control of network I/O is desirable/useful or being
  able to control bandwidth at per server level is sufficient? Consider
  the case where there are different NFS volumes mounted from the same
  NFS/CIFS server,

    /backup
    /missioncritical_data
    /apps
    /documents

  admin being able to set bandwidth limits to the each of these
  mounts based on how important would be a useful feature. If we try to
  build the logic in the network layer using tc then this still
  wouldn't be possible to limit the tasks that are writing to more than
  one volumes? (need some logic in netfs as well?). Network filesystem
  clients typically are not bothered much about the actual device but
  about the exported share. So it appears that the controlling knobs
  could be different for netfs.

- Provide minimimum guarantees for the Network I/O to keep going
  irrespective of the overloaded workload situations. i.e. operations
  that are local to the machine should not hamper Network I/O or
  operations that are happening on one mount should not impact
  operations that are happening on another mount.

  IIRC, while we currently would be able to limit maximum usage, we
  don't guarantee the minimum quantity of the resource that would be
  available in general for all controllers. This might be important from
  QoS guarantee POV.

- What are the other use-cases where limiting Network I/O would be
  useful?

> I have tried limiting network traffic on NFS using network controller
> and tc but that did not help for variety of reasons.
> 

A quick look at the current net_tls implementation shows that it allows
setting priorities but doesn't seem to provide ways to limit the
throughput? Or is it still possible?
If not did you use a out-of-tree implementation to test this?

> - We again have the problem of losing submitter's context down the layer.

If the network layer is cgroup aware why this would be a problem?

> - We have interesting TCP/IP sequencing issues. I don't have the details
>   but if you throttle traffic from one group, it kind of led to some 
>   kind of multiple re-transmissions from server for ack due to some
>   sequence number issues. Sorry, I am short on details as it was long back
>   and nfs guys told me that pNFS might help here.
> 
>   The basic problem seemed to that that if you multiplex traffic from
>   all cgroups on single tcp/ip session and then choke IO suddenly from
>   one of them, that was leading to some sequence number issues and led
>   to really sucky performance.
> 
> So something to keep in mind while coming up ways for how to implement
> throttling for network file systems.
> 


Thanks
Suresh

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
       [not found]         ` <20120404191918.GK12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2012-04-25  8:47           ` Suresh Jayaraman
@ 2012-04-25  8:47           ` Suresh Jayaraman
  0 siblings, 0 replies; 262+ messages in thread
From: Suresh Jayaraman @ 2012-04-25  8:47 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, Steve French, ctalbott, rni, andrea, containers,
	linux-kernel, lsf, linux-mm, jmoyer, lizefan, linux-fsdevel,
	cgroups

On 04/05/2012 12:49 AM, Vivek Goyal wrote:
> On Wed, Apr 04, 2012 at 11:56:05AM -0700, Tejun Heo wrote:
>> On Wed, Apr 04, 2012 at 10:36:04AM -0500, Steve French wrote:
>>>> How do you take care of thorottling IO to NFS case in this model? Current
>>>> throttling logic is tied to block device and in case of NFS, there is no
>>>> block device.
>>>
>>> Similarly smb2 gets congestion info (number of "credits") returned from
>>> the server on every response - but not sure why congestion
>>> control is tied to the block device when this would create
>>> problems for network file systems
>>
>> I hope the previous replies answered this.  It's about writeback
>> getting pressure from bdi and isn't restricted to block devices.
> 
> So the controlling knobs for network filesystems will be very different
> as current throttling knobs are per device (and not per bdi). So
> presumably there will be some throttling logic in network layer (network
> tc), and that should communicate the back pressure.

Tried to figure out potential use-case scenarios for controlling Network
I/O resource from netfs POV (which ideally should guide the interfaces).

- Is finer grained control of network I/O is desirable/useful or being
  able to control bandwidth at per server level is sufficient? Consider
  the case where there are different NFS volumes mounted from the same
  NFS/CIFS server,

    /backup
    /missioncritical_data
    /apps
    /documents

  admin being able to set bandwidth limits to the each of these
  mounts based on how important would be a useful feature. If we try to
  build the logic in the network layer using tc then this still
  wouldn't be possible to limit the tasks that are writing to more than
  one volumes? (need some logic in netfs as well?). Network filesystem
  clients typically are not bothered much about the actual device but
  about the exported share. So it appears that the controlling knobs
  could be different for netfs.

- Provide minimimum guarantees for the Network I/O to keep going
  irrespective of the overloaded workload situations. i.e. operations
  that are local to the machine should not hamper Network I/O or
  operations that are happening on one mount should not impact
  operations that are happening on another mount.

  IIRC, while we currently would be able to limit maximum usage, we
  don't guarantee the minimum quantity of the resource that would be
  available in general for all controllers. This might be important from
  QoS guarantee POV.

- What are the other use-cases where limiting Network I/O would be
  useful?

> I have tried limiting network traffic on NFS using network controller
> and tc but that did not help for variety of reasons.
> 

A quick look at the current net_tls implementation shows that it allows
setting priorities but doesn't seem to provide ways to limit the
throughput? Or is it still possible?
If not did you use a out-of-tree implementation to test this?

> - We again have the problem of losing submitter's context down the layer.

If the network layer is cgroup aware why this would be a problem?

> - We have interesting TCP/IP sequencing issues. I don't have the details
>   but if you throttle traffic from one group, it kind of led to some 
>   kind of multiple re-transmissions from server for ack due to some
>   sequence number issues. Sorry, I am short on details as it was long back
>   and nfs guys told me that pNFS might help here.
> 
>   The basic problem seemed to that that if you multiplex traffic from
>   all cgroups on single tcp/ip session and then choke IO suddenly from
>   one of them, that was leading to some sequence number issues and led
>   to really sucky performance.
> 
> So something to keep in mind while coming up ways for how to implement
> throttling for network file systems.
> 


Thanks
Suresh

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-25  8:47           ` Suresh Jayaraman
  0 siblings, 0 replies; 262+ messages in thread
From: Suresh Jayaraman @ 2012-04-25  8:47 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, Steve French, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On 04/05/2012 12:49 AM, Vivek Goyal wrote:
> On Wed, Apr 04, 2012 at 11:56:05AM -0700, Tejun Heo wrote:
>> On Wed, Apr 04, 2012 at 10:36:04AM -0500, Steve French wrote:
>>>> How do you take care of thorottling IO to NFS case in this model? Current
>>>> throttling logic is tied to block device and in case of NFS, there is no
>>>> block device.
>>>
>>> Similarly smb2 gets congestion info (number of "credits") returned from
>>> the server on every response - but not sure why congestion
>>> control is tied to the block device when this would create
>>> problems for network file systems
>>
>> I hope the previous replies answered this.  It's about writeback
>> getting pressure from bdi and isn't restricted to block devices.
> 
> So the controlling knobs for network filesystems will be very different
> as current throttling knobs are per device (and not per bdi). So
> presumably there will be some throttling logic in network layer (network
> tc), and that should communicate the back pressure.

Tried to figure out potential use-case scenarios for controlling Network
I/O resource from netfs POV (which ideally should guide the interfaces).

- Is finer grained control of network I/O is desirable/useful or being
  able to control bandwidth at per server level is sufficient? Consider
  the case where there are different NFS volumes mounted from the same
  NFS/CIFS server,

    /backup
    /missioncritical_data
    /apps
    /documents

  admin being able to set bandwidth limits to the each of these
  mounts based on how important would be a useful feature. If we try to
  build the logic in the network layer using tc then this still
  wouldn't be possible to limit the tasks that are writing to more than
  one volumes? (need some logic in netfs as well?). Network filesystem
  clients typically are not bothered much about the actual device but
  about the exported share. So it appears that the controlling knobs
  could be different for netfs.

- Provide minimimum guarantees for the Network I/O to keep going
  irrespective of the overloaded workload situations. i.e. operations
  that are local to the machine should not hamper Network I/O or
  operations that are happening on one mount should not impact
  operations that are happening on another mount.

  IIRC, while we currently would be able to limit maximum usage, we
  don't guarantee the minimum quantity of the resource that would be
  available in general for all controllers. This might be important from
  QoS guarantee POV.

- What are the other use-cases where limiting Network I/O would be
  useful?

> I have tried limiting network traffic on NFS using network controller
> and tc but that did not help for variety of reasons.
> 

A quick look at the current net_tls implementation shows that it allows
setting priorities but doesn't seem to provide ways to limit the
throughput? Or is it still possible?
If not did you use a out-of-tree implementation to test this?

> - We again have the problem of losing submitter's context down the layer.

If the network layer is cgroup aware why this would be a problem?

> - We have interesting TCP/IP sequencing issues. I don't have the details
>   but if you throttle traffic from one group, it kind of led to some 
>   kind of multiple re-transmissions from server for ack due to some
>   sequence number issues. Sorry, I am short on details as it was long back
>   and nfs guys told me that pNFS might help here.
> 
>   The basic problem seemed to that that if you multiplex traffic from
>   all cgroups on single tcp/ip session and then choke IO suddenly from
>   one of them, that was leading to some sequence number issues and led
>   to really sucky performance.
> 
> So something to keep in mind while coming up ways for how to implement
> throttling for network file systems.
> 


Thanks
Suresh

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [Lsf] [RFC] writeback and cgroup
@ 2012-04-25  8:47           ` Suresh Jayaraman
  0 siblings, 0 replies; 262+ messages in thread
From: Suresh Jayaraman @ 2012-04-25  8:47 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, Steve French, ctalbott, rni, andrea, containers,
	linux-kernel, lsf, linux-mm, jmoyer, lizefan, linux-fsdevel,
	cgroups

On 04/05/2012 12:49 AM, Vivek Goyal wrote:
> On Wed, Apr 04, 2012 at 11:56:05AM -0700, Tejun Heo wrote:
>> On Wed, Apr 04, 2012 at 10:36:04AM -0500, Steve French wrote:
>>>> How do you take care of thorottling IO to NFS case in this model? Current
>>>> throttling logic is tied to block device and in case of NFS, there is no
>>>> block device.
>>>
>>> Similarly smb2 gets congestion info (number of "credits") returned from
>>> the server on every response - but not sure why congestion
>>> control is tied to the block device when this would create
>>> problems for network file systems
>>
>> I hope the previous replies answered this.  It's about writeback
>> getting pressure from bdi and isn't restricted to block devices.
> 
> So the controlling knobs for network filesystems will be very different
> as current throttling knobs are per device (and not per bdi). So
> presumably there will be some throttling logic in network layer (network
> tc), and that should communicate the back pressure.

Tried to figure out potential use-case scenarios for controlling Network
I/O resource from netfs POV (which ideally should guide the interfaces).

- Is finer grained control of network I/O is desirable/useful or being
  able to control bandwidth at per server level is sufficient? Consider
  the case where there are different NFS volumes mounted from the same
  NFS/CIFS server,

    /backup
    /missioncritical_data
    /apps
    /documents

  admin being able to set bandwidth limits to the each of these
  mounts based on how important would be a useful feature. If we try to
  build the logic in the network layer using tc then this still
  wouldn't be possible to limit the tasks that are writing to more than
  one volumes? (need some logic in netfs as well?). Network filesystem
  clients typically are not bothered much about the actual device but
  about the exported share. So it appears that the controlling knobs
  could be different for netfs.

- Provide minimimum guarantees for the Network I/O to keep going
  irrespective of the overloaded workload situations. i.e. operations
  that are local to the machine should not hamper Network I/O or
  operations that are happening on one mount should not impact
  operations that are happening on another mount.

  IIRC, while we currently would be able to limit maximum usage, we
  don't guarantee the minimum quantity of the resource that would be
  available in general for all controllers. This might be important from
  QoS guarantee POV.

- What are the other use-cases where limiting Network I/O would be
  useful?

> I have tried limiting network traffic on NFS using network controller
> and tc but that did not help for variety of reasons.
> 

A quick look at the current net_tls implementation shows that it allows
setting priorities but doesn't seem to provide ways to limit the
throughput? Or is it still possible?
If not did you use a out-of-tree implementation to test this?

> - We again have the problem of losing submitter's context down the layer.

If the network layer is cgroup aware why this would be a problem?

> - We have interesting TCP/IP sequencing issues. I don't have the details
>   but if you throttle traffic from one group, it kind of led to some 
>   kind of multiple re-transmissions from server for ack due to some
>   sequence number issues. Sorry, I am short on details as it was long back
>   and nfs guys told me that pNFS might help here.
> 
>   The basic problem seemed to that that if you multiplex traffic from
>   all cgroups on single tcp/ip session and then choke IO suddenly from
>   one of them, that was leading to some sequence number issues and led
>   to really sucky performance.
> 
> So something to keep in mind while coming up ways for how to implement
> throttling for network file systems.
> 


Thanks
Suresh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-25  3:16                                         ` Fengguang Wu
  (?)
  (?)
@ 2012-04-25  9:01                                         ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-25  9:01 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal

On Wed 25-04-12 11:16:35, Wu Fengguang wrote:
> On Tue, Apr 24, 2012 at 04:56:55PM +0200, Jan Kara wrote:
> > On Tue 24-04-12 19:33:40, Wu Fengguang wrote:
> > > On Mon, Apr 16, 2012 at 10:57:45AM -0400, Vivek Goyal wrote:
> > > > On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote:
> > > > 
> > > > [..]
> > > > > Yeah the backpressure idea would work nicely with all possible
> > > > > intermediate stacking between the bdi and leaf devices. In my attempt
> > > > > to do combined IO bandwidth control for
> > > > > 
> > > > > - buffered writes, in balance_dirty_pages()
> > > > > - direct IO, in the cfq IO scheduler
> > > > > 
> > > > > I have to look into the cfq code in the past days to get an idea how
> > > > > the two throttling layers can cooperate (and suffer from the pains
> > > > > arise from the violations of layers). It's also rather tricky to get
> > > > > two previously independent throttling mechanisms to work seamlessly
> > > > > with each other for providing the desired _unified_ user interface. It
> > > > > took a lot of reasoning and experiments to work the basic scheme out...
> > > > > 
> > > > > But here is the first result. The attached graph shows progress of 4
> > > > > tasks:
> > > > > - cgroup A: 1 direct dd + 1 buffered dd
> > > > > - cgroup B: 1 direct dd + 1 buffered dd
> > > > > 
> > > > > The 4 tasks are mostly progressing at the same pace. The top 2
> > > > > smoother lines are for the buffered dirtiers. The bottom 2 lines are
> > > > > for the direct writers. As you may notice, the two direct writers are
> > > > > somehow stalled for 1-2 times, which increases the gaps between the
> > > > > lines. Otherwise, the algorithm is working as expected to distribute
> > > > > the bandwidth to each task.
> > > > > 
> > > > > The current code's target is to satisfy the more realistic user demand
> > > > > of distributing bandwidth equally to each cgroup, and inside each
> > > > > cgroup, distribute bandwidth equally to buffered/direct writes. On top
> > > > > of which, weights can be specified to change the default distribution.
> > > > > 
> > > > > The implementation involves adding "weight for direct IO" to the cfq
> > > > > groups and "weight for buffered writes" to the root cgroup. Note that
> > > > > current cfq proportional IO conroller does not offer explicit control
> > > > > over the direct:buffered ratio.
> > > > > 
> > > > > When there are both direct/buffered writers in the cgroup,
> > > > > balance_dirty_pages() will kick in and adjust the weights for cfq to
> > > > > execute. Note that cfq will continue to send all flusher IOs to the
> > > > > root cgroup.  balance_dirty_pages() will compute the overall async
> > > > > weight for it so that in the above test case, the computed weights
> > > > > will be
> > > > 
> > > > I think having separate weigths for sync IO groups and async IO is not
> > > > very appealing. There should be one notion of group weight and bandwidth
> > > > distrubuted among groups according to their weight.
> > > 
> > > There have to be some scheme, either explicitly or implicitly. Maybe
> > > you are baring in mind some "equal split among queues" policy? For
> > > example, if the cgroup has 9 active sync queues and 1 async queue,
> > > split the weight equally to the 10 queues?  So the sync IOs get 90%
> > > share, and the async writes get 10% share.
> >   Maybe I misunderstand but there doesn't have to be (and in fact isn't)
> > any split among sync / async IO in CFQ. At each moment, we choose a queue
> > with the highest score and dispatch a couple of requests from it. Then we
> > go and choose again. The score of the queue depends on several factors
> > (like age of requests, whether the queue is sync or async, IO priority,
> > etc.).
> > 
> > Practically, over a longer period system will stabilize on some ratio
> > but that's dependent on the load so your system should not impose some
> > artificial direct/buffered split but rather somehow deal with the reality
> > how IO scheduler decides to dispatch requests...
> 
> >   Well, but we also have IO priorities which change which queue should get
> > preference.
> 
> >   And also sync queues for several processes can get merged when CFQ
> > observes these processes cooperate together on one area of disk and get
> > split again when processes stop cooperating. I don't think you really want
> > to second-guess what CFQ does inside...
>  
> Good points, thank you!
> 
> So the cfq behavior is pretty undetermined. I more or less realize
> this from the experiments. For example, when starting 2+ "dd oflag=direct"
> tasks in one single cgroup, they _sometimes_ progress at different rates.
> See the attached graphs for two such examples on XFS. ext4 is fine.
> 
> The 2-dd test case is:
> 
> mkdir /cgroup/dd
> echo $$ > /cgroup/dd/tasks
> 
> dd if=/dev/zero of=/fs/zero1 bs=1M oflag=direct &
> dd if=/dev/zero of=/fs/zero2 bs=1M oflag=direct &
> 
> The 6-dd test case is similar.
  Hum, interesting. I would not expect that. Maybe it's because files are
allocated at the different area of the disk. But even then the difference
should not be *that* big.
 
> > > Look at this graph, the 4 dd tasks are granted the same weight (2 of
> > > them are buffered writes). I guess the 2 buffered dd tasks managed to
> > > progress much faster than the 2 direct dd tasks just because the async
> > > IOs are much more efficient than the bs=64k direct IOs.
> >   Likely because 64k is too low to get good bandwidth with direct IO. If
> > it was 4M, I believe you would get similar throughput for buffered and
> > direct IO. So essentially you are right, small IO benefits from caching
> > effects since they allow you to submit larger requests to the device which
> > is more efficient.
> 
> I didn't direct compare the effects, however here is an example of
> doing 1M, 64k, 4k direct writes in parallel. It _seems_ bs=1M only has
> marginal benefits of 64k, assuming cfq is behaving well.
> 
> https://github.com/fengguang/io-controller-tests/raw/master/log/snb/ext4/direct-write-1M-64k-4k.2012-04-19-10-50/balance_dirty_pages-task-bw.png
> 
> The test case is:
> 
> # cgroup 1
> echo 500 > /cgroup/cp/blkio.weight
> 
> dd if=/dev/zero of=/fs/zero-1M bs=1M oflag=direct &
> 
> # cgroup 2
> echo 1000 > /cgroup/dd/blkio.weight
> 
> dd if=/dev/zero of=/fs/zero-64k bs=64k oflag=direct &
> dd if=/dev/zero of=/fs/zero-4k  bs=4k  oflag=direct &
  Um, I'm not completely sure what you tried to test in the above test.
What I wanted to point out is that direct IO is not necessarily less
efficient than buffered IO. Look:
xen-node0:~ # uname -a
Linux xen-node0 3.3.0-rc4-xen+ #6 SMP PREEMPT Tue Apr 17 06:48:08 UTC 2012
x86_64 x86_64 x86_64 GNU/Linux
xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=1M count=1024 conv=fsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 10.5304 s, 102 MB/s
xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=1M count=1024 oflag=direct conv=fsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 10.3678 s, 104 MB/s

So both direct and buffered IO are about the same. Note that I used
conv=fsync flag to erase the effect that part of buffered write still
remains in the cache when dd is done writing which is unfair to direct
writer...

And actually 64k vs 1M makes a big difference on my machine:
xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=64k count=16384 oflag=direct conv=fsync
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB) copied, 19.3176 s, 55.6 MB/s

								Honza

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-25  3:16                                         ` Fengguang Wu
  (?)
@ 2012-04-25  9:01                                           ` Jan Kara
  -1 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-25  9:01 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, Vivek Goyal, Tejun Heo, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

On Wed 25-04-12 11:16:35, Wu Fengguang wrote:
> On Tue, Apr 24, 2012 at 04:56:55PM +0200, Jan Kara wrote:
> > On Tue 24-04-12 19:33:40, Wu Fengguang wrote:
> > > On Mon, Apr 16, 2012 at 10:57:45AM -0400, Vivek Goyal wrote:
> > > > On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote:
> > > > 
> > > > [..]
> > > > > Yeah the backpressure idea would work nicely with all possible
> > > > > intermediate stacking between the bdi and leaf devices. In my attempt
> > > > > to do combined IO bandwidth control for
> > > > > 
> > > > > - buffered writes, in balance_dirty_pages()
> > > > > - direct IO, in the cfq IO scheduler
> > > > > 
> > > > > I have to look into the cfq code in the past days to get an idea how
> > > > > the two throttling layers can cooperate (and suffer from the pains
> > > > > arise from the violations of layers). It's also rather tricky to get
> > > > > two previously independent throttling mechanisms to work seamlessly
> > > > > with each other for providing the desired _unified_ user interface. It
> > > > > took a lot of reasoning and experiments to work the basic scheme out...
> > > > > 
> > > > > But here is the first result. The attached graph shows progress of 4
> > > > > tasks:
> > > > > - cgroup A: 1 direct dd + 1 buffered dd
> > > > > - cgroup B: 1 direct dd + 1 buffered dd
> > > > > 
> > > > > The 4 tasks are mostly progressing at the same pace. The top 2
> > > > > smoother lines are for the buffered dirtiers. The bottom 2 lines are
> > > > > for the direct writers. As you may notice, the two direct writers are
> > > > > somehow stalled for 1-2 times, which increases the gaps between the
> > > > > lines. Otherwise, the algorithm is working as expected to distribute
> > > > > the bandwidth to each task.
> > > > > 
> > > > > The current code's target is to satisfy the more realistic user demand
> > > > > of distributing bandwidth equally to each cgroup, and inside each
> > > > > cgroup, distribute bandwidth equally to buffered/direct writes. On top
> > > > > of which, weights can be specified to change the default distribution.
> > > > > 
> > > > > The implementation involves adding "weight for direct IO" to the cfq
> > > > > groups and "weight for buffered writes" to the root cgroup. Note that
> > > > > current cfq proportional IO conroller does not offer explicit control
> > > > > over the direct:buffered ratio.
> > > > > 
> > > > > When there are both direct/buffered writers in the cgroup,
> > > > > balance_dirty_pages() will kick in and adjust the weights for cfq to
> > > > > execute. Note that cfq will continue to send all flusher IOs to the
> > > > > root cgroup.  balance_dirty_pages() will compute the overall async
> > > > > weight for it so that in the above test case, the computed weights
> > > > > will be
> > > > 
> > > > I think having separate weigths for sync IO groups and async IO is not
> > > > very appealing. There should be one notion of group weight and bandwidth
> > > > distrubuted among groups according to their weight.
> > > 
> > > There have to be some scheme, either explicitly or implicitly. Maybe
> > > you are baring in mind some "equal split among queues" policy? For
> > > example, if the cgroup has 9 active sync queues and 1 async queue,
> > > split the weight equally to the 10 queues?  So the sync IOs get 90%
> > > share, and the async writes get 10% share.
> >   Maybe I misunderstand but there doesn't have to be (and in fact isn't)
> > any split among sync / async IO in CFQ. At each moment, we choose a queue
> > with the highest score and dispatch a couple of requests from it. Then we
> > go and choose again. The score of the queue depends on several factors
> > (like age of requests, whether the queue is sync or async, IO priority,
> > etc.).
> > 
> > Practically, over a longer period system will stabilize on some ratio
> > but that's dependent on the load so your system should not impose some
> > artificial direct/buffered split but rather somehow deal with the reality
> > how IO scheduler decides to dispatch requests...
> 
> >   Well, but we also have IO priorities which change which queue should get
> > preference.
> 
> >   And also sync queues for several processes can get merged when CFQ
> > observes these processes cooperate together on one area of disk and get
> > split again when processes stop cooperating. I don't think you really want
> > to second-guess what CFQ does inside...
>  
> Good points, thank you!
> 
> So the cfq behavior is pretty undetermined. I more or less realize
> this from the experiments. For example, when starting 2+ "dd oflag=direct"
> tasks in one single cgroup, they _sometimes_ progress at different rates.
> See the attached graphs for two such examples on XFS. ext4 is fine.
> 
> The 2-dd test case is:
> 
> mkdir /cgroup/dd
> echo $$ > /cgroup/dd/tasks
> 
> dd if=/dev/zero of=/fs/zero1 bs=1M oflag=direct &
> dd if=/dev/zero of=/fs/zero2 bs=1M oflag=direct &
> 
> The 6-dd test case is similar.
  Hum, interesting. I would not expect that. Maybe it's because files are
allocated at the different area of the disk. But even then the difference
should not be *that* big.
 
> > > Look at this graph, the 4 dd tasks are granted the same weight (2 of
> > > them are buffered writes). I guess the 2 buffered dd tasks managed to
> > > progress much faster than the 2 direct dd tasks just because the async
> > > IOs are much more efficient than the bs=64k direct IOs.
> >   Likely because 64k is too low to get good bandwidth with direct IO. If
> > it was 4M, I believe you would get similar throughput for buffered and
> > direct IO. So essentially you are right, small IO benefits from caching
> > effects since they allow you to submit larger requests to the device which
> > is more efficient.
> 
> I didn't direct compare the effects, however here is an example of
> doing 1M, 64k, 4k direct writes in parallel. It _seems_ bs=1M only has
> marginal benefits of 64k, assuming cfq is behaving well.
> 
> https://github.com/fengguang/io-controller-tests/raw/master/log/snb/ext4/direct-write-1M-64k-4k.2012-04-19-10-50/balance_dirty_pages-task-bw.png
> 
> The test case is:
> 
> # cgroup 1
> echo 500 > /cgroup/cp/blkio.weight
> 
> dd if=/dev/zero of=/fs/zero-1M bs=1M oflag=direct &
> 
> # cgroup 2
> echo 1000 > /cgroup/dd/blkio.weight
> 
> dd if=/dev/zero of=/fs/zero-64k bs=64k oflag=direct &
> dd if=/dev/zero of=/fs/zero-4k  bs=4k  oflag=direct &
  Um, I'm not completely sure what you tried to test in the above test.
What I wanted to point out is that direct IO is not necessarily less
efficient than buffered IO. Look:
xen-node0:~ # uname -a
Linux xen-node0 3.3.0-rc4-xen+ #6 SMP PREEMPT Tue Apr 17 06:48:08 UTC 2012
x86_64 x86_64 x86_64 GNU/Linux
xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=1M count=1024 conv=fsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 10.5304 s, 102 MB/s
xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=1M count=1024 oflag=direct conv=fsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 10.3678 s, 104 MB/s

So both direct and buffered IO are about the same. Note that I used
conv=fsync flag to erase the effect that part of buffered write still
remains in the cache when dd is done writing which is unfair to direct
writer...

And actually 64k vs 1M makes a big difference on my machine:
xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=64k count=16384 oflag=direct conv=fsync
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB) copied, 19.3176 s, 55.6 MB/s

								Honza

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-25  9:01                                           ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-25  9:01 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, Vivek Goyal, Tejun Heo, Jens Axboe,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, sjayaraman-IBi9RG/b67k,
	andrea-oIIqvOZpAevzfdHfmsDf5w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Wed 25-04-12 11:16:35, Wu Fengguang wrote:
> On Tue, Apr 24, 2012 at 04:56:55PM +0200, Jan Kara wrote:
> > On Tue 24-04-12 19:33:40, Wu Fengguang wrote:
> > > On Mon, Apr 16, 2012 at 10:57:45AM -0400, Vivek Goyal wrote:
> > > > On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote:
> > > > 
> > > > [..]
> > > > > Yeah the backpressure idea would work nicely with all possible
> > > > > intermediate stacking between the bdi and leaf devices. In my attempt
> > > > > to do combined IO bandwidth control for
> > > > > 
> > > > > - buffered writes, in balance_dirty_pages()
> > > > > - direct IO, in the cfq IO scheduler
> > > > > 
> > > > > I have to look into the cfq code in the past days to get an idea how
> > > > > the two throttling layers can cooperate (and suffer from the pains
> > > > > arise from the violations of layers). It's also rather tricky to get
> > > > > two previously independent throttling mechanisms to work seamlessly
> > > > > with each other for providing the desired _unified_ user interface. It
> > > > > took a lot of reasoning and experiments to work the basic scheme out...
> > > > > 
> > > > > But here is the first result. The attached graph shows progress of 4
> > > > > tasks:
> > > > > - cgroup A: 1 direct dd + 1 buffered dd
> > > > > - cgroup B: 1 direct dd + 1 buffered dd
> > > > > 
> > > > > The 4 tasks are mostly progressing at the same pace. The top 2
> > > > > smoother lines are for the buffered dirtiers. The bottom 2 lines are
> > > > > for the direct writers. As you may notice, the two direct writers are
> > > > > somehow stalled for 1-2 times, which increases the gaps between the
> > > > > lines. Otherwise, the algorithm is working as expected to distribute
> > > > > the bandwidth to each task.
> > > > > 
> > > > > The current code's target is to satisfy the more realistic user demand
> > > > > of distributing bandwidth equally to each cgroup, and inside each
> > > > > cgroup, distribute bandwidth equally to buffered/direct writes. On top
> > > > > of which, weights can be specified to change the default distribution.
> > > > > 
> > > > > The implementation involves adding "weight for direct IO" to the cfq
> > > > > groups and "weight for buffered writes" to the root cgroup. Note that
> > > > > current cfq proportional IO conroller does not offer explicit control
> > > > > over the direct:buffered ratio.
> > > > > 
> > > > > When there are both direct/buffered writers in the cgroup,
> > > > > balance_dirty_pages() will kick in and adjust the weights for cfq to
> > > > > execute. Note that cfq will continue to send all flusher IOs to the
> > > > > root cgroup.  balance_dirty_pages() will compute the overall async
> > > > > weight for it so that in the above test case, the computed weights
> > > > > will be
> > > > 
> > > > I think having separate weigths for sync IO groups and async IO is not
> > > > very appealing. There should be one notion of group weight and bandwidth
> > > > distrubuted among groups according to their weight.
> > > 
> > > There have to be some scheme, either explicitly or implicitly. Maybe
> > > you are baring in mind some "equal split among queues" policy? For
> > > example, if the cgroup has 9 active sync queues and 1 async queue,
> > > split the weight equally to the 10 queues?  So the sync IOs get 90%
> > > share, and the async writes get 10% share.
> >   Maybe I misunderstand but there doesn't have to be (and in fact isn't)
> > any split among sync / async IO in CFQ. At each moment, we choose a queue
> > with the highest score and dispatch a couple of requests from it. Then we
> > go and choose again. The score of the queue depends on several factors
> > (like age of requests, whether the queue is sync or async, IO priority,
> > etc.).
> > 
> > Practically, over a longer period system will stabilize on some ratio
> > but that's dependent on the load so your system should not impose some
> > artificial direct/buffered split but rather somehow deal with the reality
> > how IO scheduler decides to dispatch requests...
> 
> >   Well, but we also have IO priorities which change which queue should get
> > preference.
> 
> >   And also sync queues for several processes can get merged when CFQ
> > observes these processes cooperate together on one area of disk and get
> > split again when processes stop cooperating. I don't think you really want
> > to second-guess what CFQ does inside...
>  
> Good points, thank you!
> 
> So the cfq behavior is pretty undetermined. I more or less realize
> this from the experiments. For example, when starting 2+ "dd oflag=direct"
> tasks in one single cgroup, they _sometimes_ progress at different rates.
> See the attached graphs for two such examples on XFS. ext4 is fine.
> 
> The 2-dd test case is:
> 
> mkdir /cgroup/dd
> echo $$ > /cgroup/dd/tasks
> 
> dd if=/dev/zero of=/fs/zero1 bs=1M oflag=direct &
> dd if=/dev/zero of=/fs/zero2 bs=1M oflag=direct &
> 
> The 6-dd test case is similar.
  Hum, interesting. I would not expect that. Maybe it's because files are
allocated at the different area of the disk. But even then the difference
should not be *that* big.
 
> > > Look at this graph, the 4 dd tasks are granted the same weight (2 of
> > > them are buffered writes). I guess the 2 buffered dd tasks managed to
> > > progress much faster than the 2 direct dd tasks just because the async
> > > IOs are much more efficient than the bs=64k direct IOs.
> >   Likely because 64k is too low to get good bandwidth with direct IO. If
> > it was 4M, I believe you would get similar throughput for buffered and
> > direct IO. So essentially you are right, small IO benefits from caching
> > effects since they allow you to submit larger requests to the device which
> > is more efficient.
> 
> I didn't direct compare the effects, however here is an example of
> doing 1M, 64k, 4k direct writes in parallel. It _seems_ bs=1M only has
> marginal benefits of 64k, assuming cfq is behaving well.
> 
> https://github.com/fengguang/io-controller-tests/raw/master/log/snb/ext4/direct-write-1M-64k-4k.2012-04-19-10-50/balance_dirty_pages-task-bw.png
> 
> The test case is:
> 
> # cgroup 1
> echo 500 > /cgroup/cp/blkio.weight
> 
> dd if=/dev/zero of=/fs/zero-1M bs=1M oflag=direct &
> 
> # cgroup 2
> echo 1000 > /cgroup/dd/blkio.weight
> 
> dd if=/dev/zero of=/fs/zero-64k bs=64k oflag=direct &
> dd if=/dev/zero of=/fs/zero-4k  bs=4k  oflag=direct &
  Um, I'm not completely sure what you tried to test in the above test.
What I wanted to point out is that direct IO is not necessarily less
efficient than buffered IO. Look:
xen-node0:~ # uname -a
Linux xen-node0 3.3.0-rc4-xen+ #6 SMP PREEMPT Tue Apr 17 06:48:08 UTC 2012
x86_64 x86_64 x86_64 GNU/Linux
xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=1M count=1024 conv=fsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 10.5304 s, 102 MB/s
xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=1M count=1024 oflag=direct conv=fsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 10.3678 s, 104 MB/s

So both direct and buffered IO are about the same. Note that I used
conv=fsync flag to erase the effect that part of buffered write still
remains in the cache when dd is done writing which is unfair to direct
writer...

And actually 64k vs 1M makes a big difference on my machine:
xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=64k count=16384 oflag=direct conv=fsync
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB) copied, 19.3176 s, 55.6 MB/s

								Honza

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-25  9:01                                           ` Jan Kara
  0 siblings, 0 replies; 262+ messages in thread
From: Jan Kara @ 2012-04-25  9:01 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, Vivek Goyal, Tejun Heo, Jens Axboe, linux-mm,
	sjayaraman, andrea, jmoyer, linux-fsdevel, linux-kernel,
	kamezawa.hiroyu, lizefan, containers, cgroups, ctalbott, rni,
	lsf

On Wed 25-04-12 11:16:35, Wu Fengguang wrote:
> On Tue, Apr 24, 2012 at 04:56:55PM +0200, Jan Kara wrote:
> > On Tue 24-04-12 19:33:40, Wu Fengguang wrote:
> > > On Mon, Apr 16, 2012 at 10:57:45AM -0400, Vivek Goyal wrote:
> > > > On Sat, Apr 14, 2012 at 10:36:39PM +0800, Fengguang Wu wrote:
> > > > 
> > > > [..]
> > > > > Yeah the backpressure idea would work nicely with all possible
> > > > > intermediate stacking between the bdi and leaf devices. In my attempt
> > > > > to do combined IO bandwidth control for
> > > > > 
> > > > > - buffered writes, in balance_dirty_pages()
> > > > > - direct IO, in the cfq IO scheduler
> > > > > 
> > > > > I have to look into the cfq code in the past days to get an idea how
> > > > > the two throttling layers can cooperate (and suffer from the pains
> > > > > arise from the violations of layers). It's also rather tricky to get
> > > > > two previously independent throttling mechanisms to work seamlessly
> > > > > with each other for providing the desired _unified_ user interface. It
> > > > > took a lot of reasoning and experiments to work the basic scheme out...
> > > > > 
> > > > > But here is the first result. The attached graph shows progress of 4
> > > > > tasks:
> > > > > - cgroup A: 1 direct dd + 1 buffered dd
> > > > > - cgroup B: 1 direct dd + 1 buffered dd
> > > > > 
> > > > > The 4 tasks are mostly progressing at the same pace. The top 2
> > > > > smoother lines are for the buffered dirtiers. The bottom 2 lines are
> > > > > for the direct writers. As you may notice, the two direct writers are
> > > > > somehow stalled for 1-2 times, which increases the gaps between the
> > > > > lines. Otherwise, the algorithm is working as expected to distribute
> > > > > the bandwidth to each task.
> > > > > 
> > > > > The current code's target is to satisfy the more realistic user demand
> > > > > of distributing bandwidth equally to each cgroup, and inside each
> > > > > cgroup, distribute bandwidth equally to buffered/direct writes. On top
> > > > > of which, weights can be specified to change the default distribution.
> > > > > 
> > > > > The implementation involves adding "weight for direct IO" to the cfq
> > > > > groups and "weight for buffered writes" to the root cgroup. Note that
> > > > > current cfq proportional IO conroller does not offer explicit control
> > > > > over the direct:buffered ratio.
> > > > > 
> > > > > When there are both direct/buffered writers in the cgroup,
> > > > > balance_dirty_pages() will kick in and adjust the weights for cfq to
> > > > > execute. Note that cfq will continue to send all flusher IOs to the
> > > > > root cgroup.  balance_dirty_pages() will compute the overall async
> > > > > weight for it so that in the above test case, the computed weights
> > > > > will be
> > > > 
> > > > I think having separate weigths for sync IO groups and async IO is not
> > > > very appealing. There should be one notion of group weight and bandwidth
> > > > distrubuted among groups according to their weight.
> > > 
> > > There have to be some scheme, either explicitly or implicitly. Maybe
> > > you are baring in mind some "equal split among queues" policy? For
> > > example, if the cgroup has 9 active sync queues and 1 async queue,
> > > split the weight equally to the 10 queues?  So the sync IOs get 90%
> > > share, and the async writes get 10% share.
> >   Maybe I misunderstand but there doesn't have to be (and in fact isn't)
> > any split among sync / async IO in CFQ. At each moment, we choose a queue
> > with the highest score and dispatch a couple of requests from it. Then we
> > go and choose again. The score of the queue depends on several factors
> > (like age of requests, whether the queue is sync or async, IO priority,
> > etc.).
> > 
> > Practically, over a longer period system will stabilize on some ratio
> > but that's dependent on the load so your system should not impose some
> > artificial direct/buffered split but rather somehow deal with the reality
> > how IO scheduler decides to dispatch requests...
> 
> >   Well, but we also have IO priorities which change which queue should get
> > preference.
> 
> >   And also sync queues for several processes can get merged when CFQ
> > observes these processes cooperate together on one area of disk and get
> > split again when processes stop cooperating. I don't think you really want
> > to second-guess what CFQ does inside...
>  
> Good points, thank you!
> 
> So the cfq behavior is pretty undetermined. I more or less realize
> this from the experiments. For example, when starting 2+ "dd oflag=direct"
> tasks in one single cgroup, they _sometimes_ progress at different rates.
> See the attached graphs for two such examples on XFS. ext4 is fine.
> 
> The 2-dd test case is:
> 
> mkdir /cgroup/dd
> echo $$ > /cgroup/dd/tasks
> 
> dd if=/dev/zero of=/fs/zero1 bs=1M oflag=direct &
> dd if=/dev/zero of=/fs/zero2 bs=1M oflag=direct &
> 
> The 6-dd test case is similar.
  Hum, interesting. I would not expect that. Maybe it's because files are
allocated at the different area of the disk. But even then the difference
should not be *that* big.
 
> > > Look at this graph, the 4 dd tasks are granted the same weight (2 of
> > > them are buffered writes). I guess the 2 buffered dd tasks managed to
> > > progress much faster than the 2 direct dd tasks just because the async
> > > IOs are much more efficient than the bs=64k direct IOs.
> >   Likely because 64k is too low to get good bandwidth with direct IO. If
> > it was 4M, I believe you would get similar throughput for buffered and
> > direct IO. So essentially you are right, small IO benefits from caching
> > effects since they allow you to submit larger requests to the device which
> > is more efficient.
> 
> I didn't direct compare the effects, however here is an example of
> doing 1M, 64k, 4k direct writes in parallel. It _seems_ bs=1M only has
> marginal benefits of 64k, assuming cfq is behaving well.
> 
> https://github.com/fengguang/io-controller-tests/raw/master/log/snb/ext4/direct-write-1M-64k-4k.2012-04-19-10-50/balance_dirty_pages-task-bw.png
> 
> The test case is:
> 
> # cgroup 1
> echo 500 > /cgroup/cp/blkio.weight
> 
> dd if=/dev/zero of=/fs/zero-1M bs=1M oflag=direct &
> 
> # cgroup 2
> echo 1000 > /cgroup/dd/blkio.weight
> 
> dd if=/dev/zero of=/fs/zero-64k bs=64k oflag=direct &
> dd if=/dev/zero of=/fs/zero-4k  bs=4k  oflag=direct &
  Um, I'm not completely sure what you tried to test in the above test.
What I wanted to point out is that direct IO is not necessarily less
efficient than buffered IO. Look:
xen-node0:~ # uname -a
Linux xen-node0 3.3.0-rc4-xen+ #6 SMP PREEMPT Tue Apr 17 06:48:08 UTC 2012
x86_64 x86_64 x86_64 GNU/Linux
xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=1M count=1024 conv=fsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 10.5304 s, 102 MB/s
xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=1M count=1024 oflag=direct conv=fsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 10.3678 s, 104 MB/s

So both direct and buffered IO are about the same. Note that I used
conv=fsync flag to erase the effect that part of buffered write still
remains in the cache when dd is done writing which is unfair to direct
writer...

And actually 64k vs 1M makes a big difference on my machine:
xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=64k count=16384 oflag=direct conv=fsync
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB) copied, 19.3176 s, 55.6 MB/s

								Honza

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-25  9:01                                           ` Jan Kara
@ 2012-04-25 12:05                                               ` Fengguang Wu
  -1 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-25 12:05 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal

[-- Attachment #1: Type: text/plain, Size: 4319 bytes --]

> > So the cfq behavior is pretty undetermined. I more or less realize
> > this from the experiments. For example, when starting 2+ "dd oflag=direct"
> > tasks in one single cgroup, they _sometimes_ progress at different rates.
> > See the attached graphs for two such examples on XFS. ext4 is fine.
> > 
> > The 2-dd test case is:
> > 
> > mkdir /cgroup/dd
> > echo $$ > /cgroup/dd/tasks
> > 
> > dd if=/dev/zero of=/fs/zero1 bs=1M oflag=direct &
> > dd if=/dev/zero of=/fs/zero2 bs=1M oflag=direct &
> > 
> > The 6-dd test case is similar.
>   Hum, interesting. I would not expect that. Maybe it's because files are
> allocated at the different area of the disk. But even then the difference
> should not be *that* big.

Agreed.

> > > > Look at this graph, the 4 dd tasks are granted the same weight (2 of
> > > > them are buffered writes). I guess the 2 buffered dd tasks managed to
> > > > progress much faster than the 2 direct dd tasks just because the async
> > > > IOs are much more efficient than the bs=64k direct IOs.
> > >   Likely because 64k is too low to get good bandwidth with direct IO. If
> > > it was 4M, I believe you would get similar throughput for buffered and
> > > direct IO. So essentially you are right, small IO benefits from caching
> > > effects since they allow you to submit larger requests to the device which
> > > is more efficient.
> > 
> > I didn't direct compare the effects, however here is an example of
> > doing 1M, 64k, 4k direct writes in parallel. It _seems_ bs=1M only has
> > marginal benefits of 64k, assuming cfq is behaving well.
> > 
> > https://github.com/fengguang/io-controller-tests/raw/master/log/snb/ext4/direct-write-1M-64k-4k.2012-04-19-10-50/balance_dirty_pages-task-bw.png
> > 
> > The test case is:
> > 
> > # cgroup 1
> > echo 500 > /cgroup/cp/blkio.weight
> > 
> > dd if=/dev/zero of=/fs/zero-1M bs=1M oflag=direct &
> > 
> > # cgroup 2
> > echo 1000 > /cgroup/dd/blkio.weight
> > 
> > dd if=/dev/zero of=/fs/zero-64k bs=64k oflag=direct &
> > dd if=/dev/zero of=/fs/zero-4k  bs=4k  oflag=direct &
>   Um, I'm not completely sure what you tried to test in the above test.

Yeah it's not a good test case. I've changed it to run the 3 dd tasks
in 3 cgroups with equal weight. Attached the new results (looks the
same as the original one).

> What I wanted to point out is that direct IO is not necessarily less
> efficient than buffered IO. Look:
> xen-node0:~ # uname -a
> Linux xen-node0 3.3.0-rc4-xen+ #6 SMP PREEMPT Tue Apr 17 06:48:08 UTC 2012
> x86_64 x86_64 x86_64 GNU/Linux
> xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=1M count=1024 conv=fsync
> 1024+0 records in
> 1024+0 records out
> 1073741824 bytes (1.1 GB) copied, 10.5304 s, 102 MB/s
> xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=1M count=1024 oflag=direct conv=fsync
> 1024+0 records in
> 1024+0 records out
> 1073741824 bytes (1.1 GB) copied, 10.3678 s, 104 MB/s
> 
> So both direct and buffered IO are about the same. Note that I used
> conv=fsync flag to erase the effect that part of buffered write still
> remains in the cache when dd is done writing which is unfair to direct
> writer...

OK, I also find direct write being a bit faster than buffered write:

root@snb /home/wfg# dd if=/dev/zero of=/mnt/file bs=1M count=1024 conv=fsync

1073741824 bytes (1.1 GB) copied, 10.4039 s, 103 MB/s
1073741824 bytes (1.1 GB) copied, 10.4143 s, 103 MB/s

root@snb /home/wfg# dd if=/dev/zero of=/mnt/file bs=1M count=1024 oflag=direct conv=fsync

1073741824 bytes (1.1 GB) copied, 9.9006 s, 108 MB/s
1073741824 bytes (1.1 GB) copied, 9.55173 s, 112 MB/s

root@snb /home/wfg# dd if=/dev/zero of=/mnt/file bs=64k count=16384 oflag=direct conv=fsync

1073741824 bytes (1.1 GB) copied, 9.83902 s, 109 MB/s
1073741824 bytes (1.1 GB) copied, 9.61725 s, 112 MB/s

> And actually 64k vs 1M makes a big difference on my machine:
> xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=64k count=16384 oflag=direct conv=fsync
> 16384+0 records in
> 16384+0 records out
> 1073741824 bytes (1.1 GB) copied, 19.3176 s, 55.6 MB/s

Interestingly, my 64k direct writes are as fast as 1M direct writes...
and 4k writes run at ~1/4 speed:

root@snb /home/wfg# dd if=/dev/zero of=/mnt/file bs=4k count=$((256<<10)) oflag=direct conv=fsync

1073741824 bytes (1.1 GB) copied, 42.0726 s, 25.5 MB/s

Thanks,
Fengguang

[-- Attachment #2: balance_dirty_pages-task-bw.png --]
[-- Type: image/png, Size: 61279 bytes --]

[-- Attachment #3: Type: text/plain, Size: 205 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-25 12:05                                               ` Fengguang Wu
  0 siblings, 0 replies; 262+ messages in thread
From: Fengguang Wu @ 2012-04-25 12:05 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, Tejun Heo, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf

[-- Attachment #1: Type: text/plain, Size: 4319 bytes --]

> > So the cfq behavior is pretty undetermined. I more or less realize
> > this from the experiments. For example, when starting 2+ "dd oflag=direct"
> > tasks in one single cgroup, they _sometimes_ progress at different rates.
> > See the attached graphs for two such examples on XFS. ext4 is fine.
> > 
> > The 2-dd test case is:
> > 
> > mkdir /cgroup/dd
> > echo $$ > /cgroup/dd/tasks
> > 
> > dd if=/dev/zero of=/fs/zero1 bs=1M oflag=direct &
> > dd if=/dev/zero of=/fs/zero2 bs=1M oflag=direct &
> > 
> > The 6-dd test case is similar.
>   Hum, interesting. I would not expect that. Maybe it's because files are
> allocated at the different area of the disk. But even then the difference
> should not be *that* big.

Agreed.

> > > > Look at this graph, the 4 dd tasks are granted the same weight (2 of
> > > > them are buffered writes). I guess the 2 buffered dd tasks managed to
> > > > progress much faster than the 2 direct dd tasks just because the async
> > > > IOs are much more efficient than the bs=64k direct IOs.
> > >   Likely because 64k is too low to get good bandwidth with direct IO. If
> > > it was 4M, I believe you would get similar throughput for buffered and
> > > direct IO. So essentially you are right, small IO benefits from caching
> > > effects since they allow you to submit larger requests to the device which
> > > is more efficient.
> > 
> > I didn't direct compare the effects, however here is an example of
> > doing 1M, 64k, 4k direct writes in parallel. It _seems_ bs=1M only has
> > marginal benefits of 64k, assuming cfq is behaving well.
> > 
> > https://github.com/fengguang/io-controller-tests/raw/master/log/snb/ext4/direct-write-1M-64k-4k.2012-04-19-10-50/balance_dirty_pages-task-bw.png
> > 
> > The test case is:
> > 
> > # cgroup 1
> > echo 500 > /cgroup/cp/blkio.weight
> > 
> > dd if=/dev/zero of=/fs/zero-1M bs=1M oflag=direct &
> > 
> > # cgroup 2
> > echo 1000 > /cgroup/dd/blkio.weight
> > 
> > dd if=/dev/zero of=/fs/zero-64k bs=64k oflag=direct &
> > dd if=/dev/zero of=/fs/zero-4k  bs=4k  oflag=direct &
>   Um, I'm not completely sure what you tried to test in the above test.

Yeah it's not a good test case. I've changed it to run the 3 dd tasks
in 3 cgroups with equal weight. Attached the new results (looks the
same as the original one).

> What I wanted to point out is that direct IO is not necessarily less
> efficient than buffered IO. Look:
> xen-node0:~ # uname -a
> Linux xen-node0 3.3.0-rc4-xen+ #6 SMP PREEMPT Tue Apr 17 06:48:08 UTC 2012
> x86_64 x86_64 x86_64 GNU/Linux
> xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=1M count=1024 conv=fsync
> 1024+0 records in
> 1024+0 records out
> 1073741824 bytes (1.1 GB) copied, 10.5304 s, 102 MB/s
> xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=1M count=1024 oflag=direct conv=fsync
> 1024+0 records in
> 1024+0 records out
> 1073741824 bytes (1.1 GB) copied, 10.3678 s, 104 MB/s
> 
> So both direct and buffered IO are about the same. Note that I used
> conv=fsync flag to erase the effect that part of buffered write still
> remains in the cache when dd is done writing which is unfair to direct
> writer...

OK, I also find direct write being a bit faster than buffered write:

root@snb /home/wfg# dd if=/dev/zero of=/mnt/file bs=1M count=1024 conv=fsync

1073741824 bytes (1.1 GB) copied, 10.4039 s, 103 MB/s
1073741824 bytes (1.1 GB) copied, 10.4143 s, 103 MB/s

root@snb /home/wfg# dd if=/dev/zero of=/mnt/file bs=1M count=1024 oflag=direct conv=fsync

1073741824 bytes (1.1 GB) copied, 9.9006 s, 108 MB/s
1073741824 bytes (1.1 GB) copied, 9.55173 s, 112 MB/s

root@snb /home/wfg# dd if=/dev/zero of=/mnt/file bs=64k count=16384 oflag=direct conv=fsync

1073741824 bytes (1.1 GB) copied, 9.83902 s, 109 MB/s
1073741824 bytes (1.1 GB) copied, 9.61725 s, 112 MB/s

> And actually 64k vs 1M makes a big difference on my machine:
> xen-node0:~ # dd if=/dev/zero of=/mnt/file bs=64k count=16384 oflag=direct conv=fsync
> 16384+0 records in
> 16384+0 records out
> 1073741824 bytes (1.1 GB) copied, 19.3176 s, 55.6 MB/s

Interestingly, my 64k direct writes are as fast as 1M direct writes...
and 4k writes run at ~1/4 speed:

root@snb /home/wfg# dd if=/dev/zero of=/mnt/file bs=4k count=$((256<<10)) oflag=direct conv=fsync

1073741824 bytes (1.1 GB) copied, 42.0726 s, 25.5 MB/s

Thanks,
Fengguang

[-- Attachment #2: balance_dirty_pages-task-bw.png --]
[-- Type: image/png, Size: 61279 bytes --]

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-24  7:58                         ` Fengguang Wu
  (?)
@ 2012-04-25 15:47                         ` Tejun Heo
  -1 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-25 15:47 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jens Axboe, ctalbott-hpIqsD4AKlfQT0dZR+AlfA, Jan Kara,
	rni-hpIqsD4AKlfQT0dZR+AlfA, andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA,
	Mel Gorman

Hey, Fengguang.

On Tue, Apr 24, 2012 at 03:58:53PM +0800, Fengguang Wu wrote:
> > I have two questions.  Why do we need memcg for this?  Writeback
> > currently works without memcg, right?  Why does that change with blkcg
> > aware bdi?
> 
> Yeah currently writeback does not depend on memcg. As for blkcg, it's
> necessary to keep a number of dirty pages for each blkcg, so that the
> cfq groups' async IO queue does not go empty and lose its turn to do
> IO. memcg provides the proper infrastructure to account dirty pages.
> 
> In a previous email, we have an example of two 10:1 weight cgroups,
> each running one dd. They will make two IO pipes, each holding a number
> of dirty pages. Since cfq honors dd-1 much more IO bandwidth, dd-1's
> dirty pages are consumed quickly. However balance_dirty_pages(),
> without knowing about cfq's bandwidth divisions, is throttling the
> two dd tasks equally. So dd-1 will be producing dirty pages much
> slower than cfq is consuming them. The flusher thus won't send enough
> dirty pages down to fill the corresponding async IO queue for dd-1.
> cfq cannot really give dd-1 more bandwidth share due to lack of data
> feed. The end result will be: the two cgroups get 1:1 bandwidth share
> honored by balance_dirty_pages() even though cfq honors 10:1 weights
> to them.

My question is why can't cgroup-bdi pair be handled the same or
similar way each bdi is handled now?  I haven't looked through the
code yet but something is determining, even inadvertently, the dirty
memory usage among different bdi's, right?  What I'm curious about is
why cgroupfying bdi makes any different to that.  If it's
indeterministic w/o memcg, let it be that way with blkcg too.  Just
treat cgroup-bdi as separate bdis.  So, what changes?

> However if it's a large memory machine whose dirty pages get
> partitioned to 100 cgroups, the flusher will be serving them
> in round robin fashion.

Just treat cgroup-bdi as a separate bdi.  Run an independent flusher
on it.  They're separate channels.

> blkio.weight will be the "number" shared and interpreted by all IO
> controller entities, whether it be cfq, NFS or balance_dirty_pages().

It already isn't.  blk-throttle is an IO controller entity but doesn't
make use of weight.

> > However, this doesn't necessarily translate easily into the actual
> > underlying IO resource.  For devices with spindle, seek time dominates
> > and the same amount of IO may consume vastly different amount of IO
> > and the disk time becomes the primary resource, not the iops or
> > bandwidth.  Naturally, people want to allocate and limit the primary
> > resource, so cfq distributes disk time across different cgroups as
> > configured.
> 
> Right. balance_dirty_pages() is always doing dirty throttling wrt.
> bandwidth, even in your back pressure scheme, isn't it? In this regard,
> there are nothing fundamentally different between our proposals. They

If balance_dirty_pages() fails to keep the IO buffer full, it's
balance_dirty_pages()'s failure (and doing so from time to time could
be fine given enough benefits), but no matter what writeback does,
blkcg *should* enforce the configured limits, so they're quite
different in terms of encapsulation and functionality.

> > Your suggested solution is applying the same a number - the weight -
> > to one portion of a mostly arbitrarily split resource using a
> > different unit.  I don't even understand what that achieves.
> 
> You seem to miss my stated plan: next step, balance_dirty_pages() will
> get some feedback information from cfq to adjust its bandwidth targets
> accordingly. That information will be
> 
>         io_cost = charge/sectors
> 
> The charge value is exactly the value computed in cfq_group_served(),
> which is the slice time or IOs dispatched depending the mode cfq is
> operating in. By dividing ratelimit by the normalized io_cost,
> balance_dirty_pages() will automatically get the same weight
> interpretation as cfq. For example, on spin disks, it will be able to
> allocate lower bandwidth to seeky cgroups due to the larger io_cost
> reported by cfq.

So, cfq is basing its cost calculation on disk time spent by sync IOs
which gets fluctuated by uncategorized async IOs and you're gonna
apply that number to async IOs in some magical way?  What the hell
does that achieve?

Please take a step back and look at the whole stack and think about
what each part is supposed to do and how they are supposed to
interact.  If you still can't see the mess you're trying to make,
ummm... I don't know.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
  2012-04-24  7:58                         ` Fengguang Wu
@ 2012-04-25 15:47                           ` Tejun Heo
  -1 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-25 15:47 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf, Mel Gorman

Hey, Fengguang.

On Tue, Apr 24, 2012 at 03:58:53PM +0800, Fengguang Wu wrote:
> > I have two questions.  Why do we need memcg for this?  Writeback
> > currently works without memcg, right?  Why does that change with blkcg
> > aware bdi?
> 
> Yeah currently writeback does not depend on memcg. As for blkcg, it's
> necessary to keep a number of dirty pages for each blkcg, so that the
> cfq groups' async IO queue does not go empty and lose its turn to do
> IO. memcg provides the proper infrastructure to account dirty pages.
> 
> In a previous email, we have an example of two 10:1 weight cgroups,
> each running one dd. They will make two IO pipes, each holding a number
> of dirty pages. Since cfq honors dd-1 much more IO bandwidth, dd-1's
> dirty pages are consumed quickly. However balance_dirty_pages(),
> without knowing about cfq's bandwidth divisions, is throttling the
> two dd tasks equally. So dd-1 will be producing dirty pages much
> slower than cfq is consuming them. The flusher thus won't send enough
> dirty pages down to fill the corresponding async IO queue for dd-1.
> cfq cannot really give dd-1 more bandwidth share due to lack of data
> feed. The end result will be: the two cgroups get 1:1 bandwidth share
> honored by balance_dirty_pages() even though cfq honors 10:1 weights
> to them.

My question is why can't cgroup-bdi pair be handled the same or
similar way each bdi is handled now?  I haven't looked through the
code yet but something is determining, even inadvertently, the dirty
memory usage among different bdi's, right?  What I'm curious about is
why cgroupfying bdi makes any different to that.  If it's
indeterministic w/o memcg, let it be that way with blkcg too.  Just
treat cgroup-bdi as separate bdis.  So, what changes?

> However if it's a large memory machine whose dirty pages get
> partitioned to 100 cgroups, the flusher will be serving them
> in round robin fashion.

Just treat cgroup-bdi as a separate bdi.  Run an independent flusher
on it.  They're separate channels.

> blkio.weight will be the "number" shared and interpreted by all IO
> controller entities, whether it be cfq, NFS or balance_dirty_pages().

It already isn't.  blk-throttle is an IO controller entity but doesn't
make use of weight.

> > However, this doesn't necessarily translate easily into the actual
> > underlying IO resource.  For devices with spindle, seek time dominates
> > and the same amount of IO may consume vastly different amount of IO
> > and the disk time becomes the primary resource, not the iops or
> > bandwidth.  Naturally, people want to allocate and limit the primary
> > resource, so cfq distributes disk time across different cgroups as
> > configured.
> 
> Right. balance_dirty_pages() is always doing dirty throttling wrt.
> bandwidth, even in your back pressure scheme, isn't it? In this regard,
> there are nothing fundamentally different between our proposals. They

If balance_dirty_pages() fails to keep the IO buffer full, it's
balance_dirty_pages()'s failure (and doing so from time to time could
be fine given enough benefits), but no matter what writeback does,
blkcg *should* enforce the configured limits, so they're quite
different in terms of encapsulation and functionality.

> > Your suggested solution is applying the same a number - the weight -
> > to one portion of a mostly arbitrarily split resource using a
> > different unit.  I don't even understand what that achieves.
> 
> You seem to miss my stated plan: next step, balance_dirty_pages() will
> get some feedback information from cfq to adjust its bandwidth targets
> accordingly. That information will be
> 
>         io_cost = charge/sectors
> 
> The charge value is exactly the value computed in cfq_group_served(),
> which is the slice time or IOs dispatched depending the mode cfq is
> operating in. By dividing ratelimit by the normalized io_cost,
> balance_dirty_pages() will automatically get the same weight
> interpretation as cfq. For example, on spin disks, it will be able to
> allocate lower bandwidth to seeky cgroups due to the larger io_cost
> reported by cfq.

So, cfq is basing its cost calculation on disk time spent by sync IOs
which gets fluctuated by uncategorized async IOs and you're gonna
apply that number to async IOs in some magical way?  What the hell
does that achieve?

Please take a step back and look at the whole stack and think about
what each part is supposed to do and how they are supposed to
interact.  If you still can't see the mess you're trying to make,
ummm... I don't know.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

* Re: [RFC] writeback and cgroup
@ 2012-04-25 15:47                           ` Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-25 15:47 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, vgoyal, Jens Axboe, linux-mm, sjayaraman, andrea,
	jmoyer, linux-fsdevel, linux-kernel, kamezawa.hiroyu, lizefan,
	containers, cgroups, ctalbott, rni, lsf, Mel Gorman

Hey, Fengguang.

On Tue, Apr 24, 2012 at 03:58:53PM +0800, Fengguang Wu wrote:
> > I have two questions.  Why do we need memcg for this?  Writeback
> > currently works without memcg, right?  Why does that change with blkcg
> > aware bdi?
> 
> Yeah currently writeback does not depend on memcg. As for blkcg, it's
> necessary to keep a number of dirty pages for each blkcg, so that the
> cfq groups' async IO queue does not go empty and lose its turn to do
> IO. memcg provides the proper infrastructure to account dirty pages.
> 
> In a previous email, we have an example of two 10:1 weight cgroups,
> each running one dd. They will make two IO pipes, each holding a number
> of dirty pages. Since cfq honors dd-1 much more IO bandwidth, dd-1's
> dirty pages are consumed quickly. However balance_dirty_pages(),
> without knowing about cfq's bandwidth divisions, is throttling the
> two dd tasks equally. So dd-1 will be producing dirty pages much
> slower than cfq is consuming them. The flusher thus won't send enough
> dirty pages down to fill the corresponding async IO queue for dd-1.
> cfq cannot really give dd-1 more bandwidth share due to lack of data
> feed. The end result will be: the two cgroups get 1:1 bandwidth share
> honored by balance_dirty_pages() even though cfq honors 10:1 weights
> to them.

My question is why can't cgroup-bdi pair be handled the same or
similar way each bdi is handled now?  I haven't looked through the
code yet but something is determining, even inadvertently, the dirty
memory usage among different bdi's, right?  What I'm curious about is
why cgroupfying bdi makes any different to that.  If it's
indeterministic w/o memcg, let it be that way with blkcg too.  Just
treat cgroup-bdi as separate bdis.  So, what changes?

> However if it's a large memory machine whose dirty pages get
> partitioned to 100 cgroups, the flusher will be serving them
> in round robin fashion.

Just treat cgroup-bdi as a separate bdi.  Run an independent flusher
on it.  They're separate channels.

> blkio.weight will be the "number" shared and interpreted by all IO
> controller entities, whether it be cfq, NFS or balance_dirty_pages().

It already isn't.  blk-throttle is an IO controller entity but doesn't
make use of weight.

> > However, this doesn't necessarily translate easily into the actual
> > underlying IO resource.  For devices with spindle, seek time dominates
> > and the same amount of IO may consume vastly different amount of IO
> > and the disk time becomes the primary resource, not the iops or
> > bandwidth.  Naturally, people want to allocate and limit the primary
> > resource, so cfq distributes disk time across different cgroups as
> > configured.
> 
> Right. balance_dirty_pages() is always doing dirty throttling wrt.
> bandwidth, even in your back pressure scheme, isn't it? In this regard,
> there are nothing fundamentally different between our proposals. They

If balance_dirty_pages() fails to keep the IO buffer full, it's
balance_dirty_pages()'s failure (and doing so from time to time could
be fine given enough benefits), but no matter what writeback does,
blkcg *should* enforce the configured limits, so they're quite
different in terms of encapsulation and functionality.

> > Your suggested solution is applying the same a number - the weight -
> > to one portion of a mostly arbitrarily split resource using a
> > different unit.  I don't even understand what that achieves.
> 
> You seem to miss my stated plan: next step, balance_dirty_pages() will
> get some feedback information from cfq to adjust its bandwidth targets
> accordingly. That information will be
> 
>         io_cost = charge/sectors
> 
> The charge value is exactly the value computed in cfq_group_served(),
> which is the slice time or IOs dispatched depending the mode cfq is
> operating in. By dividing ratelimit by the normalized io_cost,
> balance_dirty_pages() will automatically get the same weight
> interpretation as cfq. For example, on spin disks, it will be able to
> allocate lower bandwidth to seeky cgroups due to the larger io_cost
> reported by cfq.

So, cfq is basing its cost calculation on disk time spent by sync IOs
which gets fluctuated by uncategorized async IOs and you're gonna
apply that number to async IOs in some magical way?  What the hell
does that achieve?

Please take a step back and look at the whole stack and think about
what each part is supposed to do and how they are supposed to
interact.  If you still can't see the mess you're trying to make,
ummm... I don't know.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 262+ messages in thread

* [RFC] writeback and cgroup
@ 2012-04-03 18:36 Tejun Heo
  0 siblings, 0 replies; 262+ messages in thread
From: Tejun Heo @ 2012-04-03 18:36 UTC (permalink / raw)
  To: Fengguang Wu, Jan Kara, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	andrea-oIIqvOZpAevzfdHfmsDf5w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello, guys.

So, during LSF, I, Fengguang and Jan had a chance to sit down and talk
about how to cgroup support to writeback.  Here's what I got from it.

Fengguang's opinion is that the throttling algorithm implemented in
writeback is good enough and blkcg parameters can be exposed to
writeback such that those limits can be applied from writeback.  As
for reads and direct IOs, Fengguang opined that the algorithm can
easily be extended to cover those cases and IIUC all IOs, whether
buffered writes, reads or direct IOs can eventually all go through
writeback layer which will be the one layer controlling all IOs.

Unfortunately, I don't agree with that at all.  I think it's a gross
layering violation and lacks any longterm design.  We have a well
working model of applying and propagating resource pressure - we apply
the pressure where the resource exists and propagates the back
pressure through buffers to upper layers upto the originator.  Think
about network, the pressure exists or is applied at the in/egress
points which gets propagated through socket buffers and eventually
throttles the originator.

Writeback, without cgroup, isn't different.  It consists a part of the
pressure propagation chain anchored at the IO device.  IO devices
these days generate very high pressure, which gets propgated through
the IO sched and buffered requests, which in turn creates pressure at
writeback.  Here, the buffering happens in page cache and pressure at
writeback increases the amount of dirty page cache.  Propagating this
IO pressure to the dirtying task is one of the biggest
responsibililties of the writeback code, and this is the underlying
design of the whole thing.

IIUC, without cgroup, the current writeback code works more or less
like this.  Throwing in cgroup doesn't really change the fundamental
design.  Instead of a single pipe going down, we just have multiple
pipes to the same device, each of which should be treated separately.
Of course, a spinning disk can't be divided that easily and their
performance characteristics will be inter-dependent, but the place to
solve that problem is where the problem is, the block layer.

We may have to look for optimizations and expose some details to
improve the overall behavior and such optimizations may require some
deviation from the fundamental design, but such optimizations should
be justified and such deviations kept at minimum, so, no, I don't
think we're gonna be expose blkcg / block / elevator parameters
directly to writeback.  Unless someone can *really* convince me
otherwise, I'll be vetoing any change toward that direction.

Let's please keep the layering clear.  IO limitations will be applied
at the block layer and pressure will be formed there and then
propagated upwards eventually to the originator.  Sure, exposing the
whole information might result in better behavior for certain
workloads, but down the road, say, in three or five years, devices
which can be shared without worrying too much about seeks might be
commonplace and we could be swearing at a disgusting structural mess,
and sadly various cgroup support seems to be a prominent source of
such design failures.

IMHO, treating cgroup - device/bdi pair as a separate device should
suffice as the underlying design.  After all, blkio cgroup support's
ultimate goal is dividing the IO resource into separate bins.
Implementation details might change as underlying technology changes
and we learn more about how to do it better but that is the goal which
we'll always try to keep close to.  Writeback should (be able to)
treat them as separate devices.  We surely will need adjustments and
optimizations to make things work at least somewhat reasonably but
that is the baseline.

In the discussion, for such implementation, the following obstacles
were identified.

* There are a lot of cases where IOs are issued by a task which isn't
  the originiator.  ie. Writeback issues IOs for pages which are
  dirtied by some other tasks.  So, by the time an IO reaches the
  block layer, we don't know which cgroup the IO belongs to.

  Recently, block layer has grown support to attach a task to a bio
  which causes the bio to be handled as if it were issued by the
  associated task regardless of the actual issuing task.  It currently
  only allows attaching %current to a bio - bio_associate_current() -
  but changing it to support other tasks is trivial.

  We'll need to update the async issuers to tag the IOs they issue but
  the mechanism is already there.

* There's a single request pool shared by all issuers per a request
  queue.  This can lead to priority inversion among cgroups.  Note
  that problem also exists without cgroups.  Lower ioprio issuer may
  be holding a request holding back highprio issuer.

  We'll need to make request allocation cgroup (and hopefully ioprio)
  aware.  Probably in the form of separate request pools.  This will
  take some work but I don't think this will be too challenging.  I'll
  work on it.

* cfq cgroup policy throws all async IOs, which all buffered writes
  are, into the shared cgroup regardless of the actual cgroup.  This
  behavior is, I believe, mostly historical and changing it isn't
  difficult.  Prolly only few tens of lines of changes.  This may
  cause significant changes to actual IO behavior with cgroups tho.  I
  personally think the previous behavior was too wrong to keep (the
  weight was completely ignored for buffered writes) but we may want
  to introduce a switch to toggle between the two behaviors.

  Note that blk-throttle doesn't have this problem.

* Unlike dirty data pages, metadata tends to have strict ordering
  requirements and thus is susceptible to priority inversion.  Two
  solutions were suggested - 1. allow overdrawl for metadata writes so
  that low prio metadata writes don't block the whole FS, 2. provide
  an interface to query and wait for bdi-cgroup congestion which can
  be called from FS metadata paths to throttle metadata operations
  before they enter the stream of ordered operations.

  I think combination of the above two should be enough for solving
  the problem.  I *think* the second can be implemented as part of
  cgroup aware request allocation update.  The first one needs a bit
  more thinking but there can be easier interim solutions (e.g. throw
  META writes to the head of the cgroup queue or just plain ignore
  cgroup limits for META writes) for now.

* I'm sure there are a lot of design choices to be made in the
  writeback implementation but IIUC Jan seems to agree that the
  simplest would be simply deal different cgroup-bdi pairs as
  completely separate which shouldn't add too much complexity to the
  already intricate writeback code.

So, I think we have something which sounds like a plan, which at least
I can agree with and seems doable without adding a lot of complexity.

Jan, Fengguang, I'm pretty sure I missed some stuff from writeback's
side and IIUC Fengguang doesn't agree with this approach too much, so
please voice your opinions & comments.

Thank you.

--
tejun

^ permalink raw reply	[flat|nested] 262+ messages in thread

end of thread, other threads:[~2012-04-25 15:47 UTC | newest]

Thread overview: 262+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-03 18:36 [RFC] writeback and cgroup Tejun Heo
2012-04-03 18:36 ` Tejun Heo
2012-04-03 18:36 ` Tejun Heo
2012-04-04 14:51 ` Vivek Goyal
2012-04-04 14:51   ` Vivek Goyal
2012-04-04 15:36   ` [Lsf] " Steve French
2012-04-04 15:36     ` Steve French
2012-04-04 15:36     ` Steve French
2012-04-04 18:56     ` Tejun Heo
2012-04-04 18:56       ` Tejun Heo
2012-04-04 19:19       ` Vivek Goyal
2012-04-04 19:19         ` Vivek Goyal
2012-04-25  8:47         ` Suresh Jayaraman
2012-04-25  8:47           ` Suresh Jayaraman
2012-04-25  8:47           ` Suresh Jayaraman
     [not found]         ` <20120404191918.GK12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-25  8:47           ` Suresh Jayaraman
     [not found]       ` <20120404185605.GC29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
2012-04-04 19:19         ` Vivek Goyal
     [not found]     ` <CAH2r5mtwQa0Uu=_Yd2JywVJXA=OMGV43X_OUfziC-yeVy9BGtQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-04-04 18:56       ` Tejun Heo
2012-04-04 18:49   ` Tejun Heo
2012-04-04 18:49     ` Tejun Heo
2012-04-04 18:49     ` Tejun Heo
2012-04-04 19:23     ` [Lsf] " Steve French
2012-04-04 19:23       ` Steve French
     [not found]       ` <CAH2r5mvP56D0y4mk5wKrJcj+=OZ0e0Q5No_L+9a8a=GMcEhRew-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-04-14 12:15         ` Peter Zijlstra
2012-04-14 12:15       ` Peter Zijlstra
2012-04-14 12:15         ` Peter Zijlstra
2012-04-14 12:15         ` Peter Zijlstra
     [not found]     ` <20120404184909.GB29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
2012-04-04 19:23       ` Steve French
2012-04-04 20:32       ` Vivek Goyal
2012-04-05 16:38       ` Tejun Heo
2012-04-14 11:53       ` [Lsf] " Peter Zijlstra
2012-04-04 20:32     ` Vivek Goyal
2012-04-04 20:32       ` Vivek Goyal
     [not found]       ` <20120404203239.GM12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-04 23:02         ` Tejun Heo
2012-04-04 23:02       ` Tejun Heo
2012-04-04 23:02         ` Tejun Heo
2012-04-04 23:02         ` Tejun Heo
2012-04-05 16:38     ` Tejun Heo
2012-04-05 16:38       ` Tejun Heo
2012-04-05 16:38       ` Tejun Heo
     [not found]       ` <20120405163854.GE12854-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-04-05 17:13         ` Vivek Goyal
2012-04-05 17:13           ` Vivek Goyal
2012-04-05 17:13           ` Vivek Goyal
2012-04-14 11:53     ` [Lsf] " Peter Zijlstra
2012-04-14 11:53       ` Peter Zijlstra
2012-04-14 11:53       ` Peter Zijlstra
2012-04-16  1:25       ` Steve French
     [not found]   ` <20120404145134.GC12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-04 15:36     ` Steve French
2012-04-04 18:49     ` Tejun Heo
2012-04-07  8:00     ` Jan Kara
2012-04-07  8:00   ` Jan Kara
2012-04-07  8:00     ` Jan Kara
     [not found]     ` <20120407080027.GA2584-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2012-04-10 16:23       ` [Lsf] " Steve French
2012-04-10 18:06       ` Vivek Goyal
2012-04-10 16:23     ` [Lsf] " Steve French
2012-04-10 16:23       ` Steve French
2012-04-10 16:23       ` Steve French
     [not found]       ` <CAH2r5mvLVnM3Se5vBBsYzwaz5Ckp3i6SVnGp2T0XaGe9_u8YYA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-04-10 18:16         ` Vivek Goyal
2012-04-10 18:16       ` Vivek Goyal
2012-04-10 18:16         ` Vivek Goyal
2012-04-10 18:06     ` Vivek Goyal
2012-04-10 18:06       ` Vivek Goyal
     [not found]       ` <20120410180653.GJ21801-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-10 21:05         ` Jan Kara
2012-04-10 21:05           ` Jan Kara
2012-04-10 21:05           ` Jan Kara
     [not found]           ` <20120410210505.GE4936-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2012-04-10 21:20             ` Vivek Goyal
2012-04-10 21:20           ` Vivek Goyal
2012-04-10 21:20             ` Vivek Goyal
     [not found]             ` <20120410212041.GP21801-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-10 22:24               ` Jan Kara
2012-04-10 22:24             ` Jan Kara
2012-04-10 22:24               ` Jan Kara
     [not found]               ` <20120410222425.GF4936-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2012-04-11 15:40                 ` Vivek Goyal
2012-04-11 15:40                   ` Vivek Goyal
2012-04-11 15:40                   ` Vivek Goyal
2012-04-11 15:45                   ` Vivek Goyal
2012-04-11 15:45                     ` Vivek Goyal
     [not found]                     ` <20120411154531.GE16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-11 17:05                       ` Jan Kara
2012-04-11 17:05                     ` Jan Kara
2012-04-11 17:05                       ` Jan Kara
2012-04-11 17:23                       ` Vivek Goyal
2012-04-11 17:23                         ` Vivek Goyal
     [not found]                         ` <20120411172311.GF16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-11 19:44                           ` Jan Kara
2012-04-11 19:44                             ` Jan Kara
2012-04-11 19:44                             ` Jan Kara
     [not found]                       ` <20120411170542.GB16008-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2012-04-11 17:23                         ` Vivek Goyal
2012-04-17 21:48                         ` Tejun Heo
2012-04-17 21:48                       ` Tejun Heo
2012-04-17 21:48                         ` Tejun Heo
2012-04-17 21:48                         ` Tejun Heo
2012-04-18 18:18                         ` Vivek Goyal
2012-04-18 18:18                           ` Vivek Goyal
     [not found]                         ` <20120417214831.GE19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-04-18 18:18                           ` Vivek Goyal
     [not found]                   ` <20120411154005.GD16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-11 15:45                     ` Vivek Goyal
2012-04-11 19:22                     ` Jan Kara
2012-04-14 12:25                     ` [Lsf] " Peter Zijlstra
2012-04-11 19:22                   ` Jan Kara
2012-04-11 19:22                     ` Jan Kara
     [not found]                     ` <20120411192231.GF16008-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2012-04-12 20:37                       ` Vivek Goyal
2012-04-12 20:37                         ` Vivek Goyal
2012-04-12 20:37                         ` Vivek Goyal
2012-04-12 20:51                         ` Tejun Heo
2012-04-12 20:51                           ` Tejun Heo
     [not found]                           ` <20120412205148.GA24056-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-04-14 14:36                             ` Fengguang Wu
2012-04-14 14:36                               ` Fengguang Wu
2012-04-16 14:57                               ` Vivek Goyal
2012-04-16 14:57                                 ` Vivek Goyal
2012-04-24 11:33                                 ` Fengguang Wu
2012-04-24 11:33                                   ` Fengguang Wu
2012-04-24 14:56                                   ` Jan Kara
2012-04-24 14:56                                   ` Jan Kara
2012-04-24 14:56                                     ` Jan Kara
2012-04-24 14:56                                     ` Jan Kara
2012-04-24 15:58                                     ` Vivek Goyal
2012-04-24 15:58                                       ` Vivek Goyal
     [not found]                                       ` <20120424155843.GG26708-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-25  2:42                                         ` Fengguang Wu
2012-04-25  2:42                                       ` Fengguang Wu
2012-04-25  2:42                                         ` Fengguang Wu
2012-04-25  2:42                                         ` Fengguang Wu
     [not found]                                     ` <20120424145655.GA1474-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2012-04-24 15:58                                       ` Vivek Goyal
2012-04-25  3:16                                       ` Fengguang Wu
2012-04-25  3:16                                         ` Fengguang Wu
2012-04-25  9:01                                         ` Jan Kara
2012-04-25  9:01                                           ` Jan Kara
2012-04-25  9:01                                           ` Jan Kara
     [not found]                                           ` <20120425090156.GB12568-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2012-04-25 12:05                                             ` Fengguang Wu
2012-04-25 12:05                                               ` Fengguang Wu
2012-04-25  9:01                                         ` Jan Kara
     [not found]                                 ` <20120416145744.GA15437-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-24 11:33                                   ` Fengguang Wu
2012-04-16 14:57                               ` Vivek Goyal
2012-04-15 11:37                         ` [Lsf] " Peter Zijlstra
2012-04-15 11:37                           ` Peter Zijlstra
2012-04-15 11:37                           ` Peter Zijlstra
     [not found]                         ` <20120412203719.GL2207-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-12 20:51                           ` Tejun Heo
2012-04-15 11:37                           ` [Lsf] " Peter Zijlstra
2012-04-17 22:01                       ` Tejun Heo
2012-04-17 22:01                     ` Tejun Heo
2012-04-17 22:01                       ` Tejun Heo
2012-04-17 22:01                       ` Tejun Heo
2012-04-18  6:30                       ` Jan Kara
2012-04-18  6:30                         ` Jan Kara
     [not found]                       ` <20120417220106.GF19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-04-18  6:30                         ` Jan Kara
2012-04-14 12:25                   ` [Lsf] " Peter Zijlstra
2012-04-14 12:25                     ` Peter Zijlstra
2012-04-14 12:25                     ` Peter Zijlstra
2012-04-16 12:54                     ` Vivek Goyal
2012-04-16 12:54                       ` Vivek Goyal
2012-04-16 12:54                       ` Vivek Goyal
2012-04-16 13:07                       ` Fengguang Wu
2012-04-16 13:07                         ` Fengguang Wu
2012-04-16 14:19                         ` Fengguang Wu
2012-04-16 14:19                         ` Fengguang Wu
2012-04-16 14:19                           ` Fengguang Wu
2012-04-16 15:52                         ` Vivek Goyal
2012-04-16 15:52                         ` Vivek Goyal
2012-04-16 15:52                           ` Vivek Goyal
     [not found]                           ` <20120416155207.GB15437-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-17  2:14                             ` Fengguang Wu
2012-04-17  2:14                               ` Fengguang Wu
2012-04-17  2:14                               ` Fengguang Wu
     [not found]                       ` <20120416125432.GB12776-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-16 13:07                         ` Fengguang Wu
     [not found] ` <20120403183655.GA23106-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
2012-04-04 14:51   ` Vivek Goyal
2012-04-04 17:51   ` Fengguang Wu
2012-04-04 17:51     ` Fengguang Wu
2012-04-04 17:51     ` Fengguang Wu
2012-04-04 18:35     ` Vivek Goyal
2012-04-04 18:35       ` Vivek Goyal
     [not found]       ` <20120404183528.GJ12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-04 21:42         ` Fengguang Wu
2012-04-04 21:42           ` Fengguang Wu
2012-04-04 21:42           ` Fengguang Wu
2012-04-05 15:10           ` Vivek Goyal
2012-04-05 15:10             ` Vivek Goyal
     [not found]             ` <20120405151026.GB23999-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-06  0:32               ` Fengguang Wu
2012-04-06  0:32             ` Fengguang Wu
2012-04-06  0:32               ` Fengguang Wu
2012-04-05 15:10           ` Vivek Goyal
2012-04-04 18:35     ` Vivek Goyal
2012-04-04 19:33     ` Tejun Heo
2012-04-04 19:33       ` Tejun Heo
2012-04-04 19:33       ` Tejun Heo
2012-04-06  9:59       ` Fengguang Wu
2012-04-06  9:59         ` Fengguang Wu
2012-04-06  9:59         ` Fengguang Wu
2012-04-17 22:38         ` Tejun Heo
2012-04-17 22:38         ` Tejun Heo
2012-04-17 22:38           ` Tejun Heo
2012-04-17 22:38           ` Tejun Heo
2012-04-19 14:23           ` Fengguang Wu
2012-04-19 14:23             ` Fengguang Wu
2012-04-19 14:23             ` Fengguang Wu
2012-04-19 18:31             ` Vivek Goyal
2012-04-19 18:31               ` Vivek Goyal
     [not found]               ` <20120419183118.GM10216-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-20 12:45                 ` Fengguang Wu
2012-04-20 12:45               ` Fengguang Wu
2012-04-20 12:45                 ` Fengguang Wu
2012-04-20 19:29                 ` Vivek Goyal
2012-04-20 19:29                   ` Vivek Goyal
2012-04-20 21:33                   ` Tejun Heo
2012-04-20 21:33                     ` Tejun Heo
2012-04-22 14:26                     ` Fengguang Wu
2012-04-22 14:26                       ` Fengguang Wu
     [not found]                     ` <20120420213301.GA29134-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-04-22 14:26                       ` Fengguang Wu
2012-04-23 12:30                       ` Vivek Goyal
2012-04-23 12:30                     ` Vivek Goyal
2012-04-23 12:30                       ` Vivek Goyal
2012-04-23 16:04                       ` Tejun Heo
2012-04-23 16:04                         ` Tejun Heo
     [not found]                       ` <20120423123011.GA8103-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-23 16:04                         ` Tejun Heo
     [not found]                   ` <20120420192930.GR22419-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-20 21:33                     ` Tejun Heo
2012-04-20 19:29                 ` Vivek Goyal
2012-04-19 18:31             ` Vivek Goyal
2012-04-19 20:26             ` Jan Kara
2012-04-19 20:26               ` Jan Kara
2012-04-19 20:26               ` Jan Kara
2012-04-20 13:34               ` Fengguang Wu
2012-04-20 13:34                 ` Fengguang Wu
2012-04-20 19:08                 ` Tejun Heo
2012-04-20 19:08                 ` Tejun Heo
2012-04-20 19:08                   ` Tejun Heo
     [not found]                   ` <20120420190844.GH32324-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-04-22 14:46                     ` Fengguang Wu
2012-04-22 14:46                   ` Fengguang Wu
2012-04-22 14:46                     ` Fengguang Wu
2012-04-22 14:46                     ` Fengguang Wu
2012-04-23 16:56                     ` Tejun Heo
2012-04-23 16:56                       ` Tejun Heo
2012-04-23 16:56                       ` Tejun Heo
     [not found]                       ` <20120423165626.GB5406-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-04-24  7:58                         ` Fengguang Wu
2012-04-24  7:58                       ` Fengguang Wu
2012-04-24  7:58                         ` Fengguang Wu
2012-04-25 15:47                         ` Tejun Heo
2012-04-25 15:47                         ` Tejun Heo
2012-04-25 15:47                           ` Tejun Heo
2012-04-23  9:14                 ` Jan Kara
2012-04-23  9:14                   ` Jan Kara
2012-04-23  9:14                   ` Jan Kara
2012-04-23 10:24                   ` Fengguang Wu
2012-04-23 10:24                     ` Fengguang Wu
2012-04-23 12:42                     ` Jan Kara
2012-04-23 12:42                       ` Jan Kara
2012-04-23 14:31                       ` Fengguang Wu
2012-04-23 14:31                         ` Fengguang Wu
     [not found]                       ` <20120423124240.GE6512-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2012-04-23 14:31                         ` Fengguang Wu
2012-04-23 12:42                     ` Jan Kara
     [not found]                   ` <20120423091432.GC6512-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2012-04-23 10:24                     ` Fengguang Wu
     [not found]               ` <20120419202635.GA4795-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2012-04-20 13:34                 ` Fengguang Wu
     [not found]           ` <20120417223854.GG19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-04-19 14:23             ` Fengguang Wu
2012-04-18  6:57         ` Jan Kara
2012-04-18  6:57           ` Jan Kara
2012-04-18  7:58           ` Fengguang Wu
2012-04-18  7:58             ` Fengguang Wu
2012-04-18  7:58             ` Fengguang Wu
     [not found]           ` <20120418065720.GA21485-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2012-04-18  7:58             ` Fengguang Wu
2012-04-18  6:57         ` Jan Kara
     [not found]       ` <20120404193355.GD29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
2012-04-04 20:18         ` Vivek Goyal
2012-04-04 20:18           ` Vivek Goyal
2012-04-04 20:18           ` Vivek Goyal
2012-04-05 16:31           ` Tejun Heo
2012-04-05 16:31             ` Tejun Heo
2012-04-05 17:09             ` Vivek Goyal
2012-04-05 17:09               ` Vivek Goyal
     [not found]             ` <20120405163113.GD12854-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-04-05 17:09               ` Vivek Goyal
     [not found]           ` <20120404201816.GL12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-05 16:31             ` Tejun Heo
2012-04-06  9:59         ` Fengguang Wu
2012-04-03 18:36 Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.