All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [Lsf] Preliminary Agenda and Activities for LSF
       [not found] <1301373398.2590.20.camel@mulgrave.site>
@ 2011-03-29  5:14 ` Amir Goldstein
  2011-03-29 11:16 ` Ric Wheeler
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 166+ messages in thread
From: Amir Goldstein @ 2011-03-29  5:14 UTC (permalink / raw)
  To: James Bottomley; +Cc: lsf, linux-fsdevel

On Tue, Mar 29, 2011 at 6:36 AM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> Hi All,
>
> Since LSF is less than a week away, the programme committee put together
> a just in time preliminary agenda for LSF.  As you can see there is
> still plenty of empty space, which you can make suggestions (to this
> list with appropriate general list cc's) for filling:

Hi James,

I would like to give a session on Ext4 snapshots overview and
development status update.
Perhaps on the 2nd day?

I would also like to run a session on common API's and common challenges
for snapshotting file systems (ext4, ocfs2, nilfs2 and btrfs are the
ones I know of).

Amir.

>
> https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQzM5UDRXUnVEbHlYVmZUVHQ2amc&output=html
>
> If you don't make suggestions, the programme committee will feel
> empowered to make arbitrary assignments based on your topic and attendee
> email requests ...
>
> We're still not quite sure what rooms we will have at the Kabuki, but
> we'll add them to the spreadsheet when we know (they should be close to
> each other).
>
> The spreadsheet above also gives contact information for all the
> attendees and the programme committee.
>
> Yours,
>
> James Bottomley
> on behalf of LSF/MM Programme Committee
>
>
> _______________________________________________
> Lsf mailing list
> Lsf@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/lsf
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
       [not found] <1301373398.2590.20.camel@mulgrave.site>
  2011-03-29  5:14 ` [Lsf] Preliminary Agenda and Activities for LSF Amir Goldstein
@ 2011-03-29 11:16 ` Ric Wheeler
  2011-03-29 11:22   ` Matthew Wilcox
                     ` (4 more replies)
  2011-03-29 15:35 ` [LSF][MM] page allocation & direct reclaim latency Rik van Riel
  2011-03-29 17:35 ` [Lsf] Preliminary Agenda and Activities for LSF Chad Talbott
  3 siblings, 5 replies; 166+ messages in thread
From: Ric Wheeler @ 2011-03-29 11:16 UTC (permalink / raw)
  To: James Bottomley; +Cc: lsf, linux-fsdevel, linux-scsi, device-mapper development

On 03/29/2011 12:36 AM, James Bottomley wrote:
> Hi All,
>
> Since LSF is less than a week away, the programme committee put together
> a just in time preliminary agenda for LSF.  As you can see there is
> still plenty of empty space, which you can make suggestions (to this
> list with appropriate general list cc's) for filling:
>
> https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQzM5UDRXUnVEbHlYVmZUVHQ2amc&output=html
>
> If you don't make suggestions, the programme committee will feel
> empowered to make arbitrary assignments based on your topic and attendee
> email requests ...
>
> We're still not quite sure what rooms we will have at the Kabuki, but
> we'll add them to the spreadsheet when we know (they should be close to
> each other).
>
> The spreadsheet above also gives contact information for all the
> attendees and the programme committee.
>
> Yours,
>
> James Bottomley
> on behalf of LSF/MM Programme Committee
>

Here are a few topic ideas:

(1)  The first topic that might span IO & FS tracks (or just pull in device 
mapper people to an FS track) could be adding new commands that would allow 
users to grow/shrink/etc file systems in a generic way.  The thought I had was 
that we have a reasonable model that we could reuse for these new commands like 
mount and mount.fs or fsck and fsck.fs. With btrfs coming down the road, it 
could be nice to identify exactly what common operations users want to do and 
agree on how to implement them. Alasdair pointed out in the upstream thread that 
we had a prototype here in fsadm.

(2) Very high speed, low latency SSD devices and testing. Have we settled on the 
need for these devices to all have block level drivers? For S-ATA or SAS 
devices, are there known performance issues that require enhancements in 
somewhere in the stack?

(3) The union mount versus overlayfs debate - pros and cons. What each do well, 
what needs doing. Do we want/need both upstream? (Maybe this can get 10 minutes 
in Al's VFS session?)

Thanks!

Ric


^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 11:16 ` Ric Wheeler
@ 2011-03-29 11:22   ` Matthew Wilcox
  2011-03-29 12:17     ` Jens Axboe
  2011-03-29 17:20     ` Shyam_Iyer
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 166+ messages in thread
From: Matthew Wilcox @ 2011-03-29 11:22 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: James Bottomley, lsf, linux-fsdevel, linux-scsi,
	device-mapper development

On Tue, Mar 29, 2011 at 07:16:32AM -0400, Ric Wheeler wrote:
> (2) Very high speed, low latency SSD devices and testing. Have we settled 
> on the need for these devices to all have block level drivers? For S-ATA 
> or SAS devices, are there known performance issues that require 
> enhancements in somewhere in the stack?

I can throw together a quick presentation on this topic.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 11:22   ` Matthew Wilcox
@ 2011-03-29 12:17     ` Jens Axboe
  2011-03-29 13:09         ` Martin K. Petersen
  0 siblings, 1 reply; 166+ messages in thread
From: Jens Axboe @ 2011-03-29 12:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: lsf, linux-fsdevel, device-mapper development, Ric Wheeler, linux-scsi

On 2011-03-29 13:22, Matthew Wilcox wrote:
> On Tue, Mar 29, 2011 at 07:16:32AM -0400, Ric Wheeler wrote:
>> (2) Very high speed, low latency SSD devices and testing. Have we settled 
>> on the need for these devices to all have block level drivers? For S-ATA 
>> or SAS devices, are there known performance issues that require 
>> enhancements in somewhere in the stack?
> 
> I can throw together a quick presentation on this topic.

I'll join that too.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 12:17     ` Jens Axboe
@ 2011-03-29 13:09         ` Martin K. Petersen
  0 siblings, 0 replies; 166+ messages in thread
From: Martin K. Petersen @ 2011-03-29 13:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Matthew Wilcox, lsf, linux-fsdevel, device-mapper development,
	Ric Wheeler, linux-scsi

>>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes:

>> I can throw together a quick presentation on this topic.

Jens> I'll join that too.

Stack tuning aside, maybe Matthew can speak a bit about NVMe and I'll
cover what's going on with the SCSI over PCIe efforts...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
@ 2011-03-29 13:09         ` Martin K. Petersen
  0 siblings, 0 replies; 166+ messages in thread
From: Martin K. Petersen @ 2011-03-29 13:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Matthew Wilcox, lsf, linux-fsdevel, device-mapper development,
	Ric Wheeler, linux-scsi

>>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes:

>> I can throw together a quick presentation on this topic.

Jens> I'll join that too.

Stack tuning aside, maybe Matthew can speak a bit about NVMe and I'll
cover what's going on with the SCSI over PCIe efforts...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 13:09         ` Martin K. Petersen
  (?)
@ 2011-03-29 13:12         ` Ric Wheeler
  -1 siblings, 0 replies; 166+ messages in thread
From: Ric Wheeler @ 2011-03-29 13:12 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Jens Axboe, linux-scsi, lsf, device-mapper development,
	linux-fsdevel, Ric Wheeler

On 03/29/2011 09:09 AM, Martin K. Petersen wrote:
>
> Jens>  I'll join that too.
>
> Stack tuning aside, maybe Matthew can speak a bit about NVMe and I'll
> cover what's going on with the SCSI over PCIe efforts...

That sounds interesting to me...

Ric


^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 13:09         ` Martin K. Petersen
  (?)
  (?)
@ 2011-03-29 13:38         ` James Bottomley
  -1 siblings, 0 replies; 166+ messages in thread
From: James Bottomley @ 2011-03-29 13:38 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Jens Axboe, Matthew Wilcox, lsf, linux-fsdevel,
	device-mapper development, Ric Wheeler, linux-scsi

On Tue, 2011-03-29 at 09:09 -0400, Martin K. Petersen wrote:
> >>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes:
> 
> >> I can throw together a quick presentation on this topic.
> 
> Jens> I'll join that too.
> 
> Stack tuning aside, maybe Matthew can speak a bit about NVMe and I'll
> cover what's going on with the SCSI over PCIe efforts...

OK, I put you down for a joint sessions with FS and IO after the tea
break on Tuesday.

James




^ permalink raw reply	[flat|nested] 166+ messages in thread

* [LSF][MM] page allocation & direct reclaim latency
       [not found] <1301373398.2590.20.camel@mulgrave.site>
  2011-03-29  5:14 ` [Lsf] Preliminary Agenda and Activities for LSF Amir Goldstein
  2011-03-29 11:16 ` Ric Wheeler
@ 2011-03-29 15:35 ` Rik van Riel
  2011-03-29 19:05   ` [Lsf] " Andrea Arcangeli
  2011-03-29 17:35 ` [Lsf] Preliminary Agenda and Activities for LSF Chad Talbott
  3 siblings, 1 reply; 166+ messages in thread
From: Rik van Riel @ 2011-03-29 15:35 UTC (permalink / raw)
  To: lsf; +Cc: linux-mm

On 03/29/2011 12:36 AM, James Bottomley wrote:
> Hi All,
>
> Since LSF is less than a week away, the programme committee put together
> a just in time preliminary agenda for LSF.  As you can see there is
> still plenty of empty space, which you can make suggestions

There have been a few patches upstream by people for who
page allocation latency is a concern.

It may be worthwhile to have a short discussion on what
we can do to keep page allocation (and direct reclaim?)
latencies down to a minimum, reducing the slowdown that
direct reclaim introduces on some workloads.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 166+ messages in thread

* RE: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 11:16 ` Ric Wheeler
@ 2011-03-29 17:20     ` Shyam_Iyer
  2011-03-29 17:20     ` Shyam_Iyer
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 166+ messages in thread
From: Shyam_Iyer @ 2011-03-29 17:20 UTC (permalink / raw)
  To: rwheeler, James.Bottomley; +Cc: lsf, linux-fsdevel, linux-scsi, dm-devel



> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> owner@vger.kernel.org] On Behalf Of Ric Wheeler
> Sent: Tuesday, March 29, 2011 7:17 AM
> To: James Bottomley
> Cc: lsf@lists.linux-foundation.org; linux-fsdevel; linux-
> scsi@vger.kernel.org; device-mapper development
> Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> 
> On 03/29/2011 12:36 AM, James Bottomley wrote:
> > Hi All,
> >
> > Since LSF is less than a week away, the programme committee put
> together
> > a just in time preliminary agenda for LSF.  As you can see there is
> > still plenty of empty space, which you can make suggestions (to this
> > list with appropriate general list cc's) for filling:
> >
> >
> https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQz
> M5UDRXUnVEbHlYVmZUVHQ2amc&output=html
> >
> > If you don't make suggestions, the programme committee will feel
> > empowered to make arbitrary assignments based on your topic and
> attendee
> > email requests ...
> >
> > We're still not quite sure what rooms we will have at the Kabuki, but
> > we'll add them to the spreadsheet when we know (they should be close
> to
> > each other).
> >
> > The spreadsheet above also gives contact information for all the
> > attendees and the programme committee.
> >
> > Yours,
> >
> > James Bottomley
> > on behalf of LSF/MM Programme Committee
> >
> 
> Here are a few topic ideas:
> 
> (1)  The first topic that might span IO & FS tracks (or just pull in
> device
> mapper people to an FS track) could be adding new commands that would
> allow
> users to grow/shrink/etc file systems in a generic way.  The thought I
> had was
> that we have a reasonable model that we could reuse for these new
> commands like
> mount and mount.fs or fsck and fsck.fs. With btrfs coming down the
> road, it
> could be nice to identify exactly what common operations users want to
> do and
> agree on how to implement them. Alasdair pointed out in the upstream
> thread that
> we had a prototype here in fsadm.
> 
> (2) Very high speed, low latency SSD devices and testing. Have we
> settled on the
> need for these devices to all have block level drivers? For S-ATA or
> SAS
> devices, are there known performance issues that require enhancements
> in
> somewhere in the stack?
> 
> (3) The union mount versus overlayfs debate - pros and cons. What each
> do well,
> what needs doing. Do we want/need both upstream? (Maybe this can get 10
> minutes
> in Al's VFS session?)
> 
> Thanks!
> 
> Ric

A few others that I think may span across I/O, Block fs..layers.

1) Dm-thinp target vs File system thin profile vs block map based thin/trim profile. Facilitate I/O throttling for thin/trimmable storage. Online and Offline profil.
2) Interfaces for SCSI, Ethernet/*transport configuration parameters floating around in sysfs, procfs. Architecting guidelines for accepting patches for hybrid devices.
3) DM snapshot vs FS snapshots vs H/W snapshots. There is room for all and they have to help each other
4) B/W control - VM->DM->Block->Ethernet->Switch->Storage. Pick your subsystem and there are many non-cooperating B/W control constructs in each subsystem.

-Shyam

^ permalink raw reply	[flat|nested] 166+ messages in thread

* RE: [Lsf] Preliminary Agenda and Activities for LSF
@ 2011-03-29 17:20     ` Shyam_Iyer
  0 siblings, 0 replies; 166+ messages in thread
From: Shyam_Iyer @ 2011-03-29 17:20 UTC (permalink / raw)
  To: rwheeler, James.Bottomley; +Cc: lsf, linux-fsdevel, linux-scsi, dm-devel



> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> owner@vger.kernel.org] On Behalf Of Ric Wheeler
> Sent: Tuesday, March 29, 2011 7:17 AM
> To: James Bottomley
> Cc: lsf@lists.linux-foundation.org; linux-fsdevel; linux-
> scsi@vger.kernel.org; device-mapper development
> Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> 
> On 03/29/2011 12:36 AM, James Bottomley wrote:
> > Hi All,
> >
> > Since LSF is less than a week away, the programme committee put
> together
> > a just in time preliminary agenda for LSF.  As you can see there is
> > still plenty of empty space, which you can make suggestions (to this
> > list with appropriate general list cc's) for filling:
> >
> >
> https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQz
> M5UDRXUnVEbHlYVmZUVHQ2amc&output=html
> >
> > If you don't make suggestions, the programme committee will feel
> > empowered to make arbitrary assignments based on your topic and
> attendee
> > email requests ...
> >
> > We're still not quite sure what rooms we will have at the Kabuki, but
> > we'll add them to the spreadsheet when we know (they should be close
> to
> > each other).
> >
> > The spreadsheet above also gives contact information for all the
> > attendees and the programme committee.
> >
> > Yours,
> >
> > James Bottomley
> > on behalf of LSF/MM Programme Committee
> >
> 
> Here are a few topic ideas:
> 
> (1)  The first topic that might span IO & FS tracks (or just pull in
> device
> mapper people to an FS track) could be adding new commands that would
> allow
> users to grow/shrink/etc file systems in a generic way.  The thought I
> had was
> that we have a reasonable model that we could reuse for these new
> commands like
> mount and mount.fs or fsck and fsck.fs. With btrfs coming down the
> road, it
> could be nice to identify exactly what common operations users want to
> do and
> agree on how to implement them. Alasdair pointed out in the upstream
> thread that
> we had a prototype here in fsadm.
> 
> (2) Very high speed, low latency SSD devices and testing. Have we
> settled on the
> need for these devices to all have block level drivers? For S-ATA or
> SAS
> devices, are there known performance issues that require enhancements
> in
> somewhere in the stack?
> 
> (3) The union mount versus overlayfs debate - pros and cons. What each
> do well,
> what needs doing. Do we want/need both upstream? (Maybe this can get 10
> minutes
> in Al's VFS session?)
> 
> Thanks!
> 
> Ric

A few others that I think may span across I/O, Block fs..layers.

1) Dm-thinp target vs File system thin profile vs block map based thin/trim profile. Facilitate I/O throttling for thin/trimmable storage. Online and Offline profil.
2) Interfaces for SCSI, Ethernet/*transport configuration parameters floating around in sysfs, procfs. Architecting guidelines for accepting patches for hybrid devices.
3) DM snapshot vs FS snapshots vs H/W snapshots. There is room for all and they have to help each other
4) B/W control - VM->DM->Block->Ethernet->Switch->Storage. Pick your subsystem and there are many non-cooperating B/W control constructs in each subsystem.

-Shyam

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 17:20     ` Shyam_Iyer
  (?)
@ 2011-03-29 17:33     ` Vivek Goyal
  2011-03-29 18:10         ` Shyam_Iyer
  -1 siblings, 1 reply; 166+ messages in thread
From: Vivek Goyal @ 2011-03-29 17:33 UTC (permalink / raw)
  To: Shyam_Iyer
  Cc: rwheeler, James.Bottomley, lsf, linux-fsdevel, dm-devel, linux-scsi

On Tue, Mar 29, 2011 at 10:20:57AM -0700, Shyam_Iyer@dell.com wrote:
> 
> 
> > -----Original Message-----
> > From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> > owner@vger.kernel.org] On Behalf Of Ric Wheeler
> > Sent: Tuesday, March 29, 2011 7:17 AM
> > To: James Bottomley
> > Cc: lsf@lists.linux-foundation.org; linux-fsdevel; linux-
> > scsi@vger.kernel.org; device-mapper development
> > Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> > 
> > On 03/29/2011 12:36 AM, James Bottomley wrote:
> > > Hi All,
> > >
> > > Since LSF is less than a week away, the programme committee put
> > together
> > > a just in time preliminary agenda for LSF.  As you can see there is
> > > still plenty of empty space, which you can make suggestions (to this
> > > list with appropriate general list cc's) for filling:
> > >
> > >
> > https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQz
> > M5UDRXUnVEbHlYVmZUVHQ2amc&output=html
> > >
> > > If you don't make suggestions, the programme committee will feel
> > > empowered to make arbitrary assignments based on your topic and
> > attendee
> > > email requests ...
> > >
> > > We're still not quite sure what rooms we will have at the Kabuki, but
> > > we'll add them to the spreadsheet when we know (they should be close
> > to
> > > each other).
> > >
> > > The spreadsheet above also gives contact information for all the
> > > attendees and the programme committee.
> > >
> > > Yours,
> > >
> > > James Bottomley
> > > on behalf of LSF/MM Programme Committee
> > >
> > 
> > Here are a few topic ideas:
> > 
> > (1)  The first topic that might span IO & FS tracks (or just pull in
> > device
> > mapper people to an FS track) could be adding new commands that would
> > allow
> > users to grow/shrink/etc file systems in a generic way.  The thought I
> > had was
> > that we have a reasonable model that we could reuse for these new
> > commands like
> > mount and mount.fs or fsck and fsck.fs. With btrfs coming down the
> > road, it
> > could be nice to identify exactly what common operations users want to
> > do and
> > agree on how to implement them. Alasdair pointed out in the upstream
> > thread that
> > we had a prototype here in fsadm.
> > 
> > (2) Very high speed, low latency SSD devices and testing. Have we
> > settled on the
> > need for these devices to all have block level drivers? For S-ATA or
> > SAS
> > devices, are there known performance issues that require enhancements
> > in
> > somewhere in the stack?
> > 
> > (3) The union mount versus overlayfs debate - pros and cons. What each
> > do well,
> > what needs doing. Do we want/need both upstream? (Maybe this can get 10
> > minutes
> > in Al's VFS session?)
> > 
> > Thanks!
> > 
> > Ric
> 
> A few others that I think may span across I/O, Block fs..layers.
> 
> 1) Dm-thinp target vs File system thin profile vs block map based thin/trim profile.

> Facilitate I/O throttling for thin/trimmable storage. Online and Offline profil.

Is above any different from block IO throttling we have got for block
devices?

> 2) Interfaces for SCSI, Ethernet/*transport configuration parameters floating around in sysfs, procfs. Architecting guidelines for accepting patches for hybrid devices.
> 3) DM snapshot vs FS snapshots vs H/W snapshots. There is room for all and they have to help each other
> 4) B/W control - VM->DM->Block->Ethernet->Switch->Storage. Pick your subsystem and there are many non-cooperating B/W control constructs in each subsystem.

Above is pretty generic. Do you have specific needs/ideas/concerns?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
       [not found] <1301373398.2590.20.camel@mulgrave.site>
                   ` (2 preceding siblings ...)
  2011-03-29 15:35 ` [LSF][MM] page allocation & direct reclaim latency Rik van Riel
@ 2011-03-29 17:35 ` Chad Talbott
  2011-03-29 19:09   ` Vivek Goyal
  2011-03-30  4:18   ` Dave Chinner
  3 siblings, 2 replies; 166+ messages in thread
From: Chad Talbott @ 2011-03-29 17:35 UTC (permalink / raw)
  To: James Bottomley; +Cc: lsf, linux-fsdevel, Curt Wohlgemuth

I'd like to propose a discussion topic:

IO-less Dirty Throttling Considered Harmful...

to isolation and cgroup IO schedulers in general.  The disk scheduler
is knocked out of the picture unless it can see the IO generated by
each group above it.  The world of memcg-aware writeback stacked on
top of block-cgroups is a complicated one.  Throttling in
balance_dirty_pages() will likely be a non-starter for current users
of group-aware CFQ.

I'd like a discussion that covers the system-wide view of:

memory
  -> memcg groups
    -> block cgroups
      -> multiple block devices

Chad

On Mon, Mar 28, 2011 at 9:36 PM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> Hi All,
>
> Since LSF is less than a week away, the programme committee put together
> a just in time preliminary agenda for LSF.  As you can see there is
> still plenty of empty space, which you can make suggestions (to this
> list with appropriate general list cc's) for filling:
>
> https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQzM5UDRXUnVEbHlYVmZUVHQ2amc&output=html
>
> If you don't make suggestions, the programme committee will feel
> empowered to make arbitrary assignments based on your topic and attendee
> email requests ...
>
> We're still not quite sure what rooms we will have at the Kabuki, but
> we'll add them to the spreadsheet when we know (they should be close to
> each other).
>
> The spreadsheet above also gives contact information for all the
> attendees and the programme committee.
>
> Yours,
>
> James Bottomley
> on behalf of LSF/MM Programme Committee
>
>
> _______________________________________________
> Lsf mailing list
> Lsf@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/lsf
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 166+ messages in thread

* RE: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 17:33     ` Vivek Goyal
@ 2011-03-29 18:10         ` Shyam_Iyer
  0 siblings, 0 replies; 166+ messages in thread
From: Shyam_Iyer @ 2011-03-29 18:10 UTC (permalink / raw)
  To: vgoyal
  Cc: rwheeler, James.Bottomley, lsf, linux-fsdevel, dm-devel, linux-scsi



> -----Original Message-----
> From: Vivek Goyal [mailto:vgoyal@redhat.com]
> Sent: Tuesday, March 29, 2011 1:34 PM
> To: Iyer, Shyam
> Cc: rwheeler@redhat.com; James.Bottomley@hansenpartnership.com;
> lsf@lists.linux-foundation.org; linux-fsdevel@vger.kernel.org; dm-
> devel@redhat.com; linux-scsi@vger.kernel.org
> Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> 
> On Tue, Mar 29, 2011 at 10:20:57AM -0700, Shyam_Iyer@dell.com wrote:
> >
> >
> > > -----Original Message-----
> > > From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> > > owner@vger.kernel.org] On Behalf Of Ric Wheeler
> > > Sent: Tuesday, March 29, 2011 7:17 AM
> > > To: James Bottomley
> > > Cc: lsf@lists.linux-foundation.org; linux-fsdevel; linux-
> > > scsi@vger.kernel.org; device-mapper development
> > > Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> > >
> > > On 03/29/2011 12:36 AM, James Bottomley wrote:
> > > > Hi All,
> > > >
> > > > Since LSF is less than a week away, the programme committee put
> > > together
> > > > a just in time preliminary agenda for LSF.  As you can see there
> is
> > > > still plenty of empty space, which you can make suggestions (to
> this
> > > > list with appropriate general list cc's) for filling:
> > > >
> > > >
> > >
> https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQz
> > > M5UDRXUnVEbHlYVmZUVHQ2amc&output=html
> > > >
> > > > If you don't make suggestions, the programme committee will feel
> > > > empowered to make arbitrary assignments based on your topic and
> > > attendee
> > > > email requests ...
> > > >
> > > > We're still not quite sure what rooms we will have at the Kabuki,
> but
> > > > we'll add them to the spreadsheet when we know (they should be
> close
> > > to
> > > > each other).
> > > >
> > > > The spreadsheet above also gives contact information for all the
> > > > attendees and the programme committee.
> > > >
> > > > Yours,
> > > >
> > > > James Bottomley
> > > > on behalf of LSF/MM Programme Committee
> > > >
> > >
> > > Here are a few topic ideas:
> > >
> > > (1)  The first topic that might span IO & FS tracks (or just pull
> in
> > > device
> > > mapper people to an FS track) could be adding new commands that
> would
> > > allow
> > > users to grow/shrink/etc file systems in a generic way.  The
> thought I
> > > had was
> > > that we have a reasonable model that we could reuse for these new
> > > commands like
> > > mount and mount.fs or fsck and fsck.fs. With btrfs coming down the
> > > road, it
> > > could be nice to identify exactly what common operations users want
> to
> > > do and
> > > agree on how to implement them. Alasdair pointed out in the
> upstream
> > > thread that
> > > we had a prototype here in fsadm.
> > >
> > > (2) Very high speed, low latency SSD devices and testing. Have we
> > > settled on the
> > > need for these devices to all have block level drivers? For S-ATA
> or
> > > SAS
> > > devices, are there known performance issues that require
> enhancements
> > > in
> > > somewhere in the stack?
> > >
> > > (3) The union mount versus overlayfs debate - pros and cons. What
> each
> > > do well,
> > > what needs doing. Do we want/need both upstream? (Maybe this can
> get 10
> > > minutes
> > > in Al's VFS session?)
> > >
> > > Thanks!
> > >
> > > Ric
> >
> > A few others that I think may span across I/O, Block fs..layers.
> >
> > 1) Dm-thinp target vs File system thin profile vs block map based
> thin/trim profile.
> 
> > Facilitate I/O throttling for thin/trimmable storage. Online and
> Offline profil.
> 
> Is above any different from block IO throttling we have got for block
> devices?
> 
Yes.. so the throttling would be capacity  based.. when the storage array wants us to throttle the I/O. Depending on the event we may keep getting space allocation write protect check conditions for writes until a user intervenes to stop I/O.


> > 2) Interfaces for SCSI, Ethernet/*transport configuration parameters
> floating around in sysfs, procfs. Architecting guidelines for accepting
> patches for hybrid devices.
> > 3) DM snapshot vs FS snapshots vs H/W snapshots. There is room for
> all and they have to help each other

For instance if you took a DM snapshot and the storage sent a check condition to the original dm device I am not sure if the DM snapshot would get one too..

If you had a scenario of taking H/W snapshot of an entire pool and decide to delete the individual DM snapshots the H/W snapshot would be inconsistent.

The blocks being managed by a DM-device would have moved (SCSI referrals). I believe Hannes is working on the referrals piece.. 

> > 4) B/W control - VM->DM->Block->Ethernet->Switch->Storage. Pick your
> subsystem and there are many non-cooperating B/W control constructs in
> each subsystem.
> 
> Above is pretty generic. Do you have specific needs/ideas/concerns?
> 
> Thanks
> Vivek
Yes.. if I limited by Ethernet b/w to 40% I don't need to limit I/O b/w via cgroups. Such bandwidth manipulations are network switch driven and cgroups never take care of these events from the Ethernet driver.

The TC classes route the network I/O to multiqueue groups and so theoretically you could have block queues 1:1 with the number of network multiqueues..

-Shyam

^ permalink raw reply	[flat|nested] 166+ messages in thread

* RE: [Lsf] Preliminary Agenda and Activities for LSF
@ 2011-03-29 18:10         ` Shyam_Iyer
  0 siblings, 0 replies; 166+ messages in thread
From: Shyam_Iyer @ 2011-03-29 18:10 UTC (permalink / raw)
  To: vgoyal
  Cc: rwheeler, James.Bottomley, lsf, linux-fsdevel, dm-devel, linux-scsi



> -----Original Message-----
> From: Vivek Goyal [mailto:vgoyal@redhat.com]
> Sent: Tuesday, March 29, 2011 1:34 PM
> To: Iyer, Shyam
> Cc: rwheeler@redhat.com; James.Bottomley@hansenpartnership.com;
> lsf@lists.linux-foundation.org; linux-fsdevel@vger.kernel.org; dm-
> devel@redhat.com; linux-scsi@vger.kernel.org
> Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> 
> On Tue, Mar 29, 2011 at 10:20:57AM -0700, Shyam_Iyer@dell.com wrote:
> >
> >
> > > -----Original Message-----
> > > From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> > > owner@vger.kernel.org] On Behalf Of Ric Wheeler
> > > Sent: Tuesday, March 29, 2011 7:17 AM
> > > To: James Bottomley
> > > Cc: lsf@lists.linux-foundation.org; linux-fsdevel; linux-
> > > scsi@vger.kernel.org; device-mapper development
> > > Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> > >
> > > On 03/29/2011 12:36 AM, James Bottomley wrote:
> > > > Hi All,
> > > >
> > > > Since LSF is less than a week away, the programme committee put
> > > together
> > > > a just in time preliminary agenda for LSF.  As you can see there
> is
> > > > still plenty of empty space, which you can make suggestions (to
> this
> > > > list with appropriate general list cc's) for filling:
> > > >
> > > >
> > >
> https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQz
> > > M5UDRXUnVEbHlYVmZUVHQ2amc&output=html
> > > >
> > > > If you don't make suggestions, the programme committee will feel
> > > > empowered to make arbitrary assignments based on your topic and
> > > attendee
> > > > email requests ...
> > > >
> > > > We're still not quite sure what rooms we will have at the Kabuki,
> but
> > > > we'll add them to the spreadsheet when we know (they should be
> close
> > > to
> > > > each other).
> > > >
> > > > The spreadsheet above also gives contact information for all the
> > > > attendees and the programme committee.
> > > >
> > > > Yours,
> > > >
> > > > James Bottomley
> > > > on behalf of LSF/MM Programme Committee
> > > >
> > >
> > > Here are a few topic ideas:
> > >
> > > (1)  The first topic that might span IO & FS tracks (or just pull
> in
> > > device
> > > mapper people to an FS track) could be adding new commands that
> would
> > > allow
> > > users to grow/shrink/etc file systems in a generic way.  The
> thought I
> > > had was
> > > that we have a reasonable model that we could reuse for these new
> > > commands like
> > > mount and mount.fs or fsck and fsck.fs. With btrfs coming down the
> > > road, it
> > > could be nice to identify exactly what common operations users want
> to
> > > do and
> > > agree on how to implement them. Alasdair pointed out in the
> upstream
> > > thread that
> > > we had a prototype here in fsadm.
> > >
> > > (2) Very high speed, low latency SSD devices and testing. Have we
> > > settled on the
> > > need for these devices to all have block level drivers? For S-ATA
> or
> > > SAS
> > > devices, are there known performance issues that require
> enhancements
> > > in
> > > somewhere in the stack?
> > >
> > > (3) The union mount versus overlayfs debate - pros and cons. What
> each
> > > do well,
> > > what needs doing. Do we want/need both upstream? (Maybe this can
> get 10
> > > minutes
> > > in Al's VFS session?)
> > >
> > > Thanks!
> > >
> > > Ric
> >
> > A few others that I think may span across I/O, Block fs..layers.
> >
> > 1) Dm-thinp target vs File system thin profile vs block map based
> thin/trim profile.
> 
> > Facilitate I/O throttling for thin/trimmable storage. Online and
> Offline profil.
> 
> Is above any different from block IO throttling we have got for block
> devices?
> 
Yes.. so the throttling would be capacity  based.. when the storage array wants us to throttle the I/O. Depending on the event we may keep getting space allocation write protect check conditions for writes until a user intervenes to stop I/O.


> > 2) Interfaces for SCSI, Ethernet/*transport configuration parameters
> floating around in sysfs, procfs. Architecting guidelines for accepting
> patches for hybrid devices.
> > 3) DM snapshot vs FS snapshots vs H/W snapshots. There is room for
> all and they have to help each other

For instance if you took a DM snapshot and the storage sent a check condition to the original dm device I am not sure if the DM snapshot would get one too..

If you had a scenario of taking H/W snapshot of an entire pool and decide to delete the individual DM snapshots the H/W snapshot would be inconsistent.

The blocks being managed by a DM-device would have moved (SCSI referrals). I believe Hannes is working on the referrals piece.. 

> > 4) B/W control - VM->DM->Block->Ethernet->Switch->Storage. Pick your
> subsystem and there are many non-cooperating B/W control constructs in
> each subsystem.
> 
> Above is pretty generic. Do you have specific needs/ideas/concerns?
> 
> Thanks
> Vivek
Yes.. if I limited by Ethernet b/w to 40% I don't need to limit I/O b/w via cgroups. Such bandwidth manipulations are network switch driven and cgroups never take care of these events from the Ethernet driver.

The TC classes route the network I/O to multiqueue groups and so theoretically you could have block queues 1:1 with the number of network multiqueues..

-Shyam

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 18:10         ` Shyam_Iyer
  (?)
@ 2011-03-29 18:45         ` Vivek Goyal
  2011-03-29 19:13             ` Shyam_Iyer
  -1 siblings, 1 reply; 166+ messages in thread
From: Vivek Goyal @ 2011-03-29 18:45 UTC (permalink / raw)
  To: Shyam_Iyer
  Cc: rwheeler, James.Bottomley, lsf, linux-fsdevel, dm-devel, linux-scsi

On Tue, Mar 29, 2011 at 11:10:18AM -0700, Shyam_Iyer@Dell.com wrote:
> 
> 
> > -----Original Message-----
> > From: Vivek Goyal [mailto:vgoyal@redhat.com]
> > Sent: Tuesday, March 29, 2011 1:34 PM
> > To: Iyer, Shyam
> > Cc: rwheeler@redhat.com; James.Bottomley@hansenpartnership.com;
> > lsf@lists.linux-foundation.org; linux-fsdevel@vger.kernel.org; dm-
> > devel@redhat.com; linux-scsi@vger.kernel.org
> > Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> > 
> > On Tue, Mar 29, 2011 at 10:20:57AM -0700, Shyam_Iyer@dell.com wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> > > > owner@vger.kernel.org] On Behalf Of Ric Wheeler
> > > > Sent: Tuesday, March 29, 2011 7:17 AM
> > > > To: James Bottomley
> > > > Cc: lsf@lists.linux-foundation.org; linux-fsdevel; linux-
> > > > scsi@vger.kernel.org; device-mapper development
> > > > Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> > > >
> > > > On 03/29/2011 12:36 AM, James Bottomley wrote:
> > > > > Hi All,
> > > > >
> > > > > Since LSF is less than a week away, the programme committee put
> > > > together
> > > > > a just in time preliminary agenda for LSF.  As you can see there
> > is
> > > > > still plenty of empty space, which you can make suggestions (to
> > this
> > > > > list with appropriate general list cc's) for filling:
> > > > >
> > > > >
> > > >
> > https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQz
> > > > M5UDRXUnVEbHlYVmZUVHQ2amc&output=html
> > > > >
> > > > > If you don't make suggestions, the programme committee will feel
> > > > > empowered to make arbitrary assignments based on your topic and
> > > > attendee
> > > > > email requests ...
> > > > >
> > > > > We're still not quite sure what rooms we will have at the Kabuki,
> > but
> > > > > we'll add them to the spreadsheet when we know (they should be
> > close
> > > > to
> > > > > each other).
> > > > >
> > > > > The spreadsheet above also gives contact information for all the
> > > > > attendees and the programme committee.
> > > > >
> > > > > Yours,
> > > > >
> > > > > James Bottomley
> > > > > on behalf of LSF/MM Programme Committee
> > > > >
> > > >
> > > > Here are a few topic ideas:
> > > >
> > > > (1)  The first topic that might span IO & FS tracks (or just pull
> > in
> > > > device
> > > > mapper people to an FS track) could be adding new commands that
> > would
> > > > allow
> > > > users to grow/shrink/etc file systems in a generic way.  The
> > thought I
> > > > had was
> > > > that we have a reasonable model that we could reuse for these new
> > > > commands like
> > > > mount and mount.fs or fsck and fsck.fs. With btrfs coming down the
> > > > road, it
> > > > could be nice to identify exactly what common operations users want
> > to
> > > > do and
> > > > agree on how to implement them. Alasdair pointed out in the
> > upstream
> > > > thread that
> > > > we had a prototype here in fsadm.
> > > >
> > > > (2) Very high speed, low latency SSD devices and testing. Have we
> > > > settled on the
> > > > need for these devices to all have block level drivers? For S-ATA
> > or
> > > > SAS
> > > > devices, are there known performance issues that require
> > enhancements
> > > > in
> > > > somewhere in the stack?
> > > >
> > > > (3) The union mount versus overlayfs debate - pros and cons. What
> > each
> > > > do well,
> > > > what needs doing. Do we want/need both upstream? (Maybe this can
> > get 10
> > > > minutes
> > > > in Al's VFS session?)
> > > >
> > > > Thanks!
> > > >
> > > > Ric
> > >
> > > A few others that I think may span across I/O, Block fs..layers.
> > >
> > > 1) Dm-thinp target vs File system thin profile vs block map based
> > thin/trim profile.
> > 
> > > Facilitate I/O throttling for thin/trimmable storage. Online and
> > Offline profil.
> > 
> > Is above any different from block IO throttling we have got for block
> > devices?
> > 
> Yes.. so the throttling would be capacity  based.. when the storage array wants us to throttle the I/O. Depending on the event we may keep getting space allocation write protect check conditions for writes until a user intervenes to stop I/O.
> 

Sounds like some user space daemon listening for these events and then
modifying cgroup throttling limits dynamically?

> 
> > > 2) Interfaces for SCSI, Ethernet/*transport configuration parameters
> > floating around in sysfs, procfs. Architecting guidelines for accepting
> > patches for hybrid devices.
> > > 3) DM snapshot vs FS snapshots vs H/W snapshots. There is room for
> > all and they have to help each other
> 
> For instance if you took a DM snapshot and the storage sent a check condition to the original dm device I am not sure if the DM snapshot would get one too..
> 
> If you had a scenario of taking H/W snapshot of an entire pool and decide to delete the individual DM snapshots the H/W snapshot would be inconsistent.
> 
> The blocks being managed by a DM-device would have moved (SCSI referrals). I believe Hannes is working on the referrals piece.. 
> 
> > > 4) B/W control - VM->DM->Block->Ethernet->Switch->Storage. Pick your
> > subsystem and there are many non-cooperating B/W control constructs in
> > each subsystem.
> > 
> > Above is pretty generic. Do you have specific needs/ideas/concerns?
> > 
> > Thanks
> > Vivek
> Yes.. if I limited by Ethernet b/w to 40% I don't need to limit I/O b/w via cgroups. Such bandwidth manipulations are network switch driven and cgroups never take care of these events from the Ethernet driver.

So if IO is going over network and actual bandwidth control is taking
place by throttling ethernet traffic then one does not have to specify
block cgroup throttling policy and hence no need for cgroups to be worried
about ethernet driver events?

I think I am missing something here.

Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] [LSF][MM] page allocation & direct reclaim latency
  2011-03-29 15:35 ` [LSF][MM] page allocation & direct reclaim latency Rik van Riel
@ 2011-03-29 19:05   ` Andrea Arcangeli
  2011-03-29 20:35     ` Ying Han
                       ` (2 more replies)
  0 siblings, 3 replies; 166+ messages in thread
From: Andrea Arcangeli @ 2011-03-29 19:05 UTC (permalink / raw)
  To: Rik van Riel; +Cc: lsf, linux-mm, Hugh Dickins

Hi Rik, Hugh and everyone,

On Tue, Mar 29, 2011 at 11:35:09AM -0400, Rik van Riel wrote:
> On 03/29/2011 12:36 AM, James Bottomley wrote:
> > Hi All,
> >
> > Since LSF is less than a week away, the programme committee put together
> > a just in time preliminary agenda for LSF.  As you can see there is
> > still plenty of empty space, which you can make suggestions
> 
> There have been a few patches upstream by people for who
> page allocation latency is a concern.
> 
> It may be worthwhile to have a short discussion on what
> we can do to keep page allocation (and direct reclaim?)
> latencies down to a minimum, reducing the slowdown that
> direct reclaim introduces on some workloads.

I don't see the patches you refer to, but checking schedule we've a
slot with Mel&Minchan about "Reclaim, compaction and LRU
ordering". Compaction only applies to high order allocations and it
changes nothing to PAGE_SIZE allocations, but it surely has lower
latency than the older lumpy reclaim logic so overall it should be a
net improvement compared to what we had before.

Should the latency issues be discussed in that track?

The MM schedule has still a free slot 14-14:30 on Monday, I wonder if
there's interest on a "NUMA automatic migration and scheduling
awareness" topic or if it's still too vapourware for a real topic and
we should keep it for offtrack discussions, and maybe we should
reserve it for something more tangible with patches already floating
around. Comments welcome.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 17:35 ` [Lsf] Preliminary Agenda and Activities for LSF Chad Talbott
@ 2011-03-29 19:09   ` Vivek Goyal
  2011-03-29 20:14     ` Chad Talbott
  2011-03-29 20:35     ` Jan Kara
  2011-03-30  4:18   ` Dave Chinner
  1 sibling, 2 replies; 166+ messages in thread
From: Vivek Goyal @ 2011-03-29 19:09 UTC (permalink / raw)
  To: Chad Talbott; +Cc: James Bottomley, lsf, linux-fsdevel

On Tue, Mar 29, 2011 at 10:35:13AM -0700, Chad Talbott wrote:
> I'd like to propose a discussion topic:
> 
> IO-less Dirty Throttling Considered Harmful...
> 

I see that writeback has extended session at 10.00. I am assuming
IO less throttling will be discussed there. Is it possible to 
discuss its effect on block cgroups there? I am not sure enough
time is there because it ties in memory cgroup also.

Or there is a session at 12.30 "memcg dirty limits and writeback", it
can probably be discussed there too.

> to isolation and cgroup IO schedulers in general.  The disk scheduler
> is knocked out of the picture unless it can see the IO generated by
> each group above it.  The world of memcg-aware writeback stacked on
> top of block-cgroups is a complicated one.  Throttling in
> balance_dirty_pages() will likely be a non-starter for current users
> of group-aware CFQ.

Can't a single flusher thread keep all the groups busy/full on slow
SATA device.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 18:45         ` Vivek Goyal
@ 2011-03-29 19:13             ` Shyam_Iyer
  0 siblings, 0 replies; 166+ messages in thread
From: Shyam_Iyer @ 2011-03-29 19:13 UTC (permalink / raw)
  To: vgoyal; +Cc: lsf, linux-scsi, dm-devel, linux-fsdevel, rwheeler



> -----Original Message-----
> From: Vivek Goyal [mailto:vgoyal@redhat.com]
> Sent: Tuesday, March 29, 2011 2:45 PM
> To: Iyer, Shyam
> Cc: rwheeler@redhat.com; James.Bottomley@hansenpartnership.com;
> lsf@lists.linux-foundation.org; linux-fsdevel@vger.kernel.org; dm-
> devel@redhat.com; linux-scsi@vger.kernel.org
> Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> 
> On Tue, Mar 29, 2011 at 11:10:18AM -0700, Shyam_Iyer@Dell.com wrote:
> >
> >
> > > -----Original Message-----
> > > From: Vivek Goyal [mailto:vgoyal@redhat.com]
> > > Sent: Tuesday, March 29, 2011 1:34 PM
> > > To: Iyer, Shyam
> > > Cc: rwheeler@redhat.com; James.Bottomley@hansenpartnership.com;
> > > lsf@lists.linux-foundation.org; linux-fsdevel@vger.kernel.org; dm-
> > > devel@redhat.com; linux-scsi@vger.kernel.org
> > > Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> > >
> > > On Tue, Mar 29, 2011 at 10:20:57AM -0700, Shyam_Iyer@dell.com
> wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> > > > > owner@vger.kernel.org] On Behalf Of Ric Wheeler
> > > > > Sent: Tuesday, March 29, 2011 7:17 AM
> > > > > To: James Bottomley
> > > > > Cc: lsf@lists.linux-foundation.org; linux-fsdevel; linux-
> > > > > scsi@vger.kernel.org; device-mapper development
> > > > > Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> > > > >
> > > > > On 03/29/2011 12:36 AM, James Bottomley wrote:
> > > > > > Hi All,
> > > > > >
> > > > > > Since LSF is less than a week away, the programme committee
> put
> > > > > together
> > > > > > a just in time preliminary agenda for LSF.  As you can see
> there
> > > is
> > > > > > still plenty of empty space, which you can make suggestions
> (to
> > > this
> > > > > > list with appropriate general list cc's) for filling:
> > > > > >
> > > > > >
> > > > >
> > >
> https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQz
> > > > > M5UDRXUnVEbHlYVmZUVHQ2amc&output=html
> > > > > >
> > > > > > If you don't make suggestions, the programme committee will
> feel
> > > > > > empowered to make arbitrary assignments based on your topic
> and
> > > > > attendee
> > > > > > email requests ...
> > > > > >
> > > > > > We're still not quite sure what rooms we will have at the
> Kabuki,
> > > but
> > > > > > we'll add them to the spreadsheet when we know (they should
> be
> > > close
> > > > > to
> > > > > > each other).
> > > > > >
> > > > > > The spreadsheet above also gives contact information for all
> the
> > > > > > attendees and the programme committee.
> > > > > >
> > > > > > Yours,
> > > > > >
> > > > > > James Bottomley
> > > > > > on behalf of LSF/MM Programme Committee
> > > > > >
> > > > >
> > > > > Here are a few topic ideas:
> > > > >
> > > > > (1)  The first topic that might span IO & FS tracks (or just
> pull
> > > in
> > > > > device
> > > > > mapper people to an FS track) could be adding new commands that
> > > would
> > > > > allow
> > > > > users to grow/shrink/etc file systems in a generic way.  The
> > > thought I
> > > > > had was
> > > > > that we have a reasonable model that we could reuse for these
> new
> > > > > commands like
> > > > > mount and mount.fs or fsck and fsck.fs. With btrfs coming down
> the
> > > > > road, it
> > > > > could be nice to identify exactly what common operations users
> want
> > > to
> > > > > do and
> > > > > agree on how to implement them. Alasdair pointed out in the
> > > upstream
> > > > > thread that
> > > > > we had a prototype here in fsadm.
> > > > >
> > > > > (2) Very high speed, low latency SSD devices and testing. Have
> we
> > > > > settled on the
> > > > > need for these devices to all have block level drivers? For S-
> ATA
> > > or
> > > > > SAS
> > > > > devices, are there known performance issues that require
> > > enhancements
> > > > > in
> > > > > somewhere in the stack?
> > > > >
> > > > > (3) The union mount versus overlayfs debate - pros and cons.
> What
> > > each
> > > > > do well,
> > > > > what needs doing. Do we want/need both upstream? (Maybe this
> can
> > > get 10
> > > > > minutes
> > > > > in Al's VFS session?)
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Ric
> > > >
> > > > A few others that I think may span across I/O, Block fs..layers.
> > > >
> > > > 1) Dm-thinp target vs File system thin profile vs block map based
> > > thin/trim profile.
> > >
> > > > Facilitate I/O throttling for thin/trimmable storage. Online and
> > > Offline profil.
> > >
> > > Is above any different from block IO throttling we have got for
> block
> > > devices?
> > >
> > Yes.. so the throttling would be capacity  based.. when the storage
> array wants us to throttle the I/O. Depending on the event we may keep
> getting space allocation write protect check conditions for writes
> until a user intervenes to stop I/O.
> >
> 
> Sounds like some user space daemon listening for these events and then
> modifying cgroup throttling limits dynamically?

But we have dm-targets in the horizon like dm-thinp setting soft limits on capacity.. we could extend the concept to H/W imposed soft/hard limits.

The user space could throttle the I/O but it had have to go about finding all processes running I/O on the LUN.. In some cases it could be an I/O process running within a VM.. 

That would require a passthrough interface to inform it.. I doubt if we would be able to accomplish that any sooner with the multiple operating systems involved. Or requiring each application to register with the userland process. Doable but cumbersome and buggy..

The dm-thinp target can help in this scenario by setting a blanket storage limit. We could go about extending the limit dynamically based on hints/commands from the userland daemon listening to such events.

This approach will probably not take care of scenarios where VM storage is over say NFS or clustered filesystem..
> 
> >
> > > > 2) Interfaces for SCSI, Ethernet/*transport configuration
> parameters
> > > floating around in sysfs, procfs. Architecting guidelines for
> accepting
> > > patches for hybrid devices.
> > > > 3) DM snapshot vs FS snapshots vs H/W snapshots. There is room
> for
> > > all and they have to help each other
> >
> > For instance if you took a DM snapshot and the storage sent a check
> condition to the original dm device I am not sure if the DM snapshot
> would get one too..
> >
> > If you had a scenario of taking H/W snapshot of an entire pool and
> decide to delete the individual DM snapshots the H/W snapshot would be
> inconsistent.
> >
> > The blocks being managed by a DM-device would have moved (SCSI
> referrals). I believe Hannes is working on the referrals piece..
> >
> > > > 4) B/W control - VM->DM->Block->Ethernet->Switch->Storage. Pick
> your
> > > subsystem and there are many non-cooperating B/W control constructs
> in
> > > each subsystem.
> > >
> > > Above is pretty generic. Do you have specific needs/ideas/concerns?
> > >
> > > Thanks
> > > Vivek
> > Yes.. if I limited by Ethernet b/w to 40% I don't need to limit I/O
> b/w via cgroups. Such bandwidth manipulations are network switch driven
> and cgroups never take care of these events from the Ethernet driver.
> 
> So if IO is going over network and actual bandwidth control is taking
> place by throttling ethernet traffic then one does not have to specify
> block cgroup throttling policy and hence no need for cgroups to be
> worried
> about ethernet driver events?
> 
> I think I am missing something here.
> 
> Vivek
Well.. here is the catch.. example scenario..

- Two iSCSI I/O sessions emanating from Ethernet ports eth0, eth1  multipathed together. Let us say round-robin policy.

- The cgroup profile is to limit I/O bandwidth to 40% of the multipathed I/O bandwidth. But the switch may have limited the I/O bandwidth to 40% for the corresponding vlan associated with one of the eth interface say eth1

The computation that the bandwidth configured is 40% of the available bandwidth is false in this case.  What we need to do is possibly push more I/O through eth0 as it is allowed to run at 100% of bandwidth by the switch. 

Now this is a dynamic decision and multipathing layer should take care of it.. but it would need a hint..

Policies are usually decided at different levels, SLAs and sometimes logistics determine these decisions etc. Sometimes the bandwidth lowering by the switch is traffic dependent but user level policies remain in tact. Typical case of network administrator not talking to the system administrator.

-Shyam

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
@ 2011-03-29 19:13             ` Shyam_Iyer
  0 siblings, 0 replies; 166+ messages in thread
From: Shyam_Iyer @ 2011-03-29 19:13 UTC (permalink / raw)
  To: vgoyal; +Cc: lsf, linux-scsi, dm-devel, linux-fsdevel, rwheeler



> -----Original Message-----
> From: Vivek Goyal [mailto:vgoyal@redhat.com]
> Sent: Tuesday, March 29, 2011 2:45 PM
> To: Iyer, Shyam
> Cc: rwheeler@redhat.com; James.Bottomley@hansenpartnership.com;
> lsf@lists.linux-foundation.org; linux-fsdevel@vger.kernel.org; dm-
> devel@redhat.com; linux-scsi@vger.kernel.org
> Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> 
> On Tue, Mar 29, 2011 at 11:10:18AM -0700, Shyam_Iyer@Dell.com wrote:
> >
> >
> > > -----Original Message-----
> > > From: Vivek Goyal [mailto:vgoyal@redhat.com]
> > > Sent: Tuesday, March 29, 2011 1:34 PM
> > > To: Iyer, Shyam
> > > Cc: rwheeler@redhat.com; James.Bottomley@hansenpartnership.com;
> > > lsf@lists.linux-foundation.org; linux-fsdevel@vger.kernel.org; dm-
> > > devel@redhat.com; linux-scsi@vger.kernel.org
> > > Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> > >
> > > On Tue, Mar 29, 2011 at 10:20:57AM -0700, Shyam_Iyer@dell.com
> wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> > > > > owner@vger.kernel.org] On Behalf Of Ric Wheeler
> > > > > Sent: Tuesday, March 29, 2011 7:17 AM
> > > > > To: James Bottomley
> > > > > Cc: lsf@lists.linux-foundation.org; linux-fsdevel; linux-
> > > > > scsi@vger.kernel.org; device-mapper development
> > > > > Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> > > > >
> > > > > On 03/29/2011 12:36 AM, James Bottomley wrote:
> > > > > > Hi All,
> > > > > >
> > > > > > Since LSF is less than a week away, the programme committee
> put
> > > > > together
> > > > > > a just in time preliminary agenda for LSF.  As you can see
> there
> > > is
> > > > > > still plenty of empty space, which you can make suggestions
> (to
> > > this
> > > > > > list with appropriate general list cc's) for filling:
> > > > > >
> > > > > >
> > > > >
> > >
> https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQz
> > > > > M5UDRXUnVEbHlYVmZUVHQ2amc&output=html
> > > > > >
> > > > > > If you don't make suggestions, the programme committee will
> feel
> > > > > > empowered to make arbitrary assignments based on your topic
> and
> > > > > attendee
> > > > > > email requests ...
> > > > > >
> > > > > > We're still not quite sure what rooms we will have at the
> Kabuki,
> > > but
> > > > > > we'll add them to the spreadsheet when we know (they should
> be
> > > close
> > > > > to
> > > > > > each other).
> > > > > >
> > > > > > The spreadsheet above also gives contact information for all
> the
> > > > > > attendees and the programme committee.
> > > > > >
> > > > > > Yours,
> > > > > >
> > > > > > James Bottomley
> > > > > > on behalf of LSF/MM Programme Committee
> > > > > >
> > > > >
> > > > > Here are a few topic ideas:
> > > > >
> > > > > (1)  The first topic that might span IO & FS tracks (or just
> pull
> > > in
> > > > > device
> > > > > mapper people to an FS track) could be adding new commands that
> > > would
> > > > > allow
> > > > > users to grow/shrink/etc file systems in a generic way.  The
> > > thought I
> > > > > had was
> > > > > that we have a reasonable model that we could reuse for these
> new
> > > > > commands like
> > > > > mount and mount.fs or fsck and fsck.fs. With btrfs coming down
> the
> > > > > road, it
> > > > > could be nice to identify exactly what common operations users
> want
> > > to
> > > > > do and
> > > > > agree on how to implement them. Alasdair pointed out in the
> > > upstream
> > > > > thread that
> > > > > we had a prototype here in fsadm.
> > > > >
> > > > > (2) Very high speed, low latency SSD devices and testing. Have
> we
> > > > > settled on the
> > > > > need for these devices to all have block level drivers? For S-
> ATA
> > > or
> > > > > SAS
> > > > > devices, are there known performance issues that require
> > > enhancements
> > > > > in
> > > > > somewhere in the stack?
> > > > >
> > > > > (3) The union mount versus overlayfs debate - pros and cons.
> What
> > > each
> > > > > do well,
> > > > > what needs doing. Do we want/need both upstream? (Maybe this
> can
> > > get 10
> > > > > minutes
> > > > > in Al's VFS session?)
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Ric
> > > >
> > > > A few others that I think may span across I/O, Block fs..layers.
> > > >
> > > > 1) Dm-thinp target vs File system thin profile vs block map based
> > > thin/trim profile.
> > >
> > > > Facilitate I/O throttling for thin/trimmable storage. Online and
> > > Offline profil.
> > >
> > > Is above any different from block IO throttling we have got for
> block
> > > devices?
> > >
> > Yes.. so the throttling would be capacity  based.. when the storage
> array wants us to throttle the I/O. Depending on the event we may keep
> getting space allocation write protect check conditions for writes
> until a user intervenes to stop I/O.
> >
> 
> Sounds like some user space daemon listening for these events and then
> modifying cgroup throttling limits dynamically?

But we have dm-targets in the horizon like dm-thinp setting soft limits on capacity.. we could extend the concept to H/W imposed soft/hard limits.

The user space could throttle the I/O but it had have to go about finding all processes running I/O on the LUN.. In some cases it could be an I/O process running within a VM.. 

That would require a passthrough interface to inform it.. I doubt if we would be able to accomplish that any sooner with the multiple operating systems involved. Or requiring each application to register with the userland process. Doable but cumbersome and buggy..

The dm-thinp target can help in this scenario by setting a blanket storage limit. We could go about extending the limit dynamically based on hints/commands from the userland daemon listening to such events.

This approach will probably not take care of scenarios where VM storage is over say NFS or clustered filesystem..
> 
> >
> > > > 2) Interfaces for SCSI, Ethernet/*transport configuration
> parameters
> > > floating around in sysfs, procfs. Architecting guidelines for
> accepting
> > > patches for hybrid devices.
> > > > 3) DM snapshot vs FS snapshots vs H/W snapshots. There is room
> for
> > > all and they have to help each other
> >
> > For instance if you took a DM snapshot and the storage sent a check
> condition to the original dm device I am not sure if the DM snapshot
> would get one too..
> >
> > If you had a scenario of taking H/W snapshot of an entire pool and
> decide to delete the individual DM snapshots the H/W snapshot would be
> inconsistent.
> >
> > The blocks being managed by a DM-device would have moved (SCSI
> referrals). I believe Hannes is working on the referrals piece..
> >
> > > > 4) B/W control - VM->DM->Block->Ethernet->Switch->Storage. Pick
> your
> > > subsystem and there are many non-cooperating B/W control constructs
> in
> > > each subsystem.
> > >
> > > Above is pretty generic. Do you have specific needs/ideas/concerns?
> > >
> > > Thanks
> > > Vivek
> > Yes.. if I limited by Ethernet b/w to 40% I don't need to limit I/O
> b/w via cgroups. Such bandwidth manipulations are network switch driven
> and cgroups never take care of these events from the Ethernet driver.
> 
> So if IO is going over network and actual bandwidth control is taking
> place by throttling ethernet traffic then one does not have to specify
> block cgroup throttling policy and hence no need for cgroups to be
> worried
> about ethernet driver events?
> 
> I think I am missing something here.
> 
> Vivek
Well.. here is the catch.. example scenario..

- Two iSCSI I/O sessions emanating from Ethernet ports eth0, eth1  multipathed together. Let us say round-robin policy.

- The cgroup profile is to limit I/O bandwidth to 40% of the multipathed I/O bandwidth. But the switch may have limited the I/O bandwidth to 40% for the corresponding vlan associated with one of the eth interface say eth1

The computation that the bandwidth configured is 40% of the available bandwidth is false in this case.  What we need to do is possibly push more I/O through eth0 as it is allowed to run at 100% of bandwidth by the switch. 

Now this is a dynamic decision and multipathing layer should take care of it.. but it would need a hint..

Policies are usually decided at different levels, SLAs and sometimes logistics determine these decisions etc. Sometimes the bandwidth lowering by the switch is traffic dependent but user level policies remain in tact. Typical case of network administrator not talking to the system administrator.

-Shyam

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 11:16 ` Ric Wheeler
  2011-03-29 11:22   ` Matthew Wilcox
  2011-03-29 17:20     ` Shyam_Iyer
@ 2011-03-29 19:47   ` Nicholas A. Bellinger
  2011-03-29 20:29   ` Jan Kara
  2011-03-30  0:33   ` Mingming Cao
  4 siblings, 0 replies; 166+ messages in thread
From: Nicholas A. Bellinger @ 2011-03-29 19:47 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: James Bottomley, lsf, linux-fsdevel, linux-scsi,
	device-mapper development

On Tue, 2011-03-29 at 07:16 -0400, Ric Wheeler wrote:
> On 03/29/2011 12:36 AM, James Bottomley wrote:
> > Hi All,
> >
> > Since LSF is less than a week away, the programme committee put together
> > a just in time preliminary agenda for LSF.  As you can see there is
> > still plenty of empty space, which you can make suggestions (to this
> > list with appropriate general list cc's) for filling:
> >
> > https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQzM5UDRXUnVEbHlYVmZUVHQ2amc&output=html
> >
> > If you don't make suggestions, the programme committee will feel
> > empowered to make arbitrary assignments based on your topic and attendee
> > email requests ...
> >
> > We're still not quite sure what rooms we will have at the Kabuki, but
> > we'll add them to the spreadsheet when we know (they should be close to
> > each other).
> >
> > The spreadsheet above also gives contact information for all the
> > attendees and the programme committee.
> >
> > Yours,
> >
> > James Bottomley
> > on behalf of LSF/MM Programme Committee
> >
> 
> Here are a few topic ideas:
> 
> (1)  The first topic that might span IO & FS tracks (or just pull in device 
> mapper people to an FS track) could be adding new commands that would allow 
> users to grow/shrink/etc file systems in a generic way.  The thought I had was 
> that we have a reasonable model that we could reuse for these new commands like 
> mount and mount.fs or fsck and fsck.fs. With btrfs coming down the road, it 
> could be nice to identify exactly what common operations users want to do and 
> agree on how to implement them. Alasdair pointed out in the upstream thread that 
> we had a prototype here in fsadm.
> 
> (2) Very high speed, low latency SSD devices and testing. Have we settled on the 
> need for these devices to all have block level drivers? For S-ATA or SAS 
> devices, are there known performance issues that require enhancements in 
> somewhere in the stack?
> 
> (3) The union mount versus overlayfs debate - pros and cons. What each do well, 
> what needs doing. Do we want/need both upstream? (Maybe this can get 10 minutes 
> in Al's VFS session?)
> 

Hi Ric, James and LSF-PC chairs,

Beyond my original LSF topic proposal for the next-generation QEMU/KVM
Virtio-SCSI target driver here:

http://marc.info/?l=linux-scsi&m=129706545408966&w=2

The following target mode related topics would be useful for the current
attendees with interest in /drivers/target/ code if there is extra room
available for local attendance within the IO/storage track.

(4) Enabling mixed Target/Initiator mode in existing mainline SCSI LLDs
that support HW target mode, and come to an consensus determination for
how best to make the SCSI LLD / target fabric driver split when enabling
mainline target infrastructure support into existing SCSI LLDs.  This
code is currently in flight for qla2xxx / tcm_qla2xxx for .40  (Hannes,
Christoph, Mike, Qlogic and other LLD maintainers)

(5) Driving target configfs group creation from kernel-space via a
userspace passthrough using some form of portable / acceptable mainline
interface.  This is a topic that has been raised on the scsi list for
the ibmvscsis target driver for .40, and is going to be useful for other
in-flight HW target driver as well. (Tomo-san, Hannes, Mike, James,
Joel)

Thank you!

--nab


^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 19:13             ` Shyam_Iyer
  (?)
@ 2011-03-29 19:57             ` Vivek Goyal
  -1 siblings, 0 replies; 166+ messages in thread
From: Vivek Goyal @ 2011-03-29 19:57 UTC (permalink / raw)
  To: Shyam_Iyer
  Cc: rwheeler, James.Bottomley, lsf, linux-fsdevel, dm-devel, linux-scsi

On Tue, Mar 29, 2011 at 12:13:41PM -0700, Shyam_Iyer@Dell.com wrote:

[..]
> > 
> > Sounds like some user space daemon listening for these events and then
> > modifying cgroup throttling limits dynamically?
> 
> But we have dm-targets in the horizon like dm-thinp setting soft limits on capacity.. we could extend the concept to H/W imposed soft/hard limits.
> 
> The user space could throttle the I/O but it had have to go about finding all processes running I/O on the LUN.. In some cases it could be an I/O process running within a VM.. 

Well, if there is one cgroup (root cgroup), then daemon does not have to
find anything. This is one global space and there is provision to set
per device limit. So daemon can just go and adjust device limits
dynamically and that gets applicable for all processes.

The problem will happen if there are more cgroups created and limits are
per cgroup, per device. (For creating service differentiation). I would
say in that case daemon needs to be more sophisticated and reduce the
limit in each group by same % as required by thinly provisioned target.

That way a higher rate group will still get higher IO rate on a thinly
provisioned device which is imposing its own throttling. Otherwise we
again run into issues where there is no service differentiation between
faster group or slower group.

IOW, if we are throttling thinly povisioned devices, I think throttling
these using a user space daemon might be better as it will reuse the
kernel throttling infrastructure as well as throttling will be cgroup
aware.
 
> 
> That would require a passthrough interface to inform it.. I doubt if we would be able to accomplish that any sooner with the multiple operating systems involved. Or requiring each application to register with the userland process. Doable but cumbersome and buggy..
> 
> The dm-thinp target can help in this scenario by setting a blanket storage limit. We could go about extending the limit dynamically based on hints/commands from the userland daemon listening to such events.
> 
> This approach will probably not take care of scenarios where VM storage is over say NFS or clustered filesystem..

Even current blkio throttling does not work over NFS. This is one of the
issues I wanted to discuss at LSF.

[..]
> Well.. here is the catch.. example scenario..
> 
> - Two iSCSI I/O sessions emanating from Ethernet ports eth0, eth1  multipathed together. Let us say round-robin policy.
> 
> - The cgroup profile is to limit I/O bandwidth to 40% of the multipathed I/O bandwidth. But the switch may have limited the I/O bandwidth to 40% for the corresponding vlan associated with one of the eth interface say eth1
> 
> The computation that the bandwidth configured is 40% of the available bandwidth is false in this case.  What we need to do is possibly push more I/O through eth0 as it is allowed to run at 100% of bandwidth by the switch. 
> 
> Now this is a dynamic decision and multipathing layer should take care of it.. but it would need a hint..
> 

So we have multipathed two paths in a round robin manner and one path is
faster and other is slower. I am not sure what multipath does in those
scenarios but trying to send more IO on faster path sounds like right
thing to do.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: Preliminary Agenda and Activities for LSF
  2011-03-29 19:13             ` Shyam_Iyer
  (?)
  (?)
@ 2011-03-29 19:59             ` Mike Snitzer
  2011-03-29 20:12                 ` Shyam_Iyer
  -1 siblings, 1 reply; 166+ messages in thread
From: Mike Snitzer @ 2011-03-29 19:59 UTC (permalink / raw)
  To: Shyam_Iyer
  Cc: vgoyal, lsf, linux-scsi, linux-fsdevel, rwheeler,
	device-mapper development

On Tue, Mar 29 2011 at  3:13pm -0400,
Shyam_Iyer@dell.com <Shyam_Iyer@dell.com> wrote:

> > > > Above is pretty generic. Do you have specific needs/ideas/concerns?
> > > >
> > > > Thanks
> > > > Vivek
> > > Yes.. if I limited by Ethernet b/w to 40% I don't need to limit I/O
> > b/w via cgroups. Such bandwidth manipulations are network switch driven
> > and cgroups never take care of these events from the Ethernet driver.
> > 
> > So if IO is going over network and actual bandwidth control is taking
> > place by throttling ethernet traffic then one does not have to specify
> > block cgroup throttling policy and hence no need for cgroups to be
> > worried
> > about ethernet driver events?
> > 
> > I think I am missing something here.
> > 
> > Vivek
> Well.. here is the catch.. example scenario..
> 
> - Two iSCSI I/O sessions emanating from Ethernet ports eth0, eth1  multipathed together. Let us say round-robin policy.
> 
> - The cgroup profile is to limit I/O bandwidth to 40% of the multipathed I/O bandwidth. But the switch may have limited the I/O bandwidth to 40% for the corresponding vlan associated with one of the eth interface say eth1
> 
> The computation that the bandwidth configured is 40% of the available bandwidth is false in this case.  What we need to do is possibly push more I/O through eth0 as it is allowed to run at 100% of bandwidth by the switch. 
> 
> Now this is a dynamic decision and multipathing layer should take care of it.. but it would need a hint..

No hint should be needed.  Just use one of the newer multipath path
selectors that are dynamic by design: "queue-length" or "service-time".

This scenario is exactly what those path selectors are meant to address.

Mike

^ permalink raw reply	[flat|nested] 166+ messages in thread

* RE: Preliminary Agenda and Activities for LSF
  2011-03-29 19:59             ` Mike Snitzer
@ 2011-03-29 20:12                 ` Shyam_Iyer
  0 siblings, 0 replies; 166+ messages in thread
From: Shyam_Iyer @ 2011-03-29 20:12 UTC (permalink / raw)
  To: snitzer; +Cc: vgoyal, lsf, linux-scsi, linux-fsdevel, rwheeler, dm-devel



> -----Original Message-----
> From: Mike Snitzer [mailto:snitzer@redhat.com]
> Sent: Tuesday, March 29, 2011 4:00 PM
> To: Iyer, Shyam
> Cc: vgoyal@redhat.com; lsf@lists.linux-foundation.org; linux-
> scsi@vger.kernel.org; linux-fsdevel@vger.kernel.org;
> rwheeler@redhat.com; device-mapper development
> Subject: Re: Preliminary Agenda and Activities for LSF
> 
> On Tue, Mar 29 2011 at  3:13pm -0400,
> Shyam_Iyer@dell.com <Shyam_Iyer@dell.com> wrote:
> 
> > > > > Above is pretty generic. Do you have specific
> needs/ideas/concerns?
> > > > >
> > > > > Thanks
> > > > > Vivek
> > > > Yes.. if I limited by Ethernet b/w to 40% I don't need to limit
> I/O
> > > b/w via cgroups. Such bandwidth manipulations are network switch
> driven
> > > and cgroups never take care of these events from the Ethernet
> driver.
> > >
> > > So if IO is going over network and actual bandwidth control is
> taking
> > > place by throttling ethernet traffic then one does not have to
> specify
> > > block cgroup throttling policy and hence no need for cgroups to be
> > > worried
> > > about ethernet driver events?
> > >
> > > I think I am missing something here.
> > >
> > > Vivek
> > Well.. here is the catch.. example scenario..
> >
> > - Two iSCSI I/O sessions emanating from Ethernet ports eth0, eth1
> multipathed together. Let us say round-robin policy.
> >
> > - The cgroup profile is to limit I/O bandwidth to 40% of the
> multipathed I/O bandwidth. But the switch may have limited the I/O
> bandwidth to 40% for the corresponding vlan associated with one of the
> eth interface say eth1
> >
> > The computation that the bandwidth configured is 40% of the available
> bandwidth is false in this case.  What we need to do is possibly push
> more I/O through eth0 as it is allowed to run at 100% of bandwidth by
> the switch.
> >
> > Now this is a dynamic decision and multipathing layer should take
> care of it.. but it would need a hint..
> 
> No hint should be needed.  Just use one of the newer multipath path
> selectors that are dynamic by design: "queue-length" or "service-time".
> 
> This scenario is exactly what those path selectors are meant to
> address.
> 
> Mike

Since iSCSI multipaths are essentially sessions one could configure more than one session through the same ethX interface. The sessions need not be going to the same LUN and hence not governed by the same multipath selector but the bandwidth policy group would be for a group of resources.

-Shyam





^ permalink raw reply	[flat|nested] 166+ messages in thread

* RE: Preliminary Agenda and Activities for LSF
@ 2011-03-29 20:12                 ` Shyam_Iyer
  0 siblings, 0 replies; 166+ messages in thread
From: Shyam_Iyer @ 2011-03-29 20:12 UTC (permalink / raw)
  To: snitzer; +Cc: vgoyal, lsf, linux-scsi, linux-fsdevel, rwheeler, dm-devel



> -----Original Message-----
> From: Mike Snitzer [mailto:snitzer@redhat.com]
> Sent: Tuesday, March 29, 2011 4:00 PM
> To: Iyer, Shyam
> Cc: vgoyal@redhat.com; lsf@lists.linux-foundation.org; linux-
> scsi@vger.kernel.org; linux-fsdevel@vger.kernel.org;
> rwheeler@redhat.com; device-mapper development
> Subject: Re: Preliminary Agenda and Activities for LSF
> 
> On Tue, Mar 29 2011 at  3:13pm -0400,
> Shyam_Iyer@dell.com <Shyam_Iyer@dell.com> wrote:
> 
> > > > > Above is pretty generic. Do you have specific
> needs/ideas/concerns?
> > > > >
> > > > > Thanks
> > > > > Vivek
> > > > Yes.. if I limited by Ethernet b/w to 40% I don't need to limit
> I/O
> > > b/w via cgroups. Such bandwidth manipulations are network switch
> driven
> > > and cgroups never take care of these events from the Ethernet
> driver.
> > >
> > > So if IO is going over network and actual bandwidth control is
> taking
> > > place by throttling ethernet traffic then one does not have to
> specify
> > > block cgroup throttling policy and hence no need for cgroups to be
> > > worried
> > > about ethernet driver events?
> > >
> > > I think I am missing something here.
> > >
> > > Vivek
> > Well.. here is the catch.. example scenario..
> >
> > - Two iSCSI I/O sessions emanating from Ethernet ports eth0, eth1
> multipathed together. Let us say round-robin policy.
> >
> > - The cgroup profile is to limit I/O bandwidth to 40% of the
> multipathed I/O bandwidth. But the switch may have limited the I/O
> bandwidth to 40% for the corresponding vlan associated with one of the
> eth interface say eth1
> >
> > The computation that the bandwidth configured is 40% of the available
> bandwidth is false in this case.  What we need to do is possibly push
> more I/O through eth0 as it is allowed to run at 100% of bandwidth by
> the switch.
> >
> > Now this is a dynamic decision and multipathing layer should take
> care of it.. but it would need a hint..
> 
> No hint should be needed.  Just use one of the newer multipath path
> selectors that are dynamic by design: "queue-length" or "service-time".
> 
> This scenario is exactly what those path selectors are meant to
> address.
> 
> Mike

Since iSCSI multipaths are essentially sessions one could configure more than one session through the same ethX interface. The sessions need not be going to the same LUN and hence not governed by the same multipath selector but the bandwidth policy group would be for a group of resources.

-Shyam





^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 19:09   ` Vivek Goyal
@ 2011-03-29 20:14     ` Chad Talbott
  2011-03-29 20:35     ` Jan Kara
  1 sibling, 0 replies; 166+ messages in thread
From: Chad Talbott @ 2011-03-29 20:14 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: James Bottomley, lsf, linux-fsdevel

On Tue, Mar 29, 2011 at 12:09 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Tue, Mar 29, 2011 at 10:35:13AM -0700, Chad Talbott wrote:
>> I'd like to propose a discussion topic:
>>
>> IO-less Dirty Throttling Considered Harmful...
>>
>
> I see that writeback has extended session at 10.00. I am assuming
> IO less throttling will be discussed there. Is it possible to
> discuss its effect on block cgroups there? I am not sure enough
> time is there because it ties in memory cgroup also.
>
> Or there is a session at 12.30 "memcg dirty limits and writeback", it
> can probably be discussed there too.

I just want to make sure that the topic is discussed and I don't want
to eat into someone else's time.  I'll be sure to bring it up if it's
not granted a dedicated session.

>> to isolation and cgroup IO schedulers in general.  The disk scheduler
>> is knocked out of the picture unless it can see the IO generated by
>> each group above it.  The world of memcg-aware writeback stacked on
>> top of block-cgroups is a complicated one.  Throttling in
>> balance_dirty_pages() will likely be a non-starter for current users
>> of group-aware CFQ.
>
> Can't a single flusher thread keep all the groups busy/full on slow
> SATA device.

A single flusher thread *could* keep all the groups busy and full, but
the current implementation does nothing explicit to make that happen.
I'd like to make sure that this case is considered, independent of a
particular implementation.

Chad
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: Preliminary Agenda and Activities for LSF
  2011-03-29 20:12                 ` Shyam_Iyer
  (?)
@ 2011-03-29 20:23                 ` Mike Snitzer
  2011-03-29 23:09                     ` Shyam_Iyer
  -1 siblings, 1 reply; 166+ messages in thread
From: Mike Snitzer @ 2011-03-29 20:23 UTC (permalink / raw)
  To: Shyam_Iyer
  Cc: linux-scsi, lsf, linux-fsdevel, rwheeler, vgoyal,
	device-mapper development

On Tue, Mar 29 2011 at  4:12pm -0400,
Shyam_Iyer@dell.com <Shyam_Iyer@dell.com> wrote:

> 
> 
> > -----Original Message-----
> > From: Mike Snitzer [mailto:snitzer@redhat.com]
> > Sent: Tuesday, March 29, 2011 4:00 PM
> > To: Iyer, Shyam
> > Cc: vgoyal@redhat.com; lsf@lists.linux-foundation.org; linux-
> > scsi@vger.kernel.org; linux-fsdevel@vger.kernel.org;
> > rwheeler@redhat.com; device-mapper development
> > Subject: Re: Preliminary Agenda and Activities for LSF
> > 
> > On Tue, Mar 29 2011 at  3:13pm -0400,
> > Shyam_Iyer@dell.com <Shyam_Iyer@dell.com> wrote:
> > 
> > > > > > Above is pretty generic. Do you have specific
> > needs/ideas/concerns?
> > > > > >
> > > > > > Thanks
> > > > > > Vivek
> > > > > Yes.. if I limited by Ethernet b/w to 40% I don't need to limit
> > I/O
> > > > b/w via cgroups. Such bandwidth manipulations are network switch
> > driven
> > > > and cgroups never take care of these events from the Ethernet
> > driver.
> > > >
> > > > So if IO is going over network and actual bandwidth control is
> > taking
> > > > place by throttling ethernet traffic then one does not have to
> > specify
> > > > block cgroup throttling policy and hence no need for cgroups to be
> > > > worried
> > > > about ethernet driver events?
> > > >
> > > > I think I am missing something here.
> > > >
> > > > Vivek
> > > Well.. here is the catch.. example scenario..
> > >
> > > - Two iSCSI I/O sessions emanating from Ethernet ports eth0, eth1
> > multipathed together. Let us say round-robin policy.
> > >
> > > - The cgroup profile is to limit I/O bandwidth to 40% of the
> > multipathed I/O bandwidth. But the switch may have limited the I/O
> > bandwidth to 40% for the corresponding vlan associated with one of the
> > eth interface say eth1
> > >
> > > The computation that the bandwidth configured is 40% of the available
> > bandwidth is false in this case.  What we need to do is possibly push
> > more I/O through eth0 as it is allowed to run at 100% of bandwidth by
> > the switch.
> > >
> > > Now this is a dynamic decision and multipathing layer should take
> > care of it.. but it would need a hint..
> > 
> > No hint should be needed.  Just use one of the newer multipath path
> > selectors that are dynamic by design: "queue-length" or "service-time".
> > 
> > This scenario is exactly what those path selectors are meant to
> > address.
> > 
> > Mike
> 
> Since iSCSI multipaths are essentially sessions one could configure
> more than one session through the same ethX interface. The sessions
> need not be going to the same LUN and hence not governed by the same
> multipath selector but the bandwidth policy group would be for a group
> of resources.

Then the sessions don't correspond to the same backend LUN (and by
definition aren't part of the same mpath device).  You're really all
over the map with your talking points.

I'm having a hard time following you.

Mike

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 11:16 ` Ric Wheeler
                     ` (2 preceding siblings ...)
  2011-03-29 19:47   ` Nicholas A. Bellinger
@ 2011-03-29 20:29   ` Jan Kara
  2011-03-29 20:31     ` Ric Wheeler
  2011-03-30  0:33   ` Mingming Cao
  4 siblings, 1 reply; 166+ messages in thread
From: Jan Kara @ 2011-03-29 20:29 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: James Bottomley, lsf, linux-fsdevel, device-mapper development,
	linux-scsi

On Tue 29-03-11 07:16:32, Ric Wheeler wrote:
> On 03/29/2011 12:36 AM, James Bottomley wrote:
> (3) The union mount versus overlayfs debate - pros and cons. What each do well, 
> what needs doing. Do we want/need both upstream? (Maybe this can get 10 minutes 
> in Al's VFS session?)
  It might be interesting but neither Miklos nor Val seems to be attending
so I'm not sure how deep discussion we can have :).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 20:29   ` Jan Kara
@ 2011-03-29 20:31     ` Ric Wheeler
  0 siblings, 0 replies; 166+ messages in thread
From: Ric Wheeler @ 2011-03-29 20:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ric Wheeler, James Bottomley, lsf, device-mapper development,
	linux-fsdevel, linux-scsi

On 03/29/2011 04:29 PM, Jan Kara wrote:
> On Tue 29-03-11 07:16:32, Ric Wheeler wrote:
>> On 03/29/2011 12:36 AM, James Bottomley wrote:
>> (3) The union mount versus overlayfs debate - pros and cons. What each do well,
>> what needs doing. Do we want/need both upstream? (Maybe this can get 10 minutes
>> in Al's VFS session?)
>    It might be interesting but neither Miklos nor Val seems to be attending
> so I'm not sure how deep discussion we can have :).
>
> 								Honza

Very true - probably best to keep that discussion focused upstream (but that 
seems to have quieted down as well)...

Ric


^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] [LSF][MM] page allocation & direct reclaim latency
  2011-03-29 19:05   ` [Lsf] " Andrea Arcangeli
@ 2011-03-29 20:35     ` Ying Han
  2011-03-29 20:39       ` Ying Han
  2011-03-29 20:45       ` Andrea Arcangeli
  2011-03-29 21:22     ` Rik van Riel
  2011-03-29 22:13     ` Minchan Kim
  2 siblings, 2 replies; 166+ messages in thread
From: Ying Han @ 2011-03-29 20:35 UTC (permalink / raw)
  To: Andrea Arcangeli, Rik van Riel; +Cc: lsf, linux-mm, Hugh Dickins

On Tue, Mar 29, 2011 at 12:05 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Hi Rik, Hugh and everyone,
>
> On Tue, Mar 29, 2011 at 11:35:09AM -0400, Rik van Riel wrote:
>> On 03/29/2011 12:36 AM, James Bottomley wrote:
>> > Hi All,
>> >
>> > Since LSF is less than a week away, the programme committee put together
>> > a just in time preliminary agenda for LSF.  As you can see there is
>> > still plenty of empty space, which you can make suggestions
>>
>> There have been a few patches upstream by people for who
>> page allocation latency is a concern.
>>
>> It may be worthwhile to have a short discussion on what
>> we can do to keep page allocation (and direct reclaim?)
>> latencies down to a minimum, reducing the slowdown that
>> direct reclaim introduces on some workloads.
>
> I don't see the patches you refer to, but checking schedule we've a
> slot with Mel&Minchan about "Reclaim, compaction and LRU
> ordering". Compaction only applies to high order allocations and it
> changes nothing to PAGE_SIZE allocations, but it surely has lower
> latency than the older lumpy reclaim logic so overall it should be a
> net improvement compared to what we had before.
>
> Should the latency issues be discussed in that track?
>
> The MM schedule has still a free slot 14-14:30 on Monday, I wonder if
> there's interest on a "NUMA automatic migration and scheduling
> awareness" topic or if it's still too vapourware for a real topic and
> we should keep it for offtrack discussions, and maybe we should
> reserve it for something more tangible with patches already floating
> around. Comments welcome.


In page reclaim, I would like to discuss on the magic "8" *
high_wmark() in balance_pgdat(). I recently found the discussion on
thread "too big min_free_kbytes", where I didn't find where we proved
it is still a problem or not. This might not need reserve time slot,
but something I want to learn more on.

--Ying


>
> Thanks,
> Andrea
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 19:09   ` Vivek Goyal
  2011-03-29 20:14     ` Chad Talbott
@ 2011-03-29 20:35     ` Jan Kara
  2011-03-29 21:08       ` Greg Thelen
  1 sibling, 1 reply; 166+ messages in thread
From: Jan Kara @ 2011-03-29 20:35 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Chad Talbott, James Bottomley, lsf, linux-fsdevel, gthelen

On Tue 29-03-11 15:09:21, Vivek Goyal wrote:
> On Tue, Mar 29, 2011 at 10:35:13AM -0700, Chad Talbott wrote:
> > I'd like to propose a discussion topic:
> > 
> > IO-less Dirty Throttling Considered Harmful...
> > 
> 
> I see that writeback has extended session at 10.00. I am assuming
> IO less throttling will be discussed there. Is it possible to 
> discuss its effect on block cgroups there? I am not sure enough
> time is there because it ties in memory cgroup also.
> 
> Or there is a session at 12.30 "memcg dirty limits and writeback", it
> can probably be discussed there too.
  Yes, I'd like to have this discussion in this session if Greg agrees.
We've been discussing about how to combine IO-less throttling and memcg
awareness of the writeback and Greg was designing some framework to do
this... Greg?

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] [LSF][MM] page allocation & direct reclaim latency
  2011-03-29 20:35     ` Ying Han
@ 2011-03-29 20:39       ` Ying Han
  2011-03-29 20:45       ` Andrea Arcangeli
  1 sibling, 0 replies; 166+ messages in thread
From: Ying Han @ 2011-03-29 20:39 UTC (permalink / raw)
  To: Andrea Arcangeli, Rik van Riel; +Cc: lsf, linux-mm, Hugh Dickins

On Tue, Mar 29, 2011 at 1:35 PM, Ying Han <yinghan@google.com> wrote:
> On Tue, Mar 29, 2011 at 12:05 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>> Hi Rik, Hugh and everyone,
>>
>> On Tue, Mar 29, 2011 at 11:35:09AM -0400, Rik van Riel wrote:
>>> On 03/29/2011 12:36 AM, James Bottomley wrote:
>>> > Hi All,
>>> >
>>> > Since LSF is less than a week away, the programme committee put together
>>> > a just in time preliminary agenda for LSF.  As you can see there is
>>> > still plenty of empty space, which you can make suggestions
>>>
>>> There have been a few patches upstream by people for who
>>> page allocation latency is a concern.
>>>
>>> It may be worthwhile to have a short discussion on what
>>> we can do to keep page allocation (and direct reclaim?)
>>> latencies down to a minimum, reducing the slowdown that
>>> direct reclaim introduces on some workloads.
>>
>> I don't see the patches you refer to, but checking schedule we've a
>> slot with Mel&Minchan about "Reclaim, compaction and LRU
>> ordering". Compaction only applies to high order allocations and it
>> changes nothing to PAGE_SIZE allocations, but it surely has lower
>> latency than the older lumpy reclaim logic so overall it should be a
>> net improvement compared to what we had before.
>>
>> Should the latency issues be discussed in that track?
>>
>> The MM schedule has still a free slot 14-14:30 on Monday, I wonder if
>> there's interest on a "NUMA automatic migration and scheduling
>> awareness" topic or if it's still too vapourware for a real topic and
>> we should keep it for offtrack discussions, and maybe we should
>> reserve it for something more tangible with patches already floating
>> around. Comments welcome.
>
>
> In page reclaim, I would like to discuss on the magic "8" *
> high_wmark() in balance_pgdat(). I recently found the discussion on
> thread "too big min_free_kbytes", where I didn't find where we proved
> it is still a problem or not. This might not need reserve time slot,
> but something I want to learn more on.

well, forgot to mention. I also noticed that has been changed in mmotm
by a "balance_gap". In general, I would like to understand why we can
not stick on high_wmark for kswapd regardless of zones.

Thanks

--Ying

>
> --Ying
>
>
>>
>> Thanks,
>> Andrea
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] [LSF][MM] page allocation & direct reclaim latency
  2011-03-29 20:35     ` Ying Han
  2011-03-29 20:39       ` Ying Han
@ 2011-03-29 20:45       ` Andrea Arcangeli
  2011-03-29 20:53         ` Ying Han
  1 sibling, 1 reply; 166+ messages in thread
From: Andrea Arcangeli @ 2011-03-29 20:45 UTC (permalink / raw)
  To: Ying Han; +Cc: Rik van Riel, lsf, linux-mm, Hugh Dickins

On Tue, Mar 29, 2011 at 01:35:24PM -0700, Ying Han wrote:
> In page reclaim, I would like to discuss on the magic "8" *
> high_wmark() in balance_pgdat(). I recently found the discussion on
> thread "too big min_free_kbytes", where I didn't find where we proved
> it is still a problem or not. This might not need reserve time slot,
> but something I want to learn more on.

That is merged in 2.6.39-rc1. It's hopefully working good enough. We
still use high+balance_gap but the balance_gap isn't high*8 anymore. I
still think the balance_gap may as well be zero but the gap now is
small enough (not 600M on 4G machine anymore) that it's ok and this
was a safer change.

This is an LRU ordering issue to try to keep the lru balance across
the zones and not just rotate a lot a single one. I think it can be
covered in the LRU ordering topic too. But we could also expand it to
a different slot if we expect too many issues to showup in that
slot... Hugh what's your opinion?

The subtopics that comes to mind for that topic so far would be:

- reclaim latency
- compaction issues (Mel)
- lru ordering altered by compaction/migrate/khugepaged or other
  features requiring lru page isolation (Minchan)
- lru rotation balance across zones in kswapd (balance_gap) (Ying)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] [LSF][MM] page allocation & direct reclaim latency
  2011-03-29 20:45       ` Andrea Arcangeli
@ 2011-03-29 20:53         ` Ying Han
  0 siblings, 0 replies; 166+ messages in thread
From: Ying Han @ 2011-03-29 20:53 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Rik van Riel, lsf, linux-mm, Hugh Dickins

On Tue, Mar 29, 2011 at 1:45 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Tue, Mar 29, 2011 at 01:35:24PM -0700, Ying Han wrote:
>> In page reclaim, I would like to discuss on the magic "8" *
>> high_wmark() in balance_pgdat(). I recently found the discussion on
>> thread "too big min_free_kbytes", where I didn't find where we proved
>> it is still a problem or not. This might not need reserve time slot,
>> but something I want to learn more on.
>
> That is merged in 2.6.39-rc1. It's hopefully working good enough. We
> still use high+balance_gap but the balance_gap isn't high*8 anymore. I
> still think the balance_gap may as well be zero but the gap now is
> small enough (not 600M on 4G machine anymore) that it's ok and this
> was a safer change.
>
> This is an LRU ordering issue to try to keep the lru balance across
> the zones and not just rotate a lot a single one. I think it can be
> covered in the LRU ordering topic too. But we could also expand it to
> a different slot if we expect too many issues to showup in that
> slot... Hugh what's your opinion?

Yes, that is what I got from the thread discussion and thank you for
confirming that. Guess my question is :

Do we need to do balance across zones by giving the fact that each
zone does its own balancing?
What is the problem we saw without doing the cross-zone balancing?

I don't have data to back-up either way, and that is something I am
interested too :)

--Ying


>
> The subtopics that comes to mind for that topic so far would be:
>
> - reclaim latency
> - compaction issues (Mel)
> - lru ordering altered by compaction/migrate/khugepaged or other
>  features requiring lru page isolation (Minchan)
> - lru rotation balance across zones in kswapd (balance_gap) (Ying)
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 20:35     ` Jan Kara
@ 2011-03-29 21:08       ` Greg Thelen
  0 siblings, 0 replies; 166+ messages in thread
From: Greg Thelen @ 2011-03-29 21:08 UTC (permalink / raw)
  To: Jan Kara; +Cc: Vivek Goyal, Chad Talbott, James Bottomley, lsf, linux-fsdevel

On Tue, Mar 29, 2011 at 1:35 PM, Jan Kara <jack@suse.cz> wrote:
> On Tue 29-03-11 15:09:21, Vivek Goyal wrote:
>> On Tue, Mar 29, 2011 at 10:35:13AM -0700, Chad Talbott wrote:
>> > I'd like to propose a discussion topic:
>> >
>> > IO-less Dirty Throttling Considered Harmful...
>> >
>>
>> I see that writeback has extended session at 10.00. I am assuming
>> IO less throttling will be discussed there. Is it possible to
>> discuss its effect on block cgroups there? I am not sure enough
>> time is there because it ties in memory cgroup also.
>>
>> Or there is a session at 12.30 "memcg dirty limits and writeback", it
>> can probably be discussed there too.
>  Yes, I'd like to have this discussion in this session if Greg agrees.

It's fine with me if the morning session considers IO-less dirty
throttling with block cgroup service differentiation, but defers memcg
aspects to 12:30.

> We've been discussing about how to combine IO-less throttling and memcg
> awareness of the writeback and Greg was designing some framework to do
> this... Greg?

My initial patches are between memcg and the current IO-full
throttling code.  However, the framework ideally will also allow for
IO-less dirty throttling with memcg.  I have not wrapped my head
around how this should  work with block cgroup isolation.  I am hoping
others can help out with the block aspects.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] [LSF][MM] page allocation & direct reclaim latency
  2011-03-29 19:05   ` [Lsf] " Andrea Arcangeli
  2011-03-29 20:35     ` Ying Han
@ 2011-03-29 21:22     ` Rik van Riel
  2011-03-29 22:38       ` Andrea Arcangeli
  2011-03-29 22:13     ` Minchan Kim
  2 siblings, 1 reply; 166+ messages in thread
From: Rik van Riel @ 2011-03-29 21:22 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: lsf, linux-mm, Hugh Dickins

On 03/29/2011 03:05 PM, Andrea Arcangeli wrote:

> Should the latency issues be discussed in that track?

Sounds good.  I don't think we'll spend more than 5-10 minutes
on the latency thing, probably less than that.

> The MM schedule has still a free slot 14-14:30 on Monday, I wonder if
> there's interest on a "NUMA automatic migration and scheduling
> awareness" topic or if it's still too vapourware for a real topic and
> we should keep it for offtrack discussions,

I believe that problem is complex enough to warrant a 30
minute discussion.  Even if we do not come up with solutions,
it would be a good start if we could all agree on the problem.

Things this complex often end up getting shot down later, not
because people do not agree on the solution, but because people
do not agree on the PROBLEM (and the patches in question only
solve a subset of the problem).

I would be willing to lead the NUMA scheduling and memory
allocation discussion.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] [LSF][MM] page allocation & direct reclaim latency
  2011-03-29 19:05   ` [Lsf] " Andrea Arcangeli
  2011-03-29 20:35     ` Ying Han
  2011-03-29 21:22     ` Rik van Riel
@ 2011-03-29 22:13     ` Minchan Kim
  2011-03-29 23:12       ` Andrea Arcangeli
  2011-03-30 16:17       ` Mel Gorman
  2 siblings, 2 replies; 166+ messages in thread
From: Minchan Kim @ 2011-03-29 22:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, lsf, linux-mm, Hugh Dickins, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Wed, Mar 30, 2011 at 4:05 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Hi Rik, Hugh and everyone,
>
> On Tue, Mar 29, 2011 at 11:35:09AM -0400, Rik van Riel wrote:
>> On 03/29/2011 12:36 AM, James Bottomley wrote:
>> > Hi All,
>> >
>> > Since LSF is less than a week away, the programme committee put together
>> > a just in time preliminary agenda for LSF.  As you can see there is
>> > still plenty of empty space, which you can make suggestions
>>
>> There have been a few patches upstream by people for who
>> page allocation latency is a concern.
>>
>> It may be worthwhile to have a short discussion on what
>> we can do to keep page allocation (and direct reclaim?)
>> latencies down to a minimum, reducing the slowdown that
>> direct reclaim introduces on some workloads.
>
> I don't see the patches you refer to, but checking schedule we've a
> slot with Mel&Minchan about "Reclaim, compaction and LRU
> ordering". Compaction only applies to high order allocations and it
> changes nothing to PAGE_SIZE allocations, but it surely has lower
> latency than the older lumpy reclaim logic so overall it should be a
> net improvement compared to what we had before.
>
> Should the latency issues be discussed in that track?

It's okay to me. LRU ordering issue wouldn't take much time.
But I am not sure Mel would have a long time. :)

About reclaim latency, I sent a patch in the old days.
http://marc.info/?l=linux-mm&m=129187231129887&w=4

And some guys on embedded had a concern about latency.
They want OOM rather than eviction of working set and undeterministic
latency of reclaim.

As another issue of related to latency, there is a OOM.
To accelerate task's exit, we raise a priority of the victim process
but it had a problem so Kosaki decided reverting the patch. It's
totally related to latency issue but it would

In addition, Kame and I sent a patch to prevent forkbomb. Kame's
apprach is to track the history of mm and mine is to use sysrq to kill
recently created tasks. The approaches have pros and cons.
But anyone seem to not has a interest about forkbomb protection.
So I want to listen other's opinion we really need it

I am not sure this could become a topic of LSF/MM
If it is proper, I would like to talk above issues in "Reclaim,
compaction and LRU ordering" slot.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] [LSF][MM] page allocation & direct reclaim latency
  2011-03-29 21:22     ` Rik van Riel
@ 2011-03-29 22:38       ` Andrea Arcangeli
  0 siblings, 0 replies; 166+ messages in thread
From: Andrea Arcangeli @ 2011-03-29 22:38 UTC (permalink / raw)
  To: Rik van Riel; +Cc: lsf, linux-mm, Hugh Dickins

On Tue, Mar 29, 2011 at 05:22:20PM -0400, Rik van Riel wrote:
> I believe that problem is complex enough to warrant a 30
> minute discussion.  Even if we do not come up with solutions,
> it would be a good start if we could all agree on the problem.
> 
> Things this complex often end up getting shot down later, not
> because people do not agree on the solution, but because people
> do not agree on the PROBLEM (and the patches in question only
> solve a subset of the problem).
> 
> I would be willing to lead the NUMA scheduling and memory
> allocation discussion.

Well, for now I added it to schedule.

The problem I think exists as without bindings and NUMA hinting, the
current automatic behavior deviates significantly from the tuned-NUMA
binding performance as also shown by the migrate-on-fault patches.

Now THP pages can't even be migrated before being splitted, and
migrating 2M on fault isn't optimal even after we teach migrate how to
migrate 2M pages without splitting [a separate issue]. Migrate on
fault to me looks a great improvement but it doesn't look the most
optimal design we can have as the page fault can be avoided with a
background migration from kernel thread, without requiring page faults.

Hugh if you think of some other topic being more urgent feel free to
update. One other topic that comes to mind right now that could be
good candidate for the floating slot would be Hugh's OOM topic. I
think it'd be nice to somehow squeeze that into the schedule too if
Hugh has interest to lead it.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 166+ messages in thread

* RE: Preliminary Agenda and Activities for LSF
  2011-03-29 20:23                 ` Mike Snitzer
@ 2011-03-29 23:09                     ` Shyam_Iyer
  0 siblings, 0 replies; 166+ messages in thread
From: Shyam_Iyer @ 2011-03-29 23:09 UTC (permalink / raw)
  To: snitzer; +Cc: linux-scsi, lsf, linux-fsdevel, rwheeler, vgoyal, dm-devel



> -----Original Message-----
> From: Mike Snitzer [mailto:snitzer@redhat.com]
> Sent: Tuesday, March 29, 2011 4:24 PM
> To: Iyer, Shyam
> Cc: linux-scsi@vger.kernel.org; lsf@lists.linux-foundation.org; linux-
> fsdevel@vger.kernel.org; rwheeler@redhat.com; vgoyal@redhat.com;
> device-mapper development
> Subject: Re: Preliminary Agenda and Activities for LSF
> 
> On Tue, Mar 29 2011 at  4:12pm -0400,
> Shyam_Iyer@dell.com <Shyam_Iyer@dell.com> wrote:
> 
> >
> >
> > > -----Original Message-----
> > > From: Mike Snitzer [mailto:snitzer@redhat.com]
> > > Sent: Tuesday, March 29, 2011 4:00 PM
> > > To: Iyer, Shyam
> > > Cc: vgoyal@redhat.com; lsf@lists.linux-foundation.org; linux-
> > > scsi@vger.kernel.org; linux-fsdevel@vger.kernel.org;
> > > rwheeler@redhat.com; device-mapper development
> > > Subject: Re: Preliminary Agenda and Activities for LSF
> > >
> > > On Tue, Mar 29 2011 at  3:13pm -0400,
> > > Shyam_Iyer@dell.com <Shyam_Iyer@dell.com> wrote:
> > >
> > > > > > > Above is pretty generic. Do you have specific
> > > needs/ideas/concerns?
> > > > > > >
> > > > > > > Thanks
> > > > > > > Vivek
> > > > > > Yes.. if I limited by Ethernet b/w to 40% I don't need to
> limit
> > > I/O
> > > > > b/w via cgroups. Such bandwidth manipulations are network
> switch
> > > driven
> > > > > and cgroups never take care of these events from the Ethernet
> > > driver.
> > > > >
> > > > > So if IO is going over network and actual bandwidth control is
> > > taking
> > > > > place by throttling ethernet traffic then one does not have to
> > > specify
> > > > > block cgroup throttling policy and hence no need for cgroups to
> be
> > > > > worried
> > > > > about ethernet driver events?
> > > > >
> > > > > I think I am missing something here.
> > > > >
> > > > > Vivek
> > > > Well.. here is the catch.. example scenario..
> > > >
> > > > - Two iSCSI I/O sessions emanating from Ethernet ports eth0, eth1
> > > multipathed together. Let us say round-robin policy.
> > > >
> > > > - The cgroup profile is to limit I/O bandwidth to 40% of the
> > > multipathed I/O bandwidth. But the switch may have limited the I/O
> > > bandwidth to 40% for the corresponding vlan associated with one of
> the
> > > eth interface say eth1
> > > >
> > > > The computation that the bandwidth configured is 40% of the
> available
> > > bandwidth is false in this case.  What we need to do is possibly
> push
> > > more I/O through eth0 as it is allowed to run at 100% of bandwidth
> by
> > > the switch.
> > > >
> > > > Now this is a dynamic decision and multipathing layer should take
> > > care of it.. but it would need a hint..
> > >
> > > No hint should be needed.  Just use one of the newer multipath path
> > > selectors that are dynamic by design: "queue-length" or "service-
> time".
> > >
> > > This scenario is exactly what those path selectors are meant to
> > > address.
> > >
> > > Mike
> >
> > Since iSCSI multipaths are essentially sessions one could configure
> > more than one session through the same ethX interface. The sessions
> > need not be going to the same LUN and hence not governed by the same
> > multipath selector but the bandwidth policy group would be for a
> group
> > of resources.
> 
> Then the sessions don't correspond to the same backend LUN (and by
> definition aren't part of the same mpath device).  You're really all
> over the map with your talking points.
> 
> I'm having a hard time following you.
> 
> Mike

Let me back up here.. this has to be thought in not only the traditional Ethernet sense but also in a Data Centre Bridged environment. I shouldn't have wandered into the multipath constructs..

I think the statement on not going to the same LUN was a little erroneous. I meant different /dev/sdXs.. and hence different block I/O queues.

Each I/O queue could be thought of as a bandwidth queue class being serviced through a corresponding network adapter's queue(assuming a multiqueue capable adapter)

Let us say /dev/sda(Through eth0) and /dev/sdb(eth1) are a cgroup bandwidth group corresponding to a weightage of 20% of the I/O bandwidth the user has configured this weight thinking that this will correspond to say 200Mb of bandwidth.

Let us say the network bandwidth on the corresponding network queues corresponding was reduced by the DCB capable switch...
We still need an SLA of 200Mb of I/O bandwidth but the underlying dynamics have changed.

In such a scenario the option is to move I/O to a different bandwidth priority queue in the network adapter. This could be moving I/O to a new network queue in eth0 or another queue in eth1 .. 

This requires mapping the block queue to the new network queue.

One way of solving this is what is getting into the open-iscsi world i.e. creating a session tagged to the relevant DCB priority and thus the session gets mapped to the relevant tc queue which ultimately maps to one of the network adapters multiqueue..

But when multipath fails over to the different session path then the DCB bandwidth priority will not move with it..

Ok one could argue that is a user mistake to have configured bandwidth priorities differently but it may so happen that the bandwidth priority was just dynamically changed by the switch for the particular queue.

Although I gave an example of a DCB environment but we could definitely look at doing a 1:n map of block queues to network adapter queues for non-DCB environments too..


-Shyam


^ permalink raw reply	[flat|nested] 166+ messages in thread

* RE: Preliminary Agenda and Activities for LSF
@ 2011-03-29 23:09                     ` Shyam_Iyer
  0 siblings, 0 replies; 166+ messages in thread
From: Shyam_Iyer @ 2011-03-29 23:09 UTC (permalink / raw)
  To: snitzer; +Cc: linux-scsi, lsf, linux-fsdevel, rwheeler, vgoyal, dm-devel



> -----Original Message-----
> From: Mike Snitzer [mailto:snitzer@redhat.com]
> Sent: Tuesday, March 29, 2011 4:24 PM
> To: Iyer, Shyam
> Cc: linux-scsi@vger.kernel.org; lsf@lists.linux-foundation.org; linux-
> fsdevel@vger.kernel.org; rwheeler@redhat.com; vgoyal@redhat.com;
> device-mapper development
> Subject: Re: Preliminary Agenda and Activities for LSF
> 
> On Tue, Mar 29 2011 at  4:12pm -0400,
> Shyam_Iyer@dell.com <Shyam_Iyer@dell.com> wrote:
> 
> >
> >
> > > -----Original Message-----
> > > From: Mike Snitzer [mailto:snitzer@redhat.com]
> > > Sent: Tuesday, March 29, 2011 4:00 PM
> > > To: Iyer, Shyam
> > > Cc: vgoyal@redhat.com; lsf@lists.linux-foundation.org; linux-
> > > scsi@vger.kernel.org; linux-fsdevel@vger.kernel.org;
> > > rwheeler@redhat.com; device-mapper development
> > > Subject: Re: Preliminary Agenda and Activities for LSF
> > >
> > > On Tue, Mar 29 2011 at  3:13pm -0400,
> > > Shyam_Iyer@dell.com <Shyam_Iyer@dell.com> wrote:
> > >
> > > > > > > Above is pretty generic. Do you have specific
> > > needs/ideas/concerns?
> > > > > > >
> > > > > > > Thanks
> > > > > > > Vivek
> > > > > > Yes.. if I limited by Ethernet b/w to 40% I don't need to
> limit
> > > I/O
> > > > > b/w via cgroups. Such bandwidth manipulations are network
> switch
> > > driven
> > > > > and cgroups never take care of these events from the Ethernet
> > > driver.
> > > > >
> > > > > So if IO is going over network and actual bandwidth control is
> > > taking
> > > > > place by throttling ethernet traffic then one does not have to
> > > specify
> > > > > block cgroup throttling policy and hence no need for cgroups to
> be
> > > > > worried
> > > > > about ethernet driver events?
> > > > >
> > > > > I think I am missing something here.
> > > > >
> > > > > Vivek
> > > > Well.. here is the catch.. example scenario..
> > > >
> > > > - Two iSCSI I/O sessions emanating from Ethernet ports eth0, eth1
> > > multipathed together. Let us say round-robin policy.
> > > >
> > > > - The cgroup profile is to limit I/O bandwidth to 40% of the
> > > multipathed I/O bandwidth. But the switch may have limited the I/O
> > > bandwidth to 40% for the corresponding vlan associated with one of
> the
> > > eth interface say eth1
> > > >
> > > > The computation that the bandwidth configured is 40% of the
> available
> > > bandwidth is false in this case.  What we need to do is possibly
> push
> > > more I/O through eth0 as it is allowed to run at 100% of bandwidth
> by
> > > the switch.
> > > >
> > > > Now this is a dynamic decision and multipathing layer should take
> > > care of it.. but it would need a hint..
> > >
> > > No hint should be needed.  Just use one of the newer multipath path
> > > selectors that are dynamic by design: "queue-length" or "service-
> time".
> > >
> > > This scenario is exactly what those path selectors are meant to
> > > address.
> > >
> > > Mike
> >
> > Since iSCSI multipaths are essentially sessions one could configure
> > more than one session through the same ethX interface. The sessions
> > need not be going to the same LUN and hence not governed by the same
> > multipath selector but the bandwidth policy group would be for a
> group
> > of resources.
> 
> Then the sessions don't correspond to the same backend LUN (and by
> definition aren't part of the same mpath device).  You're really all
> over the map with your talking points.
> 
> I'm having a hard time following you.
> 
> Mike

Let me back up here.. this has to be thought in not only the traditional Ethernet sense but also in a Data Centre Bridged environment. I shouldn't have wandered into the multipath constructs..

I think the statement on not going to the same LUN was a little erroneous. I meant different /dev/sdXs.. and hence different block I/O queues.

Each I/O queue could be thought of as a bandwidth queue class being serviced through a corresponding network adapter's queue(assuming a multiqueue capable adapter)

Let us say /dev/sda(Through eth0) and /dev/sdb(eth1) are a cgroup bandwidth group corresponding to a weightage of 20% of the I/O bandwidth the user has configured this weight thinking that this will correspond to say 200Mb of bandwidth.

Let us say the network bandwidth on the corresponding network queues corresponding was reduced by the DCB capable switch...
We still need an SLA of 200Mb of I/O bandwidth but the underlying dynamics have changed.

In such a scenario the option is to move I/O to a different bandwidth priority queue in the network adapter. This could be moving I/O to a new network queue in eth0 or another queue in eth1 .. 

This requires mapping the block queue to the new network queue.

One way of solving this is what is getting into the open-iscsi world i.e. creating a session tagged to the relevant DCB priority and thus the session gets mapped to the relevant tc queue which ultimately maps to one of the network adapters multiqueue..

But when multipath fails over to the different session path then the DCB bandwidth priority will not move with it..

Ok one could argue that is a user mistake to have configured bandwidth priorities differently but it may so happen that the bandwidth priority was just dynamically changed by the switch for the particular queue.

Although I gave an example of a DCB environment but we could definitely look at doing a 1:n map of block queues to network adapter queues for non-DCB environments too..


-Shyam


^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] [LSF][MM] page allocation & direct reclaim latency
  2011-03-29 22:13     ` Minchan Kim
@ 2011-03-29 23:12       ` Andrea Arcangeli
  2011-03-30 16:17       ` Mel Gorman
  1 sibling, 0 replies; 166+ messages in thread
From: Andrea Arcangeli @ 2011-03-29 23:12 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, lsf, linux-mm, Hugh Dickins, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Wed, Mar 30, 2011 at 07:13:42AM +0900, Minchan Kim wrote:
> It's okay to me. LRU ordering issue wouldn't take much time.
> But I am not sure Mel would have a long time. :)
> 
> About reclaim latency, I sent a patch in the old days.
> http://marc.info/?l=linux-mm&m=129187231129887&w=4
> 
> And some guys on embedded had a concern about latency.
> They want OOM rather than eviction of working set and undeterministic
> latency of reclaim.
> 
> As another issue of related to latency, there is a OOM.
> To accelerate task's exit, we raise a priority of the victim process
> but it had a problem so Kosaki decided reverting the patch. It's
> totally related to latency issue but it would
> 
> In addition, Kame and I sent a patch to prevent forkbomb. Kame's
> apprach is to track the history of mm and mine is to use sysrq to kill
> recently created tasks. The approaches have pros and cons.
> But anyone seem to not has a interest about forkbomb protection.
> So I want to listen other's opinion we really need it

The sysrq won't help on large servers, virtual clouds or android
devices where there's no keyboard attached and not sysrqd running. So
I'd prefer not requiring sysrq for that, even if you can run a sysrq,
few people would know how to activate an obscure sysrq feature meant
to be more selective than SYSRQ+I (or SYSRQ+B..) for forkbombs, if it
works without human intervention I think it's more valuable.

> I am not sure this could become a topic of LSF/MM

Now the forkbomb detection would fit nicely as a subtopic into Hugh's
OOM topic. I found one more spare MM slot on 5 April 12:30-13, so for
now I filled it with OOM and forkbomb.

> If it is proper, I would like to talk above issues in "Reclaim,
> compaction and LRU ordering" slot.

Sounds good to me. I added "allocation latency" to your slot.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 11:16 ` Ric Wheeler
                     ` (3 preceding siblings ...)
  2011-03-29 20:29   ` Jan Kara
@ 2011-03-30  0:33   ` Mingming Cao
  2011-03-30  2:17     ` Dave Chinner
  4 siblings, 1 reply; 166+ messages in thread
From: Mingming Cao @ 2011-03-30  0:33 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: James Bottomley, lsf, linux-fsdevel, linux-scsi,
	device-mapper development

On Tue, 2011-03-29 at 07:16 -0400, Ric Wheeler wrote:
> On 03/29/2011 12:36 AM, James Bottomley wrote:
> > Hi All,
> >
> > Since LSF is less than a week away, the programme committee put together
> > a just in time preliminary agenda for LSF.  As you can see there is
> > still plenty of empty space, which you can make suggestions (to this
> > list with appropriate general list cc's) for filling:
> >
> > https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQzM5UDRXUnVEbHlYVmZUVHQ2amc&output=html
> >
> > If you don't make suggestions, the programme committee will feel
> > empowered to make arbitrary assignments based on your topic and attendee
> > email requests ...
> >
> > We're still not quite sure what rooms we will have at the Kabuki, but
> > we'll add them to the spreadsheet when we know (they should be close to
> > each other).
> >
> > The spreadsheet above also gives contact information for all the
> > attendees and the programme committee.
> >
> > Yours,
> >
> > James Bottomley
> > on behalf of LSF/MM Programme Committee
> >
> 
> Here are a few topic ideas:
> 
> (1)  The first topic that might span IO & FS tracks (or just pull in device 
> mapper people to an FS track) could be adding new commands that would allow 
> users to grow/shrink/etc file systems in a generic way.  The thought I had was 
> that we have a reasonable model that we could reuse for these new commands like 
> mount and mount.fs or fsck and fsck.fs. With btrfs coming down the road, it 
> could be nice to identify exactly what common operations users want to do and 
> agree on how to implement them. Alasdair pointed out in the upstream thread that 
> we had a prototype here in fsadm.
> 
> (2) Very high speed, low latency SSD devices and testing. Have we settled on the 
> need for these devices to all have block level drivers? For S-ATA or SAS 
> devices, are there known performance issues that require enhancements in 
> somewhere in the stack?
> 
> (3) The union mount versus overlayfs debate - pros and cons. What each do well, 
> what needs doing. Do we want/need both upstream? (Maybe this can get 10 minutes 
> in Al's VFS session?)
> 

Ric,

May I propose some discussion about concurrent direct IO support for
ext4?

Direct IO write are serialized by the single i_mutex lock.  This lock
contention becomes significant when running database or direct IO heavy
workload on guest, where  the host pass a file image to guest as a block
device. All the parallel IOs in guests are being serialized by the
i_mutex lock on the host disk image file. This greatly penalize the data
base application performance in KVM. 

I am looking for some discussion about removing the i_mutex lock in the
direct IO write code path for ext4, when multiple threads are
direct write to different offset of the same file. This would require
some way to track the in-fly DIO IO range, either done at ext4 level or
above th vfs layer. 


Thanks,



^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30  0:33   ` Mingming Cao
@ 2011-03-30  2:17     ` Dave Chinner
  2011-03-30 11:13       ` Theodore Tso
  2011-03-30 21:49       ` Mingming Cao
  0 siblings, 2 replies; 166+ messages in thread
From: Dave Chinner @ 2011-03-30  2:17 UTC (permalink / raw)
  To: Mingming Cao
  Cc: Ric Wheeler, James Bottomley, lsf, linux-fsdevel, linux-scsi,
	device-mapper development

On Tue, Mar 29, 2011 at 05:33:30PM -0700, Mingming Cao wrote:
> Ric,
> 
> May I propose some discussion about concurrent direct IO support for
> ext4?

Just look at the way XFS does it and copy that?  i.e. it has a
filesytem level IO lock and an inode lock both with shared/exclusive
semantics. These lie below the i_mutex (i.e. locking order is
i_mutex, i_iolock, i_ilock), and effectively result in the i_mutex
only being used for VFS level synchronisation and as such is rarely
used inside XFS itself.

Inode attribute operations are protected by the inode lock, while IO
operations and truncation synchronisation is provided by the IO
lock.

So for buffered IO, the IO lock is used in shared mode for reads
and exclusive mode for writes. This gives normal POSIX buffered IO
semantics and holding the IO lock exclusive allows sycnhronisation
against new IO of any kind for truncate.

For direct IO, the IO lock is always taken in shared mode, so we can
have concurrent read and write operations taking place at once
regardless of the offset into the file.

> I am looking for some discussion about removing the i_mutex lock in the
> direct IO write code path for ext4, when multiple threads are
> direct write to different offset of the same file. This would require
> some way to track the in-fly DIO IO range, either done at ext4 level or
> above th vfs layer. 

Direct IO semantics have always been that the application is allowed
to overlap IO to the same range if it wants to. The result is
undefined (just like issuing overlapping reads and writes to a disk
at the same time) so it's the application's responsibility to avoid
overlapping IO if it is a problem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 17:35 ` [Lsf] Preliminary Agenda and Activities for LSF Chad Talbott
  2011-03-29 19:09   ` Vivek Goyal
@ 2011-03-30  4:18   ` Dave Chinner
  2011-03-30 15:37     ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) Vivek Goyal
  1 sibling, 1 reply; 166+ messages in thread
From: Dave Chinner @ 2011-03-30  4:18 UTC (permalink / raw)
  To: Chad Talbott; +Cc: James Bottomley, lsf, linux-fsdevel, Curt Wohlgemuth

On Tue, Mar 29, 2011 at 10:35:13AM -0700, Chad Talbott wrote:
> I'd like to propose a discussion topic:
> 
> IO-less Dirty Throttling Considered Harmful...
> 
> to isolation and cgroup IO schedulers in general.

Why is that, exactly? The current writeback infrastructure isn't
cgroup aware at all, so isn't that the problem you need to solve
first?  i.e. how to delegate page cache writeback from
one context to anotheri and account for it correctly?

Once you solve that problem, triggering cgroup specific writeback
from the throttling code is the same regardless of whether we
are doing IO directly from the throttling code or via a separate
flusher thread. Hence I don't really understand why you think
IO-less throttling is really a problem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 23:09                     ` Shyam_Iyer
  (?)
@ 2011-03-30  5:58                     ` Hannes Reinecke
  2011-03-30 14:02                       ` James Bottomley
  -1 siblings, 1 reply; 166+ messages in thread
From: Hannes Reinecke @ 2011-03-30  5:58 UTC (permalink / raw)
  To: Shyam_Iyer; +Cc: snitzer, linux-scsi, lsf, dm-devel, linux-fsdevel, rwheeler

On 03/30/2011 01:09 AM, Shyam_Iyer@dell.com wrote:
>
> Let me back up here.. this has to be thought in not only the traditional Ethernet
 > sense but also in a Data Centre Bridged environment. I shouldn't 
have wandered
 > into the multipath constructs..
>
> I think the statement on not going to the same LUN was a little erroneous. I meant
 > different /dev/sdXs.. and hence different block I/O queues.
>
> Each I/O queue could be thought of as a bandwidth queue class being serviced through
 > a corresponding network adapter's queue(assuming a multiqueue 
capable adapter)
>
> Let us say /dev/sda(Through eth0) and /dev/sdb(eth1) are a cgroup bandwidth group
 > corresponding to a weightage of 20% of the I/O bandwidth the user 
has configured
 > this weight thinking that this will correspond to say 200Mb of 
bandwidth.
>
> Let us say the network bandwidth on the corresponding network queues corresponding
 > was reduced by the DCB capable switch...
> We still need an SLA of 200Mb of I/O bandwidth but the underlying dynamics have changed.
>
> In such a scenario the option is to move I/O to a different bandwidth priority queue
 > in the network adapter. This could be moving I/O to a new network 
queue in eth0 or
 > another queue in eth1 ..
>
> This requires mapping the block queue to the new network queue.
>
> One way of solving this is what is getting into the open-iscsi world i.e. creating
 > a session tagged to the relevant DCB priority and thus the 
session gets mapped
 > to the relevant tc queue which ultimately maps to one of the 
network adapters multiqueue..
>
> But when multipath fails over to the different session path then the DCB bandwidth
 > priority will not move with it..
>
> Ok one could argue that is a user mistake to have configured bandwidth priorities
 > differently but it may so happen that the bandwidth priority was 
just dynamically
 > changed by the switch for the particular queue.
>
> Although I gave an example of a DCB environment but we could definitely look at
 > doing a 1:n map of block queues to network adapter queues for 
non-DCB environments too..
>
That sounds quite convoluted enough to warrant it's own slot :-)

No, seriously. I think it would be good to have a separate slot 
discussing DCB (be it FCoE or iSCSI) and cgroups.
And how to best align these things.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30  2:17     ` Dave Chinner
@ 2011-03-30 11:13       ` Theodore Tso
  2011-03-30 11:28         ` Ric Wheeler
  2011-03-30 21:49       ` Mingming Cao
  1 sibling, 1 reply; 166+ messages in thread
From: Theodore Tso @ 2011-03-30 11:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Mingming Cao, Ric Wheeler, James Bottomley, lsf, linux-fsdevel,
	linux-scsi, device-mapper development


On Mar 29, 2011, at 10:17 PM, Dave Chinner wrote:

> Direct IO semantics have always been that the application is allowed
> to overlap IO to the same range if it wants to. The result is
> undefined (just like issuing overlapping reads and writes to a disk
> at the same time) so it's the application's responsibility to avoid
> overlapping IO if it is a problem.

Even if the overlapping read/writes are taking place in different processes?

DIO has never been standardized, and was originally implemented as gentleman's agreements between various database manufacturers and proprietary unix vendors.  The lack of formal specifications of what applications are guaranteed to receive is unfortunate....

-- Ted


^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30 11:13       ` Theodore Tso
@ 2011-03-30 11:28         ` Ric Wheeler
  2011-03-30 14:07           ` Chris Mason
  2011-04-01 15:19           ` Ted Ts'o
  0 siblings, 2 replies; 166+ messages in thread
From: Ric Wheeler @ 2011-03-30 11:28 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Dave Chinner, lsf, linux-scsi, James Bottomley,
	device-mapper development, linux-fsdevel, Ric Wheeler

On 03/30/2011 07:13 AM, Theodore Tso wrote:
> On Mar 29, 2011, at 10:17 PM, Dave Chinner wrote:
>
>> Direct IO semantics have always been that the application is allowed
>> to overlap IO to the same range if it wants to. The result is
>> undefined (just like issuing overlapping reads and writes to a disk
>> at the same time) so it's the application's responsibility to avoid
>> overlapping IO if it is a problem.
> Even if the overlapping read/writes are taking place in different processes?
>
> DIO has never been standardized, and was originally implemented as gentleman's agreements between various database manufacturers and proprietary unix vendors.  The lack of formal specifications of what applications are guaranteed to receive is unfortunate....
>
> -- Ted

What possible semantics could you have?

If you ever write concurrently from multiple processes without locking, you 
clearly are at the mercy of the scheduler and the underlying storage which could 
fragment a single write into multiple IO's sent to the backend device.

I would agree with Dave, let's not make it overly complicated or try to give 
people "atomic" unbounded size writes just because they set the O_DIRECT flag :)

Ric


^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30  5:58                     ` [Lsf] " Hannes Reinecke
@ 2011-03-30 14:02                       ` James Bottomley
  2011-03-30 14:10                         ` Hannes Reinecke
  0 siblings, 1 reply; 166+ messages in thread
From: James Bottomley @ 2011-03-30 14:02 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Shyam_Iyer, linux-scsi, lsf, dm-devel, linux-fsdevel, rwheeler

On Wed, 2011-03-30 at 07:58 +0200, Hannes Reinecke wrote:
> On 03/30/2011 01:09 AM, Shyam_Iyer@dell.com wrote:
> >
> > Let me back up here.. this has to be thought in not only the traditional Ethernet
>  > sense but also in a Data Centre Bridged environment. I shouldn't 
> have wandered
>  > into the multipath constructs..
> >
> > I think the statement on not going to the same LUN was a little erroneous. I meant
>  > different /dev/sdXs.. and hence different block I/O queues.
> >
> > Each I/O queue could be thought of as a bandwidth queue class being serviced through
>  > a corresponding network adapter's queue(assuming a multiqueue 
> capable adapter)
> >
> > Let us say /dev/sda(Through eth0) and /dev/sdb(eth1) are a cgroup bandwidth group
>  > corresponding to a weightage of 20% of the I/O bandwidth the user 
> has configured
>  > this weight thinking that this will correspond to say 200Mb of 
> bandwidth.
> >
> > Let us say the network bandwidth on the corresponding network queues corresponding
>  > was reduced by the DCB capable switch...
> > We still need an SLA of 200Mb of I/O bandwidth but the underlying dynamics have changed.
> >
> > In such a scenario the option is to move I/O to a different bandwidth priority queue
>  > in the network adapter. This could be moving I/O to a new network 
> queue in eth0 or
>  > another queue in eth1 ..
> >
> > This requires mapping the block queue to the new network queue.
> >
> > One way of solving this is what is getting into the open-iscsi world i.e. creating
>  > a session tagged to the relevant DCB priority and thus the 
> session gets mapped
>  > to the relevant tc queue which ultimately maps to one of the 
> network adapters multiqueue..
> >
> > But when multipath fails over to the different session path then the DCB bandwidth
>  > priority will not move with it..
> >
> > Ok one could argue that is a user mistake to have configured bandwidth priorities
>  > differently but it may so happen that the bandwidth priority was 
> just dynamically
>  > changed by the switch for the particular queue.
> >
> > Although I gave an example of a DCB environment but we could definitely look at
>  > doing a 1:n map of block queues to network adapter queues for 
> non-DCB environments too..
> >
> That sounds quite convoluted enough to warrant it's own slot :-)
> 
> No, seriously. I think it would be good to have a separate slot 
> discussing DCB (be it FCoE or iSCSI) and cgroups.
> And how to best align these things.

OK, I'll go for that ... Data Centre Bridging; experiences, technologies
and needs ... something like that.  What about virtualisation and open
vSwitch?

James



^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30 11:28         ` Ric Wheeler
@ 2011-03-30 14:07           ` Chris Mason
  2011-04-01 15:19           ` Ted Ts'o
  1 sibling, 0 replies; 166+ messages in thread
From: Chris Mason @ 2011-03-30 14:07 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Theodore Tso, Dave Chinner, lsf, linux-scsi, James Bottomley,
	device-mapper development, linux-fsdevel, Ric Wheeler

Excerpts from Ric Wheeler's message of 2011-03-30 07:28:34 -0400:
> On 03/30/2011 07:13 AM, Theodore Tso wrote:
> > On Mar 29, 2011, at 10:17 PM, Dave Chinner wrote:
> >
> >> Direct IO semantics have always been that the application is allowed
> >> to overlap IO to the same range if it wants to. The result is
> >> undefined (just like issuing overlapping reads and writes to a disk
> >> at the same time) so it's the application's responsibility to avoid
> >> overlapping IO if it is a problem.
> > Even if the overlapping read/writes are taking place in different processes?
> >
> > DIO has never been standardized, and was originally implemented as gentleman's agreements between various database manufacturers and proprietary unix vendors.  The lack of formal specifications of what applications are guaranteed to receive is unfortunate....
> >
> > -- Ted
> 
> What possible semantics could you have?
> 
> If you ever write concurrently from multiple processes without locking, you 
> clearly are at the mercy of the scheduler and the underlying storage which could 
> fragment a single write into multiple IO's sent to the backend device.
> 
> I would agree with Dave, let's not make it overly complicated or try to give 
> people "atomic" unbounded size writes just because they set the O_DIRECT flag :)

We've talked about this with the oracle database people at least, any
concurrent O_DIRECT ios to the same area would be considered a db bug.
As long as it doesn't make the kernel crash or hang, we can return
one of these: http://www.youtube.com/watch?v=rX7wtNOkuHo

IBM might have a different answer, but I don't see how you can have good
results from mixing concurrent IOs.

-chris

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30 14:02                       ` James Bottomley
@ 2011-03-30 14:10                         ` Hannes Reinecke
  2011-03-30 14:26                           ` James Bottomley
  0 siblings, 1 reply; 166+ messages in thread
From: Hannes Reinecke @ 2011-03-30 14:10 UTC (permalink / raw)
  To: James Bottomley
  Cc: Shyam_Iyer, linux-scsi, lsf, dm-devel, linux-fsdevel, rwheeler

On 03/30/2011 04:02 PM, James Bottomley wrote:
> On Wed, 2011-03-30 at 07:58 +0200, Hannes Reinecke wrote:
>> On 03/30/2011 01:09 AM, Shyam_Iyer@dell.com wrote:
>>>
>>> Let me back up here.. this has to be thought in not only the traditional Ethernet
>>   >  sense but also in a Data Centre Bridged environment. I shouldn't
>> have wandered
>>   >  into the multipath constructs..
>>>
>>> I think the statement on not going to the same LUN was a little erroneous. I meant
>>   >  different /dev/sdXs.. and hence different block I/O queues.
>>>
>>> Each I/O queue could be thought of as a bandwidth queue class being serviced through
>>   >  a corresponding network adapter's queue(assuming a multiqueue
>> capable adapter)
>>>
>>> Let us say /dev/sda(Through eth0) and /dev/sdb(eth1) are a cgroup bandwidth group
>>   >  corresponding to a weightage of 20% of the I/O bandwidth the user
>> has configured
>>   >  this weight thinking that this will correspond to say 200Mb of
>> bandwidth.
>>>
>>> Let us say the network bandwidth on the corresponding network queues corresponding
>>   >  was reduced by the DCB capable switch...
>>> We still need an SLA of 200Mb of I/O bandwidth but the underlying dynamics have changed.
>>>
>>> In such a scenario the option is to move I/O to a different bandwidth priority queue
>>   >  in the network adapter. This could be moving I/O to a new network
>> queue in eth0 or
>>   >  another queue in eth1 ..
>>>
>>> This requires mapping the block queue to the new network queue.
>>>
>>> One way of solving this is what is getting into the open-iscsi world i.e. creating
>>   >  a session tagged to the relevant DCB priority and thus the
>> session gets mapped
>>   >  to the relevant tc queue which ultimately maps to one of the
>> network adapters multiqueue..
>>>
>>> But when multipath fails over to the different session path then the DCB bandwidth
>>   >  priority will not move with it..
>>>
>>> Ok one could argue that is a user mistake to have configured bandwidth priorities
>>   >  differently but it may so happen that the bandwidth priority was
>> just dynamically
>>   >  changed by the switch for the particular queue.
>>>
>>> Although I gave an example of a DCB environment but we could definitely look at
>>   >  doing a 1:n map of block queues to network adapter queues for
>> non-DCB environments too..
>>>
>> That sounds quite convoluted enough to warrant it's own slot :-)
>>
>> No, seriously. I think it would be good to have a separate slot
>> discussing DCB (be it FCoE or iSCSI) and cgroups.
>> And how to best align these things.
>
> OK, I'll go for that ... Data Centre Bridging; experiences, technologies
> and needs ... something like that.  What about virtualisation and open
> vSwitch?
>
Hmm. Not qualified enough to talk about the latter; I was more 
envisioning the storage-related aspects here (multiqueue mapping, 
QoS classes etc). With virtualisation and open vSwitch we're more in
the network side of things; doubt open vSwitch can do DCB.
And even if it could, virtio certainly can't :-)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30 14:10                         ` Hannes Reinecke
@ 2011-03-30 14:26                           ` James Bottomley
  2011-03-30 14:55                             ` Hannes Reinecke
  0 siblings, 1 reply; 166+ messages in thread
From: James Bottomley @ 2011-03-30 14:26 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Shyam_Iyer, linux-scsi, lsf, dm-devel, linux-fsdevel, rwheeler

On Wed, 2011-03-30 at 16:10 +0200, Hannes Reinecke wrote:
> On 03/30/2011 04:02 PM, James Bottomley wrote:
> > On Wed, 2011-03-30 at 07:58 +0200, Hannes Reinecke wrote:
> >> No, seriously. I think it would be good to have a separate slot
> >> discussing DCB (be it FCoE or iSCSI) and cgroups.
> >> And how to best align these things.
> >
> > OK, I'll go for that ... Data Centre Bridging; experiences, technologies
> > and needs ... something like that.  What about virtualisation and open
> > vSwitch?
> >
> Hmm. Not qualified enough to talk about the latter; I was more 
> envisioning the storage-related aspects here (multiqueue mapping, 
> QoS classes etc). With virtualisation and open vSwitch we're more in
> the network side of things; doubt open vSwitch can do DCB.
> And even if it could, virtio certainly can't :-)

Technically, the topic DCB is about Data Centre Ethernet enhancements
and converged networks ... that's why it's naturally allied to virtual
switching.

I was thinking we might put up a panel of vendors to get us all an
education on the topic ...

James



^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30 14:26                           ` James Bottomley
@ 2011-03-30 14:55                             ` Hannes Reinecke
  2011-03-30 15:33                               ` James Bottomley
  0 siblings, 1 reply; 166+ messages in thread
From: Hannes Reinecke @ 2011-03-30 14:55 UTC (permalink / raw)
  To: James Bottomley
  Cc: Shyam_Iyer, linux-scsi, lsf, dm-devel, linux-fsdevel, rwheeler

On 03/30/2011 04:26 PM, James Bottomley wrote:
> On Wed, 2011-03-30 at 16:10 +0200, Hannes Reinecke wrote:
>> On 03/30/2011 04:02 PM, James Bottomley wrote:
>>> On Wed, 2011-03-30 at 07:58 +0200, Hannes Reinecke wrote:
>>>> No, seriously. I think it would be good to have a separate slot
>>>> discussing DCB (be it FCoE or iSCSI) and cgroups.
>>>> And how to best align these things.
>>>
>>> OK, I'll go for that ... Data Centre Bridging; experiences, technologies
>>> and needs ... something like that.  What about virtualisation and open
>>> vSwitch?
>>>
>> Hmm. Not qualified enough to talk about the latter; I was more
>> envisioning the storage-related aspects here (multiqueue mapping,
>> QoS classes etc). With virtualisation and open vSwitch we're more in
>> the network side of things; doubt open vSwitch can do DCB.
>> And even if it could, virtio certainly can't :-)
>
> Technically, the topic DCB is about Data Centre Ethernet enhancements
> and converged networks ... that's why it's naturally allied to virtual
> switching.
>
> I was thinking we might put up a panel of vendors to get us all an
> education on the topic ...
>
Oh, but gladly.
Didn't know we had some at the LSF.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30 14:55                             ` Hannes Reinecke
@ 2011-03-30 15:33                               ` James Bottomley
  2011-03-30 15:46                                   ` Shyam_Iyer
  2011-03-30 20:32                                 ` Giridhar Malavali
  0 siblings, 2 replies; 166+ messages in thread
From: James Bottomley @ 2011-03-30 15:33 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Shyam_Iyer, linux-scsi, lsf, dm-devel, linux-fsdevel, rwheeler

On Wed, 2011-03-30 at 16:55 +0200, Hannes Reinecke wrote:
> On 03/30/2011 04:26 PM, James Bottomley wrote:
> > On Wed, 2011-03-30 at 16:10 +0200, Hannes Reinecke wrote:
> >> On 03/30/2011 04:02 PM, James Bottomley wrote:
> >>> On Wed, 2011-03-30 at 07:58 +0200, Hannes Reinecke wrote:
> >>>> No, seriously. I think it would be good to have a separate slot
> >>>> discussing DCB (be it FCoE or iSCSI) and cgroups.
> >>>> And how to best align these things.
> >>>
> >>> OK, I'll go for that ... Data Centre Bridging; experiences, technologies
> >>> and needs ... something like that.  What about virtualisation and open
> >>> vSwitch?
> >>>
> >> Hmm. Not qualified enough to talk about the latter; I was more
> >> envisioning the storage-related aspects here (multiqueue mapping,
> >> QoS classes etc). With virtualisation and open vSwitch we're more in
> >> the network side of things; doubt open vSwitch can do DCB.
> >> And even if it could, virtio certainly can't :-)
> >
> > Technically, the topic DCB is about Data Centre Ethernet enhancements
> > and converged networks ... that's why it's naturally allied to virtual
> > switching.
> >
> > I was thinking we might put up a panel of vendors to get us all an
> > education on the topic ...
> >
> Oh, but gladly.
> Didn't know we had some at the LSF.

OK, so I scheduled this with Dell (Shyam Iyer), Intel (Robert Love) and
Emulex (James Smart) but any other attending vendors who want to pitch
in, send me an email and I'll add you.

James



^ permalink raw reply	[flat|nested] 166+ messages in thread

* IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF)
  2011-03-30  4:18   ` Dave Chinner
@ 2011-03-30 15:37     ` Vivek Goyal
  2011-03-30 22:20       ` Dave Chinner
  0 siblings, 1 reply; 166+ messages in thread
From: Vivek Goyal @ 2011-03-30 15:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chad Talbott, James Bottomley, lsf, linux-fsdevel

On Wed, Mar 30, 2011 at 03:18:02PM +1100, Dave Chinner wrote:
> On Tue, Mar 29, 2011 at 10:35:13AM -0700, Chad Talbott wrote:
> > I'd like to propose a discussion topic:
> > 
> > IO-less Dirty Throttling Considered Harmful...
> > 
> > to isolation and cgroup IO schedulers in general.
> 
> Why is that, exactly? The current writeback infrastructure isn't
> cgroup aware at all, so isn't that the problem you need to solve
> first?  i.e. how to delegate page cache writeback from
> one context to anotheri and account for it correctly?
> 
> Once you solve that problem, triggering cgroup specific writeback
> from the throttling code is the same regardless of whether we
> are doing IO directly from the throttling code or via a separate
> flusher thread. Hence I don't really understand why you think
> IO-less throttling is really a problem.

Dave,

We are planning to track the IO context of original submitter of IO
by storing that information in page_cgroup. So that is not the problem.

The problem google guys are trying to raise is that can a single flusher
thread keep all the groups on bdi busy in such a way so that higher
prio group can get more IO done. It should not happen that flusher
thread gets blocked somewhere (trying to get request descriptors on
request queue) or it tries to dispatch too much IO from an inode which
primarily contains pages from low prio cgroup and high prio cgroup
task does not get enough pages dispatched to device hence not getting
any prio over low prio group.

Currently we can do some IO in the context of writting process also
hence faster group can try to dispatch its own pages to bdi for writeout.
With IO less throttling, that notion will disappear.

So the concern they raised that is single flusher thread per device
is enough to keep faster cgroup full at the bdi and hence get the
service differentiation.

My take on this is that on slow SATA device it might be as long as
we make sure that flusher thread does not block on individual groups
and also try to select inodes intelligently (cgroup aware manner).

If it really becomes an issue on faster devices, will a flusher thread
per cgroup per bdi make sense.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* RE: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30 15:33                               ` James Bottomley
@ 2011-03-30 15:46                                   ` Shyam_Iyer
  2011-03-30 20:32                                 ` Giridhar Malavali
  1 sibling, 0 replies; 166+ messages in thread
From: Shyam_Iyer @ 2011-03-30 15:46 UTC (permalink / raw)
  To: James.Bottomley, hare; +Cc: linux-scsi, lsf, dm-devel, linux-fsdevel, rwheeler



> -----Original Message-----
> From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com]
> Sent: Wednesday, March 30, 2011 11:34 AM
> To: Hannes Reinecke
> Cc: Iyer, Shyam; linux-scsi@vger.kernel.org; lsf@lists.linux-
> foundation.org; dm-devel@redhat.com; linux-fsdevel@vger.kernel.org;
> rwheeler@redhat.com
> Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> 
> On Wed, 2011-03-30 at 16:55 +0200, Hannes Reinecke wrote:
> > On 03/30/2011 04:26 PM, James Bottomley wrote:
> > > On Wed, 2011-03-30 at 16:10 +0200, Hannes Reinecke wrote:
> > >> On 03/30/2011 04:02 PM, James Bottomley wrote:
> > >>> On Wed, 2011-03-30 at 07:58 +0200, Hannes Reinecke wrote:
> > >>>> No, seriously. I think it would be good to have a separate slot
> > >>>> discussing DCB (be it FCoE or iSCSI) and cgroups.
> > >>>> And how to best align these things.
> > >>>
> > >>> OK, I'll go for that ... Data Centre Bridging; experiences,
> technologies
> > >>> and needs ... something like that.  What about virtualisation and
> open
> > >>> vSwitch?
> > >>>
> > >> Hmm. Not qualified enough to talk about the latter; I was more
> > >> envisioning the storage-related aspects here (multiqueue mapping,
> > >> QoS classes etc). With virtualisation and open vSwitch we're more
> in
> > >> the network side of things; doubt open vSwitch can do DCB.
> > >> And even if it could, virtio certainly can't :-)
> > >
> > > Technically, the topic DCB is about Data Centre Ethernet
> enhancements
> > > and converged networks ... that's why it's naturally allied to
> virtual
> > > switching.
> > >
> > > I was thinking we might put up a panel of vendors to get us all an
> > > education on the topic ...
> > >
> > Oh, but gladly.
> > Didn't know we had some at the LSF.
> 
> OK, so I scheduled this with Dell (Shyam Iyer), Intel (Robert Love) and
> Emulex (James Smart) but any other attending vendors who want to pitch
> in, send me an email and I'll add you.
> 
> James
> 
Excellent.
I would probably volunteer Giridhar(Qlogic) as well looking at the list of attendees as some of the CNA implementations vary.. 

-Shyam


^ permalink raw reply	[flat|nested] 166+ messages in thread

* RE: [Lsf] Preliminary Agenda and Activities for LSF
@ 2011-03-30 15:46                                   ` Shyam_Iyer
  0 siblings, 0 replies; 166+ messages in thread
From: Shyam_Iyer @ 2011-03-30 15:46 UTC (permalink / raw)
  To: James.Bottomley, hare; +Cc: linux-scsi, lsf, dm-devel, linux-fsdevel, rwheeler



> -----Original Message-----
> From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com]
> Sent: Wednesday, March 30, 2011 11:34 AM
> To: Hannes Reinecke
> Cc: Iyer, Shyam; linux-scsi@vger.kernel.org; lsf@lists.linux-
> foundation.org; dm-devel@redhat.com; linux-fsdevel@vger.kernel.org;
> rwheeler@redhat.com
> Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> 
> On Wed, 2011-03-30 at 16:55 +0200, Hannes Reinecke wrote:
> > On 03/30/2011 04:26 PM, James Bottomley wrote:
> > > On Wed, 2011-03-30 at 16:10 +0200, Hannes Reinecke wrote:
> > >> On 03/30/2011 04:02 PM, James Bottomley wrote:
> > >>> On Wed, 2011-03-30 at 07:58 +0200, Hannes Reinecke wrote:
> > >>>> No, seriously. I think it would be good to have a separate slot
> > >>>> discussing DCB (be it FCoE or iSCSI) and cgroups.
> > >>>> And how to best align these things.
> > >>>
> > >>> OK, I'll go for that ... Data Centre Bridging; experiences,
> technologies
> > >>> and needs ... something like that.  What about virtualisation and
> open
> > >>> vSwitch?
> > >>>
> > >> Hmm. Not qualified enough to talk about the latter; I was more
> > >> envisioning the storage-related aspects here (multiqueue mapping,
> > >> QoS classes etc). With virtualisation and open vSwitch we're more
> in
> > >> the network side of things; doubt open vSwitch can do DCB.
> > >> And even if it could, virtio certainly can't :-)
> > >
> > > Technically, the topic DCB is about Data Centre Ethernet
> enhancements
> > > and converged networks ... that's why it's naturally allied to
> virtual
> > > switching.
> > >
> > > I was thinking we might put up a panel of vendors to get us all an
> > > education on the topic ...
> > >
> > Oh, but gladly.
> > Didn't know we had some at the LSF.
> 
> OK, so I scheduled this with Dell (Shyam Iyer), Intel (Robert Love) and
> Emulex (James Smart) but any other attending vendors who want to pitch
> in, send me an email and I'll add you.
> 
> James
> 
Excellent.
I would probably volunteer Giridhar(Qlogic) as well looking at the list of attendees as some of the CNA implementations vary.. 

-Shyam


^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] [LSF][MM] page allocation & direct reclaim latency
  2011-03-29 22:13     ` Minchan Kim
  2011-03-29 23:12       ` Andrea Arcangeli
@ 2011-03-30 16:17       ` Mel Gorman
  2011-03-30 16:49         ` Andrea Arcangeli
  2011-03-30 16:59         ` Dan Magenheimer
  1 sibling, 2 replies; 166+ messages in thread
From: Mel Gorman @ 2011-03-30 16:17 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Andrea Arcangeli, lsf, linux-mm

On Wed, Mar 30, 2011 at 07:13:42AM +0900, Minchan Kim wrote:
> On Wed, Mar 30, 2011 at 4:05 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > Hi Rik, Hugh and everyone,
> >
> > On Tue, Mar 29, 2011 at 11:35:09AM -0400, Rik van Riel wrote:
> >> On 03/29/2011 12:36 AM, James Bottomley wrote:
> >> > Hi All,
> >> >
> >> > Since LSF is less than a week away, the programme committee put together
> >> > a just in time preliminary agenda for LSF.  As you can see there is
> >> > still plenty of empty space, which you can make suggestions
> >>
> >> There have been a few patches upstream by people for who
> >> page allocation latency is a concern.
> >>
> >> It may be worthwhile to have a short discussion on what
> >> we can do to keep page allocation (and direct reclaim?)
> >> latencies down to a minimum, reducing the slowdown that
> >> direct reclaim introduces on some workloads.
> >
> > I don't see the patches you refer to, but checking schedule we've a
> > slot with Mel&Minchan about "Reclaim, compaction and LRU
> > ordering". Compaction only applies to high order allocations and it
> > changes nothing to PAGE_SIZE allocations, but it surely has lower
> > latency than the older lumpy reclaim logic so overall it should be a
> > net improvement compared to what we had before.
> >
> > Should the latency issues be discussed in that track?
> 
> It's okay to me. LRU ordering issue wouldn't take much time.
> But I am not sure Mel would have a long time. :)
> 

What might be worth discussing on LRU ordering is encountering dirty pages
at the end of the LRU. This is a long-standing issues and patches have been
merged to mitigate the problem since the last LSF/MM. For example [e11da5b4:
tracing, vmscan: add trace events for LRU list shrinking] was the beginning
of a series that added some tracing around catching when this happened
and to mitigate it somewhat (at least according to the report included in
that changelog).

This happened since the last LSF/MM so it might be worth re-discussing if the
dirty-pages-at-end-of-LRU has mitigated somewhat. The last major bug
report that I'm aware of in that area was due to compaction rather than
reclaim but that could just mean people have given up raising the issue.

A trickier subject on LRU ordering is to consider if we are recycling
pages through the LRU too aggressively and aging too quickly. There have
been some patches in this area recently but it's not really clear if we
are happy with how the LRU lists age at the moment.

> About reclaim latency, I sent a patch in the old days.
> http://marc.info/?l=linux-mm&m=129187231129887&w=4
> 

Andy Whitcroft also posted patches ages ago that were related to lumpy reclaim
which would capture high-order pages being reclaimed for the exclusive use
of the reclaimer. It was never shown to be necessary though. I'll read this
thread in a bit because I'm curious to see why it came up now.

> And some guys on embedded had a concern about latency.
> They want OOM rather than eviction of working set and undeterministic
> latency of reclaim.
> 
> As another issue of related to latency, there is a OOM.
> To accelerate task's exit, we raise a priority of the victim process
> but it had a problem so Kosaki decided reverting the patch. It's
> totally related to latency issue but it would
> 

I think we should be very wary of conflating OOM latency, reclaim latency and
allocation latency as they are very different things with different causes.

> In addition, Kame and I sent a patch to prevent forkbomb. Kame's
> apprach is to track the history of mm and mine is to use sysrq to kill
> recently created tasks. The approaches have pros and cons.
> But anyone seem to not has a interest about forkbomb protection.
> So I want to listen other's opinion we really need it
> 
> I am not sure this could become a topic of LSF/MM
> If it is proper, I would like to talk above issues in "Reclaim,
> compaction and LRU ordering" slot.
> 

I'd prefer to see OOM-related issues treated as a separate-but-related
problem if possible so;

1. LRU ordering - are we aging pages properly or recycling through the
   list too aggressively? The high_wmark*8 change made recently was
   partially about list rotations and the associated cost so it might
   be worth listing out whatever issues people are currently aware of.
2. LRU ordering - dirty pages at the end of the LRU. Are we still going
   the right direction on this or is it still a shambles?
3. Compaction latency, other issues (IRQ disabling latency was the last
   one I'm aware of)
4. OOM killing and OOM latency - Whole load of churn going on in there.

?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] [LSF][MM] page allocation & direct reclaim latency
  2011-03-30 16:17       ` Mel Gorman
@ 2011-03-30 16:49         ` Andrea Arcangeli
  2011-03-31  0:42           ` Hugh Dickins
  2011-03-31  9:30           ` Mel Gorman
  2011-03-30 16:59         ` Dan Magenheimer
  1 sibling, 2 replies; 166+ messages in thread
From: Andrea Arcangeli @ 2011-03-30 16:49 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Minchan Kim, lsf, linux-mm

Hi Mel,

On Wed, Mar 30, 2011 at 05:17:16PM +0100, Mel Gorman wrote:
> Andy Whitcroft also posted patches ages ago that were related to lumpy reclaim
> which would capture high-order pages being reclaimed for the exclusive use
> of the reclaimer. It was never shown to be necessary though. I'll read this
> thread in a bit because I'm curious to see why it came up now.

Ok ;).

About lumpy I wouldn't spend too much on lumpy, I'd rather spend on
other issues like the one you mentioned on lru ordering, and
compaction (compaction in kswapd has still an unknown solution, my
last attempt failed and we backed off to no compaction in kswapd which
is safe but doesn't help for GFP_ATOMIC order > 0).

Lumpy should go away in a few releases IIRC.

> I think we should be very wary of conflating OOM latency, reclaim latency and
> allocation latency as they are very different things with different causes.

I think it's better to stick to successful allocation latencies only
here, or at most -ENOMEM from order > 0 which normally never happens
with compaction (not the time it takes before declaring OOM and
triggering the oom killer).

> I'd prefer to see OOM-related issues treated as a separate-but-related
> problem if possible so;
> 
> 1. LRU ordering - are we aging pages properly or recycling through the
>    list too aggressively? The high_wmark*8 change made recently was
>    partially about list rotations and the associated cost so it might
>    be worth listing out whatever issues people are currently aware of.
> 2. LRU ordering - dirty pages at the end of the LRU. Are we still going
>    the right direction on this or is it still a shambles?
> 3. Compaction latency, other issues (IRQ disabling latency was the last
>    one I'm aware of)
> 4. OOM killing and OOM latency - Whole load of churn going on in there.

I prefer it too. The OOM killing is already covered in OOM topic from
Hugh, and we can add "OOM detection latency" to it.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 166+ messages in thread

* RE: [Lsf] [LSF][MM] page allocation & direct reclaim latency
  2011-03-30 16:17       ` Mel Gorman
  2011-03-30 16:49         ` Andrea Arcangeli
@ 2011-03-30 16:59         ` Dan Magenheimer
  1 sibling, 0 replies; 166+ messages in thread
From: Dan Magenheimer @ 2011-03-30 16:59 UTC (permalink / raw)
  To: Mel Gorman, Minchan Kim; +Cc: lsf, linux-mm

> 1. LRU ordering - are we aging pages properly or recycling through the
>    list too aggressively? The high_wmark*8 change made recently was
>    partially about list rotations and the associated cost so it might
>    be worth listing out whatever issues people are currently aware of.

Here's one: zcache (and tmem RAMster and SSmem) is essentially a level2
cache for clean page cache pages that have been reclaimed.  (Or
more precisely, the pageFRAME has been reclaimed, but the contents
has been squirreled away in zcache.)

Just like the active/inactive lists, ideally, you'd like to ensure
zcache gets filled with pages that have some probability of being used
in the future, not pages you KNOW won't be used in the future but
have left on the inactive list to rot until they are reclaimed.

There's also a sizing issue... under memory pressure, pages in
active/inactive have different advantages/disadvantages vs
pages in zcache/etc... What tuning knobs exist already?

I hacked a (non-upstreamable) patch to only "put" clean pages
that had been previously in active, to play with this a bit but
didn't pursue it.

Anyway, would like to include this in the above discussion.

Thanks,
Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30 15:33                               ` James Bottomley
  2011-03-30 15:46                                   ` Shyam_Iyer
@ 2011-03-30 20:32                                 ` Giridhar Malavali
  2011-03-30 20:45                                   ` James Bottomley
  1 sibling, 1 reply; 166+ messages in thread
From: Giridhar Malavali @ 2011-03-30 20:32 UTC (permalink / raw)
  To: James Bottomley, Hannes Reinecke
  Cc: Shyam_Iyer, linux-scsi, lsf, dm-devel, linux-fsdevel, rwheeler



>>

>On Wed, 2011-03-30 at 16:55 +0200, Hannes Reinecke wrote:
>> On 03/30/2011 04:26 PM, James Bottomley wrote:
>> > On Wed, 2011-03-30 at 16:10 +0200, Hannes Reinecke wrote:
>> >> On 03/30/2011 04:02 PM, James Bottomley wrote:
>> >>> On Wed, 2011-03-30 at 07:58 +0200, Hannes Reinecke wrote:
>> >>>> No, seriously. I think it would be good to have a separate slot
>> >>>> discussing DCB (be it FCoE or iSCSI) and cgroups.
>> >>>> And how to best align these things.
>> >>>
>> >>> OK, I'll go for that ... Data Centre Bridging; experiences,
>>technologies
>> >>> and needs ... something like that.  What about virtualisation and
>>open
>> >>> vSwitch?
>> >>>
>> >> Hmm. Not qualified enough to talk about the latter; I was more
>> >> envisioning the storage-related aspects here (multiqueue mapping,
>> >> QoS classes etc). With virtualisation and open vSwitch we're more in
>> >> the network side of things; doubt open vSwitch can do DCB.
>> >> And even if it could, virtio certainly can't :-)
>> >
>> > Technically, the topic DCB is about Data Centre Ethernet enhancements
>> > and converged networks ... that's why it's naturally allied to virtual
>> > switching.
>> >
>> > I was thinking we might put up a panel of vendors to get us all an
>> > education on the topic ...
>> >
>> Oh, but gladly.
>> Didn't know we had some at the LSF.
>
>OK, so I scheduled this with Dell (Shyam Iyer), Intel (Robert Love) and
>Emulex (James Smart) but any other attending vendors who want to pitch
>in, send me an email and I'll add you.

Can u please add me for this.

-- Giridhar



>
>James
>
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


This message and any attached documents contain information from QLogic Corporation or its wholly-owned subsidiaries that may be confidential. If you are not the intended recipient, you may not read, copy, distribute, or use this information. If you have received this transmission in error, please notify the sender immediately by reply e-mail and then delete this message.


^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30 20:32                                 ` Giridhar Malavali
@ 2011-03-30 20:45                                   ` James Bottomley
  0 siblings, 0 replies; 166+ messages in thread
From: James Bottomley @ 2011-03-30 20:45 UTC (permalink / raw)
  To: Giridhar Malavali
  Cc: Hannes Reinecke, Shyam_Iyer, linux-scsi, lsf, dm-devel,
	linux-fsdevel, rwheeler

On Wed, 2011-03-30 at 13:32 -0700, Giridhar Malavali wrote:
> >OK, so I scheduled this with Dell (Shyam Iyer), Intel (Robert Love) and
> >Emulex (James Smart) but any other attending vendors who want to pitch
> >in, send me an email and I'll add you.
> 
> Can u please add me for this.

I already did.  (The agenda web actually updates about 5 minutes behind
the driving spreadsheet).

James



^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30  2:17     ` Dave Chinner
  2011-03-30 11:13       ` Theodore Tso
@ 2011-03-30 21:49       ` Mingming Cao
  2011-03-31  0:05         ` Matthew Wilcox
  2011-03-31  1:00         ` Joel Becker
  1 sibling, 2 replies; 166+ messages in thread
From: Mingming Cao @ 2011-03-30 21:49 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ric Wheeler, James Bottomley, lsf, linux-fsdevel, linux-scsi,
	device-mapper development

On Wed, 2011-03-30 at 13:17 +1100, Dave Chinner wrote:
> On Tue, Mar 29, 2011 at 05:33:30PM -0700, Mingming Cao wrote:
> > Ric,
> > 
> > May I propose some discussion about concurrent direct IO support for
> > ext4?
> 
> Just look at the way XFS does it and copy that?  i.e. it has a
> filesytem level IO lock and an inode lock both with shared/exclusive
> semantics. These lie below the i_mutex (i.e. locking order is
> i_mutex, i_iolock, i_ilock), and effectively result in the i_mutex
> only being used for VFS level synchronisation and as such is rarely
> used inside XFS itself.
> 
> Inode attribute operations are protected by the inode lock, while IO
> operations and truncation synchronisation is provided by the IO
> lock.
> 

Right, inode attribute operations should be covered by the i_lock. in
ext4 the i_mutex is used to protect IO and truncation synch, along with
the i_datasem to pretect concurrent access.modification to file's
allocation. 

> So for buffered IO, the IO lock is used in shared mode for reads
> and exclusive mode for writes. This gives normal POSIX buffered IO
> semantics and holding the IO lock exclusive allows sycnhronisation
> against new IO of any kind for truncate.
> 
> For direct IO, the IO lock is always taken in shared mode, so we can
> have concurrent read and write operations taking place at once
> regardless of the offset into the file.
> 

thanks for reminding me,in xfs concurrent direct IO write to the same
offset is allowed.

> > I am looking for some discussion about removing the i_mutex lock in the
> > direct IO write code path for ext4, when multiple threads are
> > direct write to different offset of the same file. This would require
> > some way to track the in-fly DIO IO range, either done at ext4 level or
> > above th vfs layer. 
> 
> Direct IO semantics have always been that the application is allowed
> to overlap IO to the same range if it wants to. The result is
> undefined (just like issuing overlapping reads and writes to a disk
> at the same time) so it's the application's responsibility to avoid
> overlapping IO if it is a problem.
> 


I was thinking along the line to provide finer granularity lock to allow
concurrent direct IO to different offset/range, but to same offset, they
have to be serialized. If it's undefined behavior, i.e. overlapping is
allowed, then concurrent dio implementation is much easier. But not sure
if any apps currently using DIO aware of the ordering has to be done at
the application level. 

> Cheers,
> 
> Dave.



^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF)
  2011-03-30 15:37     ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) Vivek Goyal
@ 2011-03-30 22:20       ` Dave Chinner
  2011-03-30 22:49         ` Chad Talbott
  2011-03-31 14:16         ` Vivek Goyal
  0 siblings, 2 replies; 166+ messages in thread
From: Dave Chinner @ 2011-03-30 22:20 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Chad Talbott, James Bottomley, lsf, linux-fsdevel

On Wed, Mar 30, 2011 at 11:37:57AM -0400, Vivek Goyal wrote:
> On Wed, Mar 30, 2011 at 03:18:02PM +1100, Dave Chinner wrote:
> > On Tue, Mar 29, 2011 at 10:35:13AM -0700, Chad Talbott wrote:
> > > I'd like to propose a discussion topic:
> > > 
> > > IO-less Dirty Throttling Considered Harmful...
> > > 
> > > to isolation and cgroup IO schedulers in general.
> > 
> > Why is that, exactly? The current writeback infrastructure isn't
> > cgroup aware at all, so isn't that the problem you need to solve
> > first?  i.e. how to delegate page cache writeback from
> > one context to anotheri and account for it correctly?
> > 
> > Once you solve that problem, triggering cgroup specific writeback
> > from the throttling code is the same regardless of whether we
> > are doing IO directly from the throttling code or via a separate
> > flusher thread. Hence I don't really understand why you think
> > IO-less throttling is really a problem.
> 
> Dave,
> 
> We are planning to track the IO context of original submitter of IO
> by storing that information in page_cgroup. So that is not the problem.
> 
> The problem google guys are trying to raise is that can a single flusher
> thread keep all the groups on bdi busy in such a way so that higher
> prio group can get more IO done.

Which has nothing to do with IO-less dirty throttling at all!

> It should not happen that flusher
> thread gets blocked somewhere (trying to get request descriptors on
> request queue)

A major design principle of the bdi-flusher threads is that they
are supposed to block when the request queue gets full - that's how
we got rid of all the congestion garbage from the writeback
stack.

There are plans to move the bdi-flusher threads to work queues, and
once that is done all your concerns about blocking and parallelism
are pretty much gone because it's trivial to have multiple writeback
works in progress at once on the same bdi with that infrastructure.

> or it tries to dispatch too much IO from an inode which
> primarily contains pages from low prio cgroup and high prio cgroup
> task does not get enough pages dispatched to device hence not getting
> any prio over low prio group.

That's a writeback scheduling issue independent of how we throttle,
and something we don't do at all right now. Our only decision on
what to write back is based on how low ago the inode was dirtied.
You need to completely rework the dirty inode tracking if you want
to efficiently prioritise writeback between different groups.

Given that filesystems don't all use the VFS dirty inode tracking
infrastructure and specific filesystems have different ideas of the
order of writeback, you've got a really difficult problem there.
e.g. ext3/4 and btrfs use ordered writeback for filesystem integrity
purposes which will completely screw any sort of prioritised
writeback. Remember the ext3 "fsync = global sync" latency problems?

> Currently we can do some IO in the context of writting process also
> hence faster group can try to dispatch its own pages to bdi for writeout.
> With IO less throttling, that notion will disappear.

We'll stil do exactly the same amount of throttling - what we write
back is still the same decision, just made in a different place with
a different trigger.

> So the concern they raised that is single flusher thread per device
> is enough to keep faster cgroup full at the bdi and hence get the
> service differentiation.

I think there's much bigger problems than that.

> My take on this is that on slow SATA device it might be as long as
> we make sure that flusher thread does not block on individual groups

I don't think you can ever guarantee that - e.g. Delayed allocation
will need metadata to be read from disk to perform the allocation
so preventing blocking is impossible. Besides, see above about using
work queues rather than threads for flushing.

> and also try to select inodes intelligently (cgroup aware manner).

Such selection algorithms would need to be able to handle hundreds
of thousands of newly dirtied inodes per second so sorting and
selecting them efficiently will be a major issue...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF)
  2011-03-30 22:20       ` Dave Chinner
@ 2011-03-30 22:49         ` Chad Talbott
  2011-03-31  3:00           ` Dave Chinner
  2011-03-31 14:16         ` Vivek Goyal
  1 sibling, 1 reply; 166+ messages in thread
From: Chad Talbott @ 2011-03-30 22:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Vivek Goyal, James Bottomley, lsf, linux-fsdevel

On Wed, Mar 30, 2011 at 3:20 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Mar 30, 2011 at 11:37:57AM -0400, Vivek Goyal wrote:
>> We are planning to track the IO context of original submitter of IO
>> by storing that information in page_cgroup. So that is not the problem.
>>
>> The problem google guys are trying to raise is that can a single flusher
>> thread keep all the groups on bdi busy in such a way so that higher
>> prio group can get more IO done.
>
> Which has nothing to do with IO-less dirty throttling at all!

Not quite.  Pre IO-less dirty throttling, any thread which was
dirtying did the writeback itself.  Because there's no shortage of
threads to do the work, the IO scheduler sees a bunch of threads doing
writes against a given BDI and schedules them against each other.
This is how async IO isolation works for us.

>> It should not happen that flusher
>> thread gets blocked somewhere (trying to get request descriptors on
>> request queue)
>
> A major design principle of the bdi-flusher threads is that they
> are supposed to block when the request queue gets full - that's how
> we got rid of all the congestion garbage from the writeback
> stack.

With IO cgroups and async write isolation, there are multiple queues
per disk that all need to be filled to allow cgroup-aware CFQ schedule
between them.  If the per-BDI threads could be taught to fill each
per-cgroup queue before giving up on a BDI, then IO-less throttling
could work.  Also, having per-(BDI, blkio cgroup)-flusher threads
would work.  I think it's complicated enough to warrant a discussion.

> There are plans to move the bdi-flusher threads to work queues, and
> once that is done all your concerns about blocking and parallelism
> are pretty much gone because it's trivial to have multiple writeback
> works in progress at once on the same bdi with that infrastructure.

This sounds promising.

>> So the concern they raised that is single flusher thread per device
>> is enough to keep faster cgroup full at the bdi and hence get the
>> service differentiation.
>
> I think there's much bigger problems than that.

We seem to be agreeing that it's a complicated problem.  That's why I
think async write isolation needs some design-level discussion.

Chad

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30 21:49       ` Mingming Cao
@ 2011-03-31  0:05         ` Matthew Wilcox
  2011-03-31  1:00         ` Joel Becker
  1 sibling, 0 replies; 166+ messages in thread
From: Matthew Wilcox @ 2011-03-31  0:05 UTC (permalink / raw)
  To: Mingming Cao
  Cc: Dave Chinner, Ric Wheeler, James Bottomley, lsf, linux-fsdevel,
	linux-scsi, device-mapper development

On Wed, Mar 30, 2011 at 02:49:58PM -0700, Mingming Cao wrote:
> > Direct IO semantics have always been that the application is allowed
> > to overlap IO to the same range if it wants to. The result is
> > undefined (just like issuing overlapping reads and writes to a disk
> > at the same time) so it's the application's responsibility to avoid
> > overlapping IO if it is a problem.
> 
> I was thinking along the line to provide finer granularity lock to allow
> concurrent direct IO to different offset/range, but to same offset, they
> have to be serialized. If it's undefined behavior, i.e. overlapping is
> allowed, then concurrent dio implementation is much easier. But not sure
> if any apps currently using DIO aware of the ordering has to be done at
> the application level. 

Yes, they're aware of it.  And they consider it a bug if they ever do
concurrent I/O to the same sector.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] [LSF][MM] page allocation & direct reclaim latency
  2011-03-30 16:49         ` Andrea Arcangeli
@ 2011-03-31  0:42           ` Hugh Dickins
  2011-03-31 15:15             ` Andrea Arcangeli
  2011-03-31  9:30           ` Mel Gorman
  1 sibling, 1 reply; 166+ messages in thread
From: Hugh Dickins @ 2011-03-31  0:42 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Mel Gorman, Minchan Kim, lsf, linux-mm

On Wed, Mar 30, 2011 at 9:49 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Wed, Mar 30, 2011 at 05:17:16PM +0100, Mel Gorman wrote:
>> I'd prefer to see OOM-related issues treated as a separate-but-related
>> problem if possible so;
>
> I prefer it too. The OOM killing is already covered in OOM topic from
> Hugh, and we can add "OOM detection latency" to it.

Thanks for adjusting and updating the schedule, Andrea.  I'm way
behind in my mailbox and everything else, that was a real help.

But last night I did remove that OOM and fork-bomb topic you
mischievously added in my name ;-)  Yes, I did propose an OOM topic
against my name in the working list I sent you a few days ago, but by
Monday had concluded that it would be pretty silly for me to get up
and spout the few things I have to say about it, in the absence of
every one of the people most closely involved and experienced.  And on
fork-bombs I've even less to say.

Of course, none of these sessions are for those named facilitators to
lecture the assembled company for half an hour.  We can bring it back
it there's demand on the day: but right now I'd prefer to keep it as
an empty slot, to be decided when the time comes.  After all, those FS
people, they appear to thrive on empty slots!

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30 21:49       ` Mingming Cao
  2011-03-31  0:05         ` Matthew Wilcox
@ 2011-03-31  1:00         ` Joel Becker
  2011-04-01 21:34           ` Mingming Cao
  1 sibling, 1 reply; 166+ messages in thread
From: Joel Becker @ 2011-03-31  1:00 UTC (permalink / raw)
  To: Mingming Cao
  Cc: Dave Chinner, Ric Wheeler, James Bottomley, lsf, linux-fsdevel,
	linux-scsi, device-mapper development

On Wed, Mar 30, 2011 at 02:49:58PM -0700, Mingming Cao wrote:
> On Wed, 2011-03-30 at 13:17 +1100, Dave Chinner wrote:
> > For direct IO, the IO lock is always taken in shared mode, so we can
> > have concurrent read and write operations taking place at once
> > regardless of the offset into the file.
> > 
> 
> thanks for reminding me,in xfs concurrent direct IO write to the same
> offset is allowed.

	ocfs2 as well, with the same sort of strategem (including across
the cluster).

> > Direct IO semantics have always been that the application is allowed
> > to overlap IO to the same range if it wants to. The result is
> > undefined (just like issuing overlapping reads and writes to a disk
> > at the same time) so it's the application's responsibility to avoid
> > overlapping IO if it is a problem.
> > 
> 
> I was thinking along the line to provide finer granularity lock to allow
> concurrent direct IO to different offset/range, but to same offset, they
> have to be serialized. If it's undefined behavior, i.e. overlapping is
> allowed, then concurrent dio implementation is much easier. But not sure
> if any apps currently using DIO aware of the ordering has to be done at
> the application level. 

	Oh dear God no.  One of the major DIO use cases is to tell the
kernel, "I know I won't do that, so don't spend any effort protecting
me."

Joel

-- 

"I don't want to achieve immortality through my work; I want to
 achieve immortality through not dying."
        - Woody Allen

			http://www.jlbec.org/
			jlbec@evilplan.org

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF)
  2011-03-30 22:49         ` Chad Talbott
@ 2011-03-31  3:00           ` Dave Chinner
  0 siblings, 0 replies; 166+ messages in thread
From: Dave Chinner @ 2011-03-31  3:00 UTC (permalink / raw)
  To: Chad Talbott; +Cc: Vivek Goyal, James Bottomley, lsf, linux-fsdevel

On Wed, Mar 30, 2011 at 03:49:17PM -0700, Chad Talbott wrote:
> On Wed, Mar 30, 2011 at 3:20 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Wed, Mar 30, 2011 at 11:37:57AM -0400, Vivek Goyal wrote:
> >> We are planning to track the IO context of original submitter of IO
> >> by storing that information in page_cgroup. So that is not the problem.
> >>
> >> The problem google guys are trying to raise is that can a single flusher
> >> thread keep all the groups on bdi busy in such a way so that higher
> >> prio group can get more IO done.
> >
> > Which has nothing to do with IO-less dirty throttling at all!
> 
> Not quite.  Pre IO-less dirty throttling, any thread which was
> dirtying did the writeback itself.  Because there's no shortage of
> threads to do the work, the IO scheduler sees a bunch of threads doing
> writes against a given BDI and schedules them against each other.
> This is how async IO isolation works for us.

And it's precisely this behaviour that makes foreground throttling a
scalability limitation, both from a list/lock contention POV and
from a IO optimisation POV.

> >> So the concern they raised that is single flusher thread per device
> >> is enough to keep faster cgroup full at the bdi and hence get the
> >> service differentiation.
> >
> > I think there's much bigger problems than that.
> 
> We seem to be agreeing that it's a complicated problem.  That's why I
> think async write isolation needs some design-level discussion.

From my perspeccctive, we've still got a significant amount of work
to get writeback into a scalable form for current generation
machines, let alone future machines.

Fixing the writeback code is a slow process because of all the
subtle interactions with different filesystems and different
workloads, whіch made more complex by the fact that many filesystems
implement their own writeback paths and have their own writeback
semantics. We need to make the right decision on what IO to issue,
not just issue lots of IO and hope it all turns out OK in the end.
If we can't get that decision matrix right for the simple case of a
global context, then we have no hope of extending it to cgroup-aware
writeback.

IOWs, we need to get writeback working in a scalable manner before
we complicate it immensely with all this cgroup and isolation
madness. Hence I think trying to make writeback cgroup-aware is
probably 6-12 months premature at this point and trying to do it now
will only serve to make it harder to get the common, simple cases
working as we desire them to...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] [LSF][MM] page allocation & direct reclaim latency
  2011-03-30 16:49         ` Andrea Arcangeli
  2011-03-31  0:42           ` Hugh Dickins
@ 2011-03-31  9:30           ` Mel Gorman
  2011-03-31 16:36             ` Andrea Arcangeli
  1 sibling, 1 reply; 166+ messages in thread
From: Mel Gorman @ 2011-03-31  9:30 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Minchan Kim, lsf, linux-mm

On Wed, Mar 30, 2011 at 06:49:06PM +0200, Andrea Arcangeli wrote:
> Hi Mel,
> 
> On Wed, Mar 30, 2011 at 05:17:16PM +0100, Mel Gorman wrote:
> > Andy Whitcroft also posted patches ages ago that were related to lumpy reclaim
> > which would capture high-order pages being reclaimed for the exclusive use
> > of the reclaimer. It was never shown to be necessary though. I'll read this
> > thread in a bit because I'm curious to see why it came up now.
> 
> Ok ;).
> 
> About lumpy I wouldn't spend too much on lumpy,

I hadn't intended to but the context of the capture patches was lumpy so
it'd be the starting point for anyone looking at the old patches.  If someone
wanted to go in that direction, it would need to be adapted for compaction,
reclaim/compaction and reclaim.

> I'd rather spend on
> other issues like the one you mentioned on lru ordering, and
> compaction (compaction in kswapd has still an unknown solution, my
> last attempt failed and we backed off to no compaction in kswapd which
> is safe but doesn't help for GFP_ATOMIC order > 0).
> 

Agreed. It may also be worth a quick discussion on *how* people are currently
evauating their reclaim-related changes be it via tracepoints, systemtap,
a patched kernel or indirect measures such as faults.

> Lumpy should go away in a few releases IIRC.
> 
> > I think we should be very wary of conflating OOM latency, reclaim latency and
> > allocation latency as they are very different things with different causes.
> 
> I think it's better to stick to successful allocation latencies only
> here, or at most -ENOMEM from order > 0 which normally never happens
> with compaction (not the time it takes before declaring OOM and
> triggering the oom killer).
> 

Sounds reasonable. I could discuss briefly the scripts I use based on ftrace
that dump out highorder allocation latencies as it might be useful to others
if this is the area they are looking at.

> > I'd prefer to see OOM-related issues treated as a separate-but-related
> > problem if possible so;
> > 
> > 1. LRU ordering - are we aging pages properly or recycling through the
> >    list too aggressively? The high_wmark*8 change made recently was
> >    partially about list rotations and the associated cost so it might
> >    be worth listing out whatever issues people are currently aware of.
> > 2. LRU ordering - dirty pages at the end of the LRU. Are we still going
> >    the right direction on this or is it still a shambles?
> > 3. Compaction latency, other issues (IRQ disabling latency was the last
> >    one I'm aware of)
> > 4. OOM killing and OOM latency - Whole load of churn going on in there.
> 
> I prefer it too. The OOM killing is already covered in OOM topic from
> Hugh, and we can add "OOM detection latency" to it.
> 

Also sounds good to me.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF)
  2011-03-30 22:20       ` Dave Chinner
  2011-03-30 22:49         ` Chad Talbott
@ 2011-03-31 14:16         ` Vivek Goyal
  2011-03-31 14:34           ` Chris Mason
  2011-03-31 14:50           ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Greg Thelen
  1 sibling, 2 replies; 166+ messages in thread
From: Vivek Goyal @ 2011-03-31 14:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chad Talbott, James Bottomley, lsf, linux-fsdevel

On Thu, Mar 31, 2011 at 09:20:02AM +1100, Dave Chinner wrote:

[..]
> > It should not happen that flusher
> > thread gets blocked somewhere (trying to get request descriptors on
> > request queue)
> 
> A major design principle of the bdi-flusher threads is that they
> are supposed to block when the request queue gets full - that's how
> we got rid of all the congestion garbage from the writeback
> stack.

Instead of blocking flusher threads, can they voluntarily stop submitting
more IO when they realize too much IO is in progress. We aready keep
stats of how much IO is under writeback on bdi (BDI_WRITEBACK) and
flusher tread can use that?

Jens mentioned this idea of how about getting rid of this request accounting
at request queue level and move it somewhere up say at bdi level.

> 
> There are plans to move the bdi-flusher threads to work queues, and
> once that is done all your concerns about blocking and parallelism
> are pretty much gone because it's trivial to have multiple writeback
> works in progress at once on the same bdi with that infrastructure.

Will this essentially not nullify the advantage of IO less throttling?
I thought that we did not want have multiple threads doing writeback
at the same time to avoid number of seeks and achieve better throughput.

Now with this I am assuming that multiple work can be on progress doing
writeback. May be we can limit writeback work one per group so in global
context only one work will be active.

> 
> > or it tries to dispatch too much IO from an inode which
> > primarily contains pages from low prio cgroup and high prio cgroup
> > task does not get enough pages dispatched to device hence not getting
> > any prio over low prio group.
> 
> That's a writeback scheduling issue independent of how we throttle,
> and something we don't do at all right now. Our only decision on
> what to write back is based on how low ago the inode was dirtied.
> You need to completely rework the dirty inode tracking if you want
> to efficiently prioritise writeback between different groups.
> 
> Given that filesystems don't all use the VFS dirty inode tracking
> infrastructure and specific filesystems have different ideas of the
> order of writeback, you've got a really difficult problem there.
> e.g. ext3/4 and btrfs use ordered writeback for filesystem integrity
> purposes which will completely screw any sort of prioritised
> writeback. Remember the ext3 "fsync = global sync" latency problems?

Ok, so if one issues a fsync when filesystem is mounted in "data=ordered"
mode we will flush all the writes to disk before committing meta data.

I have no knowledge of filesystem code so here comes a stupid question.
Do multiple fsyncs get completely serialized or they can progress in
parallel? IOW, if a fsync is in progress and we slow down the writeback
of that inode's pages, can other fsync still make progress without
getting stuck behind the previous fsync?

For me knowing this is also important in another context of absolute IO
throttling.

- If a fsync is in progress and gets throttled at device, what impact it
  has on other file system operations. What gets serialized behind it. 

[..]
> > and also try to select inodes intelligently (cgroup aware manner).
> 
> Such selection algorithms would need to be able to handle hundreds
> of thousands of newly dirtied inodes per second so sorting and
> selecting them efficiently will be a major issue...

There was proposal of memory cgroup maintaining a per memory cgroup per
bdi structure which will keep a list of inodes which need writeback
from that cgroup.

So any cgroup looking for a writeback will queue up this structure on
bdi and flusher threads can walk though this list and figure out
which memory cgroups and which inodes within memory cgroup need to
be written back.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF)
  2011-03-31 14:16         ` Vivek Goyal
@ 2011-03-31 14:34           ` Chris Mason
  2011-03-31 22:14             ` Dave Chinner
  2011-03-31 22:25             ` Vivek Goyal
  2011-03-31 14:50           ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Greg Thelen
  1 sibling, 2 replies; 166+ messages in thread
From: Chris Mason @ 2011-03-31 14:34 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Dave Chinner, Chad Talbott, James Bottomley, lsf, linux-fsdevel

Excerpts from Vivek Goyal's message of 2011-03-31 10:16:37 -0400:
> On Thu, Mar 31, 2011 at 09:20:02AM +1100, Dave Chinner wrote:
> 
> [..]
> > > It should not happen that flusher
> > > thread gets blocked somewhere (trying to get request descriptors on
> > > request queue)
> > 
> > A major design principle of the bdi-flusher threads is that they
> > are supposed to block when the request queue gets full - that's how
> > we got rid of all the congestion garbage from the writeback
> > stack.
> 
> Instead of blocking flusher threads, can they voluntarily stop submitting
> more IO when they realize too much IO is in progress. We aready keep
> stats of how much IO is under writeback on bdi (BDI_WRITEBACK) and
> flusher tread can use that?

We could, but the difficult part is keeping the hardware saturated as
requests complete.  The voluntarily stopping part is pretty much the
same thing the congestion code was trying to do.

> 
> Jens mentioned this idea of how about getting rid of this request accounting
> at request queue level and move it somewhere up say at bdi level.
> 
> > 
> > There are plans to move the bdi-flusher threads to work queues, and
> > once that is done all your concerns about blocking and parallelism
> > are pretty much gone because it's trivial to have multiple writeback
> > works in progress at once on the same bdi with that infrastructure.
> 
> Will this essentially not nullify the advantage of IO less throttling?
> I thought that we did not want have multiple threads doing writeback
> at the same time to avoid number of seeks and achieve better throughput.

Work queues alone are probably not appropriate, at least for spinning
storage.  It will introduce seeks into what would have been
sequential writes.  I had to make the btrfs worker thread pools after
having a lot of trouble cramming writeback into work queues.

> 
> Now with this I am assuming that multiple work can be on progress doing
> writeback. May be we can limit writeback work one per group so in global
> context only one work will be active.
> 
> > 
> > > or it tries to dispatch too much IO from an inode which
> > > primarily contains pages from low prio cgroup and high prio cgroup
> > > task does not get enough pages dispatched to device hence not getting
> > > any prio over low prio group.
> > 
> > That's a writeback scheduling issue independent of how we throttle,
> > and something we don't do at all right now. Our only decision on
> > what to write back is based on how low ago the inode was dirtied.
> > You need to completely rework the dirty inode tracking if you want
> > to efficiently prioritise writeback between different groups.
> > 
> > Given that filesystems don't all use the VFS dirty inode tracking
> > infrastructure and specific filesystems have different ideas of the
> > order of writeback, you've got a really difficult problem there.
> > e.g. ext3/4 and btrfs use ordered writeback for filesystem integrity
> > purposes which will completely screw any sort of prioritised
> > writeback. Remember the ext3 "fsync = global sync" latency problems?
> 
> Ok, so if one issues a fsync when filesystem is mounted in "data=ordered"
> mode we will flush all the writes to disk before committing meta data.
> 
> I have no knowledge of filesystem code so here comes a stupid question.
> Do multiple fsyncs get completely serialized or they can progress in
> parallel? IOW, if a fsync is in progress and we slow down the writeback
> of that inode's pages, can other fsync still make progress without
> getting stuck behind the previous fsync?

An fsync has two basic parts

1) write the file data pages
2a) flush data=ordered in reiserfs/ext34
2b) do the real transaction commit


We can do part one in parallel across any number of writers.  For part
two, there is only one running transaction.  If the FS is smart, the
commit will only force down the transaction that last modified the
file. 50 procs running fsync may only need to trigger one commit.

btrfs and xfs do data=ordered differently.  They still avoid exposing
stale data but we don't pull the plug on the whole bathtub for every
commit.  In the btrfs case, we don't update metadata until the data is
written, so commits never have to force data writes.  xfs does something
lighter weight but with similar benefits.

ext4 with delayed allocation on and data=ordered will only end up
forcing down writes that are not under delayed allocation.  This is a
much smaller subset of the IO than ext3/reiserfs will do.

> 
> For me knowing this is also important in another context of absolute IO
> throttling.
> 
> - If a fsync is in progress and gets throttled at device, what impact it
>   has on other file system operations. What gets serialized behind it. 

It depends.  atime updates log inodes and logging needs a transaction
and transactions sometimes need to wait for the last transaction to
finish.  So its very possible you'll make anything using the FS appear
to stop.

-chris

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-03-31 14:16         ` Vivek Goyal
  2011-03-31 14:34           ` Chris Mason
@ 2011-03-31 14:50           ` Greg Thelen
  2011-03-31 22:27             ` Dave Chinner
  1 sibling, 1 reply; 166+ messages in thread
From: Greg Thelen @ 2011-03-31 14:50 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Dave Chinner, James Bottomley, lsf, linux-fsdevel

On Thu, Mar 31, 2011 at 7:16 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Thu, Mar 31, 2011 at 09:20:02AM +1100, Dave Chinner wrote:
>
> [..]
>> > It should not happen that flusher
>> > thread gets blocked somewhere (trying to get request descriptors on
>> > request queue)
>>
>> A major design principle of the bdi-flusher threads is that they
>> are supposed to block when the request queue gets full - that's how
>> we got rid of all the congestion garbage from the writeback
>> stack.
>
> Instead of blocking flusher threads, can they voluntarily stop submitting
> more IO when they realize too much IO is in progress. We aready keep
> stats of how much IO is under writeback on bdi (BDI_WRITEBACK) and
> flusher tread can use that?
>
> Jens mentioned this idea of how about getting rid of this request accounting
> at request queue level and move it somewhere up say at bdi level.
>
>>
>> There are plans to move the bdi-flusher threads to work queues, and
>> once that is done all your concerns about blocking and parallelism
>> are pretty much gone because it's trivial to have multiple writeback
>> works in progress at once on the same bdi with that infrastructure.
>
> Will this essentially not nullify the advantage of IO less throttling?
> I thought that we did not want have multiple threads doing writeback
> at the same time to avoid number of seeks and achieve better throughput.
>
> Now with this I am assuming that multiple work can be on progress doing
> writeback. May be we can limit writeback work one per group so in global
> context only one work will be active.
>
>>
>> > or it tries to dispatch too much IO from an inode which
>> > primarily contains pages from low prio cgroup and high prio cgroup
>> > task does not get enough pages dispatched to device hence not getting
>> > any prio over low prio group.
>>
>> That's a writeback scheduling issue independent of how we throttle,
>> and something we don't do at all right now. Our only decision on
>> what to write back is based on how low ago the inode was dirtied.
>> You need to completely rework the dirty inode tracking if you want
>> to efficiently prioritise writeback between different groups.
>>
>> Given that filesystems don't all use the VFS dirty inode tracking
>> infrastructure and specific filesystems have different ideas of the
>> order of writeback, you've got a really difficult problem there.
>> e.g. ext3/4 and btrfs use ordered writeback for filesystem integrity
>> purposes which will completely screw any sort of prioritised
>> writeback. Remember the ext3 "fsync = global sync" latency problems?
>
> Ok, so if one issues a fsync when filesystem is mounted in "data=ordered"
> mode we will flush all the writes to disk before committing meta data.
>
> I have no knowledge of filesystem code so here comes a stupid question.
> Do multiple fsyncs get completely serialized or they can progress in
> parallel? IOW, if a fsync is in progress and we slow down the writeback
> of that inode's pages, can other fsync still make progress without
> getting stuck behind the previous fsync?
>
> For me knowing this is also important in another context of absolute IO
> throttling.
>
> - If a fsync is in progress and gets throttled at device, what impact it
>  has on other file system operations. What gets serialized behind it.
>
> [..]
>> > and also try to select inodes intelligently (cgroup aware manner).
>>
>> Such selection algorithms would need to be able to handle hundreds
>> of thousands of newly dirtied inodes per second so sorting and
>> selecting them efficiently will be a major issue...
>
> There was proposal of memory cgroup maintaining a per memory cgroup per
> bdi structure which will keep a list of inodes which need writeback
> from that cgroup.

FYI, I have patches which implement this per memcg per bdi dirty inode
list.  I want to debug a few issues before posting an RFC series.  But
it is getting close.

> So any cgroup looking for a writeback will queue up this structure on
> bdi and flusher threads can walk though this list and figure out
> which memory cgroups and which inodes within memory cgroup need to
> be written back.

The way these memcg-writeback patches are currently implemented is
that when a memcg is over background dirty limits, it will queue the
memcg a on a global over_bg_limit list and wakeup bdi flusher.  There
is no context (memcg or otherwise) given to the bdi flusher.  After
the bdi flusher checks system-wide background limits, it uses the
over_bg_limit list to find (and rotate) an over limit memcg.  Using
the memcg, then the per memcg per bdi dirty inode list is walked to
find inode pages to writeback.  Once the memcg dirty memory usage
drops below the memcg-thresh, the memcg is removed from the global
over_bg_limit list.

> Thanks
> Vivek
> _______________________________________________
> Lsf mailing list
> Lsf@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/lsf
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] [LSF][MM] page allocation & direct reclaim latency
  2011-03-31  0:42           ` Hugh Dickins
@ 2011-03-31 15:15             ` Andrea Arcangeli
  0 siblings, 0 replies; 166+ messages in thread
From: Andrea Arcangeli @ 2011-03-31 15:15 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Mel Gorman, Minchan Kim, lsf, linux-mm

On Wed, Mar 30, 2011 at 05:42:15PM -0700, Hugh Dickins wrote:
> On Wed, Mar 30, 2011 at 9:49 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > On Wed, Mar 30, 2011 at 05:17:16PM +0100, Mel Gorman wrote:
> >> I'd prefer to see OOM-related issues treated as a separate-but-related
> >> problem if possible so;
> >
> > I prefer it too. The OOM killing is already covered in OOM topic from
> > Hugh, and we can add "OOM detection latency" to it.
> 
> Thanks for adjusting and updating the schedule, Andrea.  I'm way
> behind in my mailbox and everything else, that was a real help.

Glad I could help.

> But last night I did remove that OOM and fork-bomb topic you
> mischievously added in my name ;-)  Yes, I did propose an OOM topic
> against my name in the working list I sent you a few days ago, but by
> Monday had concluded that it would be pretty silly for me to get up
> and spout the few things I have to say about it, in the absence of
> every one of the people most closely involved and experienced.  And on
> fork-bombs I've even less to say.
>
> Of course, none of these sessions are for those named facilitators to
> lecture the assembled company for half an hour.  We can bring it back
> it there's demand on the day: but right now I'd prefer to keep it as
> an empty slot, to be decided when the time comes.  After all, those FS
> people, they appear to thrive on empty slots!

Ok, and agree that the MM track is pretty dense already ;).

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] [LSF][MM] page allocation & direct reclaim latency
  2011-03-31  9:30           ` Mel Gorman
@ 2011-03-31 16:36             ` Andrea Arcangeli
  0 siblings, 0 replies; 166+ messages in thread
From: Andrea Arcangeli @ 2011-03-31 16:36 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Minchan Kim, lsf, linux-mm

On Thu, Mar 31, 2011 at 10:30:53AM +0100, Mel Gorman wrote:
> Sounds reasonable. I could discuss briefly the scripts I use based on ftrace
> that dump out highorder allocation latencies as it might be useful to others
> if this is the area they are looking at.

I think it's interesting.

> Also sounds good to me.

Ok. BTW, the OOM topic has been removed from schedule for now and it
returned an empty slot, but as Hugh mentioned, it can be brought back
in as needed on the day.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF)
  2011-03-31 14:34           ` Chris Mason
@ 2011-03-31 22:14             ` Dave Chinner
  2011-03-31 23:43               ` Chris Mason
  2011-04-01  1:34               ` Vivek Goyal
  2011-03-31 22:25             ` Vivek Goyal
  1 sibling, 2 replies; 166+ messages in thread
From: Dave Chinner @ 2011-03-31 22:14 UTC (permalink / raw)
  To: Chris Mason
  Cc: Vivek Goyal, Chad Talbott, James Bottomley, lsf, linux-fsdevel

On Thu, Mar 31, 2011 at 10:34:03AM -0400, Chris Mason wrote:
> Excerpts from Vivek Goyal's message of 2011-03-31 10:16:37 -0400:
> > On Thu, Mar 31, 2011 at 09:20:02AM +1100, Dave Chinner wrote:
> > 
> > [..]
> > > > It should not happen that flusher
> > > > thread gets blocked somewhere (trying to get request descriptors on
> > > > request queue)
> > > 
> > > A major design principle of the bdi-flusher threads is that they
> > > are supposed to block when the request queue gets full - that's how
> > > we got rid of all the congestion garbage from the writeback
> > > stack.
> > 
> > Instead of blocking flusher threads, can they voluntarily stop submitting
> > more IO when they realize too much IO is in progress. We aready keep
> > stats of how much IO is under writeback on bdi (BDI_WRITEBACK) and
> > flusher tread can use that?
> 
> We could, but the difficult part is keeping the hardware saturated as
> requests complete.  The voluntarily stopping part is pretty much the
> same thing the congestion code was trying to do.

And it was the bit that was causing most problems. IMO, we don't want to
go back to that single threaded mechanism, especially as we have
no shortage of cores and threads available...

> > > There are plans to move the bdi-flusher threads to work queues, and
> > > once that is done all your concerns about blocking and parallelism
> > > are pretty much gone because it's trivial to have multiple writeback
> > > works in progress at once on the same bdi with that infrastructure.
> > 
> > Will this essentially not nullify the advantage of IO less throttling?
> > I thought that we did not want have multiple threads doing writeback
> > at the same time to avoid number of seeks and achieve better throughput.
> 
> Work queues alone are probably not appropriate, at least for spinning
> storage.  It will introduce seeks into what would have been
> sequential writes.  I had to make the btrfs worker thread pools after
> having a lot of trouble cramming writeback into work queues.

That was before the cmwq infrastructure, right? cmwq changes the
behaviour of workqueues in such a way that they can simply be
thought of as having a thread pool of a specific size....

As a strict translation of the existing one flusher thread per bdi,
then only allowing one work at a time to be issued (i.e. workqueue
concurency of 1) would give the same behaviour without having all
the thread management issues. i.e. regardless of the writeback
parallelism mechanism we have the same issue of managing writeback
to minimise seeking. cmwq just makes the implementation far simpler,
IMO.

As to whether that causes seeks or not, that depends on how we are
driving the concurrent works/threads. If we drive a concurrent work
per dirty cgroup that needs writing back, then we achieve the
concurrency needed to make the IO scheduler appropriately throttle
the IO. For the case of no cgroups, then we still only have a single
writeback work in progress at a time and behaviour is no different
to the current setup. Hence I don't see any particular problem with
using workqueues to acheive the necessary writeback parallelism that
cgroup aware throttling requires....

> > > > or it tries to dispatch too much IO from an inode which
> > > > primarily contains pages from low prio cgroup and high prio cgroup
> > > > task does not get enough pages dispatched to device hence not getting
> > > > any prio over low prio group.
> > > 
> > > That's a writeback scheduling issue independent of how we throttle,
> > > and something we don't do at all right now. Our only decision on
> > > what to write back is based on how low ago the inode was dirtied.
> > > You need to completely rework the dirty inode tracking if you want
> > > to efficiently prioritise writeback between different groups.
> > > 
> > > Given that filesystems don't all use the VFS dirty inode tracking
> > > infrastructure and specific filesystems have different ideas of the
> > > order of writeback, you've got a really difficult problem there.
> > > e.g. ext3/4 and btrfs use ordered writeback for filesystem integrity
> > > purposes which will completely screw any sort of prioritised
> > > writeback. Remember the ext3 "fsync = global sync" latency problems?
> > 
> > Ok, so if one issues a fsync when filesystem is mounted in "data=ordered"
> > mode we will flush all the writes to disk before committing meta data.
> > 
> > I have no knowledge of filesystem code so here comes a stupid question.
> > Do multiple fsyncs get completely serialized or they can progress in
> > parallel? IOW, if a fsync is in progress and we slow down the writeback
> > of that inode's pages, can other fsync still make progress without
> > getting stuck behind the previous fsync?
> 
> An fsync has two basic parts
> 
> 1) write the file data pages
> 2a) flush data=ordered in reiserfs/ext34
> 2b) do the real transaction commit
> 
> 
> We can do part one in parallel across any number of writers.  For part
> two, there is only one running transaction.  If the FS is smart, the
> commit will only force down the transaction that last modified the
> file. 50 procs running fsync may only need to trigger one commit.

Right. However the real issue here, I think, is that the IO comes
from a thread not associated with writeback nor is in any way cgroup
aware. IOWs, getting the right context to each block being written
back will be complex and filesystem specific.

The other thing that concerns me is how metadata IO is accounted and
throttled. Doing stuff like creating lots of small files will
generate as much or more metadata IO than data IO, and none of that
will be associated with a cgroup. Indeed, in XFS metadata doesn't
even use the pagecache anymore, and it's written back by a thread
(soon to be a workqueue) deep inside XFS's journalling subsystem, so
it's pretty much impossible to associate that IO with any specific
cgroup.

What happens to that IO? Blocking it arbitrarily can have the same
effect as blocking transaction completion - it can cause the
filesystem to completely stop....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF)
  2011-03-31 14:34           ` Chris Mason
  2011-03-31 22:14             ` Dave Chinner
@ 2011-03-31 22:25             ` Vivek Goyal
  1 sibling, 0 replies; 166+ messages in thread
From: Vivek Goyal @ 2011-03-31 22:25 UTC (permalink / raw)
  To: Chris Mason
  Cc: Dave Chinner, Chad Talbott, James Bottomley, lsf, linux-fsdevel

On Thu, Mar 31, 2011 at 10:34:03AM -0400, Chris Mason wrote:

[..]
> > 
> > For me knowing this is also important in another context of absolute IO
> > throttling.
> > 
> > - If a fsync is in progress and gets throttled at device, what impact it
> >   has on other file system operations. What gets serialized behind it. 
> 
> It depends.  atime updates log inodes and logging needs a transaction
> and transactions sometimes need to wait for the last transaction to
> finish.  So its very possible you'll make anything using the FS appear
> to stop.

I think I have run into this. I created a cgroup and gave ridiculously
low limit of 1bytes/sec and did a fsync. This process blocks. Later 
I did "ls" in the directory where fsync process is blocked and ls
also hangs. Following is backtrace. Looks like atime update led to
some kind of blocking in do_get_write_access().

ls            D ffffffff8160b060     0  5936   5192 0x00000000
ffff880138729c48 0000000000000086 0000000000000000 0000000100000010
0000000000000000 ffff88013fc40100 ffff88013ac7ac00 000000012e5d70f3
ffff8801353d7af8 ffff880138729fd8 000000000000f558 ffff8801353d7af8
Call Trace:
[<ffffffffa035b0dd>] do_get_write_access+0x29d/0x500 [jbd2]
[<ffffffff8108e150>] ? wake_bit_function+0x0/0x50
[<ffffffffa035b491>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
[<ffffffffa03a7328>] __ext4_journal_get_write_access+0x38/0x80 [ext4]
[<ffffffffa0383843>] ext4_reserve_inode_write+0x73/0xa0 [ext4]
[<ffffffffa037c618>] ? call_filldir+0x78/0xe0 [ext4]
[<ffffffffa03838bc>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
[<ffffffff81041594>] ? __do_page_fault+0x1e4/0x480
[<ffffffffa0383bb0>] ext4_dirty_inode+0x40/0x60 [ext4]
[<ffffffff8119a21b>] __mark_inode_dirty+0x3b/0x160
[<ffffffff8118acad>] touch_atime+0x12d/0x170
[<ffffffff81184c00>] ? filldir+0x0/0xe0
[<ffffffff81184e96>] vfs_readdir+0xd6/0xe0
[<ffffffff81185009>] sys_getdents+0x89/0xf0
[<ffffffff814dc635>] ? page_fault+0x25/0x30
[<ffffffff8100b172>] system_call_fastpath+0x16/0x1b


The vim process doing fsync trace is as follows. This is waiting for
some IO to finish which has been throttled at the device.

vim           D ffffffff8110d1f0     0  5934   4452 0x00000000
ffff880107e2dcc8 0000000000000086 0000000100000000 0000000000000003
ffff8801351f4538 0000000000000000 ffff880107e2dc68 ffffffff810e7da2
ffff8801353d70b8 ffff880107e2dfd8 000000000000f558 ffff8801353d70b8
Call Trace:
[<ffffffff810e7da2>] ? ring_buffer_lock_reserve+0xa2/0x160
[<ffffffff81098cb9>] ? ktime_get_ts+0xa9/0xe0
[<ffffffff8110d1f0>] ? sync_page+0x0/0x50
[<ffffffff814da123>] io_schedule+0x73/0xc0
[<ffffffff8110d22d>] sync_page+0x3d/0x50
[<ffffffff814da98f>] __wait_on_bit+0x5f/0x90
[<ffffffff8110d3e3>] wait_on_page_bit+0x73/0x80
[<ffffffff8108e150>] ? wake_bit_function+0x0/0x50
[<ffffffff81123195>] ? pagevec_lookup_tag+0x25/0x40
[<ffffffff8110d7fb>] wait_on_page_writeback_range+0xfb/0x190
[<ffffffff81122341>] ? do_writepages+0x21/0x40
[<ffffffff8110d94b>] ? __filemap_fdatawrite_range+0x5b/0x60
[<ffffffff8110d9c8>] filemap_write_and_wait_range+0x78/0x90
[<ffffffff8119f5fe>] vfs_fsync_range+0x7e/0xe0
[<ffffffff8119f6cd>] vfs_fsync+0x1d/0x20
[<ffffffff8119f70e>] do_fsync+0x3e/0x60
[<ffffffff8119f760>] sys_fsync+0x10/0x20
[<ffffffff8100b172>] system_call_fastpath+0x16/0x1b

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-03-31 14:50           ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Greg Thelen
@ 2011-03-31 22:27             ` Dave Chinner
  2011-04-01 17:18               ` Vivek Goyal
  0 siblings, 1 reply; 166+ messages in thread
From: Dave Chinner @ 2011-03-31 22:27 UTC (permalink / raw)
  To: Greg Thelen; +Cc: Vivek Goyal, James Bottomley, lsf, linux-fsdevel

On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote:
> On Thu, Mar 31, 2011 at 7:16 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Thu, Mar 31, 2011 at 09:20:02AM +1100, Dave Chinner wrote:
> >> > and also try to select inodes intelligently (cgroup aware manner).
> >>
> >> Such selection algorithms would need to be able to handle hundreds
> >> of thousands of newly dirtied inodes per second so sorting and
> >> selecting them efficiently will be a major issue...
> >
> > There was proposal of memory cgroup maintaining a per memory cgroup per
> > bdi structure which will keep a list of inodes which need writeback
> > from that cgroup.
> 
> FYI, I have patches which implement this per memcg per bdi dirty inode
> list.  I want to debug a few issues before posting an RFC series.  But
> it is getting close.

That's all well and good, but we're still trying to work out how to
scale this list in a sane fashion. We just broke it out into it's
own global lock, so it's going to change soon so that the list+lock
is not a contention point on large machines. Just breaking it into a
list per cgroup doesn't solve this problem - it just adds another
container to the list.

Also, you have the problem that some filesystems don't use the bdi
dirty inode list for all the dirty inodes in the filesytem - XFS has
recent changed to only track VFS dirtied inodes in that list, intead
using it's own active item list to track all logged modifications.
IIUC, btrfs and ext3/4 do something similar as well. My current plans
are to modify the dirty inode code to allow filesystems to say tot
the VFS "don't track this dirty inode - I'm doing it myself" so that
we can reduce the VFS dirty inode list to only those inodes with
dirty pages....

> > So any cgroup looking for a writeback will queue up this structure on
> > bdi and flusher threads can walk though this list and figure out
> > which memory cgroups and which inodes within memory cgroup need to
> > be written back.
> 
> The way these memcg-writeback patches are currently implemented is
> that when a memcg is over background dirty limits, it will queue the
> memcg a on a global over_bg_limit list and wakeup bdi flusher.

No global lists and locks, please. That's one of the big problems
with the current foreground IO based throttling - it _hammers_ the
global inode writeback list locks such that one an 8p machine we can
be wasted 2-3 entire CPUs just contending on it when all 8 CPUs are
trying to throttle and write back at the same time.....

> There
> is no context (memcg or otherwise) given to the bdi flusher.  After
> the bdi flusher checks system-wide background limits, it uses the
> over_bg_limit list to find (and rotate) an over limit memcg.  Using
> the memcg, then the per memcg per bdi dirty inode list is walked to
> find inode pages to writeback.  Once the memcg dirty memory usage
> drops below the memcg-thresh, the memcg is removed from the global
> over_bg_limit list.

If you want controlled hand-off of writeback, you need to pass the
memcg that triggered the throttling directly to the bdi. You already
know what both the bdi and memcg that need writeback are. Yes, this
needs concurrency at the BDI flush level to handle, but see my
previous email in this thread for that....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF)
  2011-03-31 22:14             ` Dave Chinner
@ 2011-03-31 23:43               ` Chris Mason
  2011-04-01  0:55                 ` Dave Chinner
  2011-04-01  1:34               ` Vivek Goyal
  1 sibling, 1 reply; 166+ messages in thread
From: Chris Mason @ 2011-03-31 23:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Vivek Goyal, Chad Talbott, James Bottomley, lsf, linux-fsdevel

Excerpts from Dave Chinner's message of 2011-03-31 18:14:25 -0400:
> On Thu, Mar 31, 2011 at 10:34:03AM -0400, Chris Mason wrote:
> > Excerpts from Vivek Goyal's message of 2011-03-31 10:16:37 -0400:
> > > On Thu, Mar 31, 2011 at 09:20:02AM +1100, Dave Chinner wrote:
> > > 
> > > [..]
> > > > > It should not happen that flusher
> > > > > thread gets blocked somewhere (trying to get request descriptors on
> > > > > request queue)
> > > > 
> > > > A major design principle of the bdi-flusher threads is that they
> > > > are supposed to block when the request queue gets full - that's how
> > > > we got rid of all the congestion garbage from the writeback
> > > > stack.
> > > 
> > > Instead of blocking flusher threads, can they voluntarily stop submitting
> > > more IO when they realize too much IO is in progress. We aready keep
> > > stats of how much IO is under writeback on bdi (BDI_WRITEBACK) and
> > > flusher tread can use that?
> > 
> > We could, but the difficult part is keeping the hardware saturated as
> > requests complete.  The voluntarily stopping part is pretty much the
> > same thing the congestion code was trying to do.
> 
> And it was the bit that was causing most problems. IMO, we don't want to
> go back to that single threaded mechanism, especially as we have
> no shortage of cores and threads available...

Getting rid of the congestion code was my favorite part of the per-bdi
work.

> 
> > > > There are plans to move the bdi-flusher threads to work queues, and
> > > > once that is done all your concerns about blocking and parallelism
> > > > are pretty much gone because it's trivial to have multiple writeback
> > > > works in progress at once on the same bdi with that infrastructure.
> > > 
> > > Will this essentially not nullify the advantage of IO less throttling?
> > > I thought that we did not want have multiple threads doing writeback
> > > at the same time to avoid number of seeks and achieve better throughput.
> > 
> > Work queues alone are probably not appropriate, at least for spinning
> > storage.  It will introduce seeks into what would have been
> > sequential writes.  I had to make the btrfs worker thread pools after
> > having a lot of trouble cramming writeback into work queues.
> 
> That was before the cmwq infrastructure, right? cmwq changes the
> behaviour of workqueues in such a way that they can simply be
> thought of as having a thread pool of a specific size....
> 
> As a strict translation of the existing one flusher thread per bdi,
> then only allowing one work at a time to be issued (i.e. workqueue
> concurency of 1) would give the same behaviour without having all
> the thread management issues. i.e. regardless of the writeback
> parallelism mechanism we have the same issue of managing writeback
> to minimise seeking. cmwq just makes the implementation far simpler,
> IMO.
> 
> As to whether that causes seeks or not, that depends on how we are
> driving the concurrent works/threads. If we drive a concurrent work
> per dirty cgroup that needs writing back, then we achieve the
> concurrency needed to make the IO scheduler appropriately throttle
> the IO. For the case of no cgroups, then we still only have a single
> writeback work in progress at a time and behaviour is no different
> to the current setup. Hence I don't see any particular problem with
> using workqueues to acheive the necessary writeback parallelism that
> cgroup aware throttling requires....

Yes, as long as we aren't trying to shotgun style spread the
inodes across a bunch of threads, it should work well enough.  The trick
will just be making sure we don't end up with a lot of inode
interleaving in the delalloc allocations.

> 
> > > > > or it tries to dispatch too much IO from an inode which
> > > > > primarily contains pages from low prio cgroup and high prio cgroup
> > > > > task does not get enough pages dispatched to device hence not getting
> > > > > any prio over low prio group.
> > > > 
> > > > That's a writeback scheduling issue independent of how we throttle,
> > > > and something we don't do at all right now. Our only decision on
> > > > what to write back is based on how low ago the inode was dirtied.
> > > > You need to completely rework the dirty inode tracking if you want
> > > > to efficiently prioritise writeback between different groups.
> > > > 
> > > > Given that filesystems don't all use the VFS dirty inode tracking
> > > > infrastructure and specific filesystems have different ideas of the
> > > > order of writeback, you've got a really difficult problem there.
> > > > e.g. ext3/4 and btrfs use ordered writeback for filesystem integrity
> > > > purposes which will completely screw any sort of prioritised
> > > > writeback. Remember the ext3 "fsync = global sync" latency problems?
> > > 
> > > Ok, so if one issues a fsync when filesystem is mounted in "data=ordered"
> > > mode we will flush all the writes to disk before committing meta data.
> > > 
> > > I have no knowledge of filesystem code so here comes a stupid question.
> > > Do multiple fsyncs get completely serialized or they can progress in
> > > parallel? IOW, if a fsync is in progress and we slow down the writeback
> > > of that inode's pages, can other fsync still make progress without
> > > getting stuck behind the previous fsync?
> > 
> > An fsync has two basic parts
> > 
> > 1) write the file data pages
> > 2a) flush data=ordered in reiserfs/ext34
> > 2b) do the real transaction commit
> > 
> > 
> > We can do part one in parallel across any number of writers.  For part
> > two, there is only one running transaction.  If the FS is smart, the
> > commit will only force down the transaction that last modified the
> > file. 50 procs running fsync may only need to trigger one commit.
> 
> Right. However the real issue here, I think, is that the IO comes
> from a thread not associated with writeback nor is in any way cgroup
> aware. IOWs, getting the right context to each block being written
> back will be complex and filesystem specific.

The ext3 style data=ordered requires that we give the same amount of
bandwidth to all the data=ordered IO during commit.  Otherwise we end up
making the commit wait for some poor page in the data=ordered list and
that slows everyone down.  ick.

> 
> The other thing that concerns me is how metadata IO is accounted and
> throttled. Doing stuff like creating lots of small files will
> generate as much or more metadata IO than data IO, and none of that
> will be associated with a cgroup. Indeed, in XFS metadata doesn't
> even use the pagecache anymore, and it's written back by a thread
> (soon to be a workqueue) deep inside XFS's journalling subsystem, so
> it's pretty much impossible to associate that IO with any specific
> cgroup.
> 
> What happens to that IO? Blocking it arbitrarily can have the same
> effect as blocking transaction completion - it can cause the
> filesystem to completely stop....

ick again, it's the same problem as the data=ordered stuff exactly.

-chris

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF)
  2011-03-31 23:43               ` Chris Mason
@ 2011-04-01  0:55                 ` Dave Chinner
  0 siblings, 0 replies; 166+ messages in thread
From: Dave Chinner @ 2011-04-01  0:55 UTC (permalink / raw)
  To: Chris Mason
  Cc: Vivek Goyal, Chad Talbott, James Bottomley, lsf, linux-fsdevel

On Thu, Mar 31, 2011 at 07:43:27PM -0400, Chris Mason wrote:
> Excerpts from Dave Chinner's message of 2011-03-31 18:14:25 -0400:
> > On Thu, Mar 31, 2011 at 10:34:03AM -0400, Chris Mason wrote:
> > > Excerpts from Vivek Goyal's message of 2011-03-31 10:16:37 -0400:
> > > > On Thu, Mar 31, 2011 at 09:20:02AM +1100, Dave Chinner wrote:
> > > > > There are plans to move the bdi-flusher threads to work queues, and
> > > > > once that is done all your concerns about blocking and parallelism
> > > > > are pretty much gone because it's trivial to have multiple writeback
> > > > > works in progress at once on the same bdi with that infrastructure.
> > > > 
> > > > Will this essentially not nullify the advantage of IO less throttling?
> > > > I thought that we did not want have multiple threads doing writeback
> > > > at the same time to avoid number of seeks and achieve better throughput.
> > > 
> > > Work queues alone are probably not appropriate, at least for spinning
> > > storage.  It will introduce seeks into what would have been
> > > sequential writes.  I had to make the btrfs worker thread pools after
> > > having a lot of trouble cramming writeback into work queues.
> > 
> > That was before the cmwq infrastructure, right? cmwq changes the
> > behaviour of workqueues in such a way that they can simply be
> > thought of as having a thread pool of a specific size....
> > 
> > As a strict translation of the existing one flusher thread per bdi,
> > then only allowing one work at a time to be issued (i.e. workqueue
> > concurency of 1) would give the same behaviour without having all
> > the thread management issues. i.e. regardless of the writeback
> > parallelism mechanism we have the same issue of managing writeback
> > to minimise seeking. cmwq just makes the implementation far simpler,
> > IMO.
> > 
> > As to whether that causes seeks or not, that depends on how we are
> > driving the concurrent works/threads. If we drive a concurrent work
> > per dirty cgroup that needs writing back, then we achieve the
> > concurrency needed to make the IO scheduler appropriately throttle
> > the IO. For the case of no cgroups, then we still only have a single
> > writeback work in progress at a time and behaviour is no different
> > to the current setup. Hence I don't see any particular problem with
> > using workqueues to acheive the necessary writeback parallelism that
> > cgroup aware throttling requires....
> 
> Yes, as long as we aren't trying to shotgun style spread the
> inodes across a bunch of threads, it should work well enough.  The trick
> will just be making sure we don't end up with a lot of inode
> interleaving in the delalloc allocations.

That's a problem for any concurrent writeback mechanism as it passes
through the filesystem. It comes down to filesystems also needing to
have either concurrency- or cgroup-aware allocation mechanisms. It's
just another piece of the puzzle, really.

In the case of XFS, cgroup awareness could be as simple as as simple
as associating each cgroup with a specific allocation group and
keeping each cgroup as isolated as possible. There is precedence for
doing this in XFS - the filestreams allocator makes these sorts of
dynamic associations on a per-directory basis.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF)
  2011-03-31 22:14             ` Dave Chinner
  2011-03-31 23:43               ` Chris Mason
@ 2011-04-01  1:34               ` Vivek Goyal
  2011-04-01  4:36                 ` Dave Chinner
  1 sibling, 1 reply; 166+ messages in thread
From: Vivek Goyal @ 2011-04-01  1:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Chris Mason, Chad Talbott, James Bottomley, lsf, linux-fsdevel

On Fri, Apr 01, 2011 at 09:14:25AM +1100, Dave Chinner wrote:

[..]
> > An fsync has two basic parts
> > 
> > 1) write the file data pages
> > 2a) flush data=ordered in reiserfs/ext34
> > 2b) do the real transaction commit
> > 
> > 
> > We can do part one in parallel across any number of writers.  For part
> > two, there is only one running transaction.  If the FS is smart, the
> > commit will only force down the transaction that last modified the
> > file. 50 procs running fsync may only need to trigger one commit.
> 
> Right. However the real issue here, I think, is that the IO comes
> from a thread not associated with writeback nor is in any way cgroup
> aware. IOWs, getting the right context to each block being written
> back will be complex and filesystem specific.
> 
> The other thing that concerns me is how metadata IO is accounted and
> throttled. Doing stuff like creating lots of small files will
> generate as much or more metadata IO than data IO, and none of that
> will be associated with a cgroup. Indeed, in XFS metadata doesn't
> even use the pagecache anymore, and it's written back by a thread
> (soon to be a workqueue) deep inside XFS's journalling subsystem, so
> it's pretty much impossible to associate that IO with any specific
> cgroup.
> 
> What happens to that IO? Blocking it arbitrarily can have the same
> effect as blocking transaction completion - it can cause the
> filesystem to completely stop....

Dave,

As of today, the cgroup/context of IO is decided from the IO submitting
thread context. So any IO submitted by kernel threads (flusher, kjournald,
workqueue threads) goes to root group IO which should remain unthrottled.
(It is not a good idea to put throttling rules for root group).

Now any meta data operation happening in the context of process will
still be subject to throttling (is there any?). If that's a concern,
can filesystem mark that bio (REQ_META?) and throttling logic can possibly
let these bio pass through.

Determining the cgroup/context from submitting process has the issue of
that any writeback IO is not throttled and we are looking for a way to
control buffered writes also. If we start determining the cgroup from
some information stored in page_cgroup, then we are more likely to
run into issues of priority inversion (filesystem in ordered mode flushing
data first before committing meta data changes). So should we throttle
buffered writes when page cache is being dirtied and not when these
writes are being written back to device.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF)
  2011-04-01  1:34               ` Vivek Goyal
@ 2011-04-01  4:36                 ` Dave Chinner
  2011-04-01  6:32                   ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Christoph Hellwig
  2011-04-01 14:49                   ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] " Vivek Goyal
  0 siblings, 2 replies; 166+ messages in thread
From: Dave Chinner @ 2011-04-01  4:36 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Chris Mason, Chad Talbott, James Bottomley, lsf, linux-fsdevel

On Thu, Mar 31, 2011 at 09:34:24PM -0400, Vivek Goyal wrote:
> On Fri, Apr 01, 2011 at 09:14:25AM +1100, Dave Chinner wrote:
> 
> [..]
> > > An fsync has two basic parts
> > > 
> > > 1) write the file data pages
> > > 2a) flush data=ordered in reiserfs/ext34
> > > 2b) do the real transaction commit
> > > 
> > > 
> > > We can do part one in parallel across any number of writers.  For part
> > > two, there is only one running transaction.  If the FS is smart, the
> > > commit will only force down the transaction that last modified the
> > > file. 50 procs running fsync may only need to trigger one commit.
> > 
> > Right. However the real issue here, I think, is that the IO comes
> > from a thread not associated with writeback nor is in any way cgroup
> > aware. IOWs, getting the right context to each block being written
> > back will be complex and filesystem specific.
> > 
> > The other thing that concerns me is how metadata IO is accounted and
> > throttled. Doing stuff like creating lots of small files will
> > generate as much or more metadata IO than data IO, and none of that
> > will be associated with a cgroup. Indeed, in XFS metadata doesn't
> > even use the pagecache anymore, and it's written back by a thread
> > (soon to be a workqueue) deep inside XFS's journalling subsystem, so
> > it's pretty much impossible to associate that IO with any specific
> > cgroup.
> > 
> > What happens to that IO? Blocking it arbitrarily can have the same
> > effect as blocking transaction completion - it can cause the
> > filesystem to completely stop....
> 
> Dave,
> 
> As of today, the cgroup/context of IO is decided from the IO submitting
> thread context. So any IO submitted by kernel threads (flusher, kjournald,
> workqueue threads) goes to root group IO which should remain unthrottled.
> (It is not a good idea to put throttling rules for root group).
> 
> Now any meta data operation happening in the context of process will
> still be subject to throttling (is there any?).

Certainly - almost all metadata _reads_ will occur in process
context, though for XFS _most_ writes occur in kernel thread context.
That being said, we can still get kernel threads hung up on metadata
read IO that has been throttled in process context.

e.g. a process is creating a new inode, which causes allocation to
occur, which triggers a read of a free space btree block, which gets
throttled.  Writeback comes along, tries to do delayed allocation,
gets hung up trying to allocate out of the same AG that is locked by
the process creating a new inode. A signle allocation can lock
multiple AGs, and so if we get enough backed-up allocations this can
cause all AGs in the filesystem to become locked. AT this point no
new allocation can complete until the throttled IO is submitted,
completed and the allocation is committed and the AG unlocked....

> If that's a concern,
> can filesystem mark that bio (REQ_META?) and throttling logic can possibly
> let these bio pass through.

We already tag most metadata IO in this way.

However, you can't just not throttle metadata IO. e.g. a process
doing a directory traversal (e.g. a find) will issue hundreds of IOs
per second so if you don't throttle them it will adversely affect
the throughput of other groups that you are trying to guarantee a
certain throughput or iops rate for. Indeed, not throttling metadata
writes will seriously harm throughput for controlled cgroups when
the log fills up and the filesystem pushes out thousands metadata
IOs in a very short period of time.

Yet if we combine that with the problem that anywhere you delay
metadata IO for arbitrarily long periods of time (read or write) via
priority based mechanisms, you risk causing a train-smash of blocked
processes all waiting for the throttled IO to complete. And that will
seriously harm throughput for controlled cgroups because they can't
make any modifications to the filesystem.

I'm not sure if there is any middle ground here - I can't see any at
this point...

> Determining the cgroup/context from submitting process has the
> issue of that any writeback IO is not throttled and we are looking
> for a way to control buffered writes also. If we start determining
> the cgroup from some information stored in page_cgroup, then we
> are more likely to run into issues of priority inversion
> (filesystem in ordered mode flushing data first before committing
> meta data changes).  So should we throttle
> buffered writes when page cache is being dirtied and not when
> these writes are being written back to device.

I'm not sure what you mean by this paragraph - AFAICT, this
is exactly the way we throttle buffered writes right now.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-01  4:36                 ` Dave Chinner
@ 2011-04-01  6:32                   ` Christoph Hellwig
  2011-04-01  7:23                     ` Dave Chinner
  2011-04-01 14:49                   ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] " Vivek Goyal
  1 sibling, 1 reply; 166+ messages in thread
From: Christoph Hellwig @ 2011-04-01  6:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Vivek Goyal, James Bottomley, lsf, linux-fsdevel

On Fri, Apr 01, 2011 at 03:36:05PM +1100, Dave Chinner wrote:
> > If that's a concern,
> > can filesystem mark that bio (REQ_META?) and throttling logic can possibly
> > let these bio pass through.
> 
> We already tag most metadata IO in this way.

Actually we don't tag any I/O that way right now.  That's mostly
because REQ_META assumes it's synchronous I/O and cfg and the block
layer give id additional priority, while in XFS metadata writes
are mostly asynchronous.  We'll need a properly defined REQ_META
to use it, which currently is not the case.


^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-01  6:32                   ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Christoph Hellwig
@ 2011-04-01  7:23                     ` Dave Chinner
  2011-04-01 12:56                       ` Christoph Hellwig
  0 siblings, 1 reply; 166+ messages in thread
From: Dave Chinner @ 2011-04-01  7:23 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Vivek Goyal, James Bottomley, lsf, linux-fsdevel

On Fri, Apr 01, 2011 at 02:32:54AM -0400, Christoph Hellwig wrote:
> On Fri, Apr 01, 2011 at 03:36:05PM +1100, Dave Chinner wrote:
> > > If that's a concern,
> > > can filesystem mark that bio (REQ_META?) and throttling logic can possibly
> > > let these bio pass through.
> > 
> > We already tag most metadata IO in this way.
> 
> Actually we don't tag any I/O that way right now.  That's mostly
> because REQ_META assumes it's synchronous I/O and cfg and the block
> layer give id additional priority, while in XFS metadata writes
> are mostly asynchronous.  We'll need a properly defined REQ_META
> to use it, which currently is not the case.

Oh, I misread the code in _xfs_buf_read that fiddles with
_XBF_RUN_QUEUES. That flag is dead then, as is the XBF_LOG_BUFFER
code  which appears to have been superceded by the new XBF_ORDERED
code. Definitely needs cleaning up.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-01  7:23                     ` Dave Chinner
@ 2011-04-01 12:56                       ` Christoph Hellwig
  2011-04-21 15:07                         ` Vivek Goyal
  0 siblings, 1 reply; 166+ messages in thread
From: Christoph Hellwig @ 2011-04-01 12:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Vivek Goyal, James Bottomley, lsf, linux-fsdevel

On Fri, Apr 01, 2011 at 06:23:48PM +1100, Dave Chinner wrote:
> Oh, I misread the code in _xfs_buf_read that fiddles with
> _XBF_RUN_QUEUES. That flag is dead then, as is the XBF_LOG_BUFFER
> code  which appears to have been superceded by the new XBF_ORDERED
> code. Definitely needs cleaning up.

Yes, that's been on my todo list for a while, but I first want a sane
defintion of REQ_META in the block layer.


^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF)
  2011-04-01  4:36                 ` Dave Chinner
  2011-04-01  6:32                   ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Christoph Hellwig
@ 2011-04-01 14:49                   ` Vivek Goyal
  1 sibling, 0 replies; 166+ messages in thread
From: Vivek Goyal @ 2011-04-01 14:49 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Chris Mason, Chad Talbott, James Bottomley, lsf, linux-fsdevel

On Fri, Apr 01, 2011 at 03:36:05PM +1100, Dave Chinner wrote:
> On Thu, Mar 31, 2011 at 09:34:24PM -0400, Vivek Goyal wrote:
> > On Fri, Apr 01, 2011 at 09:14:25AM +1100, Dave Chinner wrote:
> > 
> > [..]
> > > > An fsync has two basic parts
> > > > 
> > > > 1) write the file data pages
> > > > 2a) flush data=ordered in reiserfs/ext34
> > > > 2b) do the real transaction commit
> > > > 
> > > > 
> > > > We can do part one in parallel across any number of writers.  For part
> > > > two, there is only one running transaction.  If the FS is smart, the
> > > > commit will only force down the transaction that last modified the
> > > > file. 50 procs running fsync may only need to trigger one commit.
> > > 
> > > Right. However the real issue here, I think, is that the IO comes
> > > from a thread not associated with writeback nor is in any way cgroup
> > > aware. IOWs, getting the right context to each block being written
> > > back will be complex and filesystem specific.
> > > 
> > > The other thing that concerns me is how metadata IO is accounted and
> > > throttled. Doing stuff like creating lots of small files will
> > > generate as much or more metadata IO than data IO, and none of that
> > > will be associated with a cgroup. Indeed, in XFS metadata doesn't
> > > even use the pagecache anymore, and it's written back by a thread
> > > (soon to be a workqueue) deep inside XFS's journalling subsystem, so
> > > it's pretty much impossible to associate that IO with any specific
> > > cgroup.
> > > 
> > > What happens to that IO? Blocking it arbitrarily can have the same
> > > effect as blocking transaction completion - it can cause the
> > > filesystem to completely stop....
> > 
> > Dave,
> > 
> > As of today, the cgroup/context of IO is decided from the IO submitting
> > thread context. So any IO submitted by kernel threads (flusher, kjournald,
> > workqueue threads) goes to root group IO which should remain unthrottled.
> > (It is not a good idea to put throttling rules for root group).
> > 
> > Now any meta data operation happening in the context of process will
> > still be subject to throttling (is there any?).
> 
> Certainly - almost all metadata _reads_ will occur in process
> context, though for XFS _most_ writes occur in kernel thread context.
> That being said, we can still get kernel threads hung up on metadata
> read IO that has been throttled in process context.
> 
> e.g. a process is creating a new inode, which causes allocation to
> occur, which triggers a read of a free space btree block, which gets
> throttled.  Writeback comes along, tries to do delayed allocation,
> gets hung up trying to allocate out of the same AG that is locked by
> the process creating a new inode. A signle allocation can lock
> multiple AGs, and so if we get enough backed-up allocations this can
> cause all AGs in the filesystem to become locked. AT this point no
> new allocation can complete until the throttled IO is submitted,
> completed and the allocation is committed and the AG unlocked....
> 
> > If that's a concern,
> > can filesystem mark that bio (REQ_META?) and throttling logic can possibly
> > let these bio pass through.
> 
> We already tag most metadata IO in this way.
> 
> However, you can't just not throttle metadata IO. e.g. a process
> doing a directory traversal (e.g. a find) will issue hundreds of IOs
> per second so if you don't throttle them it will adversely affect
> the throughput of other groups that you are trying to guarantee a
> certain throughput or iops rate for. Indeed, not throttling metadata
> writes will seriously harm throughput for controlled cgroups when
> the log fills up and the filesystem pushes out thousands metadata
> IOs in a very short period of time.
> 
> Yet if we combine that with the problem that anywhere you delay
> metadata IO for arbitrarily long periods of time (read or write) via
> priority based mechanisms, you risk causing a train-smash of blocked
> processes all waiting for the throttled IO to complete. And that will
> seriously harm throughput for controlled cgroups because they can't
> make any modifications to the filesystem.
> 
> I'm not sure if there is any middle ground here - I can't see any at
> this point...

This is indeed a tricky situation. Especially the case of write getting
blocked behind reads. I think virtual machine is best use
case where one can avoid using host's file system and avoid all the
issues related to serialization in host file system.

Or we can probably advise not to set very low limits on any cgroup. That
way even if things get serialized, once in a while, it will be resolved
soon. It hurts scalability and performance though.

Or modify file systems where they can mark *selective* meta data IO as
REQ_NOTHROTTLE. If filesystem can determine that a write is dependent on
read meta data request, then mark that read as REQ_NOTHROTTLE. Like in
above example, we are performing a read of free space blktree to do
an allocation of inode. 

Or live with reduced isolation by not throttling meta data IO.
 
> 
> > Determining the cgroup/context from submitting process has the
> > issue of that any writeback IO is not throttled and we are looking
> > for a way to control buffered writes also. If we start determining
> > the cgroup from some information stored in page_cgroup, then we
> > are more likely to run into issues of priority inversion
> > (filesystem in ordered mode flushing data first before committing
> > meta data changes).  So should we throttle
> > buffered writes when page cache is being dirtied and not when
> > these writes are being written back to device.
> 
> I'm not sure what you mean by this paragraph - AFAICT, this
> is exactly the way we throttle buffered writes right now.

Actually I was referring to throttling in terms of IO rate (bytes_per_second
or io_per_second). Notion of dirty_ratio or dirty_bytes for throttling
itself is not sufficient.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30 11:28         ` Ric Wheeler
  2011-03-30 14:07           ` Chris Mason
@ 2011-04-01 15:19           ` Ted Ts'o
  2011-04-01 16:30             ` Amir Goldstein
                               ` (3 more replies)
  1 sibling, 4 replies; 166+ messages in thread
From: Ted Ts'o @ 2011-04-01 15:19 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Dave Chinner, lsf, linux-scsi, James Bottomley,
	device-mapper development, linux-fsdevel, Ric Wheeler

On Wed, Mar 30, 2011 at 07:28:34AM -0400, Ric Wheeler wrote:
> 
> What possible semantics could you have?
> 
> If you ever write concurrently from multiple processes without
> locking, you clearly are at the mercy of the scheduler and the
> underlying storage which could fragment a single write into multiple
> IO's sent to the backend device.
> 
> I would agree with Dave, let's not make it overly complicated or try
> to give people "atomic" unbounded size writes just because they set
> the O_DIRECT flag :)

I just want to have it written down.  After getting burned with ext3's
semantics promising more than what the standard guaranteed, I've just
gotten paranoid about application programmers getting upset when
things change on them --- and in the case of direct I/O, this stuff
isn't even clearly documented anywhere official.

I just think it's best that we document it the fact that concurrent
DIO's to the same region may result in completely arbitrary behaviour,
make sure it's well publicized to likely users (and I'm more worried
about the open source code bases than Oracle DB), and then call it a day.

The closest place that we have to any official documentation about
O_DIRECT semantics is the open(2) man page in the Linux manpages, and
it doesn't say anything about this.  It does give a recommendation
against not mixing buffered and O_DIRECT accesses to the same file,
but it does promise that things will work in that case.  (Even if it
does, do we really want to make the promise that it will always work?)

In any case, adding some text in that paragraph, or just after that
paragraph, to the effect that two concurrent DIO accesses to the same
file block, even by two different processes will result in undefined
behavior would be a good start.

      	       	      	   	      	     - Ted

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-04-01 15:19           ` Ted Ts'o
@ 2011-04-01 16:30             ` Amir Goldstein
  2011-04-01 21:46               ` Joel Becker
  2011-04-01 21:43             ` Joel Becker
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 166+ messages in thread
From: Amir Goldstein @ 2011-04-01 16:30 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Ric Wheeler, Dave Chinner, lsf, linux-scsi, James Bottomley,
	device-mapper development, linux-fsdevel, Ric Wheeler,
	Yongqiang Yang

On Fri, Apr 1, 2011 at 8:19 AM, Ted Ts'o <tytso@mit.edu> wrote:
> On Wed, Mar 30, 2011 at 07:28:34AM -0400, Ric Wheeler wrote:
>>
>> What possible semantics could you have?
>>
>> If you ever write concurrently from multiple processes without
>> locking, you clearly are at the mercy of the scheduler and the
>> underlying storage which could fragment a single write into multiple
>> IO's sent to the backend device.
>>
>> I would agree with Dave, let's not make it overly complicated or try
>> to give people "atomic" unbounded size writes just because they set
>> the O_DIRECT flag :)
>
> I just want to have it written down.  After getting burned with ext3's
> semantics promising more than what the standard guaranteed, I've just
> gotten paranoid about application programmers getting upset when
> things change on them --- and in the case of direct I/O, this stuff
> isn't even clearly documented anywhere official.
>
> I just think it's best that we document it the fact that concurrent
> DIO's to the same region may result in completely arbitrary behaviour,
> make sure it's well publicized to likely users (and I'm more worried
> about the open source code bases than Oracle DB), and then call it a day.
>
> The closest place that we have to any official documentation about
> O_DIRECT semantics is the open(2) man page in the Linux manpages, and
> it doesn't say anything about this.  It does give a recommendation
> against not mixing buffered and O_DIRECT accesses to the same file,
> but it does promise that things will work in that case.  (Even if it
> does, do we really want to make the promise that it will always work?)

when writing DIO to indirect mapped file holes, we fall back to buffered write
(so we won't expose stale data in the case of a crash) concurrent DIO reads
to that file (before data writeback) can expose stale data. right?
do you consider this case mixing buffered and DIO access?
do you consider that as a problem?

the case interests me because I am afraid we may have to use the fallback
trick for extent move on write from DIO (we did so in current
implementation anyway).

of course, if we end up implementing in-memory extent tree, we will probably be
able to cope with DIO MOW without fallback to buffered IO.

>
> In any case, adding some text in that paragraph, or just after that
> paragraph, to the effect that two concurrent DIO accesses to the same
> file block, even by two different processes will result in undefined
> behavior would be a good start.
>
>                                             - Ted
> _______________________________________________
> Lsf mailing list
> Lsf@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/lsf
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-03-31 22:27             ` Dave Chinner
@ 2011-04-01 17:18               ` Vivek Goyal
  2011-04-01 19:57                 ` [LSF]: fc_rport attributes to further populate HBAAPIv2 Giridhar Malavali
  2011-04-01 21:49                 ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Dave Chinner
  0 siblings, 2 replies; 166+ messages in thread
From: Vivek Goyal @ 2011-04-01 17:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Greg Thelen, James Bottomley, lsf, linux-fsdevel

On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote:
> On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote:
> > On Thu, Mar 31, 2011 at 7:16 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > > On Thu, Mar 31, 2011 at 09:20:02AM +1100, Dave Chinner wrote:
> > >> > and also try to select inodes intelligently (cgroup aware manner).
> > >>
> > >> Such selection algorithms would need to be able to handle hundreds
> > >> of thousands of newly dirtied inodes per second so sorting and
> > >> selecting them efficiently will be a major issue...
> > >
> > > There was proposal of memory cgroup maintaining a per memory cgroup per
> > > bdi structure which will keep a list of inodes which need writeback
> > > from that cgroup.
> > 
> > FYI, I have patches which implement this per memcg per bdi dirty inode
> > list.  I want to debug a few issues before posting an RFC series.  But
> > it is getting close.
> 
> That's all well and good, but we're still trying to work out how to
> scale this list in a sane fashion. We just broke it out into it's
> own global lock, so it's going to change soon so that the list+lock
> is not a contention point on large machines. Just breaking it into a
> list per cgroup doesn't solve this problem - it just adds another
> container to the list.
> 
> Also, you have the problem that some filesystems don't use the bdi
> dirty inode list for all the dirty inodes in the filesytem - XFS has
> recent changed to only track VFS dirtied inodes in that list, intead
> using it's own active item list to track all logged modifications.
> IIUC, btrfs and ext3/4 do something similar as well. My current plans
> are to modify the dirty inode code to allow filesystems to say tot
> the VFS "don't track this dirty inode - I'm doing it myself" so that
> we can reduce the VFS dirty inode list to only those inodes with
> dirty pages....
> 
> > > So any cgroup looking for a writeback will queue up this structure on
> > > bdi and flusher threads can walk though this list and figure out
> > > which memory cgroups and which inodes within memory cgroup need to
> > > be written back.
> > 
> > The way these memcg-writeback patches are currently implemented is
> > that when a memcg is over background dirty limits, it will queue the
> > memcg a on a global over_bg_limit list and wakeup bdi flusher.
> 
> No global lists and locks, please. That's one of the big problems
> with the current foreground IO based throttling - it _hammers_ the
> global inode writeback list locks such that one an 8p machine we can
> be wasted 2-3 entire CPUs just contending on it when all 8 CPUs are
> trying to throttle and write back at the same time.....
> 
> > There
> > is no context (memcg or otherwise) given to the bdi flusher.  After
> > the bdi flusher checks system-wide background limits, it uses the
> > over_bg_limit list to find (and rotate) an over limit memcg.  Using
> > the memcg, then the per memcg per bdi dirty inode list is walked to
> > find inode pages to writeback.  Once the memcg dirty memory usage
> > drops below the memcg-thresh, the memcg is removed from the global
> > over_bg_limit list.
> 
> If you want controlled hand-off of writeback, you need to pass the
> memcg that triggered the throttling directly to the bdi. You already
> know what both the bdi and memcg that need writeback are. Yes, this
> needs concurrency at the BDI flush level to handle, but see my
> previous email in this thread for that....
> 

Even with memcg being passed around I don't think that we get rid of
global list lock. The reason being that inodes are not exclusive to
the memory cgroups. Multiple memory cgroups might be writting to same
inode. So inode still remains in the global list and memory cgroups
kind of will have pointer to it. So to start writeback on an inode
you still shall have to take global lock, IIUC.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* [LSF]: fc_rport attributes to further populate HBAAPIv2
  2011-04-01 17:18               ` Vivek Goyal
@ 2011-04-01 19:57                 ` Giridhar Malavali
  2011-04-01 21:49                 ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Dave Chinner
  1 sibling, 0 replies; 166+ messages in thread
From: Giridhar Malavali @ 2011-04-01 19:57 UTC (permalink / raw)
  To: James Bottomley; +Cc: lsf, linux-scsi, Chad Dupuis, Andrew Vasquez

James,

I would also like to discuss the following topic.

The patches were submitted to linux-scsi earlier which basically includes
fc_rport attributes to FC transport.
Here are the links to previous submission

http://marc.info/?l=linux-scsi&m=128828294426031&w=2
http://marc.info/?l=linux-scsi&m=128828255825328&w=2
http://marc.info/?l=linux-scsi&m=128828263025468&w=2


>From review comments from community (Christof S and James Smart), it is
felt that a common implementation either in transport or block layer would
be better to avoid all LLD implementing this. In an effort the BSG
interface is extended to send in-kernel BSG commands to query name server
and get the required information to populate.

Here are the links for the review comments received

http://marc.info/?l=linux-scsi&m=128834912816011&w=2
http://marc.info/?l=linux-scsi&m=128838095128776&w=2
http://marc.info/?l=linux-scsi&m=128871208700376&w=2


I would like to discuss to tap more idea from community for any better
approach. If not, to continue with the present approach.

Thanks,
Giridhar Malavali





This message and any attached documents contain information from QLogic Corporation or its wholly-owned subsidiaries that may be confidential. If you are not the intended recipient, you may not read, copy, distribute, or use this information. If you have received this transmission in error, please notify the sender immediately by reply e-mail and then delete this message.


^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-31  1:00         ` Joel Becker
@ 2011-04-01 21:34           ` Mingming Cao
  2011-04-01 21:49             ` Joel Becker
  0 siblings, 1 reply; 166+ messages in thread
From: Mingming Cao @ 2011-04-01 21:34 UTC (permalink / raw)
  To: Joel Becker
  Cc: Dave Chinner, Ric Wheeler, James Bottomley, lsf, linux-fsdevel,
	linux-scsi, device-mapper development

On Wed, 2011-03-30 at 18:00 -0700, Joel Becker wrote: 
> On Wed, Mar 30, 2011 at 02:49:58PM -0700, Mingming Cao wrote:
> > On Wed, 2011-03-30 at 13:17 +1100, Dave Chinner wrote:
> > > For direct IO, the IO lock is always taken in shared mode, so we can
> > > have concurrent read and write operations taking place at once
> > > regardless of the offset into the file.
> > > 
> > 
> > thanks for reminding me,in xfs concurrent direct IO write to the same
> > offset is allowed.
> 
> 	ocfs2 as well, with the same sort of strategem (including across
> the cluster).
> 
Thanks for providing view from OCFS2 side. This is good to know.

> > > Direct IO semantics have always been that the application is allowed
> > > to overlap IO to the same range if it wants to. The result is
> > > undefined (just like issuing overlapping reads and writes to a disk
> > > at the same time) so it's the application's responsibility to avoid
> > > overlapping IO if it is a problem.
> > > 
> > 
> > I was thinking along the line to provide finer granularity lock to allow
> > concurrent direct IO to different offset/range, but to same offset, they
> > have to be serialized. If it's undefined behavior, i.e. overlapping is
> > allowed, then concurrent dio implementation is much easier. But not sure
> > if any apps currently using DIO aware of the ordering has to be done at
> > the application level. 
> 
> 	Oh dear God no.  One of the major DIO use cases is to tell the
> kernel, "I know I won't do that, so don't spend any effort protecting
> me."
> 
> Joel
> 

Looks like so -

So I think we could have a mode to turn on/off concurrent dio if the non
heavy duty applications relies on filesystem to take care of the
serialization.

Mingming




^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-04-01 15:19           ` Ted Ts'o
                               ` (2 preceding siblings ...)
  2011-04-01 21:43             ` Joel Becker
@ 2011-04-01 21:43             ` Joel Becker
  3 siblings, 0 replies; 166+ messages in thread
From: Joel Becker @ 2011-04-01 21:43 UTC (permalink / raw)
  To: Ted Ts'o, Ric Wheeler, Dave Chinner, lsf, linux-scsi@vger.kernel.org

On Fri, Apr 01, 2011 at 11:19:07AM -0400, Ted Ts'o wrote:
> The closest place that we have to any official documentation about
> O_DIRECT semantics is the open(2) man page in the Linux manpages, and
> it doesn't say anything about this.  It does give a recommendation
> against not mixing buffered and O_DIRECT accesses to the same file,
> but it does promise that things will work in that case.  (Even if it
> does, do we really want to make the promise that it will always work?)

	No, we do not.  Some OSes will silently turn buffered I/O into
direct I/O if another file already has it opened O_DIRECT.  Some OSes
will fail the write, or the open, or both, if it doesn't match the mode
of an existing fd.  Some just leave O_DIRECT and buffered access
inconsistent.
	I think that Linux should strive to make the mixed
buffered/direct case work; it's the nicest thing we can do.  But we
should not promise it.

Joel

-- 

Life's Little Instruction Book #24

	"Drink champagne for no reason at all."

			http://www.jlbec.org/
			jlbec@evilplan.org

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-04-01 15:19           ` Ted Ts'o
  2011-04-01 16:30             ` Amir Goldstein
@ 2011-04-01 21:43             ` Joel Becker
  2011-04-01 21:43             ` Joel Becker
  2011-04-01 21:43             ` Joel Becker
  3 siblings, 0 replies; 166+ messages in thread
From: Joel Becker @ 2011-04-01 21:43 UTC (permalink / raw)
  To: Ted Ts'o, Ric Wheeler, Dave Chinner, lsf, linux-scsi@vger.kernel.org

On Fri, Apr 01, 2011 at 11:19:07AM -0400, Ted Ts'o wrote:
> The closest place that we have to any official documentation about
> O_DIRECT semantics is the open(2) man page in the Linux manpages, and
> it doesn't say anything about this.  It does give a recommendation
> against not mixing buffered and O_DIRECT accesses to the same file,
> but it does promise that things will work in that case.  (Even if it
> does, do we really want to make the promise that it will always work?)

	No, we do not.  Some OSes will silently turn buffered I/O into
direct I/O if another file already has it opened O_DIRECT.  Some OSes
will fail the write, or the open, or both, if it doesn't match the mode
of an existing fd.  Some just leave O_DIRECT and buffered access
inconsistent.
	I think that Linux should strive to make the mixed
buffered/direct case work; it's the nicest thing we can do.  But we
should not promise it.

Joel

-- 

Life's Little Instruction Book #24

	"Drink champagne for no reason at all."

			http://www.jlbec.org/
			jlbec@evilplan.org

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-04-01 15:19           ` Ted Ts'o
  2011-04-01 16:30             ` Amir Goldstein
  2011-04-01 21:43             ` Joel Becker
@ 2011-04-01 21:43             ` Joel Becker
  2011-04-01 21:43             ` Joel Becker
  3 siblings, 0 replies; 166+ messages in thread
From: Joel Becker @ 2011-04-01 21:43 UTC (permalink / raw)
  To: Ted Ts'o, Ric Wheeler, Dave Chinner, lsf, linux-scsi@vger.kernel.org

On Fri, Apr 01, 2011 at 11:19:07AM -0400, Ted Ts'o wrote:
> The closest place that we have to any official documentation about
> O_DIRECT semantics is the open(2) man page in the Linux manpages, and
> it doesn't say anything about this.  It does give a recommendation
> against not mixing buffered and O_DIRECT accesses to the same file,
> but it does promise that things will work in that case.  (Even if it
> does, do we really want to make the promise that it will always work?)

	No, we do not.  Some OSes will silently turn buffered I/O into
direct I/O if another file already has it opened O_DIRECT.  Some OSes
will fail the write, or the open, or both, if it doesn't match the mode
of an existing fd.  Some just leave O_DIRECT and buffered access
inconsistent.
	I think that Linux should strive to make the mixed
buffered/direct case work; it's the nicest thing we can do.  But we
should not promise it.

Joel

-- 

Life's Little Instruction Book #24

	"Drink champagne for no reason at all."

			http://www.jlbec.org/
			jlbec@evilplan.org

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-04-01 16:30             ` Amir Goldstein
@ 2011-04-01 21:46               ` Joel Becker
  2011-04-02  3:26                 ` Amir Goldstein
  0 siblings, 1 reply; 166+ messages in thread
From: Joel Becker @ 2011-04-01 21:46 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Theodore Tso, Ric Wheeler, Dave Chinner, lsf, linux-scsi,
	James Bottomley, device-mapper development, linux-fsdevel,
	Ric Wheeler, Yongqiang Yang

On Fri, Apr 01, 2011 at 09:30:04AM -0700, Amir Goldstein wrote:
> when writing DIO to indirect mapped file holes, we fall back to buffered write
> (so we won't expose stale data in the case of a crash) concurrent DIO reads
> to that file (before data writeback) can expose stale data. right?
> do you consider this case mixing buffered and DIO access?
> do you consider that as a problem?

	I do not consider this 'mixing', nor do I consider it a problem.
ocfs2 does exactly this for holes, unwritten extents, and CoW.  It does
not violate the user's expectation that the data will be on disk when
the write(2) returns.
	Falling back to buffered on read(2) is a different story; the
caller wants the current state of the disk block, not five minutes ago.
So we can't do that.  But we also don't need to.
	O_DIRECT users that are worried about any possible space usage in
the page cache have already pre-allocated their disk blocks and don't
get here.

Joel

-- 

"Under capitalism, man exploits man.  Under Communism, it's just 
   the opposite."
				 - John Kenneth Galbraith

			http://www.jlbec.org/
			jlbec@evilplan.org

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-04-01 21:34           ` Mingming Cao
@ 2011-04-01 21:49             ` Joel Becker
  0 siblings, 0 replies; 166+ messages in thread
From: Joel Becker @ 2011-04-01 21:49 UTC (permalink / raw)
  To: Mingming Cao
  Cc: Dave Chinner, Ric Wheeler, James Bottomley, lsf, linux-fsdevel,
	linux-scsi, device-mapper development

On Fri, Apr 01, 2011 at 02:34:26PM -0700, Mingming Cao wrote:
> > > I was thinking along the line to provide finer granularity lock to allow
> > > concurrent direct IO to different offset/range, but to same offset, they
> > > have to be serialized. If it's undefined behavior, i.e. overlapping is
> > > allowed, then concurrent dio implementation is much easier. But not sure
> > > if any apps currently using DIO aware of the ordering has to be done at
> > > the application level. 
> > 
> > 	Oh dear God no.  One of the major DIO use cases is to tell the
> > kernel, "I know I won't do that, so don't spend any effort protecting
> > me."
> > 
> > Joel
> > 
> 
> Looks like so -
> 
> So I think we could have a mode to turn on/off concurrent dio if the non
> heavy duty applications relies on filesystem to take care of the
> serialization.

	I would prefer to leave this complexity out.  If you must have
it, unsafe, concurrent DIO must be the default.  Let the people who
really want it turn on serialized DIO.

Joel

-- 

"Get right to the heart of matters.
 It's the heart that matters more."

			http://www.jlbec.org/
			jlbec@evilplan.org

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-01 17:18               ` Vivek Goyal
  2011-04-01 19:57                 ` [LSF]: fc_rport attributes to further populate HBAAPIv2 Giridhar Malavali
@ 2011-04-01 21:49                 ` Dave Chinner
  2011-04-02  7:33                   ` Greg Thelen
  2011-04-05 13:13                   ` Vivek Goyal
  1 sibling, 2 replies; 166+ messages in thread
From: Dave Chinner @ 2011-04-01 21:49 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Greg Thelen, James Bottomley, lsf, linux-fsdevel

On Fri, Apr 01, 2011 at 01:18:38PM -0400, Vivek Goyal wrote:
> On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote:
> > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote:
> > > There
> > > is no context (memcg or otherwise) given to the bdi flusher.  After
> > > the bdi flusher checks system-wide background limits, it uses the
> > > over_bg_limit list to find (and rotate) an over limit memcg.  Using
> > > the memcg, then the per memcg per bdi dirty inode list is walked to
> > > find inode pages to writeback.  Once the memcg dirty memory usage
> > > drops below the memcg-thresh, the memcg is removed from the global
> > > over_bg_limit list.
> > 
> > If you want controlled hand-off of writeback, you need to pass the
> > memcg that triggered the throttling directly to the bdi. You already
> > know what both the bdi and memcg that need writeback are. Yes, this
> > needs concurrency at the BDI flush level to handle, but see my
> > previous email in this thread for that....
> > 
> 
> Even with memcg being passed around I don't think that we get rid of
> global list lock.

You need to - we're getting rid of global lists and locks from
writeback for scalability reasons so any new functionality needs to
avoid global locks for the same reason.

> The reason being that inodes are not exclusive to
> the memory cgroups. Multiple memory cgroups might be writting to same
> inode. So inode still remains in the global list and memory cgroups
> kind of will have pointer to it.

So two dirty inode lists that have to be kept in sync? That doesn't
sound particularly appealing. Nor does it scale to an inode being
dirty in multiple cgroups

Besides, if you've got multiple memory groups dirtying the same
inode, then you cannot expect isolation between groups. I'd consider
this a broken configuration in this case - how often does this
actually happen, and what is the use case for supporting
it?

Besides, the implications are that we'd have to break up contiguous
IOs in the writeback path simply because two sequential pages are
associated with different groups. That's really nasty, and exactly
the opposite of all the write combining we try to do throughout the
writeback path. Supporting this is also a mess, as we'd have to touch
quite a lot of filesystem code (i.e. .writepage(s) inplementations)
to do this.

> So to start writeback on an inode
> you still shall have to take global lock, IIUC.

Why not simply bdi -> list of dirty cgroups -> list of dirty inodes
in cgroup, and go from there? I mean, really all that cgroup-aware
writeback needs is just adding a new container for managing
dirty inodes in the writeback path and a method for selecting that
container for writeback, right? 

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-04-01 21:46               ` Joel Becker
@ 2011-04-02  3:26                 ` Amir Goldstein
  0 siblings, 0 replies; 166+ messages in thread
From: Amir Goldstein @ 2011-04-02  3:26 UTC (permalink / raw)
  To: Joel Becker
  Cc: Theodore Tso, Ric Wheeler, Dave Chinner, lsf, linux-scsi,
	James Bottomley, device-mapper development, linux-fsdevel,
	Ric Wheeler, Yongqiang Yang

On Fri, Apr 1, 2011 at 2:46 PM, Joel Becker <jlbec@evilplan.org> wrote:
> On Fri, Apr 01, 2011 at 09:30:04AM -0700, Amir Goldstein wrote:
>> when writing DIO to indirect mapped file holes, we fall back to buffered write
>> (so we won't expose stale data in the case of a crash) concurrent DIO reads
>> to that file (before data writeback) can expose stale data. right?
>> do you consider this case mixing buffered and DIO access?
>> do you consider that as a problem?
>
>        I do not consider this 'mixing', nor do I consider it a problem.
> ocfs2 does exactly this for holes, unwritten extents, and CoW.  It does
> not violate the user's expectation that the data will be on disk when
> the write(2) returns.
>        Falling back to buffered on read(2) is a different story; the
> caller wants the current state of the disk block, not five minutes ago.
> So we can't do that.  But we also don't need to.

the issue is with DIO read exposing uninitialized data on disk
is a security issue.
it's not about giving the read what is expects to see.

>        O_DIRECT users that are worried about any possible space usage in
> the page cache have already pre-allocated their disk blocks and don't
> get here.
>
> Joel
>
> --
>
> "Under capitalism, man exploits man.  Under Communism, it's just
>   the opposite."
>                                 - John Kenneth Galbraith
>
>                        http://www.jlbec.org/
>                        jlbec@evilplan.org
>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-01 21:49                 ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Dave Chinner
@ 2011-04-02  7:33                   ` Greg Thelen
  2011-04-02  7:34                     ` Greg Thelen
  2011-04-05 13:13                   ` Vivek Goyal
  1 sibling, 1 reply; 166+ messages in thread
From: Greg Thelen @ 2011-04-02  7:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Vivek Goyal, James Bottomley, lsf, linux-fsdevel

On Fri, Apr 1, 2011 at 2:49 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Apr 01, 2011 at 01:18:38PM -0400, Vivek Goyal wrote:
>> On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote:
>> > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote:
>> > > There
>> > > is no context (memcg or otherwise) given to the bdi flusher.  After
>> > > the bdi flusher checks system-wide background limits, it uses the
>> > > over_bg_limit list to find (and rotate) an over limit memcg.  Using
>> > > the memcg, then the per memcg per bdi dirty inode list is walked to
>> > > find inode pages to writeback.  Once the memcg dirty memory usage
>> > > drops below the memcg-thresh, the memcg is removed from the global
>> > > over_bg_limit list.
>> >
>> > If you want controlled hand-off of writeback, you need to pass the
>> > memcg that triggered the throttling directly to the bdi. You already
>> > know what both the bdi and memcg that need writeback are. Yes, this
>> > needs concurrency at the BDI flush level to handle, but see my
>> > previous email in this thread for that....
>> >
>>
>> Even with memcg being passed around I don't think that we get rid of
>> global list lock.
>
> You need to - we're getting rid of global lists and locks from
> writeback for scalability reasons so any new functionality needs to
> avoid global locks for the same reason.
>
>> The reason being that inodes are not exclusive to
>> the memory cgroups. Multiple memory cgroups might be writting to same
>> inode. So inode still remains in the global list and memory cgroups
>> kind of will have pointer to it.
>
> So two dirty inode lists that have to be kept in sync? That doesn't
> sound particularly appealing. Nor does it scale to an inode being
> dirty in multiple cgroups
>
> Besides, if you've got multiple memory groups dirtying the same
> inode, then you cannot expect isolation between groups. I'd consider
> this a broken configuration in this case - how often does this
> actually happen, and what is the use case for supporting
> it?
>
> Besides, the implications are that we'd have to break up contiguous
> IOs in the writeback path simply because two sequential pages are
> associated with different groups. That's really nasty, and exactly
> the opposite of all the write combining we try to do throughout the
> writeback path. Supporting this is also a mess, as we'd have to touch
> quite a lot of filesystem code (i.e. .writepage(s) inplementations)
> to do this.
>
>> So to start writeback on an inode
>> you still shall have to take global lock, IIUC.
>
> Why not simply bdi -> list of dirty cgroups -> list of dirty inodes
> in cgroup, and go from there? I mean, really all that cgroup-aware
> writeback needs is just adding a new container for managing
> dirty inodes in the writeback path and a method for selecting that
> container for writeback, right?

I feel compelled to optimize for multiple cgroup's concurrently
dirtying an inode.  I see sharing as legitimate if a file is handed
off between jobs (cgroups).  But I do not see concurrent writing as a
common use case.
If anyone else feels this is a requirement, please speak up.
However, I would like the system tolerate sharing, though it does not
have to do so in optimal fashion.
Here are two approaches that do not optimize for sharing.  Though,
each approach tries to tolerate sharing without falling over.

Approach 1 (inspired from Dave's comments):

bdi ->1:N -> bdi_memcg -> 1:N -> bdi_memcg_dirty_inode

* when setting I_DIRTY in a memcg, insert inode into
bdi_memcg_dirty_inodes rather than b_dirty.

* when clearing I_DIRTY, remove inode from bdi_memcg_dirty_inode

* balance_dirty_pages() -> mem_cgroup_balance_dirty_pages(memcg, bdi)
if over bg limit, then queue memcg writeback to bdi flusher.
if over fg limit, then queue memcg-waiting description to bdi flusher
(IO less throttle).

* bdi_flusher(bdi):
using bdi,memcg write “some” of the bdi_memcg_dirty_inodes list.
“Some” is for fairness.

if bdi flusher is unable to bring memcg dirty usage below bg limit
after bdi_memcg_dirty_inodes list is empty, then need to do
“something” to make forward progress.  This could be caused by either
(a) memcg dirtying multiple bdi, or (b) a freeloading memcg dirtying
inodes previously dirtied by another memcg therefore the first
dirtying memcg is the one that will write it back.

Case A) If memcg dirties multiple bdi and then hits memcg bg limit,
queue bg writeback for the bdi being written to.  This may not
writeback other useful bdi.  System-wide background limit has similar
issue.  Could link bdi_memcg together and wakeup peer bdi.  For now,
defer the problem.

Case B) Dirtying another cgroup’s dirty inode.  While is not a common
use case, it could happen.  Options to avoid lockup:

+ When an inode becomes dirty shared, then move the inode from the per
bdi per memcg bdi_memcg_dirty_inode list to an otherwise unused bdi
wide b_unknown_memcg_dirty (promiscuous inode) list.
b_unknown_memcg_dirty is written when memcg writeback is invoked to
the bdi.  When an inode is cleaned and later redirtied it is added to
the normal bdi_memcg_dirty_inode_list.

+ Considered: when file page goes dirty, then do not account the dirty
page to the memcg where the page was charged, instead recharge the
page to the memcg that the inode was billed to (by inode i_memcg
field).  Inode would require a memcg reference that would make memcg
cleanup tricky.

+ Scan memcg lru for dirty file pages -> associated inodes -> bdi ->
writeback(bdi, inode)

+ What if memcg dirty limits are simply ignored in case-B?
Ineffective memcg background writeback would be queued as usage grows.
 Once memcg foreground limit is hit, then it would throttle waiting
for the ineffective background writeback to never catch up.  This
could wait indefinitely.  Could argue that the hung cgroup deserves
this for writing to another cgroup’s inode.  However, the other cgroup
could be the trouble maker who sneaks in to dirty the file and assume
dirty ownership before the innocent (now hung) cgroup starts writing.
I am not worried about making this optimal, just making forward
progress.  Fallback to scanning memcg lru looking for inode’s of dirty
pages.  This may be expensive, but should only happen with dirty
inodes shared between memcg.


Approach 2 : do something even simpler:

http://www.gossamer-threads.com/lists/linux/kernel/1347359#1347359

* __set_page_dirty()

either set i_memcg=memcg or i_memcg=~0
no memcg reference needed, i_memcg is not dereferenced

* mem_cgroup_balance_dirty_pages(memcg, bdi)

if over bg limit, then queue memcg to bdi for background writeback
if over fg limit, then queue memcg-waiting description to bdi flusher
(IO less throttle)

* bdi_flusher(bdi)

if doing memcg writeback, scan b_dirty filtering using
is_memcg_inode(inode,memcg), which checks i_memcg field: return
i_memcg in [~0, memcg]

if unable to get memcg below its dirty memory limit:

+ If memcg dirties multiple bdi and then hits memcg bg limit, queue bg
writeback for the bdi being written to.  This may not writeback other
useful bdi.  System-wide background limit has similar issue.

- con: If degree of sharing exceeds compile time max supported sharing
degree (likely 1), then ANY writeback (per-memcg or system-wide) will
writeback the over-shared inode.  This is undesirable because it
punishes innocent cgroups that are not abusively sharing.

- con: have to scan entire b_dirty list which may involve skipping
many inodes not in over-limit cgroup.  A memcg constantly hitting its
limit would monopolize a bdi flusher.


Both approaches are complicated by the (rare) possibility when an
inode has been been claimed (from a dirtying memcg perspective) by
memcg M1 but later M2 writes more dirty pages.  When M2 exceeds its
dirty limit it would be nice to find the inode, even if this requires
some extra work.

> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-02  7:33                   ` Greg Thelen
@ 2011-04-02  7:34                     ` Greg Thelen
  0 siblings, 0 replies; 166+ messages in thread
From: Greg Thelen @ 2011-04-02  7:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Vivek Goyal, James Bottomley, lsf, linux-fsdevel

On Sat, Apr 2, 2011 at 12:33 AM, Greg Thelen <gthelen@google.com> wrote:
> On Fri, Apr 1, 2011 at 2:49 PM, Dave Chinner <david@fromorbit.com> wrote:
>> On Fri, Apr 01, 2011 at 01:18:38PM -0400, Vivek Goyal wrote:
>>> On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote:
>>> > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote:
>>> > > There
>>> > > is no context (memcg or otherwise) given to the bdi flusher.  After
>>> > > the bdi flusher checks system-wide background limits, it uses the
>>> > > over_bg_limit list to find (and rotate) an over limit memcg.  Using
>>> > > the memcg, then the per memcg per bdi dirty inode list is walked to
>>> > > find inode pages to writeback.  Once the memcg dirty memory usage
>>> > > drops below the memcg-thresh, the memcg is removed from the global
>>> > > over_bg_limit list.
>>> >
>>> > If you want controlled hand-off of writeback, you need to pass the
>>> > memcg that triggered the throttling directly to the bdi. You already
>>> > know what both the bdi and memcg that need writeback are. Yes, this
>>> > needs concurrency at the BDI flush level to handle, but see my
>>> > previous email in this thread for that....
>>> >
>>>
>>> Even with memcg being passed around I don't think that we get rid of
>>> global list lock.
>>
>> You need to - we're getting rid of global lists and locks from
>> writeback for scalability reasons so any new functionality needs to
>> avoid global locks for the same reason.
>>
>>> The reason being that inodes are not exclusive to
>>> the memory cgroups. Multiple memory cgroups might be writting to same
>>> inode. So inode still remains in the global list and memory cgroups
>>> kind of will have pointer to it.
>>
>> So two dirty inode lists that have to be kept in sync? That doesn't
>> sound particularly appealing. Nor does it scale to an inode being
>> dirty in multiple cgroups
>>
>> Besides, if you've got multiple memory groups dirtying the same
>> inode, then you cannot expect isolation between groups. I'd consider
>> this a broken configuration in this case - how often does this
>> actually happen, and what is the use case for supporting
>> it?
>>
>> Besides, the implications are that we'd have to break up contiguous
>> IOs in the writeback path simply because two sequential pages are
>> associated with different groups. That's really nasty, and exactly
>> the opposite of all the write combining we try to do throughout the
>> writeback path. Supporting this is also a mess, as we'd have to touch
>> quite a lot of filesystem code (i.e. .writepage(s) inplementations)
>> to do this.
>>
>>> So to start writeback on an inode
>>> you still shall have to take global lock, IIUC.
>>
>> Why not simply bdi -> list of dirty cgroups -> list of dirty inodes
>> in cgroup, and go from there? I mean, really all that cgroup-aware
>> writeback needs is just adding a new container for managing
>> dirty inodes in the writeback path and a method for selecting that
>> container for writeback, right?
>
> I feel compelled to optimize for multiple cgroup's concurrently

Correction: I do NOT feel compelled to optimizing for sharing...

> dirtying an inode.  I see sharing as legitimate if a file is handed
> off between jobs (cgroups).  But I do not see concurrent writing as a
> common use case.
> If anyone else feels this is a requirement, please speak up.
> However, I would like the system tolerate sharing, though it does not
> have to do so in optimal fashion.
> Here are two approaches that do not optimize for sharing.  Though,
> each approach tries to tolerate sharing without falling over.
>
> Approach 1 (inspired from Dave's comments):
>
> bdi ->1:N -> bdi_memcg -> 1:N -> bdi_memcg_dirty_inode
>
> * when setting I_DIRTY in a memcg, insert inode into
> bdi_memcg_dirty_inodes rather than b_dirty.
>
> * when clearing I_DIRTY, remove inode from bdi_memcg_dirty_inode
>
> * balance_dirty_pages() -> mem_cgroup_balance_dirty_pages(memcg, bdi)
> if over bg limit, then queue memcg writeback to bdi flusher.
> if over fg limit, then queue memcg-waiting description to bdi flusher
> (IO less throttle).
>
> * bdi_flusher(bdi):
> using bdi,memcg write “some” of the bdi_memcg_dirty_inodes list.
> “Some” is for fairness.
>
> if bdi flusher is unable to bring memcg dirty usage below bg limit
> after bdi_memcg_dirty_inodes list is empty, then need to do
> “something” to make forward progress.  This could be caused by either
> (a) memcg dirtying multiple bdi, or (b) a freeloading memcg dirtying
> inodes previously dirtied by another memcg therefore the first
> dirtying memcg is the one that will write it back.
>
> Case A) If memcg dirties multiple bdi and then hits memcg bg limit,
> queue bg writeback for the bdi being written to.  This may not
> writeback other useful bdi.  System-wide background limit has similar
> issue.  Could link bdi_memcg together and wakeup peer bdi.  For now,
> defer the problem.
>
> Case B) Dirtying another cgroup’s dirty inode.  While is not a common
> use case, it could happen.  Options to avoid lockup:
>
> + When an inode becomes dirty shared, then move the inode from the per
> bdi per memcg bdi_memcg_dirty_inode list to an otherwise unused bdi
> wide b_unknown_memcg_dirty (promiscuous inode) list.
> b_unknown_memcg_dirty is written when memcg writeback is invoked to
> the bdi.  When an inode is cleaned and later redirtied it is added to
> the normal bdi_memcg_dirty_inode_list.
>
> + Considered: when file page goes dirty, then do not account the dirty
> page to the memcg where the page was charged, instead recharge the
> page to the memcg that the inode was billed to (by inode i_memcg
> field).  Inode would require a memcg reference that would make memcg
> cleanup tricky.
>
> + Scan memcg lru for dirty file pages -> associated inodes -> bdi ->
> writeback(bdi, inode)
>
> + What if memcg dirty limits are simply ignored in case-B?
> Ineffective memcg background writeback would be queued as usage grows.
>  Once memcg foreground limit is hit, then it would throttle waiting
> for the ineffective background writeback to never catch up.  This
> could wait indefinitely.  Could argue that the hung cgroup deserves
> this for writing to another cgroup’s inode.  However, the other cgroup
> could be the trouble maker who sneaks in to dirty the file and assume
> dirty ownership before the innocent (now hung) cgroup starts writing.
> I am not worried about making this optimal, just making forward
> progress.  Fallback to scanning memcg lru looking for inode’s of dirty
> pages.  This may be expensive, but should only happen with dirty
> inodes shared between memcg.
>
>
> Approach 2 : do something even simpler:
>
> http://www.gossamer-threads.com/lists/linux/kernel/1347359#1347359
>
> * __set_page_dirty()
>
> either set i_memcg=memcg or i_memcg=~0
> no memcg reference needed, i_memcg is not dereferenced
>
> * mem_cgroup_balance_dirty_pages(memcg, bdi)
>
> if over bg limit, then queue memcg to bdi for background writeback
> if over fg limit, then queue memcg-waiting description to bdi flusher
> (IO less throttle)
>
> * bdi_flusher(bdi)
>
> if doing memcg writeback, scan b_dirty filtering using
> is_memcg_inode(inode,memcg), which checks i_memcg field: return
> i_memcg in [~0, memcg]
>
> if unable to get memcg below its dirty memory limit:
>
> + If memcg dirties multiple bdi and then hits memcg bg limit, queue bg
> writeback for the bdi being written to.  This may not writeback other
> useful bdi.  System-wide background limit has similar issue.
>
> - con: If degree of sharing exceeds compile time max supported sharing
> degree (likely 1), then ANY writeback (per-memcg or system-wide) will
> writeback the over-shared inode.  This is undesirable because it
> punishes innocent cgroups that are not abusively sharing.
>
> - con: have to scan entire b_dirty list which may involve skipping
> many inodes not in over-limit cgroup.  A memcg constantly hitting its
> limit would monopolize a bdi flusher.
>
>
> Both approaches are complicated by the (rare) possibility when an
> inode has been been claimed (from a dirtying memcg perspective) by
> memcg M1 but later M2 writes more dirty pages.  When M2 exceeds its
> dirty limit it would be nice to find the inode, even if this requires
> some extra work.
>
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> david@fromorbit.com
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-01 21:49                 ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Dave Chinner
  2011-04-02  7:33                   ` Greg Thelen
@ 2011-04-05 13:13                   ` Vivek Goyal
  2011-04-05 22:56                     ` Dave Chinner
  1 sibling, 1 reply; 166+ messages in thread
From: Vivek Goyal @ 2011-04-05 13:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Greg Thelen, James Bottomley, lsf, linux-fsdevel

On Sat, Apr 02, 2011 at 08:49:47AM +1100, Dave Chinner wrote:
> On Fri, Apr 01, 2011 at 01:18:38PM -0400, Vivek Goyal wrote:
> > On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote:
> > > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote:
> > > > There
> > > > is no context (memcg or otherwise) given to the bdi flusher.  After
> > > > the bdi flusher checks system-wide background limits, it uses the
> > > > over_bg_limit list to find (and rotate) an over limit memcg.  Using
> > > > the memcg, then the per memcg per bdi dirty inode list is walked to
> > > > find inode pages to writeback.  Once the memcg dirty memory usage
> > > > drops below the memcg-thresh, the memcg is removed from the global
> > > > over_bg_limit list.
> > > 
> > > If you want controlled hand-off of writeback, you need to pass the
> > > memcg that triggered the throttling directly to the bdi. You already
> > > know what both the bdi and memcg that need writeback are. Yes, this
> > > needs concurrency at the BDI flush level to handle, but see my
> > > previous email in this thread for that....
> > > 
> > 
> > Even with memcg being passed around I don't think that we get rid of
> > global list lock.
> 
> You need to - we're getting rid of global lists and locks from
> writeback for scalability reasons so any new functionality needs to
> avoid global locks for the same reason.

Ok.

> 
> > The reason being that inodes are not exclusive to
> > the memory cgroups. Multiple memory cgroups might be writting to same
> > inode. So inode still remains in the global list and memory cgroups
> > kind of will have pointer to it.
> 
> So two dirty inode lists that have to be kept in sync? That doesn't
> sound particularly appealing. Nor does it scale to an inode being
> dirty in multiple cgroups
> 
> Besides, if you've got multiple memory groups dirtying the same
> inode, then you cannot expect isolation between groups. I'd consider
> this a broken configuration in this case - how often does this
> actually happen, and what is the use case for supporting
> it?
> 
> Besides, the implications are that we'd have to break up contiguous
> IOs in the writeback path simply because two sequential pages are
> associated with different groups. That's really nasty, and exactly
> the opposite of all the write combining we try to do throughout the
> writeback path. Supporting this is also a mess, as we'd have to touch
> quite a lot of filesystem code (i.e. .writepage(s) inplementations)
> to do this.

We did not plan on breaking up contigous IO even if these belonged to
different cgroup for performance reason. So probably can live with some
inaccuracy and just trigger the writeback for one inode even if that
meant that it could writeback the pages of some other cgroups doing IO
on that inode.

> 
> > So to start writeback on an inode
> > you still shall have to take global lock, IIUC.
> 
> Why not simply bdi -> list of dirty cgroups -> list of dirty inodes
> in cgroup, and go from there? I mean, really all that cgroup-aware
> writeback needs is just adding a new container for managing
> dirty inodes in the writeback path and a method for selecting that
> container for writeback, right? 

This was the initial design where one inode is associated with one cgroup
even if process from multiple cgroups are doing IO to same inode. Then
somebody raised the concern that it probably is too coarse.

IMHO, as a first step, associating inode to one cgroup exclusively
simplifies the things considerably and we can target that first.

So yes, I agree that bdi->list_of_dirty_cgroups->list_of_drity_inodes
makes sense and is relatively simple way of doing things at the expense
of not being accurate for shared inode case.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-05 13:13                   ` Vivek Goyal
@ 2011-04-05 22:56                     ` Dave Chinner
  2011-04-06 14:49                       ` Curt Wohlgemuth
  2011-04-06 15:37                       ` Vivek Goyal
  0 siblings, 2 replies; 166+ messages in thread
From: Dave Chinner @ 2011-04-05 22:56 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Greg Thelen, James Bottomley, lsf, linux-fsdevel

On Tue, Apr 05, 2011 at 09:13:59AM -0400, Vivek Goyal wrote:
> On Sat, Apr 02, 2011 at 08:49:47AM +1100, Dave Chinner wrote:
> > On Fri, Apr 01, 2011 at 01:18:38PM -0400, Vivek Goyal wrote:
> > > On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote:
> > > > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote:
> > > > > There
> > > > > is no context (memcg or otherwise) given to the bdi flusher.  After
> > > > > the bdi flusher checks system-wide background limits, it uses the
> > > > > over_bg_limit list to find (and rotate) an over limit memcg.  Using
> > > > > the memcg, then the per memcg per bdi dirty inode list is walked to
> > > > > find inode pages to writeback.  Once the memcg dirty memory usage
> > > > > drops below the memcg-thresh, the memcg is removed from the global
> > > > > over_bg_limit list.
> > > > 
> > > > If you want controlled hand-off of writeback, you need to pass the
> > > > memcg that triggered the throttling directly to the bdi. You already
> > > > know what both the bdi and memcg that need writeback are. Yes, this
> > > > needs concurrency at the BDI flush level to handle, but see my
> > > > previous email in this thread for that....
> > > > 
> > > 
> > > Even with memcg being passed around I don't think that we get rid of
> > > global list lock.
.....
> > > The reason being that inodes are not exclusive to
> > > the memory cgroups. Multiple memory cgroups might be writting to same
> > > inode. So inode still remains in the global list and memory cgroups
> > > kind of will have pointer to it.
> > 
> > So two dirty inode lists that have to be kept in sync? That doesn't
> > sound particularly appealing. Nor does it scale to an inode being
> > dirty in multiple cgroups
> > 
> > Besides, if you've got multiple memory groups dirtying the same
> > inode, then you cannot expect isolation between groups. I'd consider
> > this a broken configuration in this case - how often does this
> > actually happen, and what is the use case for supporting
> > it?
> > 
> > Besides, the implications are that we'd have to break up contiguous
> > IOs in the writeback path simply because two sequential pages are
> > associated with different groups. That's really nasty, and exactly
> > the opposite of all the write combining we try to do throughout the
> > writeback path. Supporting this is also a mess, as we'd have to touch
> > quite a lot of filesystem code (i.e. .writepage(s) inplementations)
> > to do this.
> 
> We did not plan on breaking up contigous IO even if these belonged to
> different cgroup for performance reason. So probably can live with some
> inaccuracy and just trigger the writeback for one inode even if that
> meant that it could writeback the pages of some other cgroups doing IO
> on that inode.

Which, to me, violates the principle of isolation as it's been
described that this functionality is supposed to provide.

It also means you will have handle the case of a cgroup over a
throttle limit and no inodes on it's dirty list. It's not a case of
"probably can live with" the resultant mess, the mess will occur and
so handling it needs to be designed in from the start.

> > > So to start writeback on an inode
> > > you still shall have to take global lock, IIUC.
> > 
> > Why not simply bdi -> list of dirty cgroups -> list of dirty inodes
> > in cgroup, and go from there? I mean, really all that cgroup-aware
> > writeback needs is just adding a new container for managing
> > dirty inodes in the writeback path and a method for selecting that
> > container for writeback, right? 
> 
> This was the initial design where one inode is associated with one cgroup
> even if process from multiple cgroups are doing IO to same inode. Then
> somebody raised the concern that it probably is too coarse.

Got a pointer?

> IMHO, as a first step, associating inode to one cgroup exclusively
> simplifies the things considerably and we can target that first.
> 
> So yes, I agree that bdi->list_of_dirty_cgroups->list_of_drity_inodes
> makes sense and is relatively simple way of doing things at the expense
> of not being accurate for shared inode case.

Can someone describe a valid shared inode use case? If not, we
should not even consider it as a requirement and explicitly document
it as a "not supported" use case.

As it is, I'm hearing different ideas and requirements from the
people working on the memcg side of this vs the IO controller side.
Perhaps the first step is documenting a common set of functional
requirements that demonstrates how everything will play well
together?

e.g. Defining what isolation means, when and if it can be violated,
how violations are handled, when inodes in multiple memcgs are
acceptable and how they need to be accounted and handled by the
writepage path, how memcg's over the dirty threshold with no dirty
inodes are to be handled, how metadata IO is going to be handled by
IO controllers, what kswapd is going to do writeback when the pages
it's trying to writeback during a critical low memory event belong
to a cgroup that is throttled at the IO level, etc.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-05 22:56                     ` Dave Chinner
@ 2011-04-06 14:49                       ` Curt Wohlgemuth
  2011-04-06 15:39                         ` Vivek Goyal
  2011-04-06 23:08                         ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Dave Chinner
  2011-04-06 15:37                       ` Vivek Goyal
  1 sibling, 2 replies; 166+ messages in thread
From: Curt Wohlgemuth @ 2011-04-06 14:49 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Vivek Goyal, Greg Thelen, James Bottomley, lsf, linux-fsdevel

On Tue, Apr 5, 2011 at 3:56 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, Apr 05, 2011 at 09:13:59AM -0400, Vivek Goyal wrote:
>> On Sat, Apr 02, 2011 at 08:49:47AM +1100, Dave Chinner wrote:
>> > On Fri, Apr 01, 2011 at 01:18:38PM -0400, Vivek Goyal wrote:
>> > > On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote:
>> > > > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote:
>> > > > > There
>> > > > > is no context (memcg or otherwise) given to the bdi flusher.  After
>> > > > > the bdi flusher checks system-wide background limits, it uses the
>> > > > > over_bg_limit list to find (and rotate) an over limit memcg.  Using
>> > > > > the memcg, then the per memcg per bdi dirty inode list is walked to
>> > > > > find inode pages to writeback.  Once the memcg dirty memory usage
>> > > > > drops below the memcg-thresh, the memcg is removed from the global
>> > > > > over_bg_limit list.
>> > > >
>> > > > If you want controlled hand-off of writeback, you need to pass the
>> > > > memcg that triggered the throttling directly to the bdi. You already
>> > > > know what both the bdi and memcg that need writeback are. Yes, this
>> > > > needs concurrency at the BDI flush level to handle, but see my
>> > > > previous email in this thread for that....
>> > > >
>> > >
>> > > Even with memcg being passed around I don't think that we get rid of
>> > > global list lock.
> .....
>> > > The reason being that inodes are not exclusive to
>> > > the memory cgroups. Multiple memory cgroups might be writting to same
>> > > inode. So inode still remains in the global list and memory cgroups
>> > > kind of will have pointer to it.
>> >
>> > So two dirty inode lists that have to be kept in sync? That doesn't
>> > sound particularly appealing. Nor does it scale to an inode being
>> > dirty in multiple cgroups
>> >
>> > Besides, if you've got multiple memory groups dirtying the same
>> > inode, then you cannot expect isolation between groups. I'd consider
>> > this a broken configuration in this case - how often does this
>> > actually happen, and what is the use case for supporting
>> > it?
>> >
>> > Besides, the implications are that we'd have to break up contiguous
>> > IOs in the writeback path simply because two sequential pages are
>> > associated with different groups. That's really nasty, and exactly
>> > the opposite of all the write combining we try to do throughout the
>> > writeback path. Supporting this is also a mess, as we'd have to touch
>> > quite a lot of filesystem code (i.e. .writepage(s) inplementations)
>> > to do this.
>>
>> We did not plan on breaking up contigous IO even if these belonged to
>> different cgroup for performance reason. So probably can live with some
>> inaccuracy and just trigger the writeback for one inode even if that
>> meant that it could writeback the pages of some other cgroups doing IO
>> on that inode.
>
> Which, to me, violates the principle of isolation as it's been
> described that this functionality is supposed to provide.
>
> It also means you will have handle the case of a cgroup over a
> throttle limit and no inodes on it's dirty list. It's not a case of
> "probably can live with" the resultant mess, the mess will occur and
> so handling it needs to be designed in from the start.
>
>> > > So to start writeback on an inode
>> > > you still shall have to take global lock, IIUC.
>> >
>> > Why not simply bdi -> list of dirty cgroups -> list of dirty inodes
>> > in cgroup, and go from there? I mean, really all that cgroup-aware
>> > writeback needs is just adding a new container for managing
>> > dirty inodes in the writeback path and a method for selecting that
>> > container for writeback, right?
>>
>> This was the initial design where one inode is associated with one cgroup
>> even if process from multiple cgroups are doing IO to same inode. Then
>> somebody raised the concern that it probably is too coarse.
>
> Got a pointer?
>
>> IMHO, as a first step, associating inode to one cgroup exclusively
>> simplifies the things considerably and we can target that first.
>>
>> So yes, I agree that bdi->list_of_dirty_cgroups->list_of_drity_inodes
>> makes sense and is relatively simple way of doing things at the expense
>> of not being accurate for shared inode case.
>
> Can someone describe a valid shared inode use case? If not, we
> should not even consider it as a requirement and explicitly document
> it as a "not supported" use case.

At the very least, when a task is moved from one cgroup to another,
we've got a shared inode case.  This probably won't happen more than
once for most tasks, but it will likely be common.

Curt

>
> As it is, I'm hearing different ideas and requirements from the
> people working on the memcg side of this vs the IO controller side.
> Perhaps the first step is documenting a common set of functional
> requirements that demonstrates how everything will play well
> together?
>
> e.g. Defining what isolation means, when and if it can be violated,
> how violations are handled, when inodes in multiple memcgs are
> acceptable and how they need to be accounted and handled by the
> writepage path, how memcg's over the dirty threshold with no dirty
> inodes are to be handled, how metadata IO is going to be handled by
> IO controllers, what kswapd is going to do writeback when the pages
> it's trying to writeback during a critical low memory event belong
> to a cgroup that is throttled at the IO level, etc.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-05 22:56                     ` Dave Chinner
  2011-04-06 14:49                       ` Curt Wohlgemuth
@ 2011-04-06 15:37                       ` Vivek Goyal
  2011-04-06 16:08                         ` Vivek Goyal
  2011-04-06 23:50                         ` Dave Chinner
  1 sibling, 2 replies; 166+ messages in thread
From: Vivek Goyal @ 2011-04-06 15:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Greg Thelen, James Bottomley, lsf, linux-fsdevel

On Wed, Apr 06, 2011 at 08:56:40AM +1000, Dave Chinner wrote:
> On Tue, Apr 05, 2011 at 09:13:59AM -0400, Vivek Goyal wrote:
> > On Sat, Apr 02, 2011 at 08:49:47AM +1100, Dave Chinner wrote:
> > > On Fri, Apr 01, 2011 at 01:18:38PM -0400, Vivek Goyal wrote:
> > > > On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote:
> > > > > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote:
> > > > > > There
> > > > > > is no context (memcg or otherwise) given to the bdi flusher.  After
> > > > > > the bdi flusher checks system-wide background limits, it uses the
> > > > > > over_bg_limit list to find (and rotate) an over limit memcg.  Using
> > > > > > the memcg, then the per memcg per bdi dirty inode list is walked to
> > > > > > find inode pages to writeback.  Once the memcg dirty memory usage
> > > > > > drops below the memcg-thresh, the memcg is removed from the global
> > > > > > over_bg_limit list.
> > > > > 
> > > > > If you want controlled hand-off of writeback, you need to pass the
> > > > > memcg that triggered the throttling directly to the bdi. You already
> > > > > know what both the bdi and memcg that need writeback are. Yes, this
> > > > > needs concurrency at the BDI flush level to handle, but see my
> > > > > previous email in this thread for that....
> > > > > 
> > > > 
> > > > Even with memcg being passed around I don't think that we get rid of
> > > > global list lock.
> .....
> > > > The reason being that inodes are not exclusive to
> > > > the memory cgroups. Multiple memory cgroups might be writting to same
> > > > inode. So inode still remains in the global list and memory cgroups
> > > > kind of will have pointer to it.
> > > 
> > > So two dirty inode lists that have to be kept in sync? That doesn't
> > > sound particularly appealing. Nor does it scale to an inode being
> > > dirty in multiple cgroups
> > > 
> > > Besides, if you've got multiple memory groups dirtying the same
> > > inode, then you cannot expect isolation between groups. I'd consider
> > > this a broken configuration in this case - how often does this
> > > actually happen, and what is the use case for supporting
> > > it?
> > > 
> > > Besides, the implications are that we'd have to break up contiguous
> > > IOs in the writeback path simply because two sequential pages are
> > > associated with different groups. That's really nasty, and exactly
> > > the opposite of all the write combining we try to do throughout the
> > > writeback path. Supporting this is also a mess, as we'd have to touch
> > > quite a lot of filesystem code (i.e. .writepage(s) inplementations)
> > > to do this.
> > 
> > We did not plan on breaking up contigous IO even if these belonged to
> > different cgroup for performance reason. So probably can live with some
> > inaccuracy and just trigger the writeback for one inode even if that
> > meant that it could writeback the pages of some other cgroups doing IO
> > on that inode.
> 
> Which, to me, violates the principle of isolation as it's been
> described that this functionality is supposed to provide.
> 
> It also means you will have handle the case of a cgroup over a
> throttle limit and no inodes on it's dirty list. It's not a case of
> "probably can live with" the resultant mess, the mess will occur and
> so handling it needs to be designed in from the start.

This behavior can happen due to shared page accounting. One possible
way to mitigate this problme is to traverse through LRU list of pages
of memcg and find an inode to do the writebak. 

> 
> > > > So to start writeback on an inode
> > > > you still shall have to take global lock, IIUC.
> > > 
> > > Why not simply bdi -> list of dirty cgroups -> list of dirty inodes
> > > in cgroup, and go from there? I mean, really all that cgroup-aware
> > > writeback needs is just adding a new container for managing
> > > dirty inodes in the writeback path and a method for selecting that
> > > container for writeback, right? 
> > 
> > This was the initial design where one inode is associated with one cgroup
> > even if process from multiple cgroups are doing IO to same inode. Then
> > somebody raised the concern that it probably is too coarse.
> 
> Got a pointer?

This was briefly discussed at last LSF and some people seemed to like the
idea of associated inode with one cgroup. I guess database would be a 
case where a large file can be shared by multiple processes? Now one
can argue that why to put all these processes in separate cgroups.

Anyway, I am not arguing for solving the case of shared inodes. I personally
prefer first simple step of inode being associated with one memcg and if
we run into issues due to shared inodes, then look into how to solve this
problem. 

> 
> > IMHO, as a first step, associating inode to one cgroup exclusively
> > simplifies the things considerably and we can target that first.
> > 
> > So yes, I agree that bdi->list_of_dirty_cgroups->list_of_drity_inodes
> > makes sense and is relatively simple way of doing things at the expense
> > of not being accurate for shared inode case.
> 
> Can someone describe a valid shared inode use case? If not, we
> should not even consider it as a requirement and explicitly document
> it as a "not supported" use case.

I asked same question yesterday at LSF seesion and we don't have any
good workload example yet.

> 
> As it is, I'm hearing different ideas and requirements from the
> people working on the memcg side of this vs the IO controller side.
> Perhaps the first step is documenting a common set of functional
> requirements that demonstrates how everything will play well
> together?
> 
> e.g. Defining what isolation means, when and if it can be violated,
> how violations are handled,

> when inodes in multiple memcgs are
> acceptable and how they need to be accounted and handled by the
> writepage path,

After yesterday's discussion it looked like people agreed that to 
begin with keep it simple and maintain the notion of one inode on
one memcg list. So instead of inode being on global bdi dirty list
it will be on per memecg per bdi dirty list.

Greg would you like to elaborate more on the design.

>how memcg's over the dirty threshold with no dirty
> inodes are to be handled,

As I said above, one of the proposals was that traverse through LRU
list of memcg if memcg is above dirty ratio and there are no inodes
on that memcg.

May be there are other better ways to handle this.

> how metadata IO is going to be handled by
> IO controllers,

So IO controller provides two mechanisms.

- IO throttling(bytes_per_second, io_per_second interface)
- Proportional weight disk sharing

In case of proportional weight disk sharing, we don't run into issues of
priority inversion and metadata handing should not be a concern.

For throttling case, apart from metadata, I found that with simple
throttling of data I ran into issues with journalling with ext4 mounuted
in ordered mode. So it was suggested that WRITE IO throttling should
not be done at device level instead try to do it in higher layers,
possibly balance_dirty_pages() and throttle process early.

So yes, I agree that little more documentation and more clarity on this
would be good. All this cgroup aware writeback primarily is being done
for CFQ's proportional disk sharing at the moment.

> what kswapd is going to do writeback when the pages
> it's trying to writeback during a critical low memory event belong
> to a cgroup that is throttled at the IO level, etc.

Throttling will move up so kswapd will not be throttled. Even today,
kswapd is part of root group and we do not suggest throttling root group.

For the case of proportional disk sharing, we will probably account
IO to respective cgroups (pages submitted by kswapd) and that should
not flush to disk fairly fast and should not block for long time as it is
work consering mechanism.

Do you see an issue with kswapd IO being accounted to respective cgroups
for proportional IO. For throttling case, all IO would go to root group
which is unthrottled and real issue of dirtying too many pages by
processes will be handled by throttling processes when they are dirtying
page cache.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-06 14:49                       ` Curt Wohlgemuth
@ 2011-04-06 15:39                         ` Vivek Goyal
  2011-04-06 19:49                           ` Greg Thelen
  2011-04-06 23:07                           ` [Lsf] IO less throttling and cgroup aware writeback Greg Thelen
  2011-04-06 23:08                         ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Dave Chinner
  1 sibling, 2 replies; 166+ messages in thread
From: Vivek Goyal @ 2011-04-06 15:39 UTC (permalink / raw)
  To: Curt Wohlgemuth
  Cc: Dave Chinner, Greg Thelen, James Bottomley, lsf, linux-fsdevel

On Wed, Apr 06, 2011 at 07:49:25AM -0700, Curt Wohlgemuth wrote:

[..]
> > Can someone describe a valid shared inode use case? If not, we
> > should not even consider it as a requirement and explicitly document
> > it as a "not supported" use case.
> 
> At the very least, when a task is moved from one cgroup to another,
> we've got a shared inode case.  This probably won't happen more than
> once for most tasks, but it will likely be common.

I am hoping that for such cases sooner or later inode movement will
automatically take place. At some point of time, inode will be clean
and no more on memcg_bdi list. And when it is dirtied again, I am 
hoping it will be queued on new groups's list and not on old group's
list? Greg?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-06 15:37                       ` Vivek Goyal
@ 2011-04-06 16:08                         ` Vivek Goyal
  2011-04-06 17:10                           ` Jan Kara
  2011-04-06 23:50                         ` Dave Chinner
  1 sibling, 1 reply; 166+ messages in thread
From: Vivek Goyal @ 2011-04-06 16:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Greg Thelen, James Bottomley, lsf, linux-fsdevel

On Wed, Apr 06, 2011 at 11:37:15AM -0400, Vivek Goyal wrote:

[..]
> > what kswapd is going to do writeback when the pages
> > it's trying to writeback during a critical low memory event belong
> > to a cgroup that is throttled at the IO level, etc.
> 
> Throttling will move up so kswapd will not be throttled. Even today,
> kswapd is part of root group and we do not suggest throttling root group.
> 
> For the case of proportional disk sharing, we will probably account
> IO to respective cgroups (pages submitted by kswapd) and that should
> not flush to disk fairly fast and should not block for long time as it is
> work consering mechanism.
> 
> Do you see an issue with kswapd IO being accounted to respective cgroups
> for proportional IO. For throttling case, all IO would go to root group
> which is unthrottled and real issue of dirtying too many pages by
> processes will be handled by throttling processes when they are dirtying
> page cache.

Or may be it is not a good idea to try to account pages to associated
cgroups when memory is low and kswapd is doing IO. We can probably mark 
kswapd with some flag and account all IO to root group even for
proportional weight mechanism. In this case isolation will be broken but
I guess one can not do much. To avoid this situation, one should not
have allowed too many writes and I think that's where low dirty ratio
can come into the picture.

I thought one of the use case of this was that a high prio buffered
writer should be able to do more writes than a low prio writer. That
I think should be possible by accounting flusher writes.

Dave you have any suggestions on how to handle this?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-06 16:08                         ` Vivek Goyal
@ 2011-04-06 17:10                           ` Jan Kara
  2011-04-06 17:14                             ` Curt Wohlgemuth
  2011-04-08  1:58                             ` Dave Chinner
  0 siblings, 2 replies; 166+ messages in thread
From: Jan Kara @ 2011-04-06 17:10 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Dave Chinner, James Bottomley, lsf, linux-fsdevel

On Wed 06-04-11 12:08:05, Vivek Goyal wrote:
> On Wed, Apr 06, 2011 at 11:37:15AM -0400, Vivek Goyal wrote:
> 
> [..]
> > > what kswapd is going to do writeback when the pages
> > > it's trying to writeback during a critical low memory event belong
> > > to a cgroup that is throttled at the IO level, etc.
> > 
> > Throttling will move up so kswapd will not be throttled. Even today,
> > kswapd is part of root group and we do not suggest throttling root group.
> > 
> > For the case of proportional disk sharing, we will probably account
> > IO to respective cgroups (pages submitted by kswapd) and that should
> > not flush to disk fairly fast and should not block for long time as it is
> > work consering mechanism.
> > 
> > Do you see an issue with kswapd IO being accounted to respective cgroups
> > for proportional IO. For throttling case, all IO would go to root group
> > which is unthrottled and real issue of dirtying too many pages by
> > processes will be handled by throttling processes when they are dirtying
> > page cache.
> 
> Or may be it is not a good idea to try to account pages to associated
> cgroups when memory is low and kswapd is doing IO. We can probably mark 
> kswapd with some flag and account all IO to root group even for
> proportional weight mechanism. In this case isolation will be broken but
> I guess one can not do much. To avoid this situation, one should not
> have allowed too many writes and I think that's where low dirty ratio
> can come into the picture.
  Well, I wouldn't bother too much with kswapd handling. MM people plan to
get rid of writeback from direct reclaim and just remove the dirty page
from LRU and recycle it once flusher thread writes it...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-06 17:10                           ` Jan Kara
@ 2011-04-06 17:14                             ` Curt Wohlgemuth
  2011-04-08  1:58                             ` Dave Chinner
  1 sibling, 0 replies; 166+ messages in thread
From: Curt Wohlgemuth @ 2011-04-06 17:14 UTC (permalink / raw)
  To: Jan Kara; +Cc: Vivek Goyal, Dave Chinner, James Bottomley, lsf, linux-fsdevel

On Wed, Apr 6, 2011 at 10:10 AM, Jan Kara <jack@suse.cz> wrote:
> On Wed 06-04-11 12:08:05, Vivek Goyal wrote:
>> On Wed, Apr 06, 2011 at 11:37:15AM -0400, Vivek Goyal wrote:
>>
>> [..]
>> > > what kswapd is going to do writeback when the pages
>> > > it's trying to writeback during a critical low memory event belong
>> > > to a cgroup that is throttled at the IO level, etc.
>> >
>> > Throttling will move up so kswapd will not be throttled. Even today,
>> > kswapd is part of root group and we do not suggest throttling root group.
>> >
>> > For the case of proportional disk sharing, we will probably account
>> > IO to respective cgroups (pages submitted by kswapd) and that should
>> > not flush to disk fairly fast and should not block for long time as it is
>> > work consering mechanism.
>> >
>> > Do you see an issue with kswapd IO being accounted to respective cgroups
>> > for proportional IO. For throttling case, all IO would go to root group
>> > which is unthrottled and real issue of dirtying too many pages by
>> > processes will be handled by throttling processes when they are dirtying
>> > page cache.
>>
>> Or may be it is not a good idea to try to account pages to associated
>> cgroups when memory is low and kswapd is doing IO. We can probably mark
>> kswapd with some flag and account all IO to root group even for
>> proportional weight mechanism. In this case isolation will be broken but
>> I guess one can not do much. To avoid this situation, one should not
>> have allowed too many writes and I think that's where low dirty ratio
>> can come into the picture.
>  Well, I wouldn't bother too much with kswapd handling. MM people plan to
> get rid of writeback from direct reclaim and just remove the dirty page
> from LRU and recycle it once flusher thread writes it...

But still, it matters which memcg is "responsible" for the background
writeout from direct reclaim.  One could argue that direct reclaim
should just specify the root cgroup...

Curt

>
>                                                                Honza
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-06 15:39                         ` Vivek Goyal
@ 2011-04-06 19:49                           ` Greg Thelen
  2011-04-06 23:07                           ` [Lsf] IO less throttling and cgroup aware writeback Greg Thelen
  1 sibling, 0 replies; 166+ messages in thread
From: Greg Thelen @ 2011-04-06 19:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Curt Wohlgemuth, Dave Chinner, James Bottomley, lsf, linux-fsdevel

On Wed, Apr 6, 2011 at 8:39 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Wed, Apr 06, 2011 at 07:49:25AM -0700, Curt Wohlgemuth wrote:
>
> [..]
>> > Can someone describe a valid shared inode use case? If not, we
>> > should not even consider it as a requirement and explicitly document
>> > it as a "not supported" use case.
>>
>> At the very least, when a task is moved from one cgroup to another,
>> we've got a shared inode case.  This probably won't happen more than
>> once for most tasks, but it will likely be common.
>
> I am hoping that for such cases sooner or later inode movement will
> automatically take place. At some point of time, inode will be clean
> and no more on memcg_bdi list. And when it is dirtied again, I am
> hoping it will be queued on new groups's list and not on old group's
> list? Greg?
>
> Thanks
> Vivek

When an inode is marked dirty, current->memcg is used to determine
which per memcg b_dirty list within the bdi is used to queue the
inode.  When the inode is marked clean, then the inode is removed from
the per memcg b_dirty list.  So, as Vivek said, when a process is
migrated between memcg, then the previously dirtied inodes will not be
moved.  Once such inodes are marked clean, and the re-dirtied, then
they will be requeued to the correct per memcg dirty inode list.

Here's an overview of the approach, which is assumes inode sharing is
rare but possible.  Thus, such sharing is tolerated (no live locks,
etc) but not optimized.

bdi -> 1:N -> bdi_memcg -> 1:N -> inode

mark_inode_dirty(inode)
    If I_DIRTY is clear, set I_DIRTY and inserted inode into bdi_memcg->b_dirty
    using current->memcg as a key to select the correct list.
        This will require memory allocation of bdi_memcg, if this is the first
        inode within the bdi,memcg.  If the allocation fails (rare,
but possible),
        then fallback to adding the memcg to the root cgroup dirty inode list.
    If I_DIRTY is already set, then do nothing.

When I_DIRTY is cleared, remove inode from bdi_memcg->b_dirty.  Delete bdi_memcg
if the list is now empty.

balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(memcg, bdi)
    if over bg limit, then
        set bdi_memcg->b_over_limit
            If there is no bdi_memcg (because all inodes of current’s memcg
            dirty pages where first dirtied by other memcg) then memcg lru
            to find inode and call writeback_single_inode().  This is to handle
            uncommon sharing.
        reference memcg for bdi flusher
        awake bdi flusher
    if over fg limit
        IO-full: write bdi_memcg directly (if empty use memcg lru to
find inode to write)
        IO-less: queue memcg-waiting description to bdi flusher.

bdi_flusher(bdi):
    process work queue, which will not include any memcg flusher work - just
    like current code.

    once work queue is empty:
        wb_check_old_data_flush():
            write old inodes from each of the per-memcg dirty lists.

        wb_check_background_flush():
            if any of bdi_memcg->b_over_limit is set, then write
            bdi_memcg->b_dirty inodes until under limit.

                After writing some data, recheck to see if memcg is still over
                bg_thresh.  If under limit, then clear b_over_limit and release
                memcg reference.

                If unable to bring memcg dirty usage below bg limit after
                bdi_memcg->b_dirty is empty, release memcg reference and return.
                Next time memcg calls balance_dirty_pages it will either select
                another bdi or use lru to find an inode.

            use over_bground_thresh() to check global background limit.

When a memcg is deleted it may leave behing memcg_bdi structure.  These memcg
pointers are not referenced.  As such inodes are cleaned, the bdi_memcg b_dirty
list will become empty and bdi_memcg will be deleted.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback
  2011-04-06 15:39                         ` Vivek Goyal
  2011-04-06 19:49                           ` Greg Thelen
@ 2011-04-06 23:07                           ` Greg Thelen
  2011-04-06 23:36                             ` Dave Chinner
  1 sibling, 1 reply; 166+ messages in thread
From: Greg Thelen @ 2011-04-06 23:07 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Curt Wohlgemuth, Dave Chinner, James Bottomley, lsf, linux-fsdevel

Vivek Goyal <vgoyal@redhat.com> writes:

> On Wed, Apr 06, 2011 at 07:49:25AM -0700, Curt Wohlgemuth wrote:
>
> [..]
>> > Can someone describe a valid shared inode use case? If not, we
>> > should not even consider it as a requirement and explicitly document
>> > it as a "not supported" use case.
>> 
>> At the very least, when a task is moved from one cgroup to another,
>> we've got a shared inode case.  This probably won't happen more than
>> once for most tasks, but it will likely be common.
>
> I am hoping that for such cases sooner or later inode movement will
> automatically take place. At some point of time, inode will be clean
> and no more on memcg_bdi list. And when it is dirtied again, I am 
> hoping it will be queued on new groups's list and not on old group's
> list? Greg?
>
> Thanks
> Vivek

After more thought, a few tweaks to the previous design have emerged.  I
noted such differences with 'Clarification' below.

When an inode is marked dirty, current->memcg is used to determine
which per memcg b_dirty list within the bdi is used to queue the
inode.  When the inode is marked clean, then the inode is removed from
the per memcg b_dirty list.  So, as Vivek said, when a process is
migrated between memcg, then the previously dirtied inodes will not be
moved.  Once such inodes are marked clean, and the re-dirtied, then
they will be requeued to the correct per memcg dirty inode list.

Here's an overview of the approach, which is assumes inode sharing is
rare but possible.  Thus, such sharing is tolerated (no live locks,
etc) but not optimized.

bdi -> 1:N -> bdi_memcg -> 1:N -> inode

mark_inode_dirty(inode)
   If I_DIRTY is clear, set I_DIRTY and inserted inode into bdi_memcg->b_dirty
   using current->memcg as a key to select the correct list.
       This will require memory allocation of bdi_memcg, if this is the
       first inode within the bdi,memcg.  If the allocation fails (rare,
       but possible), then fallback to adding the memcg to the root
       cgroup dirty inode list.
   If I_DIRTY is already set, then do nothing.

When I_DIRTY is cleared, remove inode from bdi_memcg->b_dirty.  Delete bdi_memcg
if the list is now empty.

balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(memcg, bdi)
   if over bg limit, then
       set bdi_memcg->b_over_limit
           If there is no bdi_memcg (because all inodes of current’s
           memcg dirty pages where first dirtied by other memcg) then
           memcg lru to find inode and call writeback_single_inode().
           This is to handle uncommon sharing.
       reference memcg for bdi flusher
       awake bdi flusher
   if over fg limit
       IO-full: write bdi_memcg directly (if empty use memcg lru to find
       inode to write)

       Clarification: In IO-less: queue memcg-waiting description to bdi
       flusher waiters (balance_list).

Clarification:
wakeup_flusher_threads():
  would take an optional memcg parameter, which would be included in the
  created work item.

  try_to_free_pages() would pass in a memcg.  Other callers would pass
  in NULL.


bdi_flusher(bdi):
    Clarification: When processing the bdi work queue, some work items
    may include a memcg (see wakeup_flusher_threads above).  If present,
    use the specified memcg to determine which bdi_memcg (and thus
    b_dirty list) should be used.  If NULL, then all bdi_memcg would be
    considered to process all inodes within the bdi.

   once work queue is empty:
       wb_check_old_data_flush():
           write old inodes from each of the per-memcg dirty lists.

       wb_check_background_flush():
           if any of bdi_memcg->b_over_limit is set, then write
           bdi_memcg->b_dirty inodes until under limit.

               After writing some data, recheck to see if memcg is still over
               bg_thresh.  If under limit, then clear b_over_limit and release
               memcg reference.

               If unable to bring memcg dirty usage below bg limit after
               bdi_memcg->b_dirty is empty, release memcg reference and return.
               Next time memcg calls balance_dirty_pages it will either select
               another bdi or use lru to find an inode.

           use over_bground_thresh() to check global background limit.

When a memcg is deleted it may leave behing memcg_bdi structure.  These memcg
pointers are not referenced.  As such inodes are cleaned, the bdi_memcg b_dirty
list will become empty and bdi_memcg will be deleted.


Too much code churn in writeback is not good.  So these memcg writeback
enhancements should probably wait for IO-less dirty throttling to get
worked out.  These memcg messages are design level discussions to get me
heading the right direction.  I plan on implementing memcg aware
writeback in the background while IO-less balance_dirty_pages is worked
out so I can follow it up.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-06 14:49                       ` Curt Wohlgemuth
  2011-04-06 15:39                         ` Vivek Goyal
@ 2011-04-06 23:08                         ` Dave Chinner
  2011-04-07 20:04                           ` Vivek Goyal
  1 sibling, 1 reply; 166+ messages in thread
From: Dave Chinner @ 2011-04-06 23:08 UTC (permalink / raw)
  To: Curt Wohlgemuth
  Cc: Vivek Goyal, Greg Thelen, James Bottomley, lsf, linux-fsdevel

On Wed, Apr 06, 2011 at 07:49:25AM -0700, Curt Wohlgemuth wrote:
> On Tue, Apr 5, 2011 at 3:56 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Apr 05, 2011 at 09:13:59AM -0400, Vivek Goyal wrote:
> >> On Sat, Apr 02, 2011 at 08:49:47AM +1100, Dave Chinner wrote:
> >> > On Fri, Apr 01, 2011 at 01:18:38PM -0400, Vivek Goyal wrote:
> >> > > On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote:
> >> > > > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote:
> >> > > > > There
> >> > > > > is no context (memcg or otherwise) given to the bdi flusher.  After
> >> > > > > the bdi flusher checks system-wide background limits, it uses the
> >> > > > > over_bg_limit list to find (and rotate) an over limit memcg.  Using
> >> > > > > the memcg, then the per memcg per bdi dirty inode list is walked to
> >> > > > > find inode pages to writeback.  Once the memcg dirty memory usage
> >> > > > > drops below the memcg-thresh, the memcg is removed from the global
> >> > > > > over_bg_limit list.
> >> > > >
> >> > > > If you want controlled hand-off of writeback, you need to pass the
> >> > > > memcg that triggered the throttling directly to the bdi. You already
> >> > > > know what both the bdi and memcg that need writeback are. Yes, this
> >> > > > needs concurrency at the BDI flush level to handle, but see my
> >> > > > previous email in this thread for that....
> >> > > >
> >> > >
> >> > > Even with memcg being passed around I don't think that we get rid of
> >> > > global list lock.
> > .....
> >> > > The reason being that inodes are not exclusive to
> >> > > the memory cgroups. Multiple memory cgroups might be writting to same
> >> > > inode. So inode still remains in the global list and memory cgroups
> >> > > kind of will have pointer to it.
> >> >
> >> > So two dirty inode lists that have to be kept in sync? That doesn't
> >> > sound particularly appealing. Nor does it scale to an inode being
> >> > dirty in multiple cgroups
> >> >
> >> > Besides, if you've got multiple memory groups dirtying the same
> >> > inode, then you cannot expect isolation between groups. I'd consider
> >> > this a broken configuration in this case - how often does this
> >> > actually happen, and what is the use case for supporting
> >> > it?
> >> >
> >> > Besides, the implications are that we'd have to break up contiguous
> >> > IOs in the writeback path simply because two sequential pages are
> >> > associated with different groups. That's really nasty, and exactly
> >> > the opposite of all the write combining we try to do throughout the
> >> > writeback path. Supporting this is also a mess, as we'd have to touch
> >> > quite a lot of filesystem code (i.e. .writepage(s) inplementations)
> >> > to do this.
> >>
> >> We did not plan on breaking up contigous IO even if these belonged to
> >> different cgroup for performance reason. So probably can live with some
> >> inaccuracy and just trigger the writeback for one inode even if that
> >> meant that it could writeback the pages of some other cgroups doing IO
> >> on that inode.
> >
> > Which, to me, violates the principle of isolation as it's been
> > described that this functionality is supposed to provide.
> >
> > It also means you will have handle the case of a cgroup over a
> > throttle limit and no inodes on it's dirty list. It's not a case of
> > "probably can live with" the resultant mess, the mess will occur and
> > so handling it needs to be designed in from the start.
> >
> >> > > So to start writeback on an inode
> >> > > you still shall have to take global lock, IIUC.
> >> >
> >> > Why not simply bdi -> list of dirty cgroups -> list of dirty inodes
> >> > in cgroup, and go from there? I mean, really all that cgroup-aware
> >> > writeback needs is just adding a new container for managing
> >> > dirty inodes in the writeback path and a method for selecting that
> >> > container for writeback, right?
> >>
> >> This was the initial design where one inode is associated with one cgroup
> >> even if process from multiple cgroups are doing IO to same inode. Then
> >> somebody raised the concern that it probably is too coarse.
> >
> > Got a pointer?
> >
> >> IMHO, as a first step, associating inode to one cgroup exclusively
> >> simplifies the things considerably and we can target that first.
> >>
> >> So yes, I agree that bdi->list_of_dirty_cgroups->list_of_drity_inodes
> >> makes sense and is relatively simple way of doing things at the expense
> >> of not being accurate for shared inode case.
> >
> > Can someone describe a valid shared inode use case? If not, we
> > should not even consider it as a requirement and explicitly document
> > it as a "not supported" use case.
> 
> At the very least, when a task is moved from one cgroup to another,
> we've got a shared inode case.  This probably won't happen more than
> once for most tasks, but it will likely be common.

That's not a shared case, that's a transfer of ownership. If the
task changes groups, you have to charge all it's pages to the new
group, right? Otherwise you've got a problem where a task that is
not part of a specific cgroup is still somewhat controlled by it's
previous cgroup. It would also still influence that previous group
even though it's no longer a member. Not good for isolation purposes.

And if you are transfering the state, moving the inode from the
dirty list of one cgroup to another is trivial and avoids any need
for the dirty state to be shared....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback
  2011-04-06 23:07                           ` [Lsf] IO less throttling and cgroup aware writeback Greg Thelen
@ 2011-04-06 23:36                             ` Dave Chinner
  2011-04-07 19:24                               ` Vivek Goyal
  0 siblings, 1 reply; 166+ messages in thread
From: Dave Chinner @ 2011-04-06 23:36 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Vivek Goyal, Curt Wohlgemuth, James Bottomley, lsf, linux-fsdevel

On Wed, Apr 06, 2011 at 04:07:14PM -0700, Greg Thelen wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > On Wed, Apr 06, 2011 at 07:49:25AM -0700, Curt Wohlgemuth wrote:
> >
> > [..]
> >> > Can someone describe a valid shared inode use case? If not, we
> >> > should not even consider it as a requirement and explicitly document
> >> > it as a "not supported" use case.
> >> 
> >> At the very least, when a task is moved from one cgroup to another,
> >> we've got a shared inode case.  This probably won't happen more than
> >> once for most tasks, but it will likely be common.
> >
> > I am hoping that for such cases sooner or later inode movement will
> > automatically take place. At some point of time, inode will be clean
> > and no more on memcg_bdi list. And when it is dirtied again, I am 
> > hoping it will be queued on new groups's list and not on old group's
> > list? Greg?
> >
> > Thanks
> > Vivek
> 
> After more thought, a few tweaks to the previous design have emerged.  I
> noted such differences with 'Clarification' below.
> 
> When an inode is marked dirty, current->memcg is used to determine
> which per memcg b_dirty list within the bdi is used to queue the
> inode.  When the inode is marked clean, then the inode is removed from
> the per memcg b_dirty list.  So, as Vivek said, when a process is
> migrated between memcg, then the previously dirtied inodes will not be
> moved.  Once such inodes are marked clean, and the re-dirtied, then
> they will be requeued to the correct per memcg dirty inode list.
> 
> Here's an overview of the approach, which is assumes inode sharing is
> rare but possible.  Thus, such sharing is tolerated (no live locks,
> etc) but not optimized.
> 
> bdi -> 1:N -> bdi_memcg -> 1:N -> inode
> 
> mark_inode_dirty(inode)
>    If I_DIRTY is clear, set I_DIRTY and inserted inode into bdi_memcg->b_dirty
>    using current->memcg as a key to select the correct list.
>        This will require memory allocation of bdi_memcg, if this is the
>        first inode within the bdi,memcg.  If the allocation fails (rare,
>        but possible), then fallback to adding the memcg to the root
>        cgroup dirty inode list.
>    If I_DIRTY is already set, then do nothing.

This is where it gets tricky. Page cache dirtiness is tracked via
I_DIRTY_PAGES, a subset of I_DIRTY. I_DIRTY_DATASYNC and
I_DIRTY_SYNC are for inode metadata changes, and a lot of
filesystems track those themselves. Indeed, XFS doesn't mark inodes
dirty at the VFS for I_DIRTY_*SYNC for pure metadata operations any
more, and there's no way that tracking can be made cgroup aware.

Hence it can be the case that only I_DIRTY_PAGES is tracked in
the VFS dirty lists, and that is the flag you need to care about
here.

Further, we are actually looking at formalising this - changing the
.dirty_inode() operation to take the dirty flags and return a result
that indicates whether the inode should be tracked in the VFS dirty
list at all. This would stop double tracking of dirty inodes and go
a long way to solving some of the behavioural issues we have now
(e.g. the VFS tracking and trying to writeback inodes that the
filesystem has already cleaned).

Hence I think you need to be explicit that this tracking is
specifically for I_DIRTY_PAGES state, though will handle other dirty
inode states if desired by the filesytem.




> When I_DIRTY is cleared, remove inode from bdi_memcg->b_dirty.  Delete bdi_memcg
> if the list is now empty.
> 
> balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(memcg, bdi)
>    if over bg limit, then
>        set bdi_memcg->b_over_limit
>            If there is no bdi_memcg (because all inodes of current’s
>            memcg dirty pages where first dirtied by other memcg) then
>            memcg lru to find inode and call writeback_single_inode().
>            This is to handle uncommon sharing.

We don't want to introduce any new IO sources into
balance_dirty_pages(). This needs to trigger memcg-LRU based bdi
flusher writeback, not try to write back inodes itself.

Alternatively, this problem won't exist if you transfer page щache
state from one memcg to another when you move the inode from one
memcg to another.

>        reference memcg for bdi flusher
>        awake bdi flusher
>    if over fg limit
>        IO-full: write bdi_memcg directly (if empty use memcg lru to find
>        inode to write)
> 
>        Clarification: In IO-less: queue memcg-waiting description to bdi
>        flusher waiters (balance_list).

I'd be looking at designing for IO-less throttling up front....

> Clarification:
> wakeup_flusher_threads():
>   would take an optional memcg parameter, which would be included in the
>   created work item.
> 
>   try_to_free_pages() would pass in a memcg.  Other callers would pass
>   in NULL.
> 
> 
> bdi_flusher(bdi):
>     Clarification: When processing the bdi work queue, some work items
>     may include a memcg (see wakeup_flusher_threads above).  If present,
>     use the specified memcg to determine which bdi_memcg (and thus
>     b_dirty list) should be used.  If NULL, then all bdi_memcg would be
>     considered to process all inodes within the bdi.
> 
>    once work queue is empty:
>        wb_check_old_data_flush():
>            write old inodes from each of the per-memcg dirty lists.
> 
>        wb_check_background_flush():
>            if any of bdi_memcg->b_over_limit is set, then write
>            bdi_memcg->b_dirty inodes until under limit.
> 
>                After writing some data, recheck to see if memcg is still over
>                bg_thresh.  If under limit, then clear b_over_limit and release
>                memcg reference.
> 
>                If unable to bring memcg dirty usage below bg limit after
>                bdi_memcg->b_dirty is empty, release memcg reference and return.
>                Next time memcg calls balance_dirty_pages it will either select
>                another bdi or use lru to find an inode.

I think all the background flush cares about is bringing memcg's
under the dirty limit. What balance_dirty_pages() does is irrelevant
to the background flush.

>            use over_bground_thresh() to check global background limit.

the background flush needs to continue while over the global limit
even if all the memcg's are under their limits. In which case, we
need to consider if we need to be fair when writing back memcg's on
a bdi i.e. do we cycle an inode at a time until b_io is empty, then
cycle to the next memcg, and not come back to the first memcg with
inodes queued on b_more_io until they all have empty b_io queues?

> When a memcg is deleted it may leave behing memcg_bdi structure.  These memcg
> pointers are not referenced.  As such inodes are cleaned, the bdi_memcg b_dirty
> list will become empty and bdi_memcg will be deleted.

So you need to reference count the bdi_memcg structures?

> Too much code churn in writeback is not good.  So these memcg writeback
> enhancements should probably wait for IO-less dirty throttling to get
> worked out.

Agreed. We're probably looking at .41 or .42 for any memcg writeback
enhancements.

> These memcg messages are design level discussions to get me
> heading the right direction.  I plan on implementing memcg aware
> writeback in the background while IO-less balance_dirty_pages is worked
> out so I can follow it up.

Great!

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-06 15:37                       ` Vivek Goyal
  2011-04-06 16:08                         ` Vivek Goyal
@ 2011-04-06 23:50                         ` Dave Chinner
  2011-04-07 17:55                           ` Vivek Goyal
  1 sibling, 1 reply; 166+ messages in thread
From: Dave Chinner @ 2011-04-06 23:50 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Greg Thelen, James Bottomley, lsf, linux-fsdevel

On Wed, Apr 06, 2011 at 11:37:15AM -0400, Vivek Goyal wrote:
> On Wed, Apr 06, 2011 at 08:56:40AM +1000, Dave Chinner wrote:
> > On Tue, Apr 05, 2011 at 09:13:59AM -0400, Vivek Goyal wrote:
> > It also means you will have handle the case of a cgroup over a
> > throttle limit and no inodes on it's dirty list. It's not a case of
> > "probably can live with" the resultant mess, the mess will occur and
> > so handling it needs to be designed in from the start.
> 
> This behavior can happen due to shared page accounting. One possible
> way to mitigate this problme is to traverse through LRU list of pages
> of memcg and find an inode to do the writebak. 

Page LRU ordered writeback is something we need to avoid. It causes
havok with IO and allocation patterns. Also, how expensive is such a
walk? If it's a common operation, then it's a non-starter for the
generic writeback code.

BTW, how is "shared page accounting" different to the shared dirty
inode case we've been discussing?

> After yesterday's discussion it looked like people agreed that to 
> begin with keep it simple and maintain the notion of one inode on
> one memcg list. So instead of inode being on global bdi dirty list
> it will be on per memecg per bdi dirty list.

Good to hear.

> > how metadata IO is going to be handled by
> > IO controllers,
> 
> So IO controller provides two mechanisms.
> 
> - IO throttling(bytes_per_second, io_per_second interface)
> - Proportional weight disk sharing
> 
> In case of proportional weight disk sharing, we don't run into issues of
> priority inversion and metadata handing should not be a concern.

Though metadata IO will affect how much bandwidth/iops is available
for applications to use.

> For throttling case, apart from metadata, I found that with simple
> throttling of data I ran into issues with journalling with ext4 mounuted
> in ordered mode. So it was suggested that WRITE IO throttling should
> not be done at device level instead try to do it in higher layers,
> possibly balance_dirty_pages() and throttle process early.

The problem with doing it at the page cache entry level is that
cache hits then get throttled. It's not really a an IO controller at
that point, and the impact on application performance could be huge
(i.e. MB/s instead of GB/s).

> So yes, I agree that little more documentation and more clarity on this
> would be good. All this cgroup aware writeback primarily is being done
> for CFQ's proportional disk sharing at the moment.
> 
> > what kswapd is going to do writeback when the pages
> > it's trying to writeback during a critical low memory event belong
> > to a cgroup that is throttled at the IO level, etc.
> 
> Throttling will move up so kswapd will not be throttled. Even today,
> kswapd is part of root group and we do not suggest throttling root group.

So once again you have the problem of writeback from kswapd (which
is ugly to begin with) affecting all the groups. Given kswapd likes
to issue what is effectively random IO, this coul dhave devastating
effect on everything else....

> For the case of proportional disk sharing, we will probably account
> IO to respective cgroups (pages submitted by kswapd) and that should
> not flush to disk fairly fast and should not block for long time as it is
> work consering mechanism.

Well, it depends. I can still see how, with proportional IO, kswapd
would get slowed cleaning dirty pages on one memcg when there are
clean pages in another memcg that it could reclaim without doing any
IO. i.e. it has potential to slow down memory reclaim significantly.
(Note, I'm assuming proportional IO doesn't mean "no throttling" it
just means there is a much lower delay on IO issue.)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-06 23:50                         ` Dave Chinner
@ 2011-04-07 17:55                           ` Vivek Goyal
  2011-04-11  1:36                             ` Dave Chinner
  0 siblings, 1 reply; 166+ messages in thread
From: Vivek Goyal @ 2011-04-07 17:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Greg Thelen, James Bottomley, lsf, linux-fsdevel

On Thu, Apr 07, 2011 at 09:50:39AM +1000, Dave Chinner wrote:
> On Wed, Apr 06, 2011 at 11:37:15AM -0400, Vivek Goyal wrote:
> > On Wed, Apr 06, 2011 at 08:56:40AM +1000, Dave Chinner wrote:
> > > On Tue, Apr 05, 2011 at 09:13:59AM -0400, Vivek Goyal wrote:
> > > It also means you will have handle the case of a cgroup over a
> > > throttle limit and no inodes on it's dirty list. It's not a case of
> > > "probably can live with" the resultant mess, the mess will occur and
> > > so handling it needs to be designed in from the start.
> > 
> > This behavior can happen due to shared page accounting. One possible
> > way to mitigate this problme is to traverse through LRU list of pages
> > of memcg and find an inode to do the writebak. 
> 
> Page LRU ordered writeback is something we need to avoid. It causes
> havok with IO and allocation patterns. Also, how expensive is such a
> walk? If it's a common operation, then it's a non-starter for the
> generic writeback code.
> 

Agreed that LRU ordered writeback needs to be avoided as it is going to be
expensive. That's why the notion of inode on per memcg list and do per inode
writeback. This seems to be only backup plan in case when there is no
inode to do IO due to shared inode accounting issues. 

Do you have ideas on better way to handle it? The other proposal of
maintaining a list memcg_mapping, which tracks which inode this cgroup
has dirtied has been deemed complex and been kind of rejected at least
for the first step.

> BTW, how is "shared page accounting" different to the shared dirty
> inode case we've been discussing?

IIUC, there are two problems.

- Issues because of shared page accounting
- Issues because of shared inode accouting.

So in shared page accounting, if two process do IO to same page, IO gets
charged to cgroup who first touched the page. So if a cgroup is writting
on lots of shared pages, it will be charged to the other cgroup who 
brought the page in memory to begin with and will drive its dirty ratio
up. So this seems to be case of weaker isolation in case of shared pages,
and we got to live with it.

Similarly if inode is shared, inode gets put on the list of memcg who dirtied
it first. So now if two cgroups are dirtying pages on inode, then pages should
be charged to respective cgroup but inode will be only on one memcg and once
writeback is performed it might happen that cgroup is over its background
limit but there are no inodes to do writeback.

> 
> > After yesterday's discussion it looked like people agreed that to 
> > begin with keep it simple and maintain the notion of one inode on
> > one memcg list. So instead of inode being on global bdi dirty list
> > it will be on per memecg per bdi dirty list.
> 
> Good to hear.
> 
> > > how metadata IO is going to be handled by
> > > IO controllers,
> > 
> > So IO controller provides two mechanisms.
> > 
> > - IO throttling(bytes_per_second, io_per_second interface)
> > - Proportional weight disk sharing
> > 
> > In case of proportional weight disk sharing, we don't run into issues of
> > priority inversion and metadata handing should not be a concern.
> 
> Though metadata IO will affect how much bandwidth/iops is available
> for applications to use.

I think meta data IO will be accounted to the process submitting the meta
data IO. (IO tracking stuff will be used only for page cache pages during
page dirtying time). So yes, the process doing meta data IO will be
charged for it. 

I think I am missing something here and not understanding your concern
exactly here.

> 
> > For throttling case, apart from metadata, I found that with simple
> > throttling of data I ran into issues with journalling with ext4 mounuted
> > in ordered mode. So it was suggested that WRITE IO throttling should
> > not be done at device level instead try to do it in higher layers,
> > possibly balance_dirty_pages() and throttle process early.
> 
> The problem with doing it at the page cache entry level is that
> cache hits then get throttled. It's not really a an IO controller at
> that point, and the impact on application performance could be huge
> (i.e. MB/s instead of GB/s).

Agreed that throttling cache hits is not a good idea. Can we determine
if page being asked for is in cache or not and charge for IO accordingly.

> 
> > So yes, I agree that little more documentation and more clarity on this
> > would be good. All this cgroup aware writeback primarily is being done
> > for CFQ's proportional disk sharing at the moment.
> > 
> > > what kswapd is going to do writeback when the pages
> > > it's trying to writeback during a critical low memory event belong
> > > to a cgroup that is throttled at the IO level, etc.
> > 
> > Throttling will move up so kswapd will not be throttled. Even today,
> > kswapd is part of root group and we do not suggest throttling root group.
> 
> So once again you have the problem of writeback from kswapd (which
> is ugly to begin with) affecting all the groups. Given kswapd likes
> to issue what is effectively random IO, this coul dhave devastating
> effect on everything else....

Implementing throttling at higher layer has the problem of IO spikes
at the end level devices when flusher or kswapd decide to do bunch of
IO. I really don't have a good answer for that. Doing throttling at
device level runs into issues with journalling. So I guess issues of
IO spikes is lesser concern as compared to issue of choking filesystem.

Following two things might help though a bit with IO spikes.

- Keep per cgroup background dirty ratio low so that flusher tries to
  flush out pages sooner than later.

- All the IO coming from flusher/kswapd will be going in root group
  from throttling perspective. We can try to throttle it again to
  some reasonable value to reduce the impact of IO spikes. 

Ideas to handle this better?

> 
> > For the case of proportional disk sharing, we will probably account
> > IO to respective cgroups (pages submitted by kswapd) and that should
> > not flush to disk fairly fast and should not block for long time as it is
> > work consering mechanism.
> 
> Well, it depends. I can still see how, with proportional IO, kswapd
> would get slowed cleaning dirty pages on one memcg when there are
> clean pages in another memcg that it could reclaim without doing any
> IO. i.e. it has potential to slow down memory reclaim significantly.
> (Note, I'm assuming proportional IO doesn't mean "no throttling" it
> just means there is a much lower delay on IO issue.)

Proportional IO can delay submitting an IO only if there is IO happening
in other groups. So IO can still be throttled and limits are decided
by fair share of a group. But if other groups are not doing IO and not
using their fair share, then the group doing IO gets bigger share.

So yes, if heavy IO is happening at disk while kswapd is also trying
to reclaim memory, then IO submitted by kswapd can be delayed and
this can slow down reclaim. (Does kswapd has to block after submitting
IO from a memcg. Can't it just move onto next memcg and either free
pages if not dirty, or also submit IO from next memcg?)

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback
  2011-04-06 23:36                             ` Dave Chinner
@ 2011-04-07 19:24                               ` Vivek Goyal
  2011-04-07 20:33                                 ` Christoph Hellwig
  2011-04-07 23:42                                 ` Dave Chinner
  0 siblings, 2 replies; 166+ messages in thread
From: Vivek Goyal @ 2011-04-07 19:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Greg Thelen, Curt Wohlgemuth, James Bottomley, lsf, linux-fsdevel

On Thu, Apr 07, 2011 at 09:36:02AM +1000, Dave Chinner wrote:

[..]
> > mark_inode_dirty(inode)
> >    If I_DIRTY is clear, set I_DIRTY and inserted inode into bdi_memcg->b_dirty
> >    using current->memcg as a key to select the correct list.
> >        This will require memory allocation of bdi_memcg, if this is the
> >        first inode within the bdi,memcg.  If the allocation fails (rare,
> >        but possible), then fallback to adding the memcg to the root
> >        cgroup dirty inode list.
> >    If I_DIRTY is already set, then do nothing.
> 
> This is where it gets tricky. Page cache dirtiness is tracked via
> I_DIRTY_PAGES, a subset of I_DIRTY. I_DIRTY_DATASYNC and
> I_DIRTY_SYNC are for inode metadata changes, and a lot of
> filesystems track those themselves. Indeed, XFS doesn't mark inodes
> dirty at the VFS for I_DIRTY_*SYNC for pure metadata operations any
> more, and there's no way that tracking can be made cgroup aware.
> 
> Hence it can be the case that only I_DIRTY_PAGES is tracked in
> the VFS dirty lists, and that is the flag you need to care about
> here.
> 
> Further, we are actually looking at formalising this - changing the
> .dirty_inode() operation to take the dirty flags and return a result
> that indicates whether the inode should be tracked in the VFS dirty
> list at all. This would stop double tracking of dirty inodes and go
> a long way to solving some of the behavioural issues we have now
> (e.g. the VFS tracking and trying to writeback inodes that the
> filesystem has already cleaned).
> 
> Hence I think you need to be explicit that this tracking is
> specifically for I_DIRTY_PAGES state, though will handle other dirty
> inode states if desired by the filesytem.

Ok, that makes sense. We are interested primarily in I_DIRTY_PAGES state
only.

IIUC, so first we need to fix existing code where we seem to be moving
any inode on bdi writeback list based on I_DIRTY flag.

BTW, what's the difference between I_DIRTY_DATASYNC and I_DIRTY_PAGES? To
me both seem to mean that data needs to be written back and not the
inode itself.

> 
> > When I_DIRTY is cleared, remove inode from bdi_memcg->b_dirty.  Delete bdi_memcg
> > if the list is now empty.
> > 
> > balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(memcg, bdi)
> >    if over bg limit, then
> >        set bdi_memcg->b_over_limit
> >            If there is no bdi_memcg (because all inodes of current’s
> >            memcg dirty pages where first dirtied by other memcg) then
> >            memcg lru to find inode and call writeback_single_inode().
> >            This is to handle uncommon sharing.
> 
> We don't want to introduce any new IO sources into
> balance_dirty_pages(). This needs to trigger memcg-LRU based bdi
> flusher writeback, not try to write back inodes itself.

Will we not enjoy more sequtial IO traffic once we find an inode by
traversing memcg->lru list? So isn't that better than pure LRU based
flushing?

> 
> Alternatively, this problem won't exist if you transfer page щache
> state from one memcg to another when you move the inode from one
> memcg to another.

But in case of shared inode problem still remains. inode is being written
from two cgroups and it can't be in both the groups as per the exisiting
design.

> 
> >        reference memcg for bdi flusher
> >        awake bdi flusher
> >    if over fg limit
> >        IO-full: write bdi_memcg directly (if empty use memcg lru to find
> >        inode to write)
> > 
> >        Clarification: In IO-less: queue memcg-waiting description to bdi
> >        flusher waiters (balance_list).
> 
> I'd be looking at designing for IO-less throttling up front....

Agreed. Lets design it on top of IO less throttling patches. We also 
shall have to modify IO less throttling a bit so that page completions
are not uniformly distributed across all the threads but we need to
account for groups first and then distribute completions with-in group
uniformly.

[..]
> >            use over_bground_thresh() to check global background limit.
> 
> the background flush needs to continue while over the global limit
> even if all the memcg's are under their limits. In which case, we
> need to consider if we need to be fair when writing back memcg's on
> a bdi i.e. do we cycle an inode at a time until b_io is empty, then
> cycle to the next memcg, and not come back to the first memcg with
> inodes queued on b_more_io until they all have empty b_io queues?
> 

I think continue to cycle through memcg's even in this case will make 
sense.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-06 23:08                         ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Dave Chinner
@ 2011-04-07 20:04                           ` Vivek Goyal
  2011-04-07 23:47                             ` Dave Chinner
  0 siblings, 1 reply; 166+ messages in thread
From: Vivek Goyal @ 2011-04-07 20:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Curt Wohlgemuth, Greg Thelen, James Bottomley, lsf, linux-fsdevel

On Thu, Apr 07, 2011 at 09:08:04AM +1000, Dave Chinner wrote:

[..]
> > At the very least, when a task is moved from one cgroup to another,
> > we've got a shared inode case.  This probably won't happen more than
> > once for most tasks, but it will likely be common.
> 
> That's not a shared case, that's a transfer of ownership. If the
> task changes groups, you have to charge all it's pages to the new
> group, right? Otherwise you've got a problem where a task that is
> not part of a specific cgroup is still somewhat controlled by it's
> previous cgroup. It would also still influence that previous group
> even though it's no longer a member. Not good for isolation purposes.
> 
> And if you are transfering the state, moving the inode from the
> dirty list of one cgroup to another is trivial and avoids any need
> for the dirty state to be shared....

I am wondering how do you map a task to an inode. Multiple tasks in the
group might have written to same inode. Now which task owns it? 

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback
  2011-04-07 19:24                               ` Vivek Goyal
@ 2011-04-07 20:33                                 ` Christoph Hellwig
  2011-04-07 21:34                                   ` Vivek Goyal
  2011-04-07 23:42                                 ` Dave Chinner
  1 sibling, 1 reply; 166+ messages in thread
From: Christoph Hellwig @ 2011-04-07 20:33 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Dave Chinner, James Bottomley, lsf, linux-fsdevel

On Thu, Apr 07, 2011 at 03:24:24PM -0400, Vivek Goyal wrote:
> IIUC, so first we need to fix existing code where we seem to be moving
> any inode on bdi writeback list based on I_DIRTY flag.

I_DIRTY is a set of flags.  Inodes are on the dirty list if any of
the flags is set.

> BTW, what's the difference between I_DIRTY_DATASYNC and I_DIRTY_PAGES? To
> me both seem to mean that data needs to be written back and not the
> inode itself.

I_DIRTY_PAGES means dirty data (pages)
I_DIRTY_DATASYNC means dirty metadata which needs to be written for fdatasync
I_DIRTY_SYNC means dirty metadata which only needs to be written for fsync


^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback
  2011-04-07 20:33                                 ` Christoph Hellwig
@ 2011-04-07 21:34                                   ` Vivek Goyal
  0 siblings, 0 replies; 166+ messages in thread
From: Vivek Goyal @ 2011-04-07 21:34 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, James Bottomley, lsf, linux-fsdevel

On Thu, Apr 07, 2011 at 04:33:03PM -0400, Christoph Hellwig wrote:
> On Thu, Apr 07, 2011 at 03:24:24PM -0400, Vivek Goyal wrote:
> > IIUC, so first we need to fix existing code where we seem to be moving
> > any inode on bdi writeback list based on I_DIRTY flag.
> 
> I_DIRTY is a set of flags.  Inodes are on the dirty list if any of
> the flags is set.
> 
> > BTW, what's the difference between I_DIRTY_DATASYNC and I_DIRTY_PAGES? To
> > me both seem to mean that data needs to be written back and not the
> > inode itself.
> 
> I_DIRTY_PAGES means dirty data (pages)
> I_DIRTY_DATASYNC means dirty metadata which needs to be written for fdatasync
> I_DIRTY_SYNC means dirty metadata which only needs to be written for fsync

Ok, that helps. Thanks. So an fdatasync() can write back some metadata
too if I_DIRTY_DATASYNC is set.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback
  2011-04-07 19:24                               ` Vivek Goyal
  2011-04-07 20:33                                 ` Christoph Hellwig
@ 2011-04-07 23:42                                 ` Dave Chinner
  2011-04-08  0:59                                   ` Greg Thelen
  2011-04-08 13:43                                   ` Vivek Goyal
  1 sibling, 2 replies; 166+ messages in thread
From: Dave Chinner @ 2011-04-07 23:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Greg Thelen, Curt Wohlgemuth, James Bottomley, lsf, linux-fsdevel

On Thu, Apr 07, 2011 at 03:24:24PM -0400, Vivek Goyal wrote:
> On Thu, Apr 07, 2011 at 09:36:02AM +1000, Dave Chinner wrote:
[...]
> > > When I_DIRTY is cleared, remove inode from bdi_memcg->b_dirty.  Delete bdi_memcg
> > > if the list is now empty.
> > > 
> > > balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(memcg, bdi)
> > >    if over bg limit, then
> > >        set bdi_memcg->b_over_limit
> > >            If there is no bdi_memcg (because all inodes of current’s
> > >            memcg dirty pages where first dirtied by other memcg) then
> > >            memcg lru to find inode and call writeback_single_inode().
> > >            This is to handle uncommon sharing.
> > 
> > We don't want to introduce any new IO sources into
> > balance_dirty_pages(). This needs to trigger memcg-LRU based bdi
> > flusher writeback, not try to write back inodes itself.
> 
> Will we not enjoy more sequtial IO traffic once we find an inode by
> traversing memcg->lru list? So isn't that better than pure LRU based
> flushing?

Sorry, I wasn't particularly clear there, What I meant was that we
ask the bdi-flusher thread to select the inode to write back from
the LRU, not do it directly from balance_dirty_pages(). i.e.
bdp stays IO-less.

> > Alternatively, this problem won't exist if you transfer page щache
> > state from one memcg to another when you move the inode from one
> > memcg to another.
> 
> But in case of shared inode problem still remains. inode is being written
> from two cgroups and it can't be in both the groups as per the exisiting
> design.

But we've already determined that there is no use case for this
shared inode behaviour, so we aren't going to explictly support it,
right?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-07 20:04                           ` Vivek Goyal
@ 2011-04-07 23:47                             ` Dave Chinner
  2011-04-08 13:50                               ` Vivek Goyal
  0 siblings, 1 reply; 166+ messages in thread
From: Dave Chinner @ 2011-04-07 23:47 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Curt Wohlgemuth, Greg Thelen, James Bottomley, lsf, linux-fsdevel

On Thu, Apr 07, 2011 at 04:04:37PM -0400, Vivek Goyal wrote:
> On Thu, Apr 07, 2011 at 09:08:04AM +1000, Dave Chinner wrote:
> 
> [..]
> > > At the very least, when a task is moved from one cgroup to another,
> > > we've got a shared inode case.  This probably won't happen more than
> > > once for most tasks, but it will likely be common.
> > 
> > That's not a shared case, that's a transfer of ownership. If the
> > task changes groups, you have to charge all it's pages to the new
> > group, right? Otherwise you've got a problem where a task that is
> > not part of a specific cgroup is still somewhat controlled by it's
> > previous cgroup. It would also still influence that previous group
> > even though it's no longer a member. Not good for isolation purposes.
> > 
> > And if you are transfering the state, moving the inode from the
> > dirty list of one cgroup to another is trivial and avoids any need
> > for the dirty state to be shared....
> 
> I am wondering how do you map a task to an inode. Multiple tasks in the
> group might have written to same inode. Now which task owns it? 

That sounds like a completely broken configuration to me. If you are
using cgroups for isolation, you simple do not share *anything*
between them.

Right now the only use case that has been presented for shared
inodes is transfering a task from one cgroup to another. Why on
earth would you do that if it is sharing resources with other tasks
in the original cgroup? What use case does this represent, how often
is it likely to happen, and who cares about it anyway?

Let's not overly complicate things by making up requirements that
nobody cares about....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback
  2011-04-07 23:42                                 ` Dave Chinner
@ 2011-04-08  0:59                                   ` Greg Thelen
  2011-04-08  1:25                                       ` Dave Chinner
  2011-04-08 13:43                                   ` Vivek Goyal
  1 sibling, 1 reply; 166+ messages in thread
From: Greg Thelen @ 2011-04-08  0:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Vivek Goyal, Curt Wohlgemuth, James Bottomley, lsf,
	linux-fsdevel, linux-mm

cc: linux-mm

Dave Chinner <david@fromorbit.com> writes:

> On Thu, Apr 07, 2011 at 03:24:24PM -0400, Vivek Goyal wrote:
>> On Thu, Apr 07, 2011 at 09:36:02AM +1000, Dave Chinner wrote:
> [...]
>> > > When I_DIRTY is cleared, remove inode from bdi_memcg->b_dirty.  Delete bdi_memcg
>> > > if the list is now empty.
>> > > 
>> > > balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(memcg, bdi)
>> > >    if over bg limit, then
>> > >        set bdi_memcg->b_over_limit
>> > >            If there is no bdi_memcg (because all inodes of current’s
>> > >            memcg dirty pages where first dirtied by other memcg) then
>> > >            memcg lru to find inode and call writeback_single_inode().
>> > >            This is to handle uncommon sharing.
>> > 
>> > We don't want to introduce any new IO sources into
>> > balance_dirty_pages(). This needs to trigger memcg-LRU based bdi
>> > flusher writeback, not try to write back inodes itself.
>> 
>> Will we not enjoy more sequtial IO traffic once we find an inode by
>> traversing memcg->lru list? So isn't that better than pure LRU based
>> flushing?
>
> Sorry, I wasn't particularly clear there, What I meant was that we
> ask the bdi-flusher thread to select the inode to write back from
> the LRU, not do it directly from balance_dirty_pages(). i.e.
> bdp stays IO-less.
>
>> > Alternatively, this problem won't exist if you transfer page щache
>> > state from one memcg to another when you move the inode from one
>> > memcg to another.
>> 
>> But in case of shared inode problem still remains. inode is being written
>> from two cgroups and it can't be in both the groups as per the exisiting
>> design.
>
> But we've already determined that there is no use case for this
> shared inode behaviour, so we aren't going to explictly support it,
> right?
>
> Cheers,
>
> Dave.

I am thinking that we should avoid ever scanning the memcg lru for dirty
pages or corresponding dirty inodes previously associated with other
memcg.  I think the only reason we considered scanning the lru was to
handle the unexpected shared inode case.  When such inode sharing occurs
the sharing memcg will not be confined to the memcg's dirty limit.
There's always the memcg hard limit to cap memcg usage.

I'd like to add a counter (or at least tracepoint) to record when such
unsupported usage is detected.

Here's an example time line of such sharing:

1. memcg_1/process_a, writes to /var/log/messages and closes the file.
   This marks the inode in the bdi_memcg for memcg_1.

2. memcg_2/process_b, continually writes to /var/log/messages.  This
   drives up memcg_2 dirty memory usage to the memcg_2 background
   threshold.  mem_cgroup_balance_dirty_pages() would normally mark the
   corresponding bdi_memcg as over-bg-limit and kick the bdi_flusher and
   then return to the dirtying process.  However, there is no bdi_memcg
   because there are no dirty inodes for memcg_2.  So the bdi flusher
   sees no bdi_memcg as marked over-limit, so bdi flusher writes nothing
   (assuming we're still below system background threshold).

3. memcg_2/process_b, continues writing to /var/log/messages hitting the
   memcg_2 dirty memory foreground threshold.  Using IO-less
   balance_dirty_pages(), normally mem_cgroup_balance_dirty_pages()
   would block waiting for the previously kicked bdi flusher to clean
   some memcg_2 pages.  In this case mem_cgroup_balance_dirty_pages()
   sees no bdi_memcg and concludes that bdi flusher will not be lowering
   memcg dirty memory usage.  This is the unsupported sharing case, so
   mem_cgroup_balance_dirty_pages() fires a tracepoint and just returns
   allowing memcg_2 dirty memory to exceed its foreground limit growing
   upwards to the memcg_2 memory limit_in_bytes.  Once limit_in_bytes is
   hit it will use per memcg direct reclaim to recycle memcg_2 pages,
   including the previously written memcg_2 /var/log/messages dirty
   pages.

By cutting out lru scanning the code should be simpler and still handle
the common case well.

If we later find that this supposed uncommon shared inode case is
important then we can either implement the previously described lru
scanning in mem_cgroup_balance_dirty_pages() or consider extending the
bdi/memcg/inode data structures (perhaps with a memcg_mapping) to
describe such sharing.

> Cheers,
>
> Dave.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback
  2011-04-08  0:59                                   ` Greg Thelen
@ 2011-04-08  1:25                                       ` Dave Chinner
  0 siblings, 0 replies; 166+ messages in thread
From: Dave Chinner @ 2011-04-08  1:25 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Vivek Goyal, Curt Wohlgemuth, James Bottomley, lsf,
	linux-fsdevel, linux-mm

On Thu, Apr 07, 2011 at 05:59:35PM -0700, Greg Thelen wrote:
> cc: linux-mm
> 
> Dave Chinner <david@fromorbit.com> writes:
> 
> > On Thu, Apr 07, 2011 at 03:24:24PM -0400, Vivek Goyal wrote:
> >> On Thu, Apr 07, 2011 at 09:36:02AM +1000, Dave Chinner wrote:
> > [...]
> >> > > When I_DIRTY is cleared, remove inode from bdi_memcg->b_dirty.  Delete bdi_memcg
> >> > > if the list is now empty.
> >> > > 
> >> > > balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(memcg, bdi)
> >> > >    if over bg limit, then
> >> > >        set bdi_memcg->b_over_limit
> >> > >            If there is no bdi_memcg (because all inodes of current’s
> >> > >            memcg dirty pages where first dirtied by other memcg) then
> >> > >            memcg lru to find inode and call writeback_single_inode().
> >> > >            This is to handle uncommon sharing.
> >> > 
> >> > We don't want to introduce any new IO sources into
> >> > balance_dirty_pages(). This needs to trigger memcg-LRU based bdi
> >> > flusher writeback, not try to write back inodes itself.
> >> 
> >> Will we not enjoy more sequtial IO traffic once we find an inode by
> >> traversing memcg->lru list? So isn't that better than pure LRU based
> >> flushing?
> >
> > Sorry, I wasn't particularly clear there, What I meant was that we
> > ask the bdi-flusher thread to select the inode to write back from
> > the LRU, not do it directly from balance_dirty_pages(). i.e.
> > bdp stays IO-less.
> >
> >> > Alternatively, this problem won't exist if you transfer page щache
> >> > state from one memcg to another when you move the inode from one
> >> > memcg to another.
> >> 
> >> But in case of shared inode problem still remains. inode is being written
> >> from two cgroups and it can't be in both the groups as per the exisiting
> >> design.
> >
> > But we've already determined that there is no use case for this
> > shared inode behaviour, so we aren't going to explictly support it,
> > right?
> 
> I am thinking that we should avoid ever scanning the memcg lru for dirty
> pages or corresponding dirty inodes previously associated with other
> memcg.  I think the only reason we considered scanning the lru was to
> handle the unexpected shared inode case.  When such inode sharing occurs
> the sharing memcg will not be confined to the memcg's dirty limit.
> There's always the memcg hard limit to cap memcg usage.

Yup, fair enough.


> I'd like to add a counter (or at least tracepoint) to record when such
> unsupported usage is detected.

Definitely. Very good idea.

> 1. memcg_1/process_a, writes to /var/log/messages and closes the file.
>    This marks the inode in the bdi_memcg for memcg_1.
> 
> 2. memcg_2/process_b, continually writes to /var/log/messages.  This
>    drives up memcg_2 dirty memory usage to the memcg_2 background
>    threshold.  mem_cgroup_balance_dirty_pages() would normally mark the
>    corresponding bdi_memcg as over-bg-limit and kick the bdi_flusher and
>    then return to the dirtying process.  However, there is no bdi_memcg
>    because there are no dirty inodes for memcg_2.  So the bdi flusher
>    sees no bdi_memcg as marked over-limit, so bdi flusher writes nothing
>    (assuming we're still below system background threshold).
> 
> 3. memcg_2/process_b, continues writing to /var/log/messages hitting the
>    memcg_2 dirty memory foreground threshold.  Using IO-less
>    balance_dirty_pages(), normally mem_cgroup_balance_dirty_pages()
>    would block waiting for the previously kicked bdi flusher to clean
>    some memcg_2 pages.  In this case mem_cgroup_balance_dirty_pages()
>    sees no bdi_memcg and concludes that bdi flusher will not be lowering
>    memcg dirty memory usage.  This is the unsupported sharing case, so
>    mem_cgroup_balance_dirty_pages() fires a tracepoint and just returns
>    allowing memcg_2 dirty memory to exceed its foreground limit growing
>    upwards to the memcg_2 memory limit_in_bytes.  Once limit_in_bytes is
>    hit it will use per memcg direct reclaim to recycle memcg_2 pages,
>    including the previously written memcg_2 /var/log/messages dirty
>    pages.

Thanks for the good, simple  example.

> By cutting out lru scanning the code should be simpler and still
> handle the common case well.

Agreed.

> If we later find that this supposed uncommon shared inode case is
> important then we can either implement the previously described lru
> scanning in mem_cgroup_balance_dirty_pages() or consider extending the
> bdi/memcg/inode data structures (perhaps with a memcg_mapping) to
> describe such sharing.

Hmm, another idea I just had. What we're trying to avoid is needing
to a) track inodes in multiple lists, and b) scanning to find
something appropriate to write back.

Rather than tracking at page or inode granularity, how about
tracking "associated" memcgs at the memcg level? i.e. when we detect
an inode is already dirty in another memcg, link the current memcg
to the one that contains the inode. Hence if we get a situation
where a memcg is throttling with no dirty inodes, it can quickly
find and start writeback in an "associated" memcg that it _knows_
contain shared dirty inodes. Once we've triggered writeback on an
associated memcg, it is removed from the list....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback
@ 2011-04-08  1:25                                       ` Dave Chinner
  0 siblings, 0 replies; 166+ messages in thread
From: Dave Chinner @ 2011-04-08  1:25 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Vivek Goyal, Curt Wohlgemuth, James Bottomley, lsf,
	linux-fsdevel, linux-mm

On Thu, Apr 07, 2011 at 05:59:35PM -0700, Greg Thelen wrote:
> cc: linux-mm
> 
> Dave Chinner <david@fromorbit.com> writes:
> 
> > On Thu, Apr 07, 2011 at 03:24:24PM -0400, Vivek Goyal wrote:
> >> On Thu, Apr 07, 2011 at 09:36:02AM +1000, Dave Chinner wrote:
> > [...]
> >> > > When I_DIRTY is cleared, remove inode from bdi_memcg->b_dirty.  Delete bdi_memcg
> >> > > if the list is now empty.
> >> > > 
> >> > > balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(memcg, bdi)
> >> > >    if over bg limit, then
> >> > >        set bdi_memcg->b_over_limit
> >> > >            If there is no bdi_memcg (because all inodes of currenta??s
> >> > >            memcg dirty pages where first dirtied by other memcg) then
> >> > >            memcg lru to find inode and call writeback_single_inode().
> >> > >            This is to handle uncommon sharing.
> >> > 
> >> > We don't want to introduce any new IO sources into
> >> > balance_dirty_pages(). This needs to trigger memcg-LRU based bdi
> >> > flusher writeback, not try to write back inodes itself.
> >> 
> >> Will we not enjoy more sequtial IO traffic once we find an inode by
> >> traversing memcg->lru list? So isn't that better than pure LRU based
> >> flushing?
> >
> > Sorry, I wasn't particularly clear there, What I meant was that we
> > ask the bdi-flusher thread to select the inode to write back from
> > the LRU, not do it directly from balance_dirty_pages(). i.e.
> > bdp stays IO-less.
> >
> >> > Alternatively, this problem won't exist if you transfer page N?ache
> >> > state from one memcg to another when you move the inode from one
> >> > memcg to another.
> >> 
> >> But in case of shared inode problem still remains. inode is being written
> >> from two cgroups and it can't be in both the groups as per the exisiting
> >> design.
> >
> > But we've already determined that there is no use case for this
> > shared inode behaviour, so we aren't going to explictly support it,
> > right?
> 
> I am thinking that we should avoid ever scanning the memcg lru for dirty
> pages or corresponding dirty inodes previously associated with other
> memcg.  I think the only reason we considered scanning the lru was to
> handle the unexpected shared inode case.  When such inode sharing occurs
> the sharing memcg will not be confined to the memcg's dirty limit.
> There's always the memcg hard limit to cap memcg usage.

Yup, fair enough.


> I'd like to add a counter (or at least tracepoint) to record when such
> unsupported usage is detected.

Definitely. Very good idea.

> 1. memcg_1/process_a, writes to /var/log/messages and closes the file.
>    This marks the inode in the bdi_memcg for memcg_1.
> 
> 2. memcg_2/process_b, continually writes to /var/log/messages.  This
>    drives up memcg_2 dirty memory usage to the memcg_2 background
>    threshold.  mem_cgroup_balance_dirty_pages() would normally mark the
>    corresponding bdi_memcg as over-bg-limit and kick the bdi_flusher and
>    then return to the dirtying process.  However, there is no bdi_memcg
>    because there are no dirty inodes for memcg_2.  So the bdi flusher
>    sees no bdi_memcg as marked over-limit, so bdi flusher writes nothing
>    (assuming we're still below system background threshold).
> 
> 3. memcg_2/process_b, continues writing to /var/log/messages hitting the
>    memcg_2 dirty memory foreground threshold.  Using IO-less
>    balance_dirty_pages(), normally mem_cgroup_balance_dirty_pages()
>    would block waiting for the previously kicked bdi flusher to clean
>    some memcg_2 pages.  In this case mem_cgroup_balance_dirty_pages()
>    sees no bdi_memcg and concludes that bdi flusher will not be lowering
>    memcg dirty memory usage.  This is the unsupported sharing case, so
>    mem_cgroup_balance_dirty_pages() fires a tracepoint and just returns
>    allowing memcg_2 dirty memory to exceed its foreground limit growing
>    upwards to the memcg_2 memory limit_in_bytes.  Once limit_in_bytes is
>    hit it will use per memcg direct reclaim to recycle memcg_2 pages,
>    including the previously written memcg_2 /var/log/messages dirty
>    pages.

Thanks for the good, simple  example.

> By cutting out lru scanning the code should be simpler and still
> handle the common case well.

Agreed.

> If we later find that this supposed uncommon shared inode case is
> important then we can either implement the previously described lru
> scanning in mem_cgroup_balance_dirty_pages() or consider extending the
> bdi/memcg/inode data structures (perhaps with a memcg_mapping) to
> describe such sharing.

Hmm, another idea I just had. What we're trying to avoid is needing
to a) track inodes in multiple lists, and b) scanning to find
something appropriate to write back.

Rather than tracking at page or inode granularity, how about
tracking "associated" memcgs at the memcg level? i.e. when we detect
an inode is already dirty in another memcg, link the current memcg
to the one that contains the inode. Hence if we get a situation
where a memcg is throttling with no dirty inodes, it can quickly
find and start writeback in an "associated" memcg that it _knows_
contain shared dirty inodes. Once we've triggered writeback on an
associated memcg, it is removed from the list....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-06 17:10                           ` Jan Kara
  2011-04-06 17:14                             ` Curt Wohlgemuth
@ 2011-04-08  1:58                             ` Dave Chinner
  2011-04-19 14:26                               ` Wu Fengguang
  1 sibling, 1 reply; 166+ messages in thread
From: Dave Chinner @ 2011-04-08  1:58 UTC (permalink / raw)
  To: Jan Kara; +Cc: Vivek Goyal, James Bottomley, lsf, linux-fsdevel

On Wed, Apr 06, 2011 at 07:10:17PM +0200, Jan Kara wrote:
> On Wed 06-04-11 12:08:05, Vivek Goyal wrote:
> > On Wed, Apr 06, 2011 at 11:37:15AM -0400, Vivek Goyal wrote:
>   Well, I wouldn't bother too much with kswapd handling. MM people plan to
> get rid of writeback from direct reclaim and just remove the dirty page
> from LRU and recycle it once flusher thread writes it...

kswapd is not in the direct reclaim path - it's the background
memory reclaim path.  Writeback from direct reclaim is a problem
because of stack usage, and that problem doesn't exist for kswapd.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback
  2011-04-07 23:42                                 ` Dave Chinner
  2011-04-08  0:59                                   ` Greg Thelen
@ 2011-04-08 13:43                                   ` Vivek Goyal
  1 sibling, 0 replies; 166+ messages in thread
From: Vivek Goyal @ 2011-04-08 13:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Greg Thelen, Curt Wohlgemuth, James Bottomley, lsf, linux-fsdevel

On Fri, Apr 08, 2011 at 09:42:49AM +1000, Dave Chinner wrote:
> On Thu, Apr 07, 2011 at 03:24:24PM -0400, Vivek Goyal wrote:
> > On Thu, Apr 07, 2011 at 09:36:02AM +1000, Dave Chinner wrote:
> [...]
> > > > When I_DIRTY is cleared, remove inode from bdi_memcg->b_dirty.  Delete bdi_memcg
> > > > if the list is now empty.
> > > > 
> > > > balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(memcg, bdi)
> > > >    if over bg limit, then
> > > >        set bdi_memcg->b_over_limit
> > > >            If there is no bdi_memcg (because all inodes of current’s
> > > >            memcg dirty pages where first dirtied by other memcg) then
> > > >            memcg lru to find inode and call writeback_single_inode().
> > > >            This is to handle uncommon sharing.
> > > 
> > > We don't want to introduce any new IO sources into
> > > balance_dirty_pages(). This needs to trigger memcg-LRU based bdi
> > > flusher writeback, not try to write back inodes itself.
> > 
> > Will we not enjoy more sequtial IO traffic once we find an inode by
> > traversing memcg->lru list? So isn't that better than pure LRU based
> > flushing?
> 
> Sorry, I wasn't particularly clear there, What I meant was that we
> ask the bdi-flusher thread to select the inode to write back from
> the LRU, not do it directly from balance_dirty_pages(). i.e.
> bdp stays IO-less.

Agreed. Even with cgroup aware writeback, we use bdi-flusher threads to
do writeback and no direct writeback in bdp.

> 
> > > Alternatively, this problem won't exist if you transfer page щache
> > > state from one memcg to another when you move the inode from one
> > > memcg to another.
> > 
> > But in case of shared inode problem still remains. inode is being written
> > from two cgroups and it can't be in both the groups as per the exisiting
> > design.
> 
> But we've already determined that there is no use case for this
> shared inode behaviour, so we aren't going to explictly support it,
> right?

Well, we are not designing for shared inode to begin with but one can
easily create that situation. So atleast we need to have some defined
behavior that what happens if inodes are shared across multiple processes
in same cgroup and across cgroups. Database might have multiple
threads/processes doing IO to single file. What if somebody moves some
threads out to a separate cgroup etc.

So I am not saying that is common configuration but we need to define 
system behavior properly if sharing does happen.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-07 23:47                             ` Dave Chinner
@ 2011-04-08 13:50                               ` Vivek Goyal
  2011-04-11  1:05                                 ` Dave Chinner
  0 siblings, 1 reply; 166+ messages in thread
From: Vivek Goyal @ 2011-04-08 13:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Curt Wohlgemuth, Greg Thelen, James Bottomley, lsf, linux-fsdevel

On Fri, Apr 08, 2011 at 09:47:17AM +1000, Dave Chinner wrote:
> On Thu, Apr 07, 2011 at 04:04:37PM -0400, Vivek Goyal wrote:
> > On Thu, Apr 07, 2011 at 09:08:04AM +1000, Dave Chinner wrote:
> > 
> > [..]
> > > > At the very least, when a task is moved from one cgroup to another,
> > > > we've got a shared inode case.  This probably won't happen more than
> > > > once for most tasks, but it will likely be common.
> > > 
> > > That's not a shared case, that's a transfer of ownership. If the
> > > task changes groups, you have to charge all it's pages to the new
> > > group, right? Otherwise you've got a problem where a task that is
> > > not part of a specific cgroup is still somewhat controlled by it's
> > > previous cgroup. It would also still influence that previous group
> > > even though it's no longer a member. Not good for isolation purposes.
> > > 
> > > And if you are transfering the state, moving the inode from the
> > > dirty list of one cgroup to another is trivial and avoids any need
> > > for the dirty state to be shared....
> > 
> > I am wondering how do you map a task to an inode. Multiple tasks in the
> > group might have written to same inode. Now which task owns it? 
> 
> That sounds like a completely broken configuration to me. If you are
> using cgroups for isolation, you simple do not share *anything*
> between them.
> 
> Right now the only use case that has been presented for shared
> inodes is transfering a task from one cgroup to another.

Moving applications dynamically across cgroups happens quite often 
just to put task in right cgroup after it has been launched or if
a task has been running for sometime and system admin decides that
it is causing heavy IO impacting other cgroup's IO. Then system
admin might move it into a separate cgroup on the fly.
 
> Why on
> earth would you do that if it is sharing resources with other tasks
> in the original cgroup? What use case does this represent, how often
> is it likely to happen, and who cares about it anyway?

> 
> Let's not overly complicate things by making up requirements that
> nobody cares about....

Ok, so you are suggesting that always assume that only one task has
written pages to inode and if that's not the case it is broken
cofiguration. 

So if a task moves across cgroups, determine the pages and associated
inodes and move everything to the new cgroup. If inode happend to be
shared, then inode moves irrespective of the fact somebody else also
was doing IO to it. I guess reasonable first step.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-08 13:50                               ` Vivek Goyal
@ 2011-04-11  1:05                                 ` Dave Chinner
  0 siblings, 0 replies; 166+ messages in thread
From: Dave Chinner @ 2011-04-11  1:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Curt Wohlgemuth, Greg Thelen, James Bottomley, lsf, linux-fsdevel

On Fri, Apr 08, 2011 at 09:50:58AM -0400, Vivek Goyal wrote:
> On Fri, Apr 08, 2011 at 09:47:17AM +1000, Dave Chinner wrote:
> > On Thu, Apr 07, 2011 at 04:04:37PM -0400, Vivek Goyal wrote:
> > > On Thu, Apr 07, 2011 at 09:08:04AM +1000, Dave Chinner wrote:
> > > 
> > > [..]
> > > > > At the very least, when a task is moved from one cgroup to another,
> > > > > we've got a shared inode case.  This probably won't happen more than
> > > > > once for most tasks, but it will likely be common.
> > > > 
> > > > That's not a shared case, that's a transfer of ownership. If the
> > > > task changes groups, you have to charge all it's pages to the new
> > > > group, right? Otherwise you've got a problem where a task that is
> > > > not part of a specific cgroup is still somewhat controlled by it's
> > > > previous cgroup. It would also still influence that previous group
> > > > even though it's no longer a member. Not good for isolation purposes.
> > > > 
> > > > And if you are transfering the state, moving the inode from the
> > > > dirty list of one cgroup to another is trivial and avoids any need
> > > > for the dirty state to be shared....
> > > 
> > > I am wondering how do you map a task to an inode. Multiple tasks in the
> > > group might have written to same inode. Now which task owns it? 
> > 
> > That sounds like a completely broken configuration to me. If you are
> > using cgroups for isolation, you simple do not share *anything*
> > between them.
> > 
> > Right now the only use case that has been presented for shared
> > inodes is transfering a task from one cgroup to another.
> 
> Moving applications dynamically across cgroups happens quite often 
> just to put task in right cgroup after it has been launched

If it's just been launched, it won't have dirtied very many files so
I think shared dirty inodes for this use case is not an issue.

> or if
> a task has been running for sometime and system admin decides that
> it is causing heavy IO impacting other cgroup's IO. Then system
> admin might move it into a separate cgroup on the fly.

And I'd expect manual load balancing to be the exception rather than
the rule. Even so, if that process is doing lots of IO to the same
file as other tasks that it is interfering with, then there's an
application level problem there....

> > Why on
> > earth would you do that if it is sharing resources with other tasks
> > in the original cgroup? What use case does this represent, how often
> > is it likely to happen, and who cares about it anyway?
> 
> > 
> > Let's not overly complicate things by making up requirements that
> > nobody cares about....
> 
> Ok, so you are suggesting that always assume that only one task has
> written pages to inode and if that's not the case it is broken
> cofiguration. 

Not broken, but initially unsupported.

> So if a task moves across cgroups, determine the pages and associated
> inodes and move everything to the new cgroup. If inode happend to be
> shared, then inode moves irrespective of the fact somebody else also
> was doing IO to it. I guess reasonable first step.

It seems like the simplest way to start - once we have code that
works doing the simple things right we can start to complicate it ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-07 17:55                           ` Vivek Goyal
@ 2011-04-11  1:36                             ` Dave Chinner
  2011-04-15 21:07                               ` Vivek Goyal
  2011-04-19 14:17                               ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Wu Fengguang
  0 siblings, 2 replies; 166+ messages in thread
From: Dave Chinner @ 2011-04-11  1:36 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Greg Thelen, James Bottomley, lsf, linux-fsdevel

On Thu, Apr 07, 2011 at 01:55:37PM -0400, Vivek Goyal wrote:
> On Thu, Apr 07, 2011 at 09:50:39AM +1000, Dave Chinner wrote:
> > On Wed, Apr 06, 2011 at 11:37:15AM -0400, Vivek Goyal wrote:
> > > On Wed, Apr 06, 2011 at 08:56:40AM +1000, Dave Chinner wrote:
> > > > On Tue, Apr 05, 2011 at 09:13:59AM -0400, Vivek Goyal wrote:
> > > > It also means you will have handle the case of a cgroup over a
> > > > throttle limit and no inodes on it's dirty list. It's not a case of
> > > > "probably can live with" the resultant mess, the mess will occur and
> > > > so handling it needs to be designed in from the start.
> > > 
> > > This behavior can happen due to shared page accounting. One possible
> > > way to mitigate this problme is to traverse through LRU list of pages
> > > of memcg and find an inode to do the writebak. 
> > 
> > Page LRU ordered writeback is something we need to avoid. It causes
> > havok with IO and allocation patterns. Also, how expensive is such a
> > walk? If it's a common operation, then it's a non-starter for the
> > generic writeback code.
> > 
> 
> Agreed that LRU ordered writeback needs to be avoided as it is going to be
> expensive. That's why the notion of inode on per memcg list and do per inode
> writeback. This seems to be only backup plan in case when there is no
> inode to do IO due to shared inode accounting issues. 

This shouldn't be hidden inside memcg relaim - memcg reclaim should
do exactly what the MM subsystem normal does without memcg being in
the picture.  That is, you need to convince the MM guys to change
the way reclaim does writeback from the LRU. We've been asking them
to do this for years....

> Do you have ideas on better way to handle it? The other proposal of
> maintaining a list memcg_mapping, which tracks which inode this cgroup
> has dirtied has been deemed complex and been kind of rejected at least
> for the first step.

Fix the mm subsystem to DTRT first?

> 
> > BTW, how is "shared page accounting" different to the shared dirty
> > inode case we've been discussing?
> 
> IIUC, there are two problems.
> 
> - Issues because of shared page accounting
> - Issues because of shared inode accouting.
> 
> So in shared page accounting, if two process do IO to same page, IO gets
> charged to cgroup who first touched the page. So if a cgroup is writting
> on lots of shared pages, it will be charged to the other cgroup who 
> brought the page in memory to begin with and will drive its dirty ratio
> up. So this seems to be case of weaker isolation in case of shared pages,
> and we got to live with it.
> 
> Similarly if inode is shared, inode gets put on the list of memcg who dirtied
> it first. So now if two cgroups are dirtying pages on inode, then pages should
> be charged to respective cgroup but inode will be only on one memcg and once
> writeback is performed it might happen that cgroup is over its background
> limit but there are no inodes to do writeback.
> 
> > 
> > > After yesterday's discussion it looked like people agreed that to 
> > > begin with keep it simple and maintain the notion of one inode on
> > > one memcg list. So instead of inode being on global bdi dirty list
> > > it will be on per memecg per bdi dirty list.
> > 
> > Good to hear.
> > 
> > > > how metadata IO is going to be handled by
> > > > IO controllers,
> > > 
> > > So IO controller provides two mechanisms.
> > > 
> > > - IO throttling(bytes_per_second, io_per_second interface)
> > > - Proportional weight disk sharing
> > > 
> > > In case of proportional weight disk sharing, we don't run into issues of
> > > priority inversion and metadata handing should not be a concern.
> > 
> > Though metadata IO will affect how much bandwidth/iops is available
> > for applications to use.
> 
> I think meta data IO will be accounted to the process submitting the meta
> data IO. (IO tracking stuff will be used only for page cache pages during
> page dirtying time). So yes, the process doing meta data IO will be
> charged for it. 
> 
> I think I am missing something here and not understanding your concern
> exactly here.

XFS can issue thousands of delayed metadata write IO per second from
it's writeback threads when it needs to (e.g. tail pushing the
journal).  Completely unthrottled due to the context they are issued
from(*) and can basically consume all the disk iops and bandwidth
capacity for seconds at a time. 

Also, XFS doesn't use the page cache for metadata buffers anymore
so page cache accounting, throttling and reclaim mechanisms
are never going to work for controlling XFS metadata IO


(*) It'll be IO issued by workqueues rather than threads RSN:

http://git.kernel.org/?p=linux/kernel/git/dgc/xfsdev.git;a=shortlog;h=refs/heads/xfs-for-2.6.39

And this will become _much_ more common in the not-to-distant
future. So context passing between threads and to workqueues is
something you need to think about sooner rather than later if you
want metadata IO to be throttled in any way....

> > > For throttling case, apart from metadata, I found that with simple
> > > throttling of data I ran into issues with journalling with ext4 mounuted
> > > in ordered mode. So it was suggested that WRITE IO throttling should
> > > not be done at device level instead try to do it in higher layers,
> > > possibly balance_dirty_pages() and throttle process early.
> > 
> > The problem with doing it at the page cache entry level is that
> > cache hits then get throttled. It's not really a an IO controller at
> > that point, and the impact on application performance could be huge
> > (i.e. MB/s instead of GB/s).
> 
> Agreed that throttling cache hits is not a good idea. Can we determine
> if page being asked for is in cache or not and charge for IO accordingly.

You'd need hooks in find_or_create_page(), though you have no
context of whether a read or a write is in progress at that point.

> > > So yes, I agree that little more documentation and more clarity on this
> > > would be good. All this cgroup aware writeback primarily is being done
> > > for CFQ's proportional disk sharing at the moment.
> > > 
> > > > what kswapd is going to do writeback when the pages
> > > > it's trying to writeback during a critical low memory event belong
> > > > to a cgroup that is throttled at the IO level, etc.
> > > 
> > > Throttling will move up so kswapd will not be throttled. Even today,
> > > kswapd is part of root group and we do not suggest throttling root group.
> > 
> > So once again you have the problem of writeback from kswapd (which
> > is ugly to begin with) affecting all the groups. Given kswapd likes
> > to issue what is effectively random IO, this coul dhave devastating
> > effect on everything else....
> 
> Implementing throttling at higher layer has the problem of IO spikes
> at the end level devices when flusher or kswapd decide to do bunch of
> IO. I really don't have a good answer for that. Doing throttling at
> device level runs into issues with journalling. So I guess issues of
> IO spikes is lesser concern as compared to issue of choking filesystem.
> 
> Following two things might help though a bit with IO spikes.
> 
> - Keep per cgroup background dirty ratio low so that flusher tries to
>   flush out pages sooner than later.

Which has major performance impacts.
> 
> - All the IO coming from flusher/kswapd will be going in root group
>   from throttling perspective. We can try to throttle it again to
>   some reasonable value to reduce the impact of IO spikes. 

Don't do writeback from kswapd at all? Push it all to the flusher
thread which has a context to work from?

> > > For the case of proportional disk sharing, we will probably account
> > > IO to respective cgroups (pages submitted by kswapd) and that should
> > > not flush to disk fairly fast and should not block for long time as it is
> > > work consering mechanism.
> > 
> > Well, it depends. I can still see how, with proportional IO, kswapd
> > would get slowed cleaning dirty pages on one memcg when there are
> > clean pages in another memcg that it could reclaim without doing any
> > IO. i.e. it has potential to slow down memory reclaim significantly.
> > (Note, I'm assuming proportional IO doesn't mean "no throttling" it
> > just means there is a much lower delay on IO issue.)
> 
> Proportional IO can delay submitting an IO only if there is IO happening
> in other groups. So IO can still be throttled and limits are decided
> by fair share of a group. But if other groups are not doing IO and not
> using their fair share, then the group doing IO gets bigger share.
> 
> So yes, if heavy IO is happening at disk while kswapd is also trying
> to reclaim memory, then IO submitted by kswapd can be delayed and
> this can slow down reclaim. (Does kswapd has to block after submitting
> IO from a memcg. Can't it just move onto next memcg and either free
> pages if not dirty, or also submit IO from next memcg?)

No idea - you'll need to engage the mm guys to get help there.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback
  2011-04-08  1:25                                       ` Dave Chinner
  (?)
@ 2011-04-12  3:17                                       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 166+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-12  3:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Greg Thelen, Vivek Goyal, Curt Wohlgemuth, James Bottomley, lsf,
	linux-fsdevel, linux-mm

On Fri, 8 Apr 2011 11:25:56 +1000
Dave Chinner <david@fromorbit.com> wrote:

> On Thu, Apr 07, 2011 at 05:59:35PM -0700, Greg Thelen wrote:
> > cc: linux-mm
> > 
> > Dave Chinner <david@fromorbit.com> writes:

> > If we later find that this supposed uncommon shared inode case is
> > important then we can either implement the previously described lru
> > scanning in mem_cgroup_balance_dirty_pages() or consider extending the
> > bdi/memcg/inode data structures (perhaps with a memcg_mapping) to
> > describe such sharing.
> 
> Hmm, another idea I just had. What we're trying to avoid is needing
> to a) track inodes in multiple lists, and b) scanning to find
> something appropriate to write back.
> 
> Rather than tracking at page or inode granularity, how about
> tracking "associated" memcgs at the memcg level? i.e. when we detect
> an inode is already dirty in another memcg, link the current memcg
> to the one that contains the inode. Hence if we get a situation
> where a memcg is throttling with no dirty inodes, it can quickly
> find and start writeback in an "associated" memcg that it _knows_
> contain shared dirty inodes. Once we've triggered writeback on an
> associated memcg, it is removed from the list....
> 

Thank you for an idea. I think we can start from following.

 0. add some feature to set 'preferred inode' for memcg.
    I think
      fadvise(fd, MAKE_THIF_FILE_UNDER_MY_MEMCG)
    or
      echo fd > /memory.move_file_here
    can be added. 

 1. account dirty pages for a memcg. as Greg does.
 2. at the same time, account dirty pages made dirty by threads in a memcg.
    (to check which internal/external thread made page dirty.)
 3. calculate internal/external dirty pages gap.
 
 With gap, we can have several choices.

 4-a. If it exceeds some thresh, do some notify.
      userland daemon can decide to move pages to some memcg or not.
      (Of coruse, if the _shared_ dirty can be caught before making page dirty,
       user daemon can move inode before making it dirty by inotify().)

      I like helps of userland because it can be more flexible than kernel,
      it can eat config files.

 4-b. set some flag to memcg as 'this memcg is dirty busy because of some extarnal
      threads'. When a page is newly dirtied, check the thread's memcg.
      If the memcg of a thread and a page is different from each other,
      write a memo as 'please check this memcgid , too' in task_struct and
      do double-memcg-check in balance_dirty_pages().
      (How to clear per-task flag is difficult ;)

      I don't want to handle 3-100 threads does shared write case..;)
      we'll need 4-a.
 

Thanks,
-Kame








--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-11  1:36                             ` Dave Chinner
@ 2011-04-15 21:07                               ` Vivek Goyal
  2011-04-16  3:06                                 ` Vivek Goyal
  2011-04-19 14:17                               ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Wu Fengguang
  1 sibling, 1 reply; 166+ messages in thread
From: Vivek Goyal @ 2011-04-15 21:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Greg Thelen, James Bottomley, lsf, linux-fsdevel

On Mon, Apr 11, 2011 at 11:36:30AM +1000, Dave Chinner wrote:

[..]
> > > > > how metadata IO is going to be handled by
> > > > > IO controllers,
> > > > 
> > > > So IO controller provides two mechanisms.
> > > > 
> > > > - IO throttling(bytes_per_second, io_per_second interface)
> > > > - Proportional weight disk sharing
> > > > 
> > > > In case of proportional weight disk sharing, we don't run into issues of
> > > > priority inversion and metadata handing should not be a concern.
> > > 
> > > Though metadata IO will affect how much bandwidth/iops is available
> > > for applications to use.
> > 
> > I think meta data IO will be accounted to the process submitting the meta
> > data IO. (IO tracking stuff will be used only for page cache pages during
> > page dirtying time). So yes, the process doing meta data IO will be
> > charged for it. 
> > 
> > I think I am missing something here and not understanding your concern
> > exactly here.
> 
> XFS can issue thousands of delayed metadata write IO per second from
> it's writeback threads when it needs to (e.g. tail pushing the
> journal).  Completely unthrottled due to the context they are issued
> from(*) and can basically consume all the disk iops and bandwidth
> capacity for seconds at a time. 
> 
> Also, XFS doesn't use the page cache for metadata buffers anymore
> so page cache accounting, throttling and reclaim mechanisms
> are never going to work for controlling XFS metadata IO
> 
> 
> (*) It'll be IO issued by workqueues rather than threads RSN:
> 
> http://git.kernel.org/?p=linux/kernel/git/dgc/xfsdev.git;a=shortlog;h=refs/heads/xfs-for-2.6.39
> 
> And this will become _much_ more common in the not-to-distant
> future. So context passing between threads and to workqueues is
> something you need to think about sooner rather than later if you
> want metadata IO to be throttled in any way....

Ok,

So this seems to the similar case as WRITE traffic from flusher threads
which can disrupt IO on end device even if we have done throttling in
balance_dirty_pages().

How about doing throttling at two layers. All the data throttling is
done in higher layers and then also retain the mechanism of throttling
at end device. That way an admin can put a overall limit on such 
common write traffic. (XFS meta data coming from workqueues, flusher
thread, kswapd etc).

Anyway, we can't attribute this IO to per process context/group otherwise
most likely something will get serialized in higher layers.
 
Right now I am speaking purely from IO throttling point of view and not
even thinking about CFQ and IO tracking stuff.

This increases the complexity in IO cgroup interface as now we see to have
four combinations.

  Global Throttling
  	Throttling at lower layers
  	Throttling at higher layers.

  Per device throttling
 	 Throttling at lower layers
  	Throttling at higher layers.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-15 21:07                               ` Vivek Goyal
@ 2011-04-16  3:06                                 ` Vivek Goyal
  2011-04-18 21:58                                   ` Jan Kara
  0 siblings, 1 reply; 166+ messages in thread
From: Vivek Goyal @ 2011-04-16  3:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Greg Thelen, James Bottomley, lsf, linux-fsdevel

On Fri, Apr 15, 2011 at 05:07:50PM -0400, Vivek Goyal wrote:
> On Mon, Apr 11, 2011 at 11:36:30AM +1000, Dave Chinner wrote:
> 
> [..]
> > > > > > how metadata IO is going to be handled by
> > > > > > IO controllers,
> > > > > 
> > > > > So IO controller provides two mechanisms.
> > > > > 
> > > > > - IO throttling(bytes_per_second, io_per_second interface)
> > > > > - Proportional weight disk sharing
> > > > > 
> > > > > In case of proportional weight disk sharing, we don't run into issues of
> > > > > priority inversion and metadata handing should not be a concern.
> > > > 
> > > > Though metadata IO will affect how much bandwidth/iops is available
> > > > for applications to use.
> > > 
> > > I think meta data IO will be accounted to the process submitting the meta
> > > data IO. (IO tracking stuff will be used only for page cache pages during
> > > page dirtying time). So yes, the process doing meta data IO will be
> > > charged for it. 
> > > 
> > > I think I am missing something here and not understanding your concern
> > > exactly here.
> > 
> > XFS can issue thousands of delayed metadata write IO per second from
> > it's writeback threads when it needs to (e.g. tail pushing the
> > journal).  Completely unthrottled due to the context they are issued
> > from(*) and can basically consume all the disk iops and bandwidth
> > capacity for seconds at a time. 
> > 
> > Also, XFS doesn't use the page cache for metadata buffers anymore
> > so page cache accounting, throttling and reclaim mechanisms
> > are never going to work for controlling XFS metadata IO
> > 
> > 
> > (*) It'll be IO issued by workqueues rather than threads RSN:
> > 
> > http://git.kernel.org/?p=linux/kernel/git/dgc/xfsdev.git;a=shortlog;h=refs/heads/xfs-for-2.6.39
> > 
> > And this will become _much_ more common in the not-to-distant
> > future. So context passing between threads and to workqueues is
> > something you need to think about sooner rather than later if you
> > want metadata IO to be throttled in any way....
> 
> Ok,
> 
> So this seems to the similar case as WRITE traffic from flusher threads
> which can disrupt IO on end device even if we have done throttling in
> balance_dirty_pages().
> 
> How about doing throttling at two layers. All the data throttling is
> done in higher layers and then also retain the mechanism of throttling
> at end device. That way an admin can put a overall limit on such 
> common write traffic. (XFS meta data coming from workqueues, flusher
> thread, kswapd etc).
> 
> Anyway, we can't attribute this IO to per process context/group otherwise
> most likely something will get serialized in higher layers.
>  
> Right now I am speaking purely from IO throttling point of view and not
> even thinking about CFQ and IO tracking stuff.
> 
> This increases the complexity in IO cgroup interface as now we see to have
> four combinations.
> 
>   Global Throttling
>   	Throttling at lower layers
>   	Throttling at higher layers.
> 
>   Per device throttling
>  	 Throttling at lower layers
>   	Throttling at higher layers.

Dave, 

I wrote above but I myself am not fond of coming up with 4 combinations.
Want to limit it two. Per device throttling or global throttling. Here
are some more thoughts in general about both throttling policy and
proportional policy of IO controller. For throttling policy, I am 
primarily concerned with how to avoid file system serialization issues.

Proportional IO (CFQ)
---------------------
- Make writeback cgroup aware and kernel threads (flusher) which are
  cgroup aware can be marked with a task flag (GROUP_AWARE). If a 
  cgroup aware kernel threads throws IO at CFQ, then IO is accounted
  to cgroup of task who originally dirtied the page. Otherwise we use
  task context to account the IO to.

  So any IO submitted by flusher threads will go to respective cgroups
  and higher weight cgroup should be able to do more WRITES.

  IO submitted by other kernel threads like kjournald, XFS async metadata
  submission, kswapd etc all goes to thread context and that is root
  group.

- If kswapd is a concern then either make kswapd cgroup aware or let
  kswapd use cgroup aware flusher to do IO (Dave Chinner's idea).

Open Issues
-----------
- We do not get isolation for meta data IO. In virtualized setup, to
  achieve stronger isolation do not use host filesystem. Export block
  devices into guests.

IO throttling
------------

READS
-----
- Do not throttle meta data IO. Filesystem needs to mark READ metadata
  IO so that we can avoid throttling it. This way ordered filesystems
  will not get serialized behind a throttled read in slow group.

  May be one can account meta data read to a group and try to use that
  to throttle data IO in same cgroup as a compensation.
 
WRITES
------
- Throttle tasks. Do not throttle bios. That means that when a task
  submits direct write, let it go to disk. Do the accounting and if task
  is exceeding the IO rate make it sleep. Something similar to
  balance_dirty_pages().

  That way, any direct WRITES should not run into any serialization issues
  in ordered mode. We can continue to use blkio_throtle_bio() hook in
  generic_make request().

- For buffered WRITES, design a throttling hook similar to
  balance_drity_pages() and throttle tasks according to rules while they
  are dirtying page cache.

- Do not throttle buffered writes again at the end device as these have
  been throttled already while writting to page cache. Also throttling
  WRITES at end device will lead to serialization issues with file systems
  in ordered mode.

- Cgroup of a IO is always attributed to submitting thread. That way all
  meta data writes will go in root cgroup and remain unthrottled. If one
  is too concerned with lots of meta data IO, then probably one can
  put a throttling rule in root cgroup.


Open Issues
-----------
- IO spikes at end devices

  Because buffered writes are controlled at page dirtying time, we can 
  have a spike of IO later at end device when flusher thread decides to
  do writeback. 

  I am not sure how to solve this issue. Part of the problem can be
  handled by using per cgroup dirty ratio and keeping each cgroup's
  ratio low so that we don't build up huge dirty caches. This can lead
  to performance drop of applications. So this is performance vs isolation
  trade off and user chooses one.

  This issue exists in virtualized environment only if host file system
  is used. The best way to achieve maximum isolation would be to export
  block devices into guest and then perform throttling per block device.

- Poor isolation for meta data.

  We can't account and throttle meta data in each cgroup otherwise we
  should again run into file system serialization issues in ordered
  mode. So this is a trade off of using file systems. You primarily get
  throttling for data IO and not meta data IO. 

  Again, export block devices in virtual machines and create file systems
  on that and do not use host filesystem and one can achieve a very good
  isolation.

Thoughts?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-16  3:06                                 ` Vivek Goyal
@ 2011-04-18 21:58                                   ` Jan Kara
  2011-04-18 22:51                                     ` cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)) Vivek Goyal
  0 siblings, 1 reply; 166+ messages in thread
From: Jan Kara @ 2011-04-18 21:58 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Dave Chinner, Greg Thelen, James Bottomley, lsf, linux-fsdevel

On Fri 15-04-11 23:06:02, Vivek Goyal wrote:
> On Fri, Apr 15, 2011 at 05:07:50PM -0400, Vivek Goyal wrote:
> > How about doing throttling at two layers. All the data throttling is
> > done in higher layers and then also retain the mechanism of throttling
> > at end device. That way an admin can put a overall limit on such 
> > common write traffic. (XFS meta data coming from workqueues, flusher
> > thread, kswapd etc).
> > 
> > Anyway, we can't attribute this IO to per process context/group otherwise
> > most likely something will get serialized in higher layers.
> >  
> > Right now I am speaking purely from IO throttling point of view and not
> > even thinking about CFQ and IO tracking stuff.
> > 
> > This increases the complexity in IO cgroup interface as now we see to have
> > four combinations.
> > 
> >   Global Throttling
> >   	Throttling at lower layers
> >   	Throttling at higher layers.
> > 
> >   Per device throttling
> >  	 Throttling at lower layers
> >   	Throttling at higher layers.
> 
> Dave, 
> 
> I wrote above but I myself am not fond of coming up with 4 combinations.
> Want to limit it two. Per device throttling or global throttling. Here
> are some more thoughts in general about both throttling policy and
> proportional policy of IO controller. For throttling policy, I am 
> primarily concerned with how to avoid file system serialization issues.
> 
> Proportional IO (CFQ)
> ---------------------
> - Make writeback cgroup aware and kernel threads (flusher) which are
>   cgroup aware can be marked with a task flag (GROUP_AWARE). If a 
>   cgroup aware kernel threads throws IO at CFQ, then IO is accounted
>   to cgroup of task who originally dirtied the page. Otherwise we use
>   task context to account the IO to.
> 
>   So any IO submitted by flusher threads will go to respective cgroups
>   and higher weight cgroup should be able to do more WRITES.
> 
>   IO submitted by other kernel threads like kjournald, XFS async metadata
>   submission, kswapd etc all goes to thread context and that is root
>   group.
> 
> - If kswapd is a concern then either make kswapd cgroup aware or let
>   kswapd use cgroup aware flusher to do IO (Dave Chinner's idea).
> 
> Open Issues
> -----------
> - We do not get isolation for meta data IO. In virtualized setup, to
>   achieve stronger isolation do not use host filesystem. Export block
>   devices into guests.
> 
> IO throttling
> ------------
> 
> READS
> -----
> - Do not throttle meta data IO. Filesystem needs to mark READ metadata
>   IO so that we can avoid throttling it. This way ordered filesystems
>   will not get serialized behind a throttled read in slow group.
> 
>   May be one can account meta data read to a group and try to use that
>   to throttle data IO in same cgroup as a compensation.
>  
> WRITES
> ------
> - Throttle tasks. Do not throttle bios. That means that when a task
>   submits direct write, let it go to disk. Do the accounting and if task
>   is exceeding the IO rate make it sleep. Something similar to
>   balance_dirty_pages().
> 
>   That way, any direct WRITES should not run into any serialization issues
>   in ordered mode. We can continue to use blkio_throtle_bio() hook in
>   generic_make request().
> 
> - For buffered WRITES, design a throttling hook similar to
>   balance_drity_pages() and throttle tasks according to rules while they
>   are dirtying page cache.
> 
> - Do not throttle buffered writes again at the end device as these have
>   been throttled already while writting to page cache. Also throttling
>   WRITES at end device will lead to serialization issues with file systems
>   in ordered mode.
> 
> - Cgroup of a IO is always attributed to submitting thread. That way all
>   meta data writes will go in root cgroup and remain unthrottled. If one
>   is too concerned with lots of meta data IO, then probably one can
>   put a throttling rule in root cgroup.
  But I think the above scheme basically allows agressive buffered writer
to occupy as much of disk throughput as throttling at page dirty time
allows. So either you'd have to seriously limit the speed of page dirtying
for each cgroup (effectively giving each write properties like direct write)
or you'd have to live with cgroup taking your whole disk throughput. Neither
of which seems very appealing. Grumble, not that I have a good solution to
this problem...

> Open Issues
> -----------
> - IO spikes at end devices
> 
>   Because buffered writes are controlled at page dirtying time, we can 
>   have a spike of IO later at end device when flusher thread decides to
>   do writeback. 
> 
>   I am not sure how to solve this issue. Part of the problem can be
>   handled by using per cgroup dirty ratio and keeping each cgroup's
>   ratio low so that we don't build up huge dirty caches. This can lead
>   to performance drop of applications. So this is performance vs isolation
>   trade off and user chooses one.
> 
>   This issue exists in virtualized environment only if host file system
>   is used. The best way to achieve maximum isolation would be to export
>   block devices into guest and then perform throttling per block device.
> 
> - Poor isolation for meta data.
> 
>   We can't account and throttle meta data in each cgroup otherwise we
>   should again run into file system serialization issues in ordered
>   mode. So this is a trade off of using file systems. You primarily get
>   throttling for data IO and not meta data IO. 
> 
>   Again, export block devices in virtual machines and create file systems
>   on that and do not use host filesystem and one can achieve a very good
>   isolation.
> 
> Thoughts?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 166+ messages in thread

* cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF))
  2011-04-18 21:58                                   ` Jan Kara
@ 2011-04-18 22:51                                     ` Vivek Goyal
  2011-04-19  0:33                                       ` Dave Chinner
  0 siblings, 1 reply; 166+ messages in thread
From: Vivek Goyal @ 2011-04-18 22:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Greg Thelen, James Bottomley, lsf, linux-fsdevel,
	linux kernel mailing list

On Mon, Apr 18, 2011 at 11:58:44PM +0200, Jan Kara wrote:
> On Fri 15-04-11 23:06:02, Vivek Goyal wrote:
> > On Fri, Apr 15, 2011 at 05:07:50PM -0400, Vivek Goyal wrote:
> > > How about doing throttling at two layers. All the data throttling is
> > > done in higher layers and then also retain the mechanism of throttling
> > > at end device. That way an admin can put a overall limit on such 
> > > common write traffic. (XFS meta data coming from workqueues, flusher
> > > thread, kswapd etc).
> > > 
> > > Anyway, we can't attribute this IO to per process context/group otherwise
> > > most likely something will get serialized in higher layers.
> > >  
> > > Right now I am speaking purely from IO throttling point of view and not
> > > even thinking about CFQ and IO tracking stuff.
> > > 
> > > This increases the complexity in IO cgroup interface as now we see to have
> > > four combinations.
> > > 
> > >   Global Throttling
> > >   	Throttling at lower layers
> > >   	Throttling at higher layers.
> > > 
> > >   Per device throttling
> > >  	 Throttling at lower layers
> > >   	Throttling at higher layers.
> > 
> > Dave, 
> > 
> > I wrote above but I myself am not fond of coming up with 4 combinations.
> > Want to limit it two. Per device throttling or global throttling. Here
> > are some more thoughts in general about both throttling policy and
> > proportional policy of IO controller. For throttling policy, I am 
> > primarily concerned with how to avoid file system serialization issues.
> > 
> > Proportional IO (CFQ)
> > ---------------------
> > - Make writeback cgroup aware and kernel threads (flusher) which are
> >   cgroup aware can be marked with a task flag (GROUP_AWARE). If a 
> >   cgroup aware kernel threads throws IO at CFQ, then IO is accounted
> >   to cgroup of task who originally dirtied the page. Otherwise we use
> >   task context to account the IO to.
> > 
> >   So any IO submitted by flusher threads will go to respective cgroups
> >   and higher weight cgroup should be able to do more WRITES.
> > 
> >   IO submitted by other kernel threads like kjournald, XFS async metadata
> >   submission, kswapd etc all goes to thread context and that is root
> >   group.
> > 
> > - If kswapd is a concern then either make kswapd cgroup aware or let
> >   kswapd use cgroup aware flusher to do IO (Dave Chinner's idea).
> > 
> > Open Issues
> > -----------
> > - We do not get isolation for meta data IO. In virtualized setup, to
> >   achieve stronger isolation do not use host filesystem. Export block
> >   devices into guests.
> > 
> > IO throttling
> > ------------
> > 
> > READS
> > -----
> > - Do not throttle meta data IO. Filesystem needs to mark READ metadata
> >   IO so that we can avoid throttling it. This way ordered filesystems
> >   will not get serialized behind a throttled read in slow group.
> > 
> >   May be one can account meta data read to a group and try to use that
> >   to throttle data IO in same cgroup as a compensation.
> >  
> > WRITES
> > ------
> > - Throttle tasks. Do not throttle bios. That means that when a task
> >   submits direct write, let it go to disk. Do the accounting and if task
> >   is exceeding the IO rate make it sleep. Something similar to
> >   balance_dirty_pages().
> > 
> >   That way, any direct WRITES should not run into any serialization issues
> >   in ordered mode. We can continue to use blkio_throtle_bio() hook in
> >   generic_make request().
> > 
> > - For buffered WRITES, design a throttling hook similar to
> >   balance_drity_pages() and throttle tasks according to rules while they
> >   are dirtying page cache.
> > 
> > - Do not throttle buffered writes again at the end device as these have
> >   been throttled already while writting to page cache. Also throttling
> >   WRITES at end device will lead to serialization issues with file systems
> >   in ordered mode.
> > 
> > - Cgroup of a IO is always attributed to submitting thread. That way all
> >   meta data writes will go in root cgroup and remain unthrottled. If one
> >   is too concerned with lots of meta data IO, then probably one can
> >   put a throttling rule in root cgroup.
>   But I think the above scheme basically allows agressive buffered writer
> to occupy as much of disk throughput as throttling at page dirty time
> allows. So either you'd have to seriously limit the speed of page dirtying
> for each cgroup (effectively giving each write properties like direct write)
> or you'd have to live with cgroup taking your whole disk throughput. Neither
> of which seems very appealing. Grumble, not that I have a good solution to
> this problem...

[CCing lkml]

Hi Jan,

I agree that if we do throttling in balance_dirty_pages() to solve the
issue of file system ordered mode, then we allow flusher threads to
write data at high rate which is bad. Keeping write throttling at device
level runs into issues of file system ordered mode write.

I think problem is that file systems are not cgroup aware (/me runs for
cover) and we are just trying to work around that hence none of the proposed
problem solution is not satisfying.

To get cgroup thing right, we shall have to make whole stack cgroup aware.
In this case because file system journaling is not cgroup aware and is
essentially a serialized operation and life becomes hard. Throttling is
in higher layer is not a good solution and throttling in lower layer
is not a good solution either.

Ideally, throttling in generic_make_request() is good as long as all the
layers sitting above it (file systems, flusher writeback, page cache share)
can be made cgroup aware. So that if a cgroup is throttled, others cgroup
are more or less not impacted by throttled cgroup. We have talked about
making flusher cgroup aware and per cgroup dirty ratio thing, but making
file system journalling cgroup aware seems to be out of question (I don't
even know if it is possible to do and how much work does it involve).

I will try to summarize the options I have thought about so far.

- Keep throttling at device level. Do not use it with host filesystems
  especially with ordered mode. So this is primarily useful in case of
  virtualization.

  Or recommend user to not configure too low limits on each cgroup. So
  once in a while file systems in ordered mode will get serialized and
  it will impact scalability but will not livelock the system.

- Move all write throttling in balance_dirty_pages(). This avoids ordering
  issues but introduce the issue of flusher writting at high speed also
  people have been looking for limiting traffic from a host coming to
  shared storage. It does not work very well there as we limit the IO
  rate coming into page cache and not going out of device. So there
  will be lot of bursts.

- Keep throttling at device level and do something magical in file systems
  journalling code so that it is more parallel and cgroup aware.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF))
  2011-04-18 22:51                                     ` cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)) Vivek Goyal
@ 2011-04-19  0:33                                       ` Dave Chinner
  2011-04-19 14:30                                         ` Vivek Goyal
  0 siblings, 1 reply; 166+ messages in thread
From: Dave Chinner @ 2011-04-19  0:33 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Greg Thelen, James Bottomley, lsf, linux-fsdevel,
	linux kernel mailing list

On Mon, Apr 18, 2011 at 06:51:18PM -0400, Vivek Goyal wrote:
> On Mon, Apr 18, 2011 at 11:58:44PM +0200, Jan Kara wrote:
> > On Fri 15-04-11 23:06:02, Vivek Goyal wrote:
> > > On Fri, Apr 15, 2011 at 05:07:50PM -0400, Vivek Goyal wrote:
> > > > How about doing throttling at two layers. All the data throttling is
> > > > done in higher layers and then also retain the mechanism of throttling
> > > > at end device. That way an admin can put a overall limit on such 
> > > > common write traffic. (XFS meta data coming from workqueues, flusher
> > > > thread, kswapd etc).
> > > > 
> > > > Anyway, we can't attribute this IO to per process context/group otherwise
> > > > most likely something will get serialized in higher layers.
> > > >  
> > > > Right now I am speaking purely from IO throttling point of view and not
> > > > even thinking about CFQ and IO tracking stuff.
> > > > 
> > > > This increases the complexity in IO cgroup interface as now we see to have
> > > > four combinations.
> > > > 
> > > >   Global Throttling
> > > >   	Throttling at lower layers
> > > >   	Throttling at higher layers.
> > > > 
> > > >   Per device throttling
> > > >  	 Throttling at lower layers
> > > >   	Throttling at higher layers.
> > > 
> > > Dave, 
> > > 
> > > I wrote above but I myself am not fond of coming up with 4 combinations.
> > > Want to limit it two. Per device throttling or global throttling. Here
> > > are some more thoughts in general about both throttling policy and
> > > proportional policy of IO controller. For throttling policy, I am 
> > > primarily concerned with how to avoid file system serialization issues.
> > > 
> > > Proportional IO (CFQ)
> > > ---------------------
> > > - Make writeback cgroup aware and kernel threads (flusher) which are
> > >   cgroup aware can be marked with a task flag (GROUP_AWARE). If a 
> > >   cgroup aware kernel threads throws IO at CFQ, then IO is accounted
> > >   to cgroup of task who originally dirtied the page. Otherwise we use
> > >   task context to account the IO to.
> > > 
> > >   So any IO submitted by flusher threads will go to respective cgroups
> > >   and higher weight cgroup should be able to do more WRITES.
> > > 
> > >   IO submitted by other kernel threads like kjournald, XFS async metadata
> > >   submission, kswapd etc all goes to thread context and that is root
> > >   group.
> > > 
> > > - If kswapd is a concern then either make kswapd cgroup aware or let
> > >   kswapd use cgroup aware flusher to do IO (Dave Chinner's idea).
> > > 
> > > Open Issues
> > > -----------
> > > - We do not get isolation for meta data IO. In virtualized setup, to
> > >   achieve stronger isolation do not use host filesystem. Export block
> > >   devices into guests.
> > > 
> > > IO throttling
> > > ------------
> > > 
> > > READS
> > > -----
> > > - Do not throttle meta data IO. Filesystem needs to mark READ metadata
> > >   IO so that we can avoid throttling it. This way ordered filesystems
> > >   will not get serialized behind a throttled read in slow group.
> > > 
> > >   May be one can account meta data read to a group and try to use that
> > >   to throttle data IO in same cgroup as a compensation.
> > >  
> > > WRITES
> > > ------
> > > - Throttle tasks. Do not throttle bios. That means that when a task
> > >   submits direct write, let it go to disk. Do the accounting and if task
> > >   is exceeding the IO rate make it sleep. Something similar to
> > >   balance_dirty_pages().
> > > 
> > >   That way, any direct WRITES should not run into any serialization issues
> > >   in ordered mode. We can continue to use blkio_throtle_bio() hook in
> > >   generic_make request().
> > > 
> > > - For buffered WRITES, design a throttling hook similar to
> > >   balance_drity_pages() and throttle tasks according to rules while they
> > >   are dirtying page cache.
> > > 
> > > - Do not throttle buffered writes again at the end device as these have
> > >   been throttled already while writting to page cache. Also throttling
> > >   WRITES at end device will lead to serialization issues with file systems
> > >   in ordered mode.
> > > 
> > > - Cgroup of a IO is always attributed to submitting thread. That way all
> > >   meta data writes will go in root cgroup and remain unthrottled. If one
> > >   is too concerned with lots of meta data IO, then probably one can
> > >   put a throttling rule in root cgroup.
> >   But I think the above scheme basically allows agressive buffered writer
> > to occupy as much of disk throughput as throttling at page dirty time
> > allows. So either you'd have to seriously limit the speed of page dirtying
> > for each cgroup (effectively giving each write properties like direct write)
> > or you'd have to live with cgroup taking your whole disk throughput. Neither
> > of which seems very appealing. Grumble, not that I have a good solution to
> > this problem...
> 
> [CCing lkml]
> 
> Hi Jan,
> 
> I agree that if we do throttling in balance_dirty_pages() to solve the
> issue of file system ordered mode, then we allow flusher threads to
> write data at high rate which is bad. Keeping write throttling at device
> level runs into issues of file system ordered mode write.
> 
> I think problem is that file systems are not cgroup aware (/me runs for
> cover) and we are just trying to work around that hence none of the proposed
> problem solution is not satisfying.
> 
> To get cgroup thing right, we shall have to make whole stack cgroup aware.
> In this case because file system journaling is not cgroup aware and is
> essentially a serialized operation and life becomes hard. Throttling is
> in higher layer is not a good solution and throttling in lower layer
> is not a good solution either.
> 
> Ideally, throttling in generic_make_request() is good as long as all the
> layers sitting above it (file systems, flusher writeback, page cache share)
> can be made cgroup aware. So that if a cgroup is throttled, others cgroup
> are more or less not impacted by throttled cgroup. We have talked about
> making flusher cgroup aware and per cgroup dirty ratio thing, but making
> file system journalling cgroup aware seems to be out of question (I don't
> even know if it is possible to do and how much work does it involve).

If you want to throttle journal operations, then we probably need to
throttle metadata operations that commit to the journal, not the
journal IO itself.  The journal is a shared global resource that all
cgroups use, so throttling journal IO inappropriately will affect
the performance of all cgroups, not just the one that is "hogging"
it.

In XFS, you could probably do this at the transaction reservation
stage where log space is reserved. We know everything about the
transaction at this point in time, and we throttle here already when
the journal is full. Adding cgroup transaction limits to this point
would be the place to do it, but the control parameter for it would
be very XFS specific (i.e. number of transactions/s). Concurrency is
not an issue - the XFS transaction subsystem is only limited in
concurrency by the space available in the journal for reservations
(hundred to thousands of concurrent transactions).

FWIW, this would even allow per-bdi-flusher thread transaction
throttling parameters to be set, so writeback triggered metadata IO
could possibly be limited as well.

I'm not sure whether this is possible with other filesystems, and
ext3/4 would still have the issue of ordered writeback causing much
more writeback than expected at times (e.g. fsync), but I suspect
there is nothing that can really be done about this.

> I will try to summarize the options I have thought about so far.
> 
> - Keep throttling at device level. Do not use it with host filesystems
>   especially with ordered mode. So this is primarily useful in case of
>   virtualization.
> 
>   Or recommend user to not configure too low limits on each cgroup. So
>   once in a while file systems in ordered mode will get serialized and
>   it will impact scalability but will not livelock the system.
> 
> - Move all write throttling in balance_dirty_pages(). This avoids ordering
>   issues but introduce the issue of flusher writting at high speed also
>   people have been looking for limiting traffic from a host coming to
>   shared storage. It does not work very well there as we limit the IO
>   rate coming into page cache and not going out of device. So there
>   will be lot of bursts.
> 
> - Keep throttling at device level and do something magical in file systems
>   journalling code so that it is more parallel and cgroup aware.

I think the third approach is the best long term approach.

FWIW, if you really want cgroups integrated properly into XFS, then
they need to be integrated into the allocator as well so we can push
isolateed cgroups into different, non-contending regions of the
filesystem (similar to filestreams containers). I started on an
general allocation policy framework for XFS a few years ago, but
never had more than a POC prototype. I always intended this
framework to implement (at the time) a cpuset aware policy, so I'm
pretty sure such an approach would work for cgroups, too. Maybe it's
time to dust off that patch set....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-11  1:36                             ` Dave Chinner
  2011-04-15 21:07                               ` Vivek Goyal
@ 2011-04-19 14:17                               ` Wu Fengguang
  2011-04-19 14:34                                 ` Vivek Goyal
  1 sibling, 1 reply; 166+ messages in thread
From: Wu Fengguang @ 2011-04-19 14:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Vivek Goyal, Greg Thelen, James Bottomley, lsf, linux-fsdevel

[snip]
> > > > For throttling case, apart from metadata, I found that with simple
> > > > throttling of data I ran into issues with journalling with ext4 mounuted
> > > > in ordered mode. So it was suggested that WRITE IO throttling should
> > > > not be done at device level instead try to do it in higher layers,
> > > > possibly balance_dirty_pages() and throttle process early.
> > > 
> > > The problem with doing it at the page cache entry level is that
> > > cache hits then get throttled. It's not really a an IO controller at
> > > that point, and the impact on application performance could be huge
> > > (i.e. MB/s instead of GB/s).
> > 
> > Agreed that throttling cache hits is not a good idea. Can we determine
> > if page being asked for is in cache or not and charge for IO accordingly.
> 
> You'd need hooks in find_or_create_page(), though you have no
> context of whether a read or a write is in progress at that point.

I'm confused.  Where is the throttling at cache hits?

The balance_dirty_pages() throttling kicks in at write() syscall and
page fault time. For example, generic_perform_write(), do_wp_page()
and __do_fault() will explicitly call
balance_dirty_pages_ratelimited() to do the write throttling.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-08  1:58                             ` Dave Chinner
@ 2011-04-19 14:26                               ` Wu Fengguang
  0 siblings, 0 replies; 166+ messages in thread
From: Wu Fengguang @ 2011-04-19 14:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, Vivek Goyal, James Bottomley, lsf, linux-fsdevel

On Fri, Apr 08, 2011 at 11:58:41AM +1000, Dave Chinner wrote:
> On Wed, Apr 06, 2011 at 07:10:17PM +0200, Jan Kara wrote:
> > On Wed 06-04-11 12:08:05, Vivek Goyal wrote:
> > > On Wed, Apr 06, 2011 at 11:37:15AM -0400, Vivek Goyal wrote:
> >   Well, I wouldn't bother too much with kswapd handling. MM people plan to
> > get rid of writeback from direct reclaim and just remove the dirty page
> > from LRU and recycle it once flusher thread writes it...
> 
> kswapd is not in the direct reclaim path - it's the background
> memory reclaim path.  Writeback from direct reclaim is a problem
> because of stack usage, and that problem doesn't exist for kswapd.

FYI the IO initiated from pageout() in kswapd/direct reclaim can
mostly be transfered to the flushers.

Here is the early RFC patch, and I'll submit an update soon.

http://www.spinics.net/lists/linux-mm/msg09199.html

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF))
  2011-04-19  0:33                                       ` Dave Chinner
@ 2011-04-19 14:30                                         ` Vivek Goyal
  2011-04-19 14:45                                           ` Jan Kara
                                                             ` (2 more replies)
  0 siblings, 3 replies; 166+ messages in thread
From: Vivek Goyal @ 2011-04-19 14:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Greg Thelen, James Bottomley, lsf, linux-fsdevel,
	linux kernel mailing list

On Tue, Apr 19, 2011 at 10:33:39AM +1000, Dave Chinner wrote:
> On Mon, Apr 18, 2011 at 06:51:18PM -0400, Vivek Goyal wrote:
> > On Mon, Apr 18, 2011 at 11:58:44PM +0200, Jan Kara wrote:
> > > On Fri 15-04-11 23:06:02, Vivek Goyal wrote:
> > > > On Fri, Apr 15, 2011 at 05:07:50PM -0400, Vivek Goyal wrote:
> > > > > How about doing throttling at two layers. All the data throttling is
> > > > > done in higher layers and then also retain the mechanism of throttling
> > > > > at end device. That way an admin can put a overall limit on such 
> > > > > common write traffic. (XFS meta data coming from workqueues, flusher
> > > > > thread, kswapd etc).
> > > > > 
> > > > > Anyway, we can't attribute this IO to per process context/group otherwise
> > > > > most likely something will get serialized in higher layers.
> > > > >  
> > > > > Right now I am speaking purely from IO throttling point of view and not
> > > > > even thinking about CFQ and IO tracking stuff.
> > > > > 
> > > > > This increases the complexity in IO cgroup interface as now we see to have
> > > > > four combinations.
> > > > > 
> > > > >   Global Throttling
> > > > >   	Throttling at lower layers
> > > > >   	Throttling at higher layers.
> > > > > 
> > > > >   Per device throttling
> > > > >  	 Throttling at lower layers
> > > > >   	Throttling at higher layers.
> > > > 
> > > > Dave, 
> > > > 
> > > > I wrote above but I myself am not fond of coming up with 4 combinations.
> > > > Want to limit it two. Per device throttling or global throttling. Here
> > > > are some more thoughts in general about both throttling policy and
> > > > proportional policy of IO controller. For throttling policy, I am 
> > > > primarily concerned with how to avoid file system serialization issues.
> > > > 
> > > > Proportional IO (CFQ)
> > > > ---------------------
> > > > - Make writeback cgroup aware and kernel threads (flusher) which are
> > > >   cgroup aware can be marked with a task flag (GROUP_AWARE). If a 
> > > >   cgroup aware kernel threads throws IO at CFQ, then IO is accounted
> > > >   to cgroup of task who originally dirtied the page. Otherwise we use
> > > >   task context to account the IO to.
> > > > 
> > > >   So any IO submitted by flusher threads will go to respective cgroups
> > > >   and higher weight cgroup should be able to do more WRITES.
> > > > 
> > > >   IO submitted by other kernel threads like kjournald, XFS async metadata
> > > >   submission, kswapd etc all goes to thread context and that is root
> > > >   group.
> > > > 
> > > > - If kswapd is a concern then either make kswapd cgroup aware or let
> > > >   kswapd use cgroup aware flusher to do IO (Dave Chinner's idea).
> > > > 
> > > > Open Issues
> > > > -----------
> > > > - We do not get isolation for meta data IO. In virtualized setup, to
> > > >   achieve stronger isolation do not use host filesystem. Export block
> > > >   devices into guests.
> > > > 
> > > > IO throttling
> > > > ------------
> > > > 
> > > > READS
> > > > -----
> > > > - Do not throttle meta data IO. Filesystem needs to mark READ metadata
> > > >   IO so that we can avoid throttling it. This way ordered filesystems
> > > >   will not get serialized behind a throttled read in slow group.
> > > > 
> > > >   May be one can account meta data read to a group and try to use that
> > > >   to throttle data IO in same cgroup as a compensation.
> > > >  
> > > > WRITES
> > > > ------
> > > > - Throttle tasks. Do not throttle bios. That means that when a task
> > > >   submits direct write, let it go to disk. Do the accounting and if task
> > > >   is exceeding the IO rate make it sleep. Something similar to
> > > >   balance_dirty_pages().
> > > > 
> > > >   That way, any direct WRITES should not run into any serialization issues
> > > >   in ordered mode. We can continue to use blkio_throtle_bio() hook in
> > > >   generic_make request().
> > > > 
> > > > - For buffered WRITES, design a throttling hook similar to
> > > >   balance_drity_pages() and throttle tasks according to rules while they
> > > >   are dirtying page cache.
> > > > 
> > > > - Do not throttle buffered writes again at the end device as these have
> > > >   been throttled already while writting to page cache. Also throttling
> > > >   WRITES at end device will lead to serialization issues with file systems
> > > >   in ordered mode.
> > > > 
> > > > - Cgroup of a IO is always attributed to submitting thread. That way all
> > > >   meta data writes will go in root cgroup and remain unthrottled. If one
> > > >   is too concerned with lots of meta data IO, then probably one can
> > > >   put a throttling rule in root cgroup.
> > >   But I think the above scheme basically allows agressive buffered writer
> > > to occupy as much of disk throughput as throttling at page dirty time
> > > allows. So either you'd have to seriously limit the speed of page dirtying
> > > for each cgroup (effectively giving each write properties like direct write)
> > > or you'd have to live with cgroup taking your whole disk throughput. Neither
> > > of which seems very appealing. Grumble, not that I have a good solution to
> > > this problem...
> > 
> > [CCing lkml]
> > 
> > Hi Jan,
> > 
> > I agree that if we do throttling in balance_dirty_pages() to solve the
> > issue of file system ordered mode, then we allow flusher threads to
> > write data at high rate which is bad. Keeping write throttling at device
> > level runs into issues of file system ordered mode write.
> > 
> > I think problem is that file systems are not cgroup aware (/me runs for
> > cover) and we are just trying to work around that hence none of the proposed
> > problem solution is not satisfying.
> > 
> > To get cgroup thing right, we shall have to make whole stack cgroup aware.
> > In this case because file system journaling is not cgroup aware and is
> > essentially a serialized operation and life becomes hard. Throttling is
> > in higher layer is not a good solution and throttling in lower layer
> > is not a good solution either.
> > 
> > Ideally, throttling in generic_make_request() is good as long as all the
> > layers sitting above it (file systems, flusher writeback, page cache share)
> > can be made cgroup aware. So that if a cgroup is throttled, others cgroup
> > are more or less not impacted by throttled cgroup. We have talked about
> > making flusher cgroup aware and per cgroup dirty ratio thing, but making
> > file system journalling cgroup aware seems to be out of question (I don't
> > even know if it is possible to do and how much work does it involve).
> 
> If you want to throttle journal operations, then we probably need to
> throttle metadata operations that commit to the journal, not the
> journal IO itself.  The journal is a shared global resource that all
> cgroups use, so throttling journal IO inappropriately will affect
> the performance of all cgroups, not just the one that is "hogging"
> it.

Agreed.

> 
> In XFS, you could probably do this at the transaction reservation
> stage where log space is reserved. We know everything about the
> transaction at this point in time, and we throttle here already when
> the journal is full. Adding cgroup transaction limits to this point
> would be the place to do it, but the control parameter for it would
> be very XFS specific (i.e. number of transactions/s). Concurrency is
> not an issue - the XFS transaction subsystem is only limited in
> concurrency by the space available in the journal for reservations
> (hundred to thousands of concurrent transactions).

Instead of transaction per second, can we implement some kind of upper
limit of pending transactions per cgroup. And that limit does not have
to be user tunable to begin with. The effective transactions/sec rate
will automatically be determined by IO throttling rate of the cgroup
at the end nodes.

I think effectively what we need is that the notion of parallel
transactions so that transactions of one cgroup can make progress
independent of transactions of other cgroup. So if a process does
an fsync and it is throttled then it should block transaction of 
only that cgroup and not other cgroups.

You mentioned that concurrency is not an issue in XFS and hundreds of
thousands of concurrent trasactions can progress depending on log space
available. If that's the case, I think to begin with we might not have
to do anything at all. Processes can still get blocked but as long as
we have enough log space, this might not be a frequent event. I will
do some testing with XFS and see can I livelock the system with very
low IO limits.

> 
> FWIW, this would even allow per-bdi-flusher thread transaction
> throttling parameters to be set, so writeback triggered metadata IO
> could possibly be limited as well.

How does writeback trigger metadata IO?

In the first step I was looking to not throttle meta data IO as that
will require even more changes in file system layer. I was thinking
that if we provide throttling only for data and do changes in filesystems
so that concurrent transactions can exist and make progress and file
system IO does not serialize behind slow throttled cgroup.

This leads to weaker isolation but atleast we don't run into livelocking
or filesystem scalability issues. Once that's resolved, we can handle the
case of throttling meta data IO also.

In fact if metadata is dependent on data (in ordered mode) and if we are
throttling data, then we automatically throttle meata for select cases.

> 
> I'm not sure whether this is possible with other filesystems, and
> ext3/4 would still have the issue of ordered writeback causing much
> more writeback than expected at times (e.g. fsync), but I suspect
> there is nothing that can really be done about this.

Can't this be modified so that multiple per cgroup transactions can make
progress. So if one fsync is blocked, then processes in other cgroup
should still be able to do IO using a separate transaction and be able
to commit it.

> 
> > I will try to summarize the options I have thought about so far.
> > 
> > - Keep throttling at device level. Do not use it with host filesystems
> >   especially with ordered mode. So this is primarily useful in case of
> >   virtualization.
> > 
> >   Or recommend user to not configure too low limits on each cgroup. So
> >   once in a while file systems in ordered mode will get serialized and
> >   it will impact scalability but will not livelock the system.
> > 
> > - Move all write throttling in balance_dirty_pages(). This avoids ordering
> >   issues but introduce the issue of flusher writting at high speed also
> >   people have been looking for limiting traffic from a host coming to
> >   shared storage. It does not work very well there as we limit the IO
> >   rate coming into page cache and not going out of device. So there
> >   will be lot of bursts.
> > 
> > - Keep throttling at device level and do something magical in file systems
> >   journalling code so that it is more parallel and cgroup aware.
> 
> I think the third approach is the best long term approach.

I also like the third approach. It is complex but more sustabinable in
long term.

> 
> FWIW, if you really want cgroups integrated properly into XFS, then
> they need to be integrated into the allocator as well so we can push
> isolateed cgroups into different, non-contending regions of the
> filesystem (similar to filestreams containers). I started on an
> general allocation policy framework for XFS a few years ago, but
> never had more than a POC prototype. I always intended this
> framework to implement (at the time) a cpuset aware policy, so I'm
> pretty sure such an approach would work for cgroups, too. Maybe it's
> time to dust off that patch set....

So having separate allocation areas/groups for separate group is useful
from locking perspective? Is it useful even if we do not throttle
meta data?

I will be willing to test these patches if you decide to dust off old patches.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-19 14:17                               ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Wu Fengguang
@ 2011-04-19 14:34                                 ` Vivek Goyal
  2011-04-19 14:48                                   ` Jan Kara
  0 siblings, 1 reply; 166+ messages in thread
From: Vivek Goyal @ 2011-04-19 14:34 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dave Chinner, Greg Thelen, James Bottomley, lsf, linux-fsdevel

On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote:
> [snip]
> > > > > For throttling case, apart from metadata, I found that with simple
> > > > > throttling of data I ran into issues with journalling with ext4 mounuted
> > > > > in ordered mode. So it was suggested that WRITE IO throttling should
> > > > > not be done at device level instead try to do it in higher layers,
> > > > > possibly balance_dirty_pages() and throttle process early.
> > > > 
> > > > The problem with doing it at the page cache entry level is that
> > > > cache hits then get throttled. It's not really a an IO controller at
> > > > that point, and the impact on application performance could be huge
> > > > (i.e. MB/s instead of GB/s).
> > > 
> > > Agreed that throttling cache hits is not a good idea. Can we determine
> > > if page being asked for is in cache or not and charge for IO accordingly.
> > 
> > You'd need hooks in find_or_create_page(), though you have no
> > context of whether a read or a write is in progress at that point.
> 
> I'm confused.  Where is the throttling at cache hits?
> 
> The balance_dirty_pages() throttling kicks in at write() syscall and
> page fault time. For example, generic_perform_write(), do_wp_page()
> and __do_fault() will explicitly call
> balance_dirty_pages_ratelimited() to do the write throttling.

This comment was in the context of what if we move block IO controller read
throttling also in higher layers. Then we don't want to throttle reads
which are already in cache.

Currently throttling hook is in generic_make_request() and it kicks in
only if data is not present in page cache and actual disk IO is initiated.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF))
  2011-04-19 14:30                                         ` Vivek Goyal
@ 2011-04-19 14:45                                           ` Jan Kara
  2011-04-19 17:17                                           ` Vivek Goyal
  2011-04-21  0:29                                           ` Dave Chinner
  2 siblings, 0 replies; 166+ messages in thread
From: Jan Kara @ 2011-04-19 14:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Dave Chinner, Jan Kara, Greg Thelen, James Bottomley, lsf,
	linux-fsdevel, linux kernel mailing list

On Tue 19-04-11 10:30:22, Vivek Goyal wrote:
> On Tue, Apr 19, 2011 at 10:33:39AM +1000, Dave Chinner wrote:
> > If you want to throttle journal operations, then we probably need to
> > throttle metadata operations that commit to the journal, not the
> > journal IO itself.  The journal is a shared global resource that all
> > cgroups use, so throttling journal IO inappropriately will affect
> > the performance of all cgroups, not just the one that is "hogging"
> > it.
> 
> Agreed.
> 
> > 
> > In XFS, you could probably do this at the transaction reservation
> > stage where log space is reserved. We know everything about the
> > transaction at this point in time, and we throttle here already when
> > the journal is full. Adding cgroup transaction limits to this point
> > would be the place to do it, but the control parameter for it would
> > be very XFS specific (i.e. number of transactions/s). Concurrency is
> > not an issue - the XFS transaction subsystem is only limited in
> > concurrency by the space available in the journal for reservations
> > (hundred to thousands of concurrent transactions).
> 
> Instead of transaction per second, can we implement some kind of upper
> limit of pending transactions per cgroup. And that limit does not have
> to be user tunable to begin with. The effective transactions/sec rate
> will automatically be determined by IO throttling rate of the cgroup
> at the end nodes.
> 
> I think effectively what we need is that the notion of parallel
> transactions so that transactions of one cgroup can make progress
> independent of transactions of other cgroup. So if a process does
> an fsync and it is throttled then it should block transaction of 
> only that cgroup and not other cgroups.
> 
> You mentioned that concurrency is not an issue in XFS and hundreds of
> thousands of concurrent trasactions can progress depending on log space
> available. If that's the case, I think to begin with we might not have
> to do anything at all. Processes can still get blocked but as long as
> we have enough log space, this might not be a frequent event. I will
> do some testing with XFS and see can I livelock the system with very
> low IO limits.
> 
> > 
> > FWIW, this would even allow per-bdi-flusher thread transaction
> > throttling parameters to be set, so writeback triggered metadata IO
> > could possibly be limited as well.
> 
> How does writeback trigger metadata IO?
  Because by writing data, you may need to do block allocation or mark
blocks as written on disk, or similar changes to metadata...

> In the first step I was looking to not throttle meta data IO as that
> will require even more changes in file system layer. I was thinking
> that if we provide throttling only for data and do changes in filesystems
> so that concurrent transactions can exist and make progress and file
> system IO does not serialize behind slow throttled cgroup.
  Yes, I think not throttling metadata is a good start.

> This leads to weaker isolation but atleast we don't run into livelocking
> or filesystem scalability issues. Once that's resolved, we can handle the
> case of throttling meta data IO also.
> 
> In fact if metadata is dependent on data (in ordered mode) and if we are
> throttling data, then we automatically throttle meata for select cases.
> 
> > 
> > I'm not sure whether this is possible with other filesystems, and
> > ext3/4 would still have the issue of ordered writeback causing much
> > more writeback than expected at times (e.g. fsync), but I suspect
> > there is nothing that can really be done about this.
> 
> Can't this be modified so that multiple per cgroup transactions can make
> progress. So if one fsync is blocked, then processes in other cgroup
> should still be able to do IO using a separate transaction and be able
> to commit it.
  Not really. Ext3/4 has always a single running transaction and all
metadata updates from all threads are recorded in it. When the transaction
grows large/old enough, we commit it and start a new transaction. The fact
that there is always just one running transaction is heavily used in the
journaling code so it would need serious rewrite of JBD2...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-19 14:34                                 ` Vivek Goyal
@ 2011-04-19 14:48                                   ` Jan Kara
  2011-04-19 15:11                                     ` Vivek Goyal
  0 siblings, 1 reply; 166+ messages in thread
From: Jan Kara @ 2011-04-19 14:48 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, James Bottomley, lsf, linux-fsdevel, Dave Chinner

On Tue 19-04-11 10:34:23, Vivek Goyal wrote:
> On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote:
> > [snip]
> > > > > > For throttling case, apart from metadata, I found that with simple
> > > > > > throttling of data I ran into issues with journalling with ext4 mounuted
> > > > > > in ordered mode. So it was suggested that WRITE IO throttling should
> > > > > > not be done at device level instead try to do it in higher layers,
> > > > > > possibly balance_dirty_pages() and throttle process early.
> > > > > 
> > > > > The problem with doing it at the page cache entry level is that
> > > > > cache hits then get throttled. It's not really a an IO controller at
> > > > > that point, and the impact on application performance could be huge
> > > > > (i.e. MB/s instead of GB/s).
> > > > 
> > > > Agreed that throttling cache hits is not a good idea. Can we determine
> > > > if page being asked for is in cache or not and charge for IO accordingly.
> > > 
> > > You'd need hooks in find_or_create_page(), though you have no
> > > context of whether a read or a write is in progress at that point.
> > 
> > I'm confused.  Where is the throttling at cache hits?
> > 
> > The balance_dirty_pages() throttling kicks in at write() syscall and
> > page fault time. For example, generic_perform_write(), do_wp_page()
> > and __do_fault() will explicitly call
> > balance_dirty_pages_ratelimited() to do the write throttling.
> 
> This comment was in the context of what if we move block IO controller read
> throttling also in higher layers. Then we don't want to throttle reads
> which are already in cache.
> 
> Currently throttling hook is in generic_make_request() and it kicks in
> only if data is not present in page cache and actual disk IO is initiated.
  You can always throttle in readpage(). It's not much higher than
generic_make_request() but basically as high as it can get I suspect
(otherwise you'd have to deal with lots of different code paths like page
faults, splice, read, ...).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-19 14:48                                   ` Jan Kara
@ 2011-04-19 15:11                                     ` Vivek Goyal
  2011-04-19 15:22                                       ` Wu Fengguang
  0 siblings, 1 reply; 166+ messages in thread
From: Vivek Goyal @ 2011-04-19 15:11 UTC (permalink / raw)
  To: Jan Kara; +Cc: Wu Fengguang, James Bottomley, lsf, linux-fsdevel, Dave Chinner

On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote:
> On Tue 19-04-11 10:34:23, Vivek Goyal wrote:
> > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote:
> > > [snip]
> > > > > > > For throttling case, apart from metadata, I found that with simple
> > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted
> > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should
> > > > > > > not be done at device level instead try to do it in higher layers,
> > > > > > > possibly balance_dirty_pages() and throttle process early.
> > > > > > 
> > > > > > The problem with doing it at the page cache entry level is that
> > > > > > cache hits then get throttled. It's not really a an IO controller at
> > > > > > that point, and the impact on application performance could be huge
> > > > > > (i.e. MB/s instead of GB/s).
> > > > > 
> > > > > Agreed that throttling cache hits is not a good idea. Can we determine
> > > > > if page being asked for is in cache or not and charge for IO accordingly.
> > > > 
> > > > You'd need hooks in find_or_create_page(), though you have no
> > > > context of whether a read or a write is in progress at that point.
> > > 
> > > I'm confused.  Where is the throttling at cache hits?
> > > 
> > > The balance_dirty_pages() throttling kicks in at write() syscall and
> > > page fault time. For example, generic_perform_write(), do_wp_page()
> > > and __do_fault() will explicitly call
> > > balance_dirty_pages_ratelimited() to do the write throttling.
> > 
> > This comment was in the context of what if we move block IO controller read
> > throttling also in higher layers. Then we don't want to throttle reads
> > which are already in cache.
> > 
> > Currently throttling hook is in generic_make_request() and it kicks in
> > only if data is not present in page cache and actual disk IO is initiated.
>   You can always throttle in readpage(). It's not much higher than
> generic_make_request() but basically as high as it can get I suspect
> (otherwise you'd have to deal with lots of different code paths like page
> faults, splice, read, ...).

Yep, I was thinking that what do I gain by moving READ throttling up. 
The only thing generic_make_request() does not catch is network file
systems. I think for that I can introduce another hook say in NFS and
I might be all set.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-19 15:11                                     ` Vivek Goyal
@ 2011-04-19 15:22                                       ` Wu Fengguang
  2011-04-19 15:31                                         ` Vivek Goyal
  0 siblings, 1 reply; 166+ messages in thread
From: Wu Fengguang @ 2011-04-19 15:22 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Jan Kara, James Bottomley, lsf, linux-fsdevel, Dave Chinner

On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote:
> On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote:
> > On Tue 19-04-11 10:34:23, Vivek Goyal wrote:
> > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote:
> > > > [snip]
> > > > > > > > For throttling case, apart from metadata, I found that with simple
> > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted
> > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should
> > > > > > > > not be done at device level instead try to do it in higher layers,
> > > > > > > > possibly balance_dirty_pages() and throttle process early.
> > > > > > > 
> > > > > > > The problem with doing it at the page cache entry level is that
> > > > > > > cache hits then get throttled. It's not really a an IO controller at
> > > > > > > that point, and the impact on application performance could be huge
> > > > > > > (i.e. MB/s instead of GB/s).
> > > > > > 
> > > > > > Agreed that throttling cache hits is not a good idea. Can we determine
> > > > > > if page being asked for is in cache or not and charge for IO accordingly.
> > > > > 
> > > > > You'd need hooks in find_or_create_page(), though you have no
> > > > > context of whether a read or a write is in progress at that point.
> > > > 
> > > > I'm confused.  Where is the throttling at cache hits?
> > > > 
> > > > The balance_dirty_pages() throttling kicks in at write() syscall and
> > > > page fault time. For example, generic_perform_write(), do_wp_page()
> > > > and __do_fault() will explicitly call
> > > > balance_dirty_pages_ratelimited() to do the write throttling.
> > > 
> > > This comment was in the context of what if we move block IO controller read
> > > throttling also in higher layers. Then we don't want to throttle reads
> > > which are already in cache.
> > > 
> > > Currently throttling hook is in generic_make_request() and it kicks in
> > > only if data is not present in page cache and actual disk IO is initiated.
> >   You can always throttle in readpage(). It's not much higher than
> > generic_make_request() but basically as high as it can get I suspect
> > (otherwise you'd have to deal with lots of different code paths like page
> > faults, splice, read, ...).
> 
> Yep, I was thinking that what do I gain by moving READ throttling up. 
> The only thing generic_make_request() does not catch is network file
> systems. I think for that I can introduce another hook say in NFS and
> I might be all set.

Basically all data reads go through the readahead layer, and the
__do_page_cache_readahead() function.

Just one more option for your tradeoffs :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-19 15:22                                       ` Wu Fengguang
@ 2011-04-19 15:31                                         ` Vivek Goyal
  2011-04-19 16:58                                           ` Wu Fengguang
  0 siblings, 1 reply; 166+ messages in thread
From: Vivek Goyal @ 2011-04-19 15:31 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Jan Kara, James Bottomley, lsf, linux-fsdevel, Dave Chinner

On Tue, Apr 19, 2011 at 11:22:40PM +0800, Wu Fengguang wrote:
> On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote:
> > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote:
> > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote:
> > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote:
> > > > > [snip]
> > > > > > > > > For throttling case, apart from metadata, I found that with simple
> > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted
> > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should
> > > > > > > > > not be done at device level instead try to do it in higher layers,
> > > > > > > > > possibly balance_dirty_pages() and throttle process early.
> > > > > > > > 
> > > > > > > > The problem with doing it at the page cache entry level is that
> > > > > > > > cache hits then get throttled. It's not really a an IO controller at
> > > > > > > > that point, and the impact on application performance could be huge
> > > > > > > > (i.e. MB/s instead of GB/s).
> > > > > > > 
> > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine
> > > > > > > if page being asked for is in cache or not and charge for IO accordingly.
> > > > > > 
> > > > > > You'd need hooks in find_or_create_page(), though you have no
> > > > > > context of whether a read or a write is in progress at that point.
> > > > > 
> > > > > I'm confused.  Where is the throttling at cache hits?
> > > > > 
> > > > > The balance_dirty_pages() throttling kicks in at write() syscall and
> > > > > page fault time. For example, generic_perform_write(), do_wp_page()
> > > > > and __do_fault() will explicitly call
> > > > > balance_dirty_pages_ratelimited() to do the write throttling.
> > > > 
> > > > This comment was in the context of what if we move block IO controller read
> > > > throttling also in higher layers. Then we don't want to throttle reads
> > > > which are already in cache.
> > > > 
> > > > Currently throttling hook is in generic_make_request() and it kicks in
> > > > only if data is not present in page cache and actual disk IO is initiated.
> > >   You can always throttle in readpage(). It's not much higher than
> > > generic_make_request() but basically as high as it can get I suspect
> > > (otherwise you'd have to deal with lots of different code paths like page
> > > faults, splice, read, ...).
> > 
> > Yep, I was thinking that what do I gain by moving READ throttling up. 
> > The only thing generic_make_request() does not catch is network file
> > systems. I think for that I can introduce another hook say in NFS and
> > I might be all set.
> 
> Basically all data reads go through the readahead layer, and the
> __do_page_cache_readahead() function.
> 
> Just one more option for your tradeoffs :)

But this does not cover direct IO?

But I guess if I split the hook into two parts (one in direct IO path
and one in __do_page_cache_readahead()), then filesystems don't have
to mark meta data READS. I will look into it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-19 15:31                                         ` Vivek Goyal
@ 2011-04-19 16:58                                           ` Wu Fengguang
  2011-04-19 17:05                                             ` Vivek Goyal
  0 siblings, 1 reply; 166+ messages in thread
From: Wu Fengguang @ 2011-04-19 16:58 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Jan Kara, James Bottomley, lsf, linux-fsdevel, Dave Chinner

On Tue, Apr 19, 2011 at 11:31:06PM +0800, Vivek Goyal wrote:
> On Tue, Apr 19, 2011 at 11:22:40PM +0800, Wu Fengguang wrote:
> > On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote:
> > > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote:
> > > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote:
> > > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote:
> > > > > > [snip]
> > > > > > > > > > For throttling case, apart from metadata, I found that with simple
> > > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted
> > > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should
> > > > > > > > > > not be done at device level instead try to do it in higher layers,
> > > > > > > > > > possibly balance_dirty_pages() and throttle process early.
> > > > > > > > > 
> > > > > > > > > The problem with doing it at the page cache entry level is that
> > > > > > > > > cache hits then get throttled. It's not really a an IO controller at
> > > > > > > > > that point, and the impact on application performance could be huge
> > > > > > > > > (i.e. MB/s instead of GB/s).
> > > > > > > > 
> > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine
> > > > > > > > if page being asked for is in cache or not and charge for IO accordingly.
> > > > > > > 
> > > > > > > You'd need hooks in find_or_create_page(), though you have no
> > > > > > > context of whether a read or a write is in progress at that point.
> > > > > > 
> > > > > > I'm confused.  Where is the throttling at cache hits?
> > > > > > 
> > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and
> > > > > > page fault time. For example, generic_perform_write(), do_wp_page()
> > > > > > and __do_fault() will explicitly call
> > > > > > balance_dirty_pages_ratelimited() to do the write throttling.
> > > > > 
> > > > > This comment was in the context of what if we move block IO controller read
> > > > > throttling also in higher layers. Then we don't want to throttle reads
> > > > > which are already in cache.
> > > > > 
> > > > > Currently throttling hook is in generic_make_request() and it kicks in
> > > > > only if data is not present in page cache and actual disk IO is initiated.
> > > >   You can always throttle in readpage(). It's not much higher than
> > > > generic_make_request() but basically as high as it can get I suspect
> > > > (otherwise you'd have to deal with lots of different code paths like page
> > > > faults, splice, read, ...).
> > > 
> > > Yep, I was thinking that what do I gain by moving READ throttling up. 
> > > The only thing generic_make_request() does not catch is network file
> > > systems. I think for that I can introduce another hook say in NFS and
> > > I might be all set.
> > 
> > Basically all data reads go through the readahead layer, and the
> > __do_page_cache_readahead() function.
> > 
> > Just one more option for your tradeoffs :)
> 
> But this does not cover direct IO?

Yes, sorry!

> But I guess if I split the hook into two parts (one in direct IO path
> and one in __do_page_cache_readahead()), then filesystems don't have
> to mark meta data READS. I will look into it.

Right, and the hooks should be trivial to add.

The readahead code is typically invoked in three ways:

- sync readahead, on page cache miss, => page_cache_sync_readahead()

- async readahead, on hitting PG_readahead (tagged on one page per readahead window),
  => page_cache_async_readahead()

- user space readahead, fadvise(WILLNEED), => force_page_cache_readahead()

ext3/4 also call into readahead on readdir().

The readahead window size is typically 128K, but much larger for
software raid, btrfs and NFS, typically multiple MB and even more.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-19 16:58                                           ` Wu Fengguang
@ 2011-04-19 17:05                                             ` Vivek Goyal
  2011-04-19 20:58                                               ` Jan Kara
  2011-04-20  1:16                                               ` Wu Fengguang
  0 siblings, 2 replies; 166+ messages in thread
From: Vivek Goyal @ 2011-04-19 17:05 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Jan Kara, James Bottomley, lsf, linux-fsdevel, Dave Chinner

On Wed, Apr 20, 2011 at 12:58:38AM +0800, Wu Fengguang wrote:
> On Tue, Apr 19, 2011 at 11:31:06PM +0800, Vivek Goyal wrote:
> > On Tue, Apr 19, 2011 at 11:22:40PM +0800, Wu Fengguang wrote:
> > > On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote:
> > > > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote:
> > > > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote:
> > > > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote:
> > > > > > > [snip]
> > > > > > > > > > > For throttling case, apart from metadata, I found that with simple
> > > > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted
> > > > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should
> > > > > > > > > > > not be done at device level instead try to do it in higher layers,
> > > > > > > > > > > possibly balance_dirty_pages() and throttle process early.
> > > > > > > > > > 
> > > > > > > > > > The problem with doing it at the page cache entry level is that
> > > > > > > > > > cache hits then get throttled. It's not really a an IO controller at
> > > > > > > > > > that point, and the impact on application performance could be huge
> > > > > > > > > > (i.e. MB/s instead of GB/s).
> > > > > > > > > 
> > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine
> > > > > > > > > if page being asked for is in cache or not and charge for IO accordingly.
> > > > > > > > 
> > > > > > > > You'd need hooks in find_or_create_page(), though you have no
> > > > > > > > context of whether a read or a write is in progress at that point.
> > > > > > > 
> > > > > > > I'm confused.  Where is the throttling at cache hits?
> > > > > > > 
> > > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and
> > > > > > > page fault time. For example, generic_perform_write(), do_wp_page()
> > > > > > > and __do_fault() will explicitly call
> > > > > > > balance_dirty_pages_ratelimited() to do the write throttling.
> > > > > > 
> > > > > > This comment was in the context of what if we move block IO controller read
> > > > > > throttling also in higher layers. Then we don't want to throttle reads
> > > > > > which are already in cache.
> > > > > > 
> > > > > > Currently throttling hook is in generic_make_request() and it kicks in
> > > > > > only if data is not present in page cache and actual disk IO is initiated.
> > > > >   You can always throttle in readpage(). It's not much higher than
> > > > > generic_make_request() but basically as high as it can get I suspect
> > > > > (otherwise you'd have to deal with lots of different code paths like page
> > > > > faults, splice, read, ...).
> > > > 
> > > > Yep, I was thinking that what do I gain by moving READ throttling up. 
> > > > The only thing generic_make_request() does not catch is network file
> > > > systems. I think for that I can introduce another hook say in NFS and
> > > > I might be all set.
> > > 
> > > Basically all data reads go through the readahead layer, and the
> > > __do_page_cache_readahead() function.
> > > 
> > > Just one more option for your tradeoffs :)
> > 
> > But this does not cover direct IO?
> 
> Yes, sorry!
> 
> > But I guess if I split the hook into two parts (one in direct IO path
> > and one in __do_page_cache_readahead()), then filesystems don't have
> > to mark meta data READS. I will look into it.
> 
> Right, and the hooks should be trivial to add.
> 
> The readahead code is typically invoked in three ways:
> 
> - sync readahead, on page cache miss, => page_cache_sync_readahead()
> 
> - async readahead, on hitting PG_readahead (tagged on one page per readahead window),
>   => page_cache_async_readahead()
> 
> - user space readahead, fadvise(WILLNEED), => force_page_cache_readahead()
> 
> ext3/4 also call into readahead on readdir().

So this will be called for even meta data READS. Then there is no
advantage of moving the throttle hook out of generic_make_request()?

Instead what I will need is that ask file systems to mark meta data
IO so that I can avoid throttling.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF))
  2011-04-19 14:30                                         ` Vivek Goyal
  2011-04-19 14:45                                           ` Jan Kara
@ 2011-04-19 17:17                                           ` Vivek Goyal
  2011-04-19 18:30                                             ` Vivek Goyal
  2011-04-21  0:29                                           ` Dave Chinner
  2 siblings, 1 reply; 166+ messages in thread
From: Vivek Goyal @ 2011-04-19 17:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Greg Thelen, James Bottomley, lsf, linux-fsdevel,
	linux kernel mailing list

On Tue, Apr 19, 2011 at 10:30:22AM -0400, Vivek Goyal wrote:

[..]
> > 
> > In XFS, you could probably do this at the transaction reservation
> > stage where log space is reserved. We know everything about the
> > transaction at this point in time, and we throttle here already when
> > the journal is full. Adding cgroup transaction limits to this point
> > would be the place to do it, but the control parameter for it would
> > be very XFS specific (i.e. number of transactions/s). Concurrency is
> > not an issue - the XFS transaction subsystem is only limited in
> > concurrency by the space available in the journal for reservations
> > (hundred to thousands of concurrent transactions).
> 
> Instead of transaction per second, can we implement some kind of upper
> limit of pending transactions per cgroup. And that limit does not have
> to be user tunable to begin with. The effective transactions/sec rate
> will automatically be determined by IO throttling rate of the cgroup
> at the end nodes.
> 
> I think effectively what we need is that the notion of parallel
> transactions so that transactions of one cgroup can make progress
> independent of transactions of other cgroup. So if a process does
> an fsync and it is throttled then it should block transaction of 
> only that cgroup and not other cgroups.
> 
> You mentioned that concurrency is not an issue in XFS and hundreds of
> thousands of concurrent trasactions can progress depending on log space
> available. If that's the case, I think to begin with we might not have
> to do anything at all. Processes can still get blocked but as long as
> we have enough log space, this might not be a frequent event. I will
> do some testing with XFS and see can I livelock the system with very
> low IO limits.

Wow, XFS seems to be doing pretty good here. I created a group of
1 bytes/sec limit and wrote few bytes in a file and write quit it (vim).
That led to an fsync and process got blocked. From a different cgroup, in the
same directory I seem to be able to do all other regular operations like ls,
opening a new file, editing it etc.

ext4 will lockup immediately. So concurrent transactions do seem to work in
XFS.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF))
  2011-04-19 17:17                                           ` Vivek Goyal
@ 2011-04-19 18:30                                             ` Vivek Goyal
  2011-04-21  0:32                                               ` Dave Chinner
  0 siblings, 1 reply; 166+ messages in thread
From: Vivek Goyal @ 2011-04-19 18:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Greg Thelen, James Bottomley, lsf, linux-fsdevel,
	linux kernel mailing list

On Tue, Apr 19, 2011 at 01:17:23PM -0400, Vivek Goyal wrote:
> On Tue, Apr 19, 2011 at 10:30:22AM -0400, Vivek Goyal wrote:
> 
> [..]
> > > 
> > > In XFS, you could probably do this at the transaction reservation
> > > stage where log space is reserved. We know everything about the
> > > transaction at this point in time, and we throttle here already when
> > > the journal is full. Adding cgroup transaction limits to this point
> > > would be the place to do it, but the control parameter for it would
> > > be very XFS specific (i.e. number of transactions/s). Concurrency is
> > > not an issue - the XFS transaction subsystem is only limited in
> > > concurrency by the space available in the journal for reservations
> > > (hundred to thousands of concurrent transactions).
> > 
> > Instead of transaction per second, can we implement some kind of upper
> > limit of pending transactions per cgroup. And that limit does not have
> > to be user tunable to begin with. The effective transactions/sec rate
> > will automatically be determined by IO throttling rate of the cgroup
> > at the end nodes.
> > 
> > I think effectively what we need is that the notion of parallel
> > transactions so that transactions of one cgroup can make progress
> > independent of transactions of other cgroup. So if a process does
> > an fsync and it is throttled then it should block transaction of 
> > only that cgroup and not other cgroups.
> > 
> > You mentioned that concurrency is not an issue in XFS and hundreds of
> > thousands of concurrent trasactions can progress depending on log space
> > available. If that's the case, I think to begin with we might not have
> > to do anything at all. Processes can still get blocked but as long as
> > we have enough log space, this might not be a frequent event. I will
> > do some testing with XFS and see can I livelock the system with very
> > low IO limits.
> 
> Wow, XFS seems to be doing pretty good here. I created a group of
> 1 bytes/sec limit and wrote few bytes in a file and write quit it (vim).
> That led to an fsync and process got blocked. From a different cgroup, in the
> same directory I seem to be able to do all other regular operations like ls,
> opening a new file, editing it etc.
> 
> ext4 will lockup immediately. So concurrent transactions do seem to work in
> XFS.

Well, I used tedso's fsync tester test case which wrote a file of 1MB
and then did fsync. I launched this test case in two cgroups. One is 
throttled and other is not. Looks like unthrottled one gets blocked
somewhere and can't make progress. So there are dependencies somewhere
even with XFS.

Thanks
Vivek
> 
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-19 17:05                                             ` Vivek Goyal
@ 2011-04-19 20:58                                               ` Jan Kara
  2011-04-20  1:21                                                 ` Wu Fengguang
  2011-04-20  1:16                                               ` Wu Fengguang
  1 sibling, 1 reply; 166+ messages in thread
From: Jan Kara @ 2011-04-19 20:58 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, Jan Kara, James Bottomley, lsf, linux-fsdevel,
	Dave Chinner

On Tue 19-04-11 13:05:43, Vivek Goyal wrote:
> On Wed, Apr 20, 2011 at 12:58:38AM +0800, Wu Fengguang wrote:
> > On Tue, Apr 19, 2011 at 11:31:06PM +0800, Vivek Goyal wrote:
> > > On Tue, Apr 19, 2011 at 11:22:40PM +0800, Wu Fengguang wrote:
> > > > On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote:
> > > > > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote:
> > > > > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote:
> > > > > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote:
> > > > > > > > [snip]
> > > > > > > > > > > > For throttling case, apart from metadata, I found that with simple
> > > > > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted
> > > > > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should
> > > > > > > > > > > > not be done at device level instead try to do it in higher layers,
> > > > > > > > > > > > possibly balance_dirty_pages() and throttle process early.
> > > > > > > > > > > 
> > > > > > > > > > > The problem with doing it at the page cache entry level is that
> > > > > > > > > > > cache hits then get throttled. It's not really a an IO controller at
> > > > > > > > > > > that point, and the impact on application performance could be huge
> > > > > > > > > > > (i.e. MB/s instead of GB/s).
> > > > > > > > > > 
> > > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine
> > > > > > > > > > if page being asked for is in cache or not and charge for IO accordingly.
> > > > > > > > > 
> > > > > > > > > You'd need hooks in find_or_create_page(), though you have no
> > > > > > > > > context of whether a read or a write is in progress at that point.
> > > > > > > > 
> > > > > > > > I'm confused.  Where is the throttling at cache hits?
> > > > > > > > 
> > > > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and
> > > > > > > > page fault time. For example, generic_perform_write(), do_wp_page()
> > > > > > > > and __do_fault() will explicitly call
> > > > > > > > balance_dirty_pages_ratelimited() to do the write throttling.
> > > > > > > 
> > > > > > > This comment was in the context of what if we move block IO controller read
> > > > > > > throttling also in higher layers. Then we don't want to throttle reads
> > > > > > > which are already in cache.
> > > > > > > 
> > > > > > > Currently throttling hook is in generic_make_request() and it kicks in
> > > > > > > only if data is not present in page cache and actual disk IO is initiated.
> > > > > >   You can always throttle in readpage(). It's not much higher than
> > > > > > generic_make_request() but basically as high as it can get I suspect
> > > > > > (otherwise you'd have to deal with lots of different code paths like page
> > > > > > faults, splice, read, ...).
> > > > > 
> > > > > Yep, I was thinking that what do I gain by moving READ throttling up. 
> > > > > The only thing generic_make_request() does not catch is network file
> > > > > systems. I think for that I can introduce another hook say in NFS and
> > > > > I might be all set.
> > > > 
> > > > Basically all data reads go through the readahead layer, and the
> > > > __do_page_cache_readahead() function.
> > > > 
> > > > Just one more option for your tradeoffs :)
> > > 
> > > But this does not cover direct IO?
> > 
> > Yes, sorry!
> > 
> > > But I guess if I split the hook into two parts (one in direct IO path
> > > and one in __do_page_cache_readahead()), then filesystems don't have
> > > to mark meta data READS. I will look into it.
> > 
> > Right, and the hooks should be trivial to add.
> > 
> > The readahead code is typically invoked in three ways:
> > 
> > - sync readahead, on page cache miss, => page_cache_sync_readahead()
> > 
> > - async readahead, on hitting PG_readahead (tagged on one page per readahead window),
> >   => page_cache_async_readahead()
> > 
> > - user space readahead, fadvise(WILLNEED), => force_page_cache_readahead()
> > 
> > ext3/4 also call into readahead on readdir().
> 
> So this will be called for even meta data READS. Then there is no
> advantage of moving the throttle hook out of generic_make_request()?
  No, generally it won't. I think Fengguang was wrong - only ext2 carries
directories in page cache and thus uses readahead code. All other
filesystems handle directories specially and don't use readpage for them.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-19 17:05                                             ` Vivek Goyal
  2011-04-19 20:58                                               ` Jan Kara
@ 2011-04-20  1:16                                               ` Wu Fengguang
  2011-04-20 18:44                                                 ` Vivek Goyal
  1 sibling, 1 reply; 166+ messages in thread
From: Wu Fengguang @ 2011-04-20  1:16 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Jan Kara, James Bottomley, lsf, linux-fsdevel, Dave Chinner

On Wed, Apr 20, 2011 at 01:05:43AM +0800, Vivek Goyal wrote:
> On Wed, Apr 20, 2011 at 12:58:38AM +0800, Wu Fengguang wrote:
> > On Tue, Apr 19, 2011 at 11:31:06PM +0800, Vivek Goyal wrote:
> > > On Tue, Apr 19, 2011 at 11:22:40PM +0800, Wu Fengguang wrote:
> > > > On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote:
> > > > > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote:
> > > > > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote:
> > > > > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote:
> > > > > > > > [snip]
> > > > > > > > > > > > For throttling case, apart from metadata, I found that with simple
> > > > > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted
> > > > > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should
> > > > > > > > > > > > not be done at device level instead try to do it in higher layers,
> > > > > > > > > > > > possibly balance_dirty_pages() and throttle process early.
> > > > > > > > > > > 
> > > > > > > > > > > The problem with doing it at the page cache entry level is that
> > > > > > > > > > > cache hits then get throttled. It's not really a an IO controller at
> > > > > > > > > > > that point, and the impact on application performance could be huge
> > > > > > > > > > > (i.e. MB/s instead of GB/s).
> > > > > > > > > > 
> > > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine
> > > > > > > > > > if page being asked for is in cache or not and charge for IO accordingly.
> > > > > > > > > 
> > > > > > > > > You'd need hooks in find_or_create_page(), though you have no
> > > > > > > > > context of whether a read or a write is in progress at that point.
> > > > > > > > 
> > > > > > > > I'm confused.  Where is the throttling at cache hits?
> > > > > > > > 
> > > > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and
> > > > > > > > page fault time. For example, generic_perform_write(), do_wp_page()
> > > > > > > > and __do_fault() will explicitly call
> > > > > > > > balance_dirty_pages_ratelimited() to do the write throttling.
> > > > > > > 
> > > > > > > This comment was in the context of what if we move block IO controller read
> > > > > > > throttling also in higher layers. Then we don't want to throttle reads
> > > > > > > which are already in cache.
> > > > > > > 
> > > > > > > Currently throttling hook is in generic_make_request() and it kicks in
> > > > > > > only if data is not present in page cache and actual disk IO is initiated.
> > > > > >   You can always throttle in readpage(). It's not much higher than
> > > > > > generic_make_request() but basically as high as it can get I suspect
> > > > > > (otherwise you'd have to deal with lots of different code paths like page
> > > > > > faults, splice, read, ...).
> > > > > 
> > > > > Yep, I was thinking that what do I gain by moving READ throttling up. 
> > > > > The only thing generic_make_request() does not catch is network file
> > > > > systems. I think for that I can introduce another hook say in NFS and
> > > > > I might be all set.
> > > > 
> > > > Basically all data reads go through the readahead layer, and the
> > > > __do_page_cache_readahead() function.
> > > > 
> > > > Just one more option for your tradeoffs :)
> > > 
> > > But this does not cover direct IO?
> > 
> > Yes, sorry!
> > 
> > > But I guess if I split the hook into two parts (one in direct IO path
> > > and one in __do_page_cache_readahead()), then filesystems don't have
> > > to mark meta data READS. I will look into it.
> > 
> > Right, and the hooks should be trivial to add.
> > 
> > The readahead code is typically invoked in three ways:
> > 
> > - sync readahead, on page cache miss, => page_cache_sync_readahead()
> > 
> > - async readahead, on hitting PG_readahead (tagged on one page per readahead window),
> >   => page_cache_async_readahead()
> > 
> > - user space readahead, fadvise(WILLNEED), => force_page_cache_readahead()
> > 
> > ext3/4 also call into readahead on readdir().
> 
> So this will be called for even meta data READS. Then there is no
> advantage of moving the throttle hook out of generic_make_request()?
> Instead what I will need is that ask file systems to mark meta data
> IO so that I can avoid throttling.

Do you want to avoid meta data itself, or to avoid overall performance
being impacted as a result of meta data read throttling?

Either way, you have the freedom to test whether the passed filp is a
normal file or a directory "file", and do conditional throttling.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-19 20:58                                               ` Jan Kara
@ 2011-04-20  1:21                                                 ` Wu Fengguang
  2011-04-20 10:56                                                   ` Jan Kara
  0 siblings, 1 reply; 166+ messages in thread
From: Wu Fengguang @ 2011-04-20  1:21 UTC (permalink / raw)
  To: Jan Kara; +Cc: Vivek Goyal, James Bottomley, lsf, linux-fsdevel, Dave Chinner

On Wed, Apr 20, 2011 at 04:58:21AM +0800, Jan Kara wrote:
> On Tue 19-04-11 13:05:43, Vivek Goyal wrote:
> > On Wed, Apr 20, 2011 at 12:58:38AM +0800, Wu Fengguang wrote:
> > > On Tue, Apr 19, 2011 at 11:31:06PM +0800, Vivek Goyal wrote:
> > > > On Tue, Apr 19, 2011 at 11:22:40PM +0800, Wu Fengguang wrote:
> > > > > On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote:
> > > > > > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote:
> > > > > > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote:
> > > > > > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote:
> > > > > > > > > [snip]
> > > > > > > > > > > > > For throttling case, apart from metadata, I found that with simple
> > > > > > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted
> > > > > > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should
> > > > > > > > > > > > > not be done at device level instead try to do it in higher layers,
> > > > > > > > > > > > > possibly balance_dirty_pages() and throttle process early.
> > > > > > > > > > > > 
> > > > > > > > > > > > The problem with doing it at the page cache entry level is that
> > > > > > > > > > > > cache hits then get throttled. It's not really a an IO controller at
> > > > > > > > > > > > that point, and the impact on application performance could be huge
> > > > > > > > > > > > (i.e. MB/s instead of GB/s).
> > > > > > > > > > > 
> > > > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine
> > > > > > > > > > > if page being asked for is in cache or not and charge for IO accordingly.
> > > > > > > > > > 
> > > > > > > > > > You'd need hooks in find_or_create_page(), though you have no
> > > > > > > > > > context of whether a read or a write is in progress at that point.
> > > > > > > > > 
> > > > > > > > > I'm confused.  Where is the throttling at cache hits?
> > > > > > > > > 
> > > > > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and
> > > > > > > > > page fault time. For example, generic_perform_write(), do_wp_page()
> > > > > > > > > and __do_fault() will explicitly call
> > > > > > > > > balance_dirty_pages_ratelimited() to do the write throttling.
> > > > > > > > 
> > > > > > > > This comment was in the context of what if we move block IO controller read
> > > > > > > > throttling also in higher layers. Then we don't want to throttle reads
> > > > > > > > which are already in cache.
> > > > > > > > 
> > > > > > > > Currently throttling hook is in generic_make_request() and it kicks in
> > > > > > > > only if data is not present in page cache and actual disk IO is initiated.
> > > > > > >   You can always throttle in readpage(). It's not much higher than
> > > > > > > generic_make_request() but basically as high as it can get I suspect
> > > > > > > (otherwise you'd have to deal with lots of different code paths like page
> > > > > > > faults, splice, read, ...).
> > > > > > 
> > > > > > Yep, I was thinking that what do I gain by moving READ throttling up. 
> > > > > > The only thing generic_make_request() does not catch is network file
> > > > > > systems. I think for that I can introduce another hook say in NFS and
> > > > > > I might be all set.
> > > > > 
> > > > > Basically all data reads go through the readahead layer, and the
> > > > > __do_page_cache_readahead() function.
> > > > > 
> > > > > Just one more option for your tradeoffs :)
> > > > 
> > > > But this does not cover direct IO?
> > > 
> > > Yes, sorry!
> > > 
> > > > But I guess if I split the hook into two parts (one in direct IO path
> > > > and one in __do_page_cache_readahead()), then filesystems don't have
> > > > to mark meta data READS. I will look into it.
> > > 
> > > Right, and the hooks should be trivial to add.
> > > 
> > > The readahead code is typically invoked in three ways:
> > > 
> > > - sync readahead, on page cache miss, => page_cache_sync_readahead()
> > > 
> > > - async readahead, on hitting PG_readahead (tagged on one page per readahead window),
> > >   => page_cache_async_readahead()
> > > 
> > > - user space readahead, fadvise(WILLNEED), => force_page_cache_readahead()
> > > 
> > > ext3/4 also call into readahead on readdir().
> > 
> > So this will be called for even meta data READS. Then there is no
> > advantage of moving the throttle hook out of generic_make_request()?
>   No, generally it won't. I think Fengguang was wrong - only ext2 carries
> directories in page cache and thus uses readahead code. All other
> filesystems handle directories specially and don't use readpage for them.

So ext2 is implicitly using readahead? ext3/4 behave different in that
ext4_readdir() has an explicit call to page_cache_sync_readahead(),
passing the blockdev mapping as the page cache container.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-20  1:21                                                 ` Wu Fengguang
@ 2011-04-20 10:56                                                   ` Jan Kara
  2011-04-20 11:19                                                     ` Wu Fengguang
  0 siblings, 1 reply; 166+ messages in thread
From: Jan Kara @ 2011-04-20 10:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Vivek Goyal, James Bottomley, lsf, linux-fsdevel, Dave Chinner

On Wed 20-04-11 09:21:31, Wu Fengguang wrote:
> On Wed, Apr 20, 2011 at 04:58:21AM +0800, Jan Kara wrote:
> > On Tue 19-04-11 13:05:43, Vivek Goyal wrote:
> > > On Wed, Apr 20, 2011 at 12:58:38AM +0800, Wu Fengguang wrote:
> > > > On Tue, Apr 19, 2011 at 11:31:06PM +0800, Vivek Goyal wrote:
> > > > > On Tue, Apr 19, 2011 at 11:22:40PM +0800, Wu Fengguang wrote:
> > > > > > On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote:
> > > > > > > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote:
> > > > > > > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote:
> > > > > > > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote:
> > > > > > > > > > [snip]
> > > > > > > > > > > > > > For throttling case, apart from metadata, I found that with simple
> > > > > > > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted
> > > > > > > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should
> > > > > > > > > > > > > > not be done at device level instead try to do it in higher layers,
> > > > > > > > > > > > > > possibly balance_dirty_pages() and throttle process early.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The problem with doing it at the page cache entry level is that
> > > > > > > > > > > > > cache hits then get throttled. It's not really a an IO controller at
> > > > > > > > > > > > > that point, and the impact on application performance could be huge
> > > > > > > > > > > > > (i.e. MB/s instead of GB/s).
> > > > > > > > > > > > 
> > > > > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine
> > > > > > > > > > > > if page being asked for is in cache or not and charge for IO accordingly.
> > > > > > > > > > > 
> > > > > > > > > > > You'd need hooks in find_or_create_page(), though you have no
> > > > > > > > > > > context of whether a read or a write is in progress at that point.
> > > > > > > > > > 
> > > > > > > > > > I'm confused.  Where is the throttling at cache hits?
> > > > > > > > > > 
> > > > > > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and
> > > > > > > > > > page fault time. For example, generic_perform_write(), do_wp_page()
> > > > > > > > > > and __do_fault() will explicitly call
> > > > > > > > > > balance_dirty_pages_ratelimited() to do the write throttling.
> > > > > > > > > 
> > > > > > > > > This comment was in the context of what if we move block IO controller read
> > > > > > > > > throttling also in higher layers. Then we don't want to throttle reads
> > > > > > > > > which are already in cache.
> > > > > > > > > 
> > > > > > > > > Currently throttling hook is in generic_make_request() and it kicks in
> > > > > > > > > only if data is not present in page cache and actual disk IO is initiated.
> > > > > > > >   You can always throttle in readpage(). It's not much higher than
> > > > > > > > generic_make_request() but basically as high as it can get I suspect
> > > > > > > > (otherwise you'd have to deal with lots of different code paths like page
> > > > > > > > faults, splice, read, ...).
> > > > > > > 
> > > > > > > Yep, I was thinking that what do I gain by moving READ throttling up. 
> > > > > > > The only thing generic_make_request() does not catch is network file
> > > > > > > systems. I think for that I can introduce another hook say in NFS and
> > > > > > > I might be all set.
> > > > > > 
> > > > > > Basically all data reads go through the readahead layer, and the
> > > > > > __do_page_cache_readahead() function.
> > > > > > 
> > > > > > Just one more option for your tradeoffs :)
> > > > > 
> > > > > But this does not cover direct IO?
> > > > 
> > > > Yes, sorry!
> > > > 
> > > > > But I guess if I split the hook into two parts (one in direct IO path
> > > > > and one in __do_page_cache_readahead()), then filesystems don't have
> > > > > to mark meta data READS. I will look into it.
> > > > 
> > > > Right, and the hooks should be trivial to add.
> > > > 
> > > > The readahead code is typically invoked in three ways:
> > > > 
> > > > - sync readahead, on page cache miss, => page_cache_sync_readahead()
> > > > 
> > > > - async readahead, on hitting PG_readahead (tagged on one page per readahead window),
> > > >   => page_cache_async_readahead()
> > > > 
> > > > - user space readahead, fadvise(WILLNEED), => force_page_cache_readahead()
> > > > 
> > > > ext3/4 also call into readahead on readdir().
> > > 
> > > So this will be called for even meta data READS. Then there is no
> > > advantage of moving the throttle hook out of generic_make_request()?
> >   No, generally it won't. I think Fengguang was wrong - only ext2 carries
> > directories in page cache and thus uses readahead code. All other
> > filesystems handle directories specially and don't use readpage for them.
> 
> So ext2 is implicitly using readahead? ext3/4 behave different in that
> ext4_readdir() has an explicit call to page_cache_sync_readahead(),
> passing the blockdev mapping as the page cache container.
  Yes, ext2 uses implicitely readahead because it uses read_mapping_page()
for directory inodes. I forgot that ext3/4 call
page_cache_sync_readahead() so you were right that they actually use it for
the device inode. I'm sorry for the noise.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-20 10:56                                                   ` Jan Kara
@ 2011-04-20 11:19                                                     ` Wu Fengguang
  2011-04-20 14:42                                                       ` Jan Kara
  0 siblings, 1 reply; 166+ messages in thread
From: Wu Fengguang @ 2011-04-20 11:19 UTC (permalink / raw)
  To: Jan Kara; +Cc: Vivek Goyal, James Bottomley, lsf, linux-fsdevel, Dave Chinner

On Wed, Apr 20, 2011 at 06:56:06PM +0800, Jan Kara wrote:
> On Wed 20-04-11 09:21:31, Wu Fengguang wrote:
> > On Wed, Apr 20, 2011 at 04:58:21AM +0800, Jan Kara wrote:
> > > On Tue 19-04-11 13:05:43, Vivek Goyal wrote:
> > > > On Wed, Apr 20, 2011 at 12:58:38AM +0800, Wu Fengguang wrote:
> > > > > On Tue, Apr 19, 2011 at 11:31:06PM +0800, Vivek Goyal wrote:
> > > > > > On Tue, Apr 19, 2011 at 11:22:40PM +0800, Wu Fengguang wrote:
> > > > > > > On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote:
> > > > > > > > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote:
> > > > > > > > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote:
> > > > > > > > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote:
> > > > > > > > > > > [snip]
> > > > > > > > > > > > > > > For throttling case, apart from metadata, I found that with simple
> > > > > > > > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted
> > > > > > > > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should
> > > > > > > > > > > > > > > not be done at device level instead try to do it in higher layers,
> > > > > > > > > > > > > > > possibly balance_dirty_pages() and throttle process early.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > The problem with doing it at the page cache entry level is that
> > > > > > > > > > > > > > cache hits then get throttled. It's not really a an IO controller at
> > > > > > > > > > > > > > that point, and the impact on application performance could be huge
> > > > > > > > > > > > > > (i.e. MB/s instead of GB/s).
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine
> > > > > > > > > > > > > if page being asked for is in cache or not and charge for IO accordingly.
> > > > > > > > > > > > 
> > > > > > > > > > > > You'd need hooks in find_or_create_page(), though you have no
> > > > > > > > > > > > context of whether a read or a write is in progress at that point.
> > > > > > > > > > > 
> > > > > > > > > > > I'm confused.  Where is the throttling at cache hits?
> > > > > > > > > > > 
> > > > > > > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and
> > > > > > > > > > > page fault time. For example, generic_perform_write(), do_wp_page()
> > > > > > > > > > > and __do_fault() will explicitly call
> > > > > > > > > > > balance_dirty_pages_ratelimited() to do the write throttling.
> > > > > > > > > > 
> > > > > > > > > > This comment was in the context of what if we move block IO controller read
> > > > > > > > > > throttling also in higher layers. Then we don't want to throttle reads
> > > > > > > > > > which are already in cache.
> > > > > > > > > > 
> > > > > > > > > > Currently throttling hook is in generic_make_request() and it kicks in
> > > > > > > > > > only if data is not present in page cache and actual disk IO is initiated.
> > > > > > > > >   You can always throttle in readpage(). It's not much higher than
> > > > > > > > > generic_make_request() but basically as high as it can get I suspect
> > > > > > > > > (otherwise you'd have to deal with lots of different code paths like page
> > > > > > > > > faults, splice, read, ...).
> > > > > > > > 
> > > > > > > > Yep, I was thinking that what do I gain by moving READ throttling up. 
> > > > > > > > The only thing generic_make_request() does not catch is network file
> > > > > > > > systems. I think for that I can introduce another hook say in NFS and
> > > > > > > > I might be all set.
> > > > > > > 
> > > > > > > Basically all data reads go through the readahead layer, and the
> > > > > > > __do_page_cache_readahead() function.
> > > > > > > 
> > > > > > > Just one more option for your tradeoffs :)
> > > > > > 
> > > > > > But this does not cover direct IO?
> > > > > 
> > > > > Yes, sorry!
> > > > > 
> > > > > > But I guess if I split the hook into two parts (one in direct IO path
> > > > > > and one in __do_page_cache_readahead()), then filesystems don't have
> > > > > > to mark meta data READS. I will look into it.
> > > > > 
> > > > > Right, and the hooks should be trivial to add.
> > > > > 
> > > > > The readahead code is typically invoked in three ways:
> > > > > 
> > > > > - sync readahead, on page cache miss, => page_cache_sync_readahead()
> > > > > 
> > > > > - async readahead, on hitting PG_readahead (tagged on one page per readahead window),
> > > > >   => page_cache_async_readahead()
> > > > > 
> > > > > - user space readahead, fadvise(WILLNEED), => force_page_cache_readahead()
> > > > > 
> > > > > ext3/4 also call into readahead on readdir().
> > > > 
> > > > So this will be called for even meta data READS. Then there is no
> > > > advantage of moving the throttle hook out of generic_make_request()?
> > >   No, generally it won't. I think Fengguang was wrong - only ext2 carries
> > > directories in page cache and thus uses readahead code. All other
> > > filesystems handle directories specially and don't use readpage for them.
> > 
> > So ext2 is implicitly using readahead? ext3/4 behave different in that
> > ext4_readdir() has an explicit call to page_cache_sync_readahead(),
> > passing the blockdev mapping as the page cache container.
>   Yes, ext2 uses implicitely readahead because it uses read_mapping_page()
> for directory inodes. I forgot that ext3/4 call
> page_cache_sync_readahead() so you were right that they actually use it for
> the device inode. I'm sorry for the noise.

Never mind.  However I cannot find readahead calls in the
read_mapping_page() call chain. ext2 readdir() may not be doing
readahead at all...

        read_mapping_page()
          read_cache_page()
            read_cache_page_async()
              do_read_cache_page()
                __read_cache_page()

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-20 11:19                                                     ` Wu Fengguang
@ 2011-04-20 14:42                                                       ` Jan Kara
  0 siblings, 0 replies; 166+ messages in thread
From: Jan Kara @ 2011-04-20 14:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Vivek Goyal, James Bottomley, lsf, linux-fsdevel, Dave Chinner

On Wed 20-04-11 19:19:57, Wu Fengguang wrote:
> On Wed, Apr 20, 2011 at 06:56:06PM +0800, Jan Kara wrote:
> > On Wed 20-04-11 09:21:31, Wu Fengguang wrote:
> > > On Wed, Apr 20, 2011 at 04:58:21AM +0800, Jan Kara wrote:
> > > > On Tue 19-04-11 13:05:43, Vivek Goyal wrote:
> > > > > On Wed, Apr 20, 2011 at 12:58:38AM +0800, Wu Fengguang wrote:
> > > > > > On Tue, Apr 19, 2011 at 11:31:06PM +0800, Vivek Goyal wrote:
> > > > > > > On Tue, Apr 19, 2011 at 11:22:40PM +0800, Wu Fengguang wrote:
> > > > > > > > On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote:
> > > > > > > > > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote:
> > > > > > > > > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote:
> > > > > > > > > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote:
> > > > > > > > > > > > [snip]
> > > > > > > > > > > > > > > > For throttling case, apart from metadata, I found that with simple
> > > > > > > > > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted
> > > > > > > > > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should
> > > > > > > > > > > > > > > > not be done at device level instead try to do it in higher layers,
> > > > > > > > > > > > > > > > possibly balance_dirty_pages() and throttle process early.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > The problem with doing it at the page cache entry level is that
> > > > > > > > > > > > > > > cache hits then get throttled. It's not really a an IO controller at
> > > > > > > > > > > > > > > that point, and the impact on application performance could be huge
> > > > > > > > > > > > > > > (i.e. MB/s instead of GB/s).
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine
> > > > > > > > > > > > > > if page being asked for is in cache or not and charge for IO accordingly.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > You'd need hooks in find_or_create_page(), though you have no
> > > > > > > > > > > > > context of whether a read or a write is in progress at that point.
> > > > > > > > > > > > 
> > > > > > > > > > > > I'm confused.  Where is the throttling at cache hits?
> > > > > > > > > > > > 
> > > > > > > > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and
> > > > > > > > > > > > page fault time. For example, generic_perform_write(), do_wp_page()
> > > > > > > > > > > > and __do_fault() will explicitly call
> > > > > > > > > > > > balance_dirty_pages_ratelimited() to do the write throttling.
> > > > > > > > > > > 
> > > > > > > > > > > This comment was in the context of what if we move block IO controller read
> > > > > > > > > > > throttling also in higher layers. Then we don't want to throttle reads
> > > > > > > > > > > which are already in cache.
> > > > > > > > > > > 
> > > > > > > > > > > Currently throttling hook is in generic_make_request() and it kicks in
> > > > > > > > > > > only if data is not present in page cache and actual disk IO is initiated.
> > > > > > > > > >   You can always throttle in readpage(). It's not much higher than
> > > > > > > > > > generic_make_request() but basically as high as it can get I suspect
> > > > > > > > > > (otherwise you'd have to deal with lots of different code paths like page
> > > > > > > > > > faults, splice, read, ...).
> > > > > > > > > 
> > > > > > > > > Yep, I was thinking that what do I gain by moving READ throttling up. 
> > > > > > > > > The only thing generic_make_request() does not catch is network file
> > > > > > > > > systems. I think for that I can introduce another hook say in NFS and
> > > > > > > > > I might be all set.
> > > > > > > > 
> > > > > > > > Basically all data reads go through the readahead layer, and the
> > > > > > > > __do_page_cache_readahead() function.
> > > > > > > > 
> > > > > > > > Just one more option for your tradeoffs :)
> > > > > > > 
> > > > > > > But this does not cover direct IO?
> > > > > > 
> > > > > > Yes, sorry!
> > > > > > 
> > > > > > > But I guess if I split the hook into two parts (one in direct IO path
> > > > > > > and one in __do_page_cache_readahead()), then filesystems don't have
> > > > > > > to mark meta data READS. I will look into it.
> > > > > > 
> > > > > > Right, and the hooks should be trivial to add.
> > > > > > 
> > > > > > The readahead code is typically invoked in three ways:
> > > > > > 
> > > > > > - sync readahead, on page cache miss, => page_cache_sync_readahead()
> > > > > > 
> > > > > > - async readahead, on hitting PG_readahead (tagged on one page per readahead window),
> > > > > >   => page_cache_async_readahead()
> > > > > > 
> > > > > > - user space readahead, fadvise(WILLNEED), => force_page_cache_readahead()
> > > > > > 
> > > > > > ext3/4 also call into readahead on readdir().
> > > > > 
> > > > > So this will be called for even meta data READS. Then there is no
> > > > > advantage of moving the throttle hook out of generic_make_request()?
> > > >   No, generally it won't. I think Fengguang was wrong - only ext2 carries
> > > > directories in page cache and thus uses readahead code. All other
> > > > filesystems handle directories specially and don't use readpage for them.
> > > 
> > > So ext2 is implicitly using readahead? ext3/4 behave different in that
> > > ext4_readdir() has an explicit call to page_cache_sync_readahead(),
> > > passing the blockdev mapping as the page cache container.
> >   Yes, ext2 uses implicitely readahead because it uses read_mapping_page()
> > for directory inodes. I forgot that ext3/4 call
> > page_cache_sync_readahead() so you were right that they actually use it for
> > the device inode. I'm sorry for the noise.
> 
> Never mind.  However I cannot find readahead calls in the
> read_mapping_page() call chain. ext2 readdir() may not be doing
> readahead at all...
> 
>         read_mapping_page()
>           read_cache_page()
>             read_cache_page_async()
>               do_read_cache_page()
>                 __read_cache_page()
  Right, I've now checked the real code and it would have to use
read_cache_pages() to have some readahead. I'm not completely sure where
did I get from that ext2 performs directory readahead - some papers about
ext2 I found in the Internet say so and I believe Andrew mentioned it as
well. But I cannot find a kernel where this would happen... So thanks for
correcting me :).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-20  1:16                                               ` Wu Fengguang
@ 2011-04-20 18:44                                                 ` Vivek Goyal
  2011-04-20 19:16                                                   ` Jan Kara
                                                                     ` (2 more replies)
  0 siblings, 3 replies; 166+ messages in thread
From: Vivek Goyal @ 2011-04-20 18:44 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Jan Kara, James Bottomley, lsf, linux-fsdevel, Dave Chinner

On Wed, Apr 20, 2011 at 09:16:38AM +0800, Wu Fengguang wrote:
> On Wed, Apr 20, 2011 at 01:05:43AM +0800, Vivek Goyal wrote:
> > On Wed, Apr 20, 2011 at 12:58:38AM +0800, Wu Fengguang wrote:
> > > On Tue, Apr 19, 2011 at 11:31:06PM +0800, Vivek Goyal wrote:
> > > > On Tue, Apr 19, 2011 at 11:22:40PM +0800, Wu Fengguang wrote:
> > > > > On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote:
> > > > > > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote:
> > > > > > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote:
> > > > > > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote:
> > > > > > > > > [snip]
> > > > > > > > > > > > > For throttling case, apart from metadata, I found that with simple
> > > > > > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted
> > > > > > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should
> > > > > > > > > > > > > not be done at device level instead try to do it in higher layers,
> > > > > > > > > > > > > possibly balance_dirty_pages() and throttle process early.
> > > > > > > > > > > > 
> > > > > > > > > > > > The problem with doing it at the page cache entry level is that
> > > > > > > > > > > > cache hits then get throttled. It's not really a an IO controller at
> > > > > > > > > > > > that point, and the impact on application performance could be huge
> > > > > > > > > > > > (i.e. MB/s instead of GB/s).
> > > > > > > > > > > 
> > > > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine
> > > > > > > > > > > if page being asked for is in cache or not and charge for IO accordingly.
> > > > > > > > > > 
> > > > > > > > > > You'd need hooks in find_or_create_page(), though you have no
> > > > > > > > > > context of whether a read or a write is in progress at that point.
> > > > > > > > > 
> > > > > > > > > I'm confused.  Where is the throttling at cache hits?
> > > > > > > > > 
> > > > > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and
> > > > > > > > > page fault time. For example, generic_perform_write(), do_wp_page()
> > > > > > > > > and __do_fault() will explicitly call
> > > > > > > > > balance_dirty_pages_ratelimited() to do the write throttling.
> > > > > > > > 
> > > > > > > > This comment was in the context of what if we move block IO controller read
> > > > > > > > throttling also in higher layers. Then we don't want to throttle reads
> > > > > > > > which are already in cache.
> > > > > > > > 
> > > > > > > > Currently throttling hook is in generic_make_request() and it kicks in
> > > > > > > > only if data is not present in page cache and actual disk IO is initiated.
> > > > > > >   You can always throttle in readpage(). It's not much higher than
> > > > > > > generic_make_request() but basically as high as it can get I suspect
> > > > > > > (otherwise you'd have to deal with lots of different code paths like page
> > > > > > > faults, splice, read, ...).
> > > > > > 
> > > > > > Yep, I was thinking that what do I gain by moving READ throttling up. 
> > > > > > The only thing generic_make_request() does not catch is network file
> > > > > > systems. I think for that I can introduce another hook say in NFS and
> > > > > > I might be all set.
> > > > > 
> > > > > Basically all data reads go through the readahead layer, and the
> > > > > __do_page_cache_readahead() function.
> > > > > 
> > > > > Just one more option for your tradeoffs :)
> > > > 
> > > > But this does not cover direct IO?
> > > 
> > > Yes, sorry!
> > > 
> > > > But I guess if I split the hook into two parts (one in direct IO path
> > > > and one in __do_page_cache_readahead()), then filesystems don't have
> > > > to mark meta data READS. I will look into it.
> > > 
> > > Right, and the hooks should be trivial to add.
> > > 
> > > The readahead code is typically invoked in three ways:
> > > 
> > > - sync readahead, on page cache miss, => page_cache_sync_readahead()
> > > 
> > > - async readahead, on hitting PG_readahead (tagged on one page per readahead window),
> > >   => page_cache_async_readahead()
> > > 
> > > - user space readahead, fadvise(WILLNEED), => force_page_cache_readahead()
> > > 
> > > ext3/4 also call into readahead on readdir().
> > 
> > So this will be called for even meta data READS. Then there is no
> > advantage of moving the throttle hook out of generic_make_request()?
> > Instead what I will need is that ask file systems to mark meta data
> > IO so that I can avoid throttling.
> 
> Do you want to avoid meta data itself, or to avoid overall performance
> being impacted as a result of meta data read throttling?

I wanted to avoid throttling metadata beacause it might lead to reduced
overall performance due to dependencies in file system layer.

> 
> Either way, you have the freedom to test whether the passed filp is a
> normal file or a directory "file", and do conditional throttling.

Ok, will look into it. That will probably take care of READS. What 
about WRITES and meta data. Is it safe to assume that any meta data
write will come in some jounalling thread context and not in user 
process context?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-20 18:44                                                 ` Vivek Goyal
@ 2011-04-20 19:16                                                   ` Jan Kara
  2011-04-21  0:17                                                   ` Dave Chinner
  2011-04-21 15:06                                                   ` Wu Fengguang
  2 siblings, 0 replies; 166+ messages in thread
From: Jan Kara @ 2011-04-20 19:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, Jan Kara, James Bottomley, lsf, linux-fsdevel,
	Dave Chinner

On Wed 20-04-11 14:44:33, Vivek Goyal wrote:
> Ok, will look into it. That will probably take care of READS. What 
> about WRITES and meta data. Is it safe to assume that any meta data
> write will come in some jounalling thread context and not in user 
> process context?
  For ext3/4 it is a journal thread context or flusher thread context
because after metadata is written to journal by journal thread, they are
left as dirty buffers in page cache of the block device. So flusher thread
can come and write them - and these writes will hold buffer lock and thus
also block any manipulation with the metadata.

I don't know about other filesystems.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-20 18:44                                                 ` Vivek Goyal
  2011-04-20 19:16                                                   ` Jan Kara
@ 2011-04-21  0:17                                                   ` Dave Chinner
  2011-04-21 15:06                                                   ` Wu Fengguang
  2 siblings, 0 replies; 166+ messages in thread
From: Dave Chinner @ 2011-04-21  0:17 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Wu Fengguang, Jan Kara, James Bottomley, lsf, linux-fsdevel

On Wed, Apr 20, 2011 at 02:44:33PM -0400, Vivek Goyal wrote:
> On Wed, Apr 20, 2011 at 09:16:38AM +0800, Wu Fengguang wrote:
> > On Wed, Apr 20, 2011 at 01:05:43AM +0800, Vivek Goyal wrote:
> > > So this will be called for even meta data READS. Then there is no
> > > advantage of moving the throttle hook out of generic_make_request()?
> > > Instead what I will need is that ask file systems to mark meta data
> > > IO so that I can avoid throttling.
> > 
> > Do you want to avoid meta data itself, or to avoid overall performance
> > being impacted as a result of meta data read throttling?
> 
> I wanted to avoid throttling metadata beacause it might lead to reduced
> overall performance due to dependencies in file system layer.
> 
> > 
> > Either way, you have the freedom to test whether the passed filp is a
> > normal file or a directory "file", and do conditional throttling.
> 
> Ok, will look into it. That will probably take care of READS. What 
> about WRITES and meta data. Is it safe to assume that any meta data
> write will come in some jounalling thread context and not in user 
> process context?

No.

Journal writes in XFS come from the context that forces them to
occur, whether it be user, bdi-flusher or background kernel thread
context. Indeed, we can even have journal writes coming from
workqueues and there is a possibility that they will always come
from a workqueue context in the next release or so. 

As for metadata buffer writes themselves, they currently come from
background kernel threads or workqueues in most normal operational
cases.  However, in certain situations (e.g. sync(1), filesystem
freeze and unmount) we can issue write IO on metadata buffers
directly from the user process context....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF))
  2011-04-19 14:30                                         ` Vivek Goyal
  2011-04-19 14:45                                           ` Jan Kara
  2011-04-19 17:17                                           ` Vivek Goyal
@ 2011-04-21  0:29                                           ` Dave Chinner
  2 siblings, 0 replies; 166+ messages in thread
From: Dave Chinner @ 2011-04-21  0:29 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Greg Thelen, James Bottomley, lsf, linux-fsdevel,
	linux kernel mailing list

On Tue, Apr 19, 2011 at 10:30:22AM -0400, Vivek Goyal wrote:
> On Tue, Apr 19, 2011 at 10:33:39AM +1000, Dave Chinner wrote:
> > On Mon, Apr 18, 2011 at 06:51:18PM -0400, Vivek Goyal wrote:
> > > On Mon, Apr 18, 2011 at 11:58:44PM +0200, Jan Kara wrote:
> > > > On Fri 15-04-11 23:06:02, Vivek Goyal wrote:
> > > > > On Fri, Apr 15, 2011 at 05:07:50PM -0400, Vivek Goyal wrote:
> > > > > > How about doing throttling at two layers. All the data throttling is
> > > > > > done in higher layers and then also retain the mechanism of throttling
> > > > > > at end device. That way an admin can put a overall limit on such 
> > > > > > common write traffic. (XFS meta data coming from workqueues, flusher
> > > > > > thread, kswapd etc).
> > > > > > 
> > > > > > Anyway, we can't attribute this IO to per process context/group otherwise
> > > > > > most likely something will get serialized in higher layers.
> > > > > >  
> > > > > > Right now I am speaking purely from IO throttling point of view and not
> > > > > > even thinking about CFQ and IO tracking stuff.
> > > > > > 
> > > > > > This increases the complexity in IO cgroup interface as now we see to have
> > > > > > four combinations.
> > > > > > 
> > > > > >   Global Throttling
> > > > > >   	Throttling at lower layers
> > > > > >   	Throttling at higher layers.
> > > > > > 
> > > > > >   Per device throttling
> > > > > >  	 Throttling at lower layers
> > > > > >   	Throttling at higher layers.
> > > > > 
> > > > > Dave, 
> > > > > 
> > > > > I wrote above but I myself am not fond of coming up with 4 combinations.
> > > > > Want to limit it two. Per device throttling or global throttling. Here
> > > > > are some more thoughts in general about both throttling policy and
> > > > > proportional policy of IO controller. For throttling policy, I am 
> > > > > primarily concerned with how to avoid file system serialization issues.
> > > > > 
> > > > > Proportional IO (CFQ)
> > > > > ---------------------
> > > > > - Make writeback cgroup aware and kernel threads (flusher) which are
> > > > >   cgroup aware can be marked with a task flag (GROUP_AWARE). If a 
> > > > >   cgroup aware kernel threads throws IO at CFQ, then IO is accounted
> > > > >   to cgroup of task who originally dirtied the page. Otherwise we use
> > > > >   task context to account the IO to.
> > > > > 
> > > > >   So any IO submitted by flusher threads will go to respective cgroups
> > > > >   and higher weight cgroup should be able to do more WRITES.
> > > > > 
> > > > >   IO submitted by other kernel threads like kjournald, XFS async metadata
> > > > >   submission, kswapd etc all goes to thread context and that is root
> > > > >   group.
> > > > > 
> > > > > - If kswapd is a concern then either make kswapd cgroup aware or let
> > > > >   kswapd use cgroup aware flusher to do IO (Dave Chinner's idea).
> > > > > 
> > > > > Open Issues
> > > > > -----------
> > > > > - We do not get isolation for meta data IO. In virtualized setup, to
> > > > >   achieve stronger isolation do not use host filesystem. Export block
> > > > >   devices into guests.
> > > > > 
> > > > > IO throttling
> > > > > ------------
> > > > > 
> > > > > READS
> > > > > -----
> > > > > - Do not throttle meta data IO. Filesystem needs to mark READ metadata
> > > > >   IO so that we can avoid throttling it. This way ordered filesystems
> > > > >   will not get serialized behind a throttled read in slow group.
> > > > > 
> > > > >   May be one can account meta data read to a group and try to use that
> > > > >   to throttle data IO in same cgroup as a compensation.
> > > > >  
> > > > > WRITES
> > > > > ------
> > > > > - Throttle tasks. Do not throttle bios. That means that when a task
> > > > >   submits direct write, let it go to disk. Do the accounting and if task
> > > > >   is exceeding the IO rate make it sleep. Something similar to
> > > > >   balance_dirty_pages().
> > > > > 
> > > > >   That way, any direct WRITES should not run into any serialization issues
> > > > >   in ordered mode. We can continue to use blkio_throtle_bio() hook in
> > > > >   generic_make request().
> > > > > 
> > > > > - For buffered WRITES, design a throttling hook similar to
> > > > >   balance_drity_pages() and throttle tasks according to rules while they
> > > > >   are dirtying page cache.
> > > > > 
> > > > > - Do not throttle buffered writes again at the end device as these have
> > > > >   been throttled already while writting to page cache. Also throttling
> > > > >   WRITES at end device will lead to serialization issues with file systems
> > > > >   in ordered mode.
> > > > > 
> > > > > - Cgroup of a IO is always attributed to submitting thread. That way all
> > > > >   meta data writes will go in root cgroup and remain unthrottled. If one
> > > > >   is too concerned with lots of meta data IO, then probably one can
> > > > >   put a throttling rule in root cgroup.
> > > >   But I think the above scheme basically allows agressive buffered writer
> > > > to occupy as much of disk throughput as throttling at page dirty time
> > > > allows. So either you'd have to seriously limit the speed of page dirtying
> > > > for each cgroup (effectively giving each write properties like direct write)
> > > > or you'd have to live with cgroup taking your whole disk throughput. Neither
> > > > of which seems very appealing. Grumble, not that I have a good solution to
> > > > this problem...
> > > 
> > > [CCing lkml]
> > > 
> > > Hi Jan,
> > > 
> > > I agree that if we do throttling in balance_dirty_pages() to solve the
> > > issue of file system ordered mode, then we allow flusher threads to
> > > write data at high rate which is bad. Keeping write throttling at device
> > > level runs into issues of file system ordered mode write.
> > > 
> > > I think problem is that file systems are not cgroup aware (/me runs for
> > > cover) and we are just trying to work around that hence none of the proposed
> > > problem solution is not satisfying.
> > > 
> > > To get cgroup thing right, we shall have to make whole stack cgroup aware.
> > > In this case because file system journaling is not cgroup aware and is
> > > essentially a serialized operation and life becomes hard. Throttling is
> > > in higher layer is not a good solution and throttling in lower layer
> > > is not a good solution either.
> > > 
> > > Ideally, throttling in generic_make_request() is good as long as all the
> > > layers sitting above it (file systems, flusher writeback, page cache share)
> > > can be made cgroup aware. So that if a cgroup is throttled, others cgroup
> > > are more or less not impacted by throttled cgroup. We have talked about
> > > making flusher cgroup aware and per cgroup dirty ratio thing, but making
> > > file system journalling cgroup aware seems to be out of question (I don't
> > > even know if it is possible to do and how much work does it involve).
> > 
> > If you want to throttle journal operations, then we probably need to
> > throttle metadata operations that commit to the journal, not the
> > journal IO itself.  The journal is a shared global resource that all
> > cgroups use, so throttling journal IO inappropriately will affect
> > the performance of all cgroups, not just the one that is "hogging"
> > it.
> 
> Agreed.
> 
> > 
> > In XFS, you could probably do this at the transaction reservation
> > stage where log space is reserved. We know everything about the
> > transaction at this point in time, and we throttle here already when
> > the journal is full. Adding cgroup transaction limits to this point
> > would be the place to do it, but the control parameter for it would
> > be very XFS specific (i.e. number of transactions/s). Concurrency is
> > not an issue - the XFS transaction subsystem is only limited in
> > concurrency by the space available in the journal for reservations
> > (hundred to thousands of concurrent transactions).
> 
> Instead of transaction per second, can we implement some kind of upper
> limit of pending transactions per cgroup. And that limit does not have
> to be user tunable to begin with. The effective transactions/sec rate
> will automatically be determined by IO throttling rate of the cgroup
> at the end nodes.

Sure - that's just another measure of the same thing, really.

> I think effectively what we need is that the notion of parallel
> transactions so that transactions of one cgroup can make progress
> independent of transactions of other cgroup. So if a process does
> an fsync and it is throttled then it should block transaction of 
> only that cgroup and not other cgroups.

Parallel transactions only get you so far - there's still the
serialisation of the transaction commit that occurs.

> You mentioned that concurrency is not an issue in XFS and hundreds of
> thousands of concurrent trasactions can progress depending on log space

"hundreds _to_ thousands of concurrent transactions". You read a
couple of orders of magnitude larger number there ;)

> > FWIW, this would even allow per-bdi-flusher thread transaction
> > throttling parameters to be set, so writeback triggered metadata IO
> > could possibly be limited as well.
> 
> How does writeback trigger metadata IO?

Allocation might need to read free space btree blocks, transaction
reservation can trigger a log tail push becuase there isn't enough
space in the log, transaction commit might cause journal writes....


> > I'm not sure whether this is possible with other filesystems, and
> > ext3/4 would still have the issue of ordered writeback causing much
> > more writeback than expected at times (e.g. fsync), but I suspect
> > there is nothing that can really be done about this.
> 
> Can't this be modified so that multiple per cgroup transactions can make
> progress. So if one fsync is blocked, then processes in other cgroup
> should still be able to do IO using a separate transaction and be able
> to commit it.

That would be for the ext4 guys to answer.

> > FWIW, if you really want cgroups integrated properly into XFS, then
> > they need to be integrated into the allocator as well so we can push
> > isolateed cgroups into different, non-contending regions of the
> > filesystem (similar to filestreams containers). I started on an
> > general allocation policy framework for XFS a few years ago, but
> > never had more than a POC prototype. I always intended this
> > framework to implement (at the time) a cpuset aware policy, so I'm
> > pretty sure such an approach would work for cgroups, too. Maybe it's
> > time to dust off that patch set....
> 
> So having separate allocation areas/groups for separate group is useful
> from locking perspective? Is it useful even if we do not throttle
> meta data?

Yes. Allocation groups have their own locking and can operate
completely in parallel. The only typical serialisation point between
allocation transactions in different AGs is the transaction
commit...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF))
  2011-04-19 18:30                                             ` Vivek Goyal
@ 2011-04-21  0:32                                               ` Dave Chinner
  0 siblings, 0 replies; 166+ messages in thread
From: Dave Chinner @ 2011-04-21  0:32 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Greg Thelen, James Bottomley, lsf, linux-fsdevel,
	linux kernel mailing list

On Tue, Apr 19, 2011 at 02:30:22PM -0400, Vivek Goyal wrote:
> On Tue, Apr 19, 2011 at 01:17:23PM -0400, Vivek Goyal wrote:
> > On Tue, Apr 19, 2011 at 10:30:22AM -0400, Vivek Goyal wrote:
> > 
> > [..]
> > > > 
> > > > In XFS, you could probably do this at the transaction reservation
> > > > stage where log space is reserved. We know everything about the
> > > > transaction at this point in time, and we throttle here already when
> > > > the journal is full. Adding cgroup transaction limits to this point
> > > > would be the place to do it, but the control parameter for it would
> > > > be very XFS specific (i.e. number of transactions/s). Concurrency is
> > > > not an issue - the XFS transaction subsystem is only limited in
> > > > concurrency by the space available in the journal for reservations
> > > > (hundred to thousands of concurrent transactions).
> > > 
> > > Instead of transaction per second, can we implement some kind of upper
> > > limit of pending transactions per cgroup. And that limit does not have
> > > to be user tunable to begin with. The effective transactions/sec rate
> > > will automatically be determined by IO throttling rate of the cgroup
> > > at the end nodes.
> > > 
> > > I think effectively what we need is that the notion of parallel
> > > transactions so that transactions of one cgroup can make progress
> > > independent of transactions of other cgroup. So if a process does
> > > an fsync and it is throttled then it should block transaction of 
> > > only that cgroup and not other cgroups.
> > > 
> > > You mentioned that concurrency is not an issue in XFS and hundreds of
> > > thousands of concurrent trasactions can progress depending on log space
> > > available. If that's the case, I think to begin with we might not have
> > > to do anything at all. Processes can still get blocked but as long as
> > > we have enough log space, this might not be a frequent event. I will
> > > do some testing with XFS and see can I livelock the system with very
> > > low IO limits.
> > 
> > Wow, XFS seems to be doing pretty good here. I created a group of
> > 1 bytes/sec limit and wrote few bytes in a file and write quit it (vim).
> > That led to an fsync and process got blocked. From a different cgroup, in the
> > same directory I seem to be able to do all other regular operations like ls,
> > opening a new file, editing it etc.
> > 
> > ext4 will lockup immediately. So concurrent transactions do seem to work in
> > XFS.
> 
> Well, I used tedso's fsync tester test case which wrote a file of 1MB
> and then did fsync. I launched this test case in two cgroups. One is 
> throttled and other is not. Looks like unthrottled one gets blocked
> somewhere and can't make progress. So there are dependencies somewhere
> even with XFS.

Yes, if you throttle the journal commit IO then other transaction
commits will stall when we run out of log buffers to write new
commits to disk. Like I said - the journal is a shared resource and
stalling it will eventually stop _everything_.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-20 18:44                                                 ` Vivek Goyal
  2011-04-20 19:16                                                   ` Jan Kara
  2011-04-21  0:17                                                   ` Dave Chinner
@ 2011-04-21 15:06                                                   ` Wu Fengguang
  2011-04-21 15:10                                                     ` Wu Fengguang
  2011-04-21 17:20                                                     ` Vivek Goyal
  2 siblings, 2 replies; 166+ messages in thread
From: Wu Fengguang @ 2011-04-21 15:06 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Jan Kara, James Bottomley, lsf, linux-fsdevel, Dave Chinner

On Thu, Apr 21, 2011 at 02:44:33AM +0800, Vivek Goyal wrote:
> On Wed, Apr 20, 2011 at 09:16:38AM +0800, Wu Fengguang wrote:
> > On Wed, Apr 20, 2011 at 01:05:43AM +0800, Vivek Goyal wrote:
> > > On Wed, Apr 20, 2011 at 12:58:38AM +0800, Wu Fengguang wrote:
> > > > On Tue, Apr 19, 2011 at 11:31:06PM +0800, Vivek Goyal wrote:
> > > > > On Tue, Apr 19, 2011 at 11:22:40PM +0800, Wu Fengguang wrote:
> > > > > > On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote:
> > > > > > > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote:
> > > > > > > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote:
> > > > > > > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote:
> > > > > > > > > > [snip]
> > > > > > > > > > > > > > For throttling case, apart from metadata, I found that with simple
> > > > > > > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted
> > > > > > > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should
> > > > > > > > > > > > > > not be done at device level instead try to do it in higher layers,
> > > > > > > > > > > > > > possibly balance_dirty_pages() and throttle process early.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The problem with doing it at the page cache entry level is that
> > > > > > > > > > > > > cache hits then get throttled. It's not really a an IO controller at
> > > > > > > > > > > > > that point, and the impact on application performance could be huge
> > > > > > > > > > > > > (i.e. MB/s instead of GB/s).
> > > > > > > > > > > > 
> > > > > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine
> > > > > > > > > > > > if page being asked for is in cache or not and charge for IO accordingly.
> > > > > > > > > > > 
> > > > > > > > > > > You'd need hooks in find_or_create_page(), though you have no
> > > > > > > > > > > context of whether a read or a write is in progress at that point.
> > > > > > > > > > 
> > > > > > > > > > I'm confused.  Where is the throttling at cache hits?
> > > > > > > > > > 
> > > > > > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and
> > > > > > > > > > page fault time. For example, generic_perform_write(), do_wp_page()
> > > > > > > > > > and __do_fault() will explicitly call
> > > > > > > > > > balance_dirty_pages_ratelimited() to do the write throttling.
> > > > > > > > > 
> > > > > > > > > This comment was in the context of what if we move block IO controller read
> > > > > > > > > throttling also in higher layers. Then we don't want to throttle reads
> > > > > > > > > which are already in cache.
> > > > > > > > > 
> > > > > > > > > Currently throttling hook is in generic_make_request() and it kicks in
> > > > > > > > > only if data is not present in page cache and actual disk IO is initiated.
> > > > > > > >   You can always throttle in readpage(). It's not much higher than
> > > > > > > > generic_make_request() but basically as high as it can get I suspect
> > > > > > > > (otherwise you'd have to deal with lots of different code paths like page
> > > > > > > > faults, splice, read, ...).
> > > > > > > 
> > > > > > > Yep, I was thinking that what do I gain by moving READ throttling up. 
> > > > > > > The only thing generic_make_request() does not catch is network file
> > > > > > > systems. I think for that I can introduce another hook say in NFS and
> > > > > > > I might be all set.
> > > > > > 
> > > > > > Basically all data reads go through the readahead layer, and the
> > > > > > __do_page_cache_readahead() function.
> > > > > > 
> > > > > > Just one more option for your tradeoffs :)
> > > > > 
> > > > > But this does not cover direct IO?
> > > > 
> > > > Yes, sorry!
> > > > 
> > > > > But I guess if I split the hook into two parts (one in direct IO path
> > > > > and one in __do_page_cache_readahead()), then filesystems don't have
> > > > > to mark meta data READS. I will look into it.
> > > > 
> > > > Right, and the hooks should be trivial to add.
> > > > 
> > > > The readahead code is typically invoked in three ways:
> > > > 
> > > > - sync readahead, on page cache miss, => page_cache_sync_readahead()
> > > > 
> > > > - async readahead, on hitting PG_readahead (tagged on one page per readahead window),
> > > >   => page_cache_async_readahead()
> > > > 
> > > > - user space readahead, fadvise(WILLNEED), => force_page_cache_readahead()
> > > > 
> > > > ext3/4 also call into readahead on readdir().
> > > 
> > > So this will be called for even meta data READS. Then there is no
> > > advantage of moving the throttle hook out of generic_make_request()?
> > > Instead what I will need is that ask file systems to mark meta data
> > > IO so that I can avoid throttling.
> > 
> > Do you want to avoid meta data itself, or to avoid overall performance
> > being impacted as a result of meta data read throttling?
> 
> I wanted to avoid throttling metadata beacause it might lead to reduced
> overall performance due to dependencies in file system layer.

You can get meta data "throttling" and performance at the same time.
See below ideas.

> > 
> > Either way, you have the freedom to test whether the passed filp is a
> > normal file or a directory "file", and do conditional throttling.
> 
> Ok, will look into it. That will probably take care of READS. What 
> about WRITES and meta data. Is it safe to assume that any meta data
> write will come in some jounalling thread context and not in user 
> process context?

It's very possible to throttle meta data READS/WRITES, as long as they
can be attributed to the original task (assuming task oriented throttling
instead of bio/request oriented).

The trick is to separate the concepts of THROTTLING and ACCOUNTING.
You can ACCOUNT data and meta data reads/writes to the right task, and
only to THROTTLE the task when it's doing data reads/writes.

FYI I played the same trick for balance_dirty_pages_ratelimited() for
another reason: _accurate_ accounting of dirtied pages.

That trick should play well with most applications who do interleaved
data and meta data reads/writes. For the special case of "find" who
does pure meta data reads, we can still throttle it by playing another
trick: to THROTTLE meta data reads/writes with a much higher threshold
than that of data. So normal applications will be almost always be
throttled at data accesses while "find" will be throttled at meta data
accesses.

For a real example of how it works, you can check this patch (plus the
attached one)

writeback: IO-less balance_dirty_pages()
http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=commitdiff;h=e0de5e9961eeb992f305e877c5ef944fcd7a4269;hp=992851d56d79d227beaba1e4dcc657cbcf815556

Where tsk->nr_dirtied does dirty ACCOUNTING and tsk->nr_dirtied_pause
is the threshold for THROTTLING. When

        tsk->nr_dirtied > tsk->nr_dirtied_pause

The task will voluntarily enter balance_dirty_pages() for taking a
nap (pause time will be proportional to tsk->nr_dirtied), and when
finished, start a new account-and-throttle period by resetting
tsk->nr_dirtied and possibly adjust tsk->nr_dirtied_pause for a more
reasonable pause time at next sleep.

BTW, I'd like to advocate balance_dirty_pages() based IO controller :)

As you may have noticed, it's not all that hard: the main functions
blkcg_update_bandwidth()/blkcg_update_dirty_ratelimit() can fit nicely
in one screen!

writeback: async write IO controllers
http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=commitdiff;h=1a58ad99ce1f6a9df6618a4b92fa4859cc3e7e90;hp=5b6fcb3125ea52ff04a2fad27a51307842deb1a0

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-01 12:56                       ` Christoph Hellwig
@ 2011-04-21 15:07                         ` Vivek Goyal
  0 siblings, 0 replies; 166+ messages in thread
From: Vivek Goyal @ 2011-04-21 15:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, James Bottomley, lsf, linux-fsdevel, Jens Axboe

On Fri, Apr 01, 2011 at 08:56:54AM -0400, Christoph Hellwig wrote:
> On Fri, Apr 01, 2011 at 06:23:48PM +1100, Dave Chinner wrote:
> > Oh, I misread the code in _xfs_buf_read that fiddles with
> > _XBF_RUN_QUEUES. That flag is dead then, as is the XBF_LOG_BUFFER
> > code  which appears to have been superceded by the new XBF_ORDERED
> > code. Definitely needs cleaning up.
> 
> Yes, that's been on my todo list for a while, but I first want a sane
> defintion of REQ_META in the block layer.

Will splitting REQ_META in two will help. Say REQ_META_SYNC and
REQ_META_ASYNC. So meta requests which don't require any kind of priority
boost at CFQ can mark these REQ_META_ASYNC (XFS).

- So we retain the capability to mark metadata requests
- Priority boost only for selected meta data.
- Throttling can use this to avoid throttling meta data.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-21 15:06                                                   ` Wu Fengguang
@ 2011-04-21 15:10                                                     ` Wu Fengguang
  2011-04-21 17:20                                                     ` Vivek Goyal
  1 sibling, 0 replies; 166+ messages in thread
From: Wu Fengguang @ 2011-04-21 15:10 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Jan Kara, James Bottomley, lsf, linux-fsdevel, Dave Chinner

[-- Attachment #1: Type: text/plain, Size: 2294 bytes --]

Sorry, attached is the "separate ACCOUNTING from THROTTLING" patch.

> It's very possible to throttle meta data READS/WRITES, as long as they
> can be attributed to the original task (assuming task oriented throttling
> instead of bio/request oriented).
> 
> The trick is to separate the concepts of THROTTLING and ACCOUNTING.
> You can ACCOUNT data and meta data reads/writes to the right task, and
> only to THROTTLE the task when it's doing data reads/writes.
> 
> FYI I played the same trick for balance_dirty_pages_ratelimited() for
> another reason: _accurate_ accounting of dirtied pages.
> 
> That trick should play well with most applications who do interleaved
> data and meta data reads/writes. For the special case of "find" who
> does pure meta data reads, we can still throttle it by playing another
> trick: to THROTTLE meta data reads/writes with a much higher threshold
> than that of data. So normal applications will be almost always be
> throttled at data accesses while "find" will be throttled at meta data
> accesses.
> 
> For a real example of how it works, you can check this patch (plus the
> attached one)
> 
> writeback: IO-less balance_dirty_pages()
> http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=commitdiff;h=e0de5e9961eeb992f305e877c5ef944fcd7a4269;hp=992851d56d79d227beaba1e4dcc657cbcf815556
> 
> Where tsk->nr_dirtied does dirty ACCOUNTING and tsk->nr_dirtied_pause
> is the threshold for THROTTLING. When
> 
>         tsk->nr_dirtied > tsk->nr_dirtied_pause
> 
> The task will voluntarily enter balance_dirty_pages() for taking a
> nap (pause time will be proportional to tsk->nr_dirtied), and when
> finished, start a new account-and-throttle period by resetting
> tsk->nr_dirtied and possibly adjust tsk->nr_dirtied_pause for a more
> reasonable pause time at next sleep.
> 
> BTW, I'd like to advocate balance_dirty_pages() based IO controller :)
> 
> As you may have noticed, it's not all that hard: the main functions
> blkcg_update_bandwidth()/blkcg_update_dirty_ratelimit() can fit nicely
> in one screen!
> 
> writeback: async write IO controllers
> http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=commitdiff;h=1a58ad99ce1f6a9df6618a4b92fa4859cc3e7e90;hp=5b6fcb3125ea52ff04a2fad27a51307842deb1a0
> 
> Thanks,
> Fengguang

[-- Attachment #2: writeback-accurate-task-dirtied.patch --]
[-- Type: text/x-diff, Size: 924 bytes --]

Subject: writeback: accurately account dirtied pages
Date: Thu Apr 14 07:52:37 CST 2011


Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-04-16 11:28:41.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-04-16 11:28:41.000000000 +0800
@@ -1352,8 +1352,6 @@ void balance_dirty_pages_ratelimited_nr(
 	if (!bdi_cap_account_dirty(bdi))
 		return;
 
-	current->nr_dirtied += nr_pages_dirtied;
-
 	if (dirty_exceeded_recently(bdi, MAX_PAUSE)) {
 		unsigned long max = current->nr_dirtied +
 						(128 >> (PAGE_SHIFT - 10));
@@ -1819,6 +1817,7 @@ void account_page_dirtied(struct page *p
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
 		task_dirty_inc(current);
 		task_io_account_write(PAGE_CACHE_SIZE);
+		current->nr_dirtied++;
 	}
 }
 EXPORT_SYMBOL(account_page_dirtied);

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-21 15:06                                                   ` Wu Fengguang
  2011-04-21 15:10                                                     ` Wu Fengguang
@ 2011-04-21 17:20                                                     ` Vivek Goyal
  2011-04-22  4:21                                                       ` Wu Fengguang
  1 sibling, 1 reply; 166+ messages in thread
From: Vivek Goyal @ 2011-04-21 17:20 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Jan Kara, James Bottomley, lsf, linux-fsdevel, Dave Chinner

On Thu, Apr 21, 2011 at 11:06:18PM +0800, Wu Fengguang wrote:

[..]
> 
> You can get meta data "throttling" and performance at the same time.
> See below ideas.
> 
> > > 
> > > Either way, you have the freedom to test whether the passed filp is a
> > > normal file or a directory "file", and do conditional throttling.
> > 
> > Ok, will look into it. That will probably take care of READS. What 
> > about WRITES and meta data. Is it safe to assume that any meta data
> > write will come in some jounalling thread context and not in user 
> > process context?
> 
> It's very possible to throttle meta data READS/WRITES, as long as they
> can be attributed to the original task (assuming task oriented throttling
> instead of bio/request oriented).

Even in bio oriented throttling we attribute the bio to a task and hence
to the group (atleast as of today). So from that perspective, it should
not make much difference.

> 
> The trick is to separate the concepts of THROTTLING and ACCOUNTING.
> You can ACCOUNT data and meta data reads/writes to the right task, and
> only to THROTTLE the task when it's doing data reads/writes.

Agreed. I too mentioned this idea in one of the mails that account meta data
but do not throttle meta data and use that meta data accounting to throttle
data for longer period of times.

For this to implement, I need to know whether an IO is regular IO or
metadata IO and looks like one of the ways will that filesystems mark
that info in bio for meta data requests.

> 
> FYI I played the same trick for balance_dirty_pages_ratelimited() for
> another reason: _accurate_ accounting of dirtied pages.
> 
> That trick should play well with most applications who do interleaved
> data and meta data reads/writes. For the special case of "find" who
> does pure meta data reads, we can still throttle it by playing another
> trick: to THROTTLE meta data reads/writes with a much higher threshold
> than that of data. So normal applications will be almost always be
> throttled at data accesses while "find" will be throttled at meta data
> accesses.

Ok, that makes sense. If an application is doing lots of meta data
transactions only then try to limit it after some high limit

I am not very sure if it will run into issues of some file system
dependencies and hence priority inversion.

> 
> For a real example of how it works, you can check this patch (plus the
> attached one)

Ok, I will go through the patches for more details.

> 
> writeback: IO-less balance_dirty_pages()
> http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=commitdiff;h=e0de5e9961eeb992f305e877c5ef944fcd7a4269;hp=992851d56d79d227beaba1e4dcc657cbcf815556
> 
> Where tsk->nr_dirtied does dirty ACCOUNTING and tsk->nr_dirtied_pause
> is the threshold for THROTTLING. When
> 
>         tsk->nr_dirtied > tsk->nr_dirtied_pause
> 
> The task will voluntarily enter balance_dirty_pages() for taking a
> nap (pause time will be proportional to tsk->nr_dirtied), and when
> finished, start a new account-and-throttle period by resetting
> tsk->nr_dirtied and possibly adjust tsk->nr_dirtied_pause for a more
> reasonable pause time at next sleep.
> 
> BTW, I'd like to advocate balance_dirty_pages() based IO controller :)
> 

Actually implementing throttling in balance_dirty_pages() is not hard. I
think it has following issues.

- One controls the IO rate coming into the page cache and does not control
  the IO rate at the outgoing devices. So a flusher thread can still throw
  lots of writes at a device and completely disrupting read latencies.

  If buffered WRITES can disrupt READ latencies unexpectedly, then it kind
  of renders IO controller/throttling useless.

- For the application performance, I thought a better mechanism would be
  that we come up with per cgroup dirty ratio. This is equivalent to
  partitioning the page cache and coming up with cgroup's share. Now
  an application can write to this cache as fast as it want and is only
  throttled either by balance_dirty_pages() rules.

  All this IO must be going to some device and if an admin has put this cgroup
  in a low bandwidth group, then pages from this cgroup will be written
  slowly hence tasks in this group will be blocked for longer time.

 If we can make this work, then application can write to cache at higher
 rate at the same time not create a havoc at the end device.  

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-21 17:20                                                     ` Vivek Goyal
@ 2011-04-22  4:21                                                       ` Wu Fengguang
  2011-04-22 15:25                                                         ` Vivek Goyal
  0 siblings, 1 reply; 166+ messages in thread
From: Wu Fengguang @ 2011-04-22  4:21 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Jan Kara, James Bottomley, lsf, linux-fsdevel, Dave Chinner

On Fri, Apr 22, 2011 at 01:20:40AM +0800, Vivek Goyal wrote:
> On Thu, Apr 21, 2011 at 11:06:18PM +0800, Wu Fengguang wrote:
> 
> [..]
> > 
> > You can get meta data "throttling" and performance at the same time.
> > See below ideas.
> > 
> > > > 
> > > > Either way, you have the freedom to test whether the passed filp is a
> > > > normal file or a directory "file", and do conditional throttling.
> > > 
> > > Ok, will look into it. That will probably take care of READS. What 
> > > about WRITES and meta data. Is it safe to assume that any meta data
> > > write will come in some jounalling thread context and not in user 
> > > process context?
> > 
> > It's very possible to throttle meta data READS/WRITES, as long as they
> > can be attributed to the original task (assuming task oriented throttling
> > instead of bio/request oriented).
> 
> Even in bio oriented throttling we attribute the bio to a task and hence
> to the group (atleast as of today). So from that perspective, it should
> not make much difference.

OK, good to learn about that :)

> > 
> > The trick is to separate the concepts of THROTTLING and ACCOUNTING.
> > You can ACCOUNT data and meta data reads/writes to the right task, and
> > only to THROTTLE the task when it's doing data reads/writes.
> 
> Agreed. I too mentioned this idea in one of the mails that account meta data
> but do not throttle meta data and use that meta data accounting to throttle
> data for longer period of times.

That's great.

> For this to implement, I need to know whether an IO is regular IO or
> metadata IO and looks like one of the ways will that filesystems mark
> that info in bio for meta data requests.

OK.

> > 
> > FYI I played the same trick for balance_dirty_pages_ratelimited() for
> > another reason: _accurate_ accounting of dirtied pages.
> > 
> > That trick should play well with most applications who do interleaved
> > data and meta data reads/writes. For the special case of "find" who
> > does pure meta data reads, we can still throttle it by playing another
> > trick: to THROTTLE meta data reads/writes with a much higher threshold
> > than that of data. So normal applications will be almost always be
> > throttled at data accesses while "find" will be throttled at meta data
> > accesses.
> 
> Ok, that makes sense. If an application is doing lots of meta data
> transactions only then try to limit it after some high limit
> 
> I am not very sure if it will run into issues of some file system
> dependencies and hence priority inversion.

It's safe at least for task-context reads? For meta data writes, we
may also differentiate task-context DIRTY, kernel-context DIRTY and
WRITEOUT. We should still be able to throttle task-context meta data
DIRTY, probably not for kernel-context DIRTY, and never for WRITEOUT.

> > For a real example of how it works, you can check this patch (plus the
> > attached one)
> 
> Ok, I will go through the patches for more details.

Thanks!
FYI this document describes the basic ideas in the first 14 pages.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf

> > 
> > writeback: IO-less balance_dirty_pages()
> > http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=commitdiff;h=e0de5e9961eeb992f305e877c5ef944fcd7a4269;hp=992851d56d79d227beaba1e4dcc657cbcf815556
> > 
> > Where tsk->nr_dirtied does dirty ACCOUNTING and tsk->nr_dirtied_pause
> > is the threshold for THROTTLING. When
> > 
> >         tsk->nr_dirtied > tsk->nr_dirtied_pause
> > 
> > The task will voluntarily enter balance_dirty_pages() for taking a
> > nap (pause time will be proportional to tsk->nr_dirtied), and when
> > finished, start a new account-and-throttle period by resetting
> > tsk->nr_dirtied and possibly adjust tsk->nr_dirtied_pause for a more
> > reasonable pause time at next sleep.
> > 
> > BTW, I'd like to advocate balance_dirty_pages() based IO controller :)
> > 
> 
> Actually implementing throttling in balance_dirty_pages() is not hard. I
> think it has following issues.
> 
> - One controls the IO rate coming into the page cache and does not control
>   the IO rate at the outgoing devices. So a flusher thread can still throw
>   lots of writes at a device and completely disrupting read latencies.
> 
>   If buffered WRITES can disrupt READ latencies unexpectedly, then it kind
>   of renders IO controller/throttling useless.

Hmm..I doubt IO controller is the right solution to this problem at all.

It's such a fundamental problem that it would be Linux's failure to
recommend normal users to use IO controller for the sake of good read
latencies in the presence of heavy writes.

It actually helps reducing seeks when the flushers submit async write
requests in bursts (eg. 1 second). It will then kind of optimally
"work on this bdi area on behalf of this flusher for 1 second, and
then to the other area for 1 second...". The IO scheduler should have
similar optimizations, which should generally work better with more
clustered data supplies from the flushers. (Sorry I'm not tracking the
cfq code, so it's all general hypothesis and please correct me...)

The IO scheduler looks like the right owner to safeguard read latencies.
Where you already have the commit 365722bb917b08b7 ("cfq-iosched:
delay async IO dispatch, if sync IO was just done") and friends.
They do such a good job that if there are continual reads, the async
writes will be totally starved.

But yeah that still leaves sporadic reads at the mercy of heavy
writes, where the default policy will prefer write throughput to read
latencies.

And there is the "no heavy writes to saturate the disk in long term,
but still temporal heavy writes created by the bursty flushing" case.
In this case the device level throttling has the nice side effect of
smoothing writes out without performance penalties. However, if it's
so useful so that you regard it as an important target, why not build
some smoothing logic into the flushers? It has the great prospect of
benefiting _all_ users _by default_ :)

> - For the application performance, I thought a better mechanism would be
>   that we come up with per cgroup dirty ratio. This is equivalent to
>   partitioning the page cache and coming up with cgroup's share. Now
>   an application can write to this cache as fast as it want and is only
>   throttled either by balance_dirty_pages() rules.
> 
>   All this IO must be going to some device and if an admin has put this cgroup
>   in a low bandwidth group, then pages from this cgroup will be written
>   slowly hence tasks in this group will be blocked for longer time.
> 
>  If we can make this work, then application can write to cache at higher
>  rate at the same time not create a havoc at the end device.  

The memcg dirty ratio is fundamentally different from blkio
throttling. The former aims to eliminate excessive pageout()s when
reclaiming pages from the memcg LRU lists. It treats "dirty pages" as
throttle goal, and has the side effect throttling the task at the rate
the memcg's dirty inodes can be flushed to disk. Its complexity
originates from the correlation with "how the flusher selects the
inodes to writeout". Unfortunately the flusher by nature works in a
coarse way..

OTOH, blkio-cgroup don't need to care about inode selection at all.
It's enough to account and throttle tasks' dirty rate, and let the
flusher freely work on whatever dirtied inodes.

In this manner, blkio-cgroup dirty rate throttling is more user
oriented. While memcg dirty pages throttling looks like a complex
solution to some technical problems (if me understand it right).

The blkio-cgroup dirty throttling code can mainly go to
page-writeback.c, while the memcg code will mainly go to
fs-writeback.c (balance_dirty_pages() will also be involved, but
that's actually a more trivial part).

The correlations seem to be,

- you can get the page tagging functionality from memcg, if doing
  async write throttling at device level

- the side effect of rate limiting by memcg's dirty pages throttling,
  which is much less controllable than blkio-cgroup's rate limiting

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-22  4:21                                                       ` Wu Fengguang
@ 2011-04-22 15:25                                                         ` Vivek Goyal
  2011-04-22 16:28                                                           ` Andrea Arcangeli
  0 siblings, 1 reply; 166+ messages in thread
From: Vivek Goyal @ 2011-04-22 15:25 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Jan Kara, James Bottomley, lsf, linux-fsdevel, Dave Chinner

On Fri, Apr 22, 2011 at 12:21:23PM +0800, Wu Fengguang wrote:

[..]
> > > BTW, I'd like to advocate balance_dirty_pages() based IO controller :)
> > > 
> > 
> > Actually implementing throttling in balance_dirty_pages() is not hard. I
> > think it has following issues.
> > 
> > - One controls the IO rate coming into the page cache and does not control
> >   the IO rate at the outgoing devices. So a flusher thread can still throw
> >   lots of writes at a device and completely disrupting read latencies.
> > 
> >   If buffered WRITES can disrupt READ latencies unexpectedly, then it kind
> >   of renders IO controller/throttling useless.
> 
> Hmm..I doubt IO controller is the right solution to this problem at all.
> 
> It's such a fundamental problem that it would be Linux's failure to
> recommend normal users to use IO controller for the sake of good read
> latencies in the presence of heavy writes.

It is and we have modified CFQ a lot to tackle that but still... 

Just do a "dd if=/dev/zero of=/zerofile bs=1M count=4K" on your root
disk and then try to launch firefox and browse few websites and see if
you are happy with the response of the firefox. It took me more than
a minute to launch firefox and be able to input and load first website.

But I agree that READ latencies in presence of WRITES can be a problem
independent of IO controller.

Also there is another case of cluster where IO is coming to storage from
multiple hosts and one does not probably want a flurry of WRITES from
one host so that IO of other hosts is not severely impacted. In that case
IO scheduler can't do much as it has the view of single system.  

Secondly, the whole thing with IO controller is that it provides user more
control of IO instead of living with a default system specific policy. For
example, an admin might want to just look for better latencies for READS
and is willing to give up on WRITE throughput. So if IO controller is
properly implemented, he might say that my WRITE intensive application
I am putting in a cgroup with WRITE limit of 20MB/s. Now the READ
latencies in root cgroup should be better and may be predictable too as
we know the WRITE rate to disk never exceedes 20MB/s. 

Also it is only CFQ which provides READS so much preferrence over WRITES.
deadline and noop do not which we typically use on faster storage. There
we might take a bigger hit on READ latencies depending on what storage
is and how effected it is with a burst of WRITES.

I guess it boils down to better system control and better predictability.

So I think throttling buffered writes in balance_dirty_pages() is better
than not providing any way to control buffered WRITES at all but controlling
it at end device provides much better control on IO and serves more use
cases.

> 
> It actually helps reducing seeks when the flushers submit async write
> requests in bursts (eg. 1 second). It will then kind of optimally
> "work on this bdi area on behalf of this flusher for 1 second, and
> then to the other area for 1 second...". The IO scheduler should have
> similar optimizations, which should generally work better with more
> clustered data supplies from the flushers. (Sorry I'm not tracking the
> cfq code, so it's all general hypothesis and please correct me...)
> 

Isolation and throughput are orthogonal. You go for better isolation
and you will esentially pay by reduced throughput. Now as a user one
can decide what are his priorities. I see it as a slider where on one
end it is 100% isolation and on other end it is 100% throughput. Now
a user can slide the slider and keep that somewhere in between depending
on his/her needs. One of the goals of IO controller is to provide that
fine grained control. By implementing throttling in balance_dirty_pages()
we really lose that capability.

Also flusher still will submit the requests in burst. flusher will still
pick one inode at a time so IO is as sequential as possible. We will still
do the IO-lesss throttling to reduce the seeks.  If we do IO throttling
below page cache, it also gives us the capability to control flusher
IO burst. Gives user a fine grained control which is lost if we do
the control while entering page cache. 

> The IO scheduler looks like the right owner to safeguard read latencies.
> Where you already have the commit 365722bb917b08b7 ("cfq-iosched:
> delay async IO dispatch, if sync IO was just done") and friends.
> They do such a good job that if there are continual reads, the async
> writes will be totally starved.
> 
> But yeah that still leaves sporadic reads at the mercy of heavy
> writes, where the default policy will prefer write throughput to read
> latencies.

Well, there is no default policy as such. CFQ tries to prioritize READs
as much as it can.  Deadline does not as much. So as I said previously,
we really are not controlling the burst. We are leaving it to IO scheduler
to handle it as per its policy and lose isolation between the groups which
is primary purpose of IO controller.

IOW, doing throttling below page cache allows us much better/smoother
control of IO.

> 
> And there is the "no heavy writes to saturate the disk in long term,
> but still temporal heavy writes created by the bursty flushing" case.
> In this case the device level throttling has the nice side effect of
> smoothing writes out without performance penalties. However, if it's
> so useful so that you regard it as an important target, why not build
> some smoothing logic into the flushers? It has the great prospect of
> benefiting _all_ users _by default_ :)

We already have implemented the control at lower layers. So we really
don't have to build secondary control now. Just that rest of the
subsystems have to be aware of cgroups and play nicely.

At high level smoothing logic is just another throttling technique.
Whether to throttle process abruptly or try to put more complex technique
to smooth out the traffic. It is just a knob. The key question here
is where to put the knob in stack for maximum degree of control.

flusher logic is already complicated. I am not sure what we will gain
by training flushers about the IO rate and throttling it based on user
policies. We can let lower layers do it as long as we can make sure
flusher is aware of cgroups and can select inodes to flush in such a
manner that it does not get blocked behind slow cgroups and can keep
all the cgroups busy.

The challenge I am facing here is the file system dependencies on IO.
One example is that if I throttle fsync IO, then it leads to issues
with journalling and other IO in filesystem seems to be stopping.

> 
> > - For the application performance, I thought a better mechanism would be
> >   that we come up with per cgroup dirty ratio. This is equivalent to
> >   partitioning the page cache and coming up with cgroup's share. Now
> >   an application can write to this cache as fast as it want and is only
> >   throttled either by balance_dirty_pages() rules.
> > 
> >   All this IO must be going to some device and if an admin has put this cgroup
> >   in a low bandwidth group, then pages from this cgroup will be written
> >   slowly hence tasks in this group will be blocked for longer time.
> > 
> >  If we can make this work, then application can write to cache at higher
> >  rate at the same time not create a havoc at the end device.  
> 
> The memcg dirty ratio is fundamentally different from blkio
> throttling. The former aims to eliminate excessive pageout()s when
> reclaiming pages from the memcg LRU lists. It treats "dirty pages" as
> throttle goal, and has the side effect throttling the task at the rate
> the memcg's dirty inodes can be flushed to disk. Its complexity
> originates from the correlation with "how the flusher selects the
> inodes to writeout". Unfortunately the flusher by nature works in a
> coarse way..

memcg dirty ratio is a different problem but it needs to work with IO
controller to solve the whole issue. If IO was just direct IO, and no
page cache in picture we don't need memcg. But the momemnt, page cache
comes into the picture, immediately comes the notion of logically
dividing that page cache among cgroups. And comes the notion of
dirty ratio per cgroup so that even if the overall cache usage is
less but this cgroups has consumed its share of dirty pages and now we
need to throttle it and ask flusher to send IO to underlying devices.

IO controller is sitting below page cache. So we need to make sure
that memcg is enhanced to support per cgroup dirt ratio, and train flusher
threads so that they are aware of cgroup presence and can do writeout
in per memcg aware manner. Greg Thelen is working on putting these two
pieces together.

So memcg dirty ratio is a different problem but is required to make IO
controller work for buffered WRITES.

> 
> OTOH, blkio-cgroup don't need to care about inode selection at all.
> It's enough to account and throttle tasks' dirty rate, and let the
> flusher freely work on whatever dirtied inodes.

That goes back to the model of putting the knob in balance_dirty_pages().
Yes it simplifies the implementation but also takes away the capability
of better control. One would still see the burst of WRITES at end devices.

> 
> In this manner, blkio-cgroup dirty rate throttling is more user
> oriented. While memcg dirty pages throttling looks like a complex
> solution to some technical problems (if me understand it right).

If we implement IO throttling in balance_dirty_pages(), then we don't 
require memcg dirty ratio thing for it to work. But we will still reuire
memcg dirty ratio for other reasons.

- Prportional IO control for CFQ
- memcg's own problems of starting to write out pages from a cgroup
  earlier.

> 
> The blkio-cgroup dirty throttling code can mainly go to
> page-writeback.c, while the memcg code will mainly go to
> fs-writeback.c (balance_dirty_pages() will also be involved, but
> that's actually a more trivial part).
> 
> The correlations seem to be,
> 
> - you can get the page tagging functionality from memcg, if doing
>   async write throttling at device level
> 
> - the side effect of rate limiting by memcg's dirty pages throttling,
>   which is much less controllable than blkio-cgroup's rate limiting

Well, I thought memcg's per cgroup ratio and IO controller's rate limit
will work together. memcgroup will keep track of per cgroup share of 
page cache and when caches usage is more than certain %, it will ask
flusher to send IO to device and then IO controller will throttle that
IO. Now if rate limit of the cgroup is less, then task of that cgroup
will be throttled for longer in balance_dirty_pages(). 

So throttling is happening at two layers. One throttling is in
balance_dirty_pages() which is actually not dependent on user inputted
parameters. It is more dependent on what's the page cache share of 
this cgroup and what's the effecitve IO rate this cgroup is getting.
The real IO throttling is happning at device level which is dependent
on parameters inputted by user and which in-turn indirectly should decide
how tasks are throttled in balance_dirty_pages().

I have yet to look at your implementation of throttling but keep in 
mind that once IO controller comes into picture the throttling/smoothing
mechanism also needs to be able to take into account direct writes and
we should be able to use same algorithms for throttling READS.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-22 15:25                                                         ` Vivek Goyal
@ 2011-04-22 16:28                                                           ` Andrea Arcangeli
  2011-04-25 18:19                                                             ` Vivek Goyal
  0 siblings, 1 reply; 166+ messages in thread
From: Andrea Arcangeli @ 2011-04-22 16:28 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, James Bottomley, lsf, Dave Chinner, linux-fsdevel

On Fri, Apr 22, 2011 at 11:25:31AM -0400, Vivek Goyal wrote:
> It is and we have modified CFQ a lot to tackle that but still... 
> 
> Just do a "dd if=/dev/zero of=/zerofile bs=1M count=4K" on your root
> disk and then try to launch firefox and browse few websites and see if
> you are happy with the response of the firefox. It took me more than
> a minute to launch firefox and be able to input and load first website.
> 
> But I agree that READ latencies in presence of WRITES can be a problem
> independent of IO controller.

Reading this I've some dejavu, this is literally a decade old problem,
so old that when I first worked on it the elevator had no notion of
latency and it would potentially infinitely starve any I/O (regardless
of read/write) at the end of the disk if any I/O before the end would
keep coming in ;).

We're orders of magnitude better these days, but one thing I didn't
see mentioned is that according to memories, a lot of it had to do
with the way the DMA command size can grow to the maximum allowed by
the sg table for writes, but reads (especially metadata and small
files where readahead is less effective) won't grow to the maximum or
even if it grows to the maximum the readahead may not be useful
(userland will seek again not reading into the readahead) and even if
synchronous metadata reads aren't involved it'll submit another
physical readahead after having satisfied only a little userland read.

So even if you have a totally unfair io scheduler that places the next
read request always at the top of the queue (ignoring any fairness
requirement), you're still going to have the synchronous small read
dma waiting at the top of the queue for the large dma write to
complete.

The time I got the dd if=/dev/zero working best is when I broke the
throughput by massively reducing the dma size (by error or intentional
frankly I don't remember). SATA requires ~64k large dma to run at peak
speed, and I expect if you reduce it to 4k it'll behave a lot better
than current 256k. Some very old scsi device I had performed best at
512k dma (much faster than 64k). The max sector size is still 512k
today, probably 256k (or only 128k) for SATA but likely above 64k (as
it saves CPU even if throughput can be maxed out at ~64k dma as far as
the platter is concerned).

> Also it is only CFQ which provides READS so much preferrence over WRITES.
> deadline and noop do not which we typically use on faster storage. There
> we might take a bigger hit on READ latencies depending on what storage
> is and how effected it is with a burst of WRITES.
> 
> I guess it boils down to better system control and better predictability.

I tend to think to get even better read latency and predictability,
the IO scheduler could dynamically and temporarily reduce the max
sector size of the write dma (and also ensure any read readahead is
also reduced to the dynamic reduced sector size or it'd be detrimental
on the number of read DMA issued for each userland read).

Maybe with tagged queuing things are better and the dma size doesn't
make a difference anymore, I don't know. Surely Jens knows this best
and can tell me if I'm wrong.

Anyway it should be real easy to test, just a two liner reducing the
max sector size to scsi_lib and the max readahead, should allow you to
see how fast firefox starts with cfq when dd if=/dev/zero is running
and if there's any difference at all.

I've seen huge work on cfq but still the max merging remains at top
and it doesn't decrease dynamically and I doubt you can get
real unnoticeable writeback to reads, without such a chance, no matter
how the IO scheduler is otherwise implemented.

I'm unsure if this will ever be really viable in single user
environment (often absolute throughput is more important and that is
clearly higher - at least for the writeback - by keeping the max
sector fixed to the max), but if cgroup wants to make a dd
if=/dev/zero of=zero bs=10M oflag=direct from one group unnoticeable
to the other cgroups that are reading, it's worth researching if this
still an actual issue with todays hardware. I guess SSD won't change
it much, as it's a DMA duration issue, not seeks, in fact it may be
way more noticeable on SSD as seeks will be less costly leaving the
duration effect more visible.

> So throttling is happening at two layers. One throttling is in
> balance_dirty_pages() which is actually not dependent on user inputted
> parameters. It is more dependent on what's the page cache share of 
> this cgroup and what's the effecitve IO rate this cgroup is getting.
> The real IO throttling is happning at device level which is dependent
> on parameters inputted by user and which in-turn indirectly should decide
> how tasks are throttled in balance_dirty_pages().

This sounds a fine design to me.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-22 16:28                                                           ` Andrea Arcangeli
@ 2011-04-25 18:19                                                             ` Vivek Goyal
  2011-04-26 14:37                                                               ` Vivek Goyal
  0 siblings, 1 reply; 166+ messages in thread
From: Vivek Goyal @ 2011-04-25 18:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Wu Fengguang, James Bottomley, lsf, Dave Chinner, linux-fsdevel,
	Jens Axboe

On Fri, Apr 22, 2011 at 06:28:29PM +0200, Andrea Arcangeli wrote:

[..]
> > Also it is only CFQ which provides READS so much preferrence over WRITES.
> > deadline and noop do not which we typically use on faster storage. There
> > we might take a bigger hit on READ latencies depending on what storage
> > is and how effected it is with a burst of WRITES.
> > 
> > I guess it boils down to better system control and better predictability.
> 
> I tend to think to get even better read latency and predictability,
> the IO scheduler could dynamically and temporarily reduce the max
> sector size of the write dma (and also ensure any read readahead is
> also reduced to the dynamic reduced sector size or it'd be detrimental
> on the number of read DMA issued for each userland read).
> 
> Maybe with tagged queuing things are better and the dma size doesn't
> make a difference anymore, I don't know. Surely Jens knows this best
> and can tell me if I'm wrong.
> 
> Anyway it should be real easy to test, just a two liner reducing the
> max sector size to scsi_lib and the max readahead, should allow you to
> see how fast firefox starts with cfq when dd if=/dev/zero is running
> and if there's any difference at all.

I did some quick runs.

- Default queue depth is 31 on my SATA disk. Reducing queue depth to 1
  helps a bit.

  In CFQ we already try to reduce the queue depth of WRITES if READS
  are going on.

- I reduced /sys/block/sda/queue/max_sector_kb to 16. That seemed to
  help with firefox launch time.

There are couple of interesting observations though.

- Even after I reduced max_sector_kb to 16, I saw requests of 1024 sector
  size coming from flusher threads. 

- Firefox launch time reduced by reducing the max_sector_kb but it did
  not help much when I tried to launch first website "lwn.net". It still
  took me little more than 1 minute, to be able to select lwn.net from
  cached entries and then be able to really load and display the page.

I will spend more time figuring out what's happening here.

But in general, reducing the max request size dynamically sounds
interesting. I am not sure how upper layers are impacted because
of this (dm etc).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
  2011-04-25 18:19                                                             ` Vivek Goyal
@ 2011-04-26 14:37                                                               ` Vivek Goyal
  0 siblings, 0 replies; 166+ messages in thread
From: Vivek Goyal @ 2011-04-26 14:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Wu Fengguang, James Bottomley, lsf, Dave Chinner, linux-fsdevel,
	Jens Axboe

On Mon, Apr 25, 2011 at 02:19:54PM -0400, Vivek Goyal wrote:
> On Fri, Apr 22, 2011 at 06:28:29PM +0200, Andrea Arcangeli wrote:
> 
> [..]
> > > Also it is only CFQ which provides READS so much preferrence over WRITES.
> > > deadline and noop do not which we typically use on faster storage. There
> > > we might take a bigger hit on READ latencies depending on what storage
> > > is and how effected it is with a burst of WRITES.
> > > 
> > > I guess it boils down to better system control and better predictability.
> > 
> > I tend to think to get even better read latency and predictability,
> > the IO scheduler could dynamically and temporarily reduce the max
> > sector size of the write dma (and also ensure any read readahead is
> > also reduced to the dynamic reduced sector size or it'd be detrimental
> > on the number of read DMA issued for each userland read).
> > 
> > Maybe with tagged queuing things are better and the dma size doesn't
> > make a difference anymore, I don't know. Surely Jens knows this best
> > and can tell me if I'm wrong.
> > 
> > Anyway it should be real easy to test, just a two liner reducing the
> > max sector size to scsi_lib and the max readahead, should allow you to
> > see how fast firefox starts with cfq when dd if=/dev/zero is running
> > and if there's any difference at all.
> 
> I did some quick runs.
> 
> - Default queue depth is 31 on my SATA disk. Reducing queue depth to 1
>   helps a bit.
> 
>   In CFQ we already try to reduce the queue depth of WRITES if READS
>   are going on.
> 
> - I reduced /sys/block/sda/queue/max_sector_kb to 16. That seemed to
>   help with firefox launch time.
> 
> There are couple of interesting observations though.
> 
> - Even after I reduced max_sector_kb to 16, I saw requests of 1024 sector
>   size coming from flusher threads. 
> 

I realized that I had a dm device sitting on top of sda and I was
changing max_sector_kb only on sda and not on dm device hence request
size was still 1024 sector each.

I changed max_sector_kb to 16 and that seems to help. Launching and
loading first website time comes down from 1minute to roughly 30 seconds.

At dm layer no IO scheduler is running so IO scheduler really can't
do much in controlling the request size dynamically depending on what's
happening on the device.

I am not sure if one can break the requests in smaller pieces in IO
scheduler if reads are going on.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 166+ messages in thread

end of thread, other threads:[~2011-04-26 14:37 UTC | newest]

Thread overview: 166+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1301373398.2590.20.camel@mulgrave.site>
2011-03-29  5:14 ` [Lsf] Preliminary Agenda and Activities for LSF Amir Goldstein
2011-03-29 11:16 ` Ric Wheeler
2011-03-29 11:22   ` Matthew Wilcox
2011-03-29 12:17     ` Jens Axboe
2011-03-29 13:09       ` Martin K. Petersen
2011-03-29 13:09         ` Martin K. Petersen
2011-03-29 13:12         ` Ric Wheeler
2011-03-29 13:38         ` James Bottomley
2011-03-29 17:20   ` Shyam_Iyer
2011-03-29 17:20     ` Shyam_Iyer
2011-03-29 17:33     ` Vivek Goyal
2011-03-29 18:10       ` Shyam_Iyer
2011-03-29 18:10         ` Shyam_Iyer
2011-03-29 18:45         ` Vivek Goyal
2011-03-29 19:13           ` Shyam_Iyer
2011-03-29 19:13             ` Shyam_Iyer
2011-03-29 19:57             ` Vivek Goyal
2011-03-29 19:59             ` Mike Snitzer
2011-03-29 20:12               ` Shyam_Iyer
2011-03-29 20:12                 ` Shyam_Iyer
2011-03-29 20:23                 ` Mike Snitzer
2011-03-29 23:09                   ` Shyam_Iyer
2011-03-29 23:09                     ` Shyam_Iyer
2011-03-30  5:58                     ` [Lsf] " Hannes Reinecke
2011-03-30 14:02                       ` James Bottomley
2011-03-30 14:10                         ` Hannes Reinecke
2011-03-30 14:26                           ` James Bottomley
2011-03-30 14:55                             ` Hannes Reinecke
2011-03-30 15:33                               ` James Bottomley
2011-03-30 15:46                                 ` Shyam_Iyer
2011-03-30 15:46                                   ` Shyam_Iyer
2011-03-30 20:32                                 ` Giridhar Malavali
2011-03-30 20:45                                   ` James Bottomley
2011-03-29 19:47   ` Nicholas A. Bellinger
2011-03-29 20:29   ` Jan Kara
2011-03-29 20:31     ` Ric Wheeler
2011-03-30  0:33   ` Mingming Cao
2011-03-30  2:17     ` Dave Chinner
2011-03-30 11:13       ` Theodore Tso
2011-03-30 11:28         ` Ric Wheeler
2011-03-30 14:07           ` Chris Mason
2011-04-01 15:19           ` Ted Ts'o
2011-04-01 16:30             ` Amir Goldstein
2011-04-01 21:46               ` Joel Becker
2011-04-02  3:26                 ` Amir Goldstein
2011-04-01 21:43             ` Joel Becker
2011-04-01 21:43             ` Joel Becker
2011-04-01 21:43             ` Joel Becker
2011-03-30 21:49       ` Mingming Cao
2011-03-31  0:05         ` Matthew Wilcox
2011-03-31  1:00         ` Joel Becker
2011-04-01 21:34           ` Mingming Cao
2011-04-01 21:49             ` Joel Becker
2011-03-29 15:35 ` [LSF][MM] page allocation & direct reclaim latency Rik van Riel
2011-03-29 19:05   ` [Lsf] " Andrea Arcangeli
2011-03-29 20:35     ` Ying Han
2011-03-29 20:39       ` Ying Han
2011-03-29 20:45       ` Andrea Arcangeli
2011-03-29 20:53         ` Ying Han
2011-03-29 21:22     ` Rik van Riel
2011-03-29 22:38       ` Andrea Arcangeli
2011-03-29 22:13     ` Minchan Kim
2011-03-29 23:12       ` Andrea Arcangeli
2011-03-30 16:17       ` Mel Gorman
2011-03-30 16:49         ` Andrea Arcangeli
2011-03-31  0:42           ` Hugh Dickins
2011-03-31 15:15             ` Andrea Arcangeli
2011-03-31  9:30           ` Mel Gorman
2011-03-31 16:36             ` Andrea Arcangeli
2011-03-30 16:59         ` Dan Magenheimer
2011-03-29 17:35 ` [Lsf] Preliminary Agenda and Activities for LSF Chad Talbott
2011-03-29 19:09   ` Vivek Goyal
2011-03-29 20:14     ` Chad Talbott
2011-03-29 20:35     ` Jan Kara
2011-03-29 21:08       ` Greg Thelen
2011-03-30  4:18   ` Dave Chinner
2011-03-30 15:37     ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) Vivek Goyal
2011-03-30 22:20       ` Dave Chinner
2011-03-30 22:49         ` Chad Talbott
2011-03-31  3:00           ` Dave Chinner
2011-03-31 14:16         ` Vivek Goyal
2011-03-31 14:34           ` Chris Mason
2011-03-31 22:14             ` Dave Chinner
2011-03-31 23:43               ` Chris Mason
2011-04-01  0:55                 ` Dave Chinner
2011-04-01  1:34               ` Vivek Goyal
2011-04-01  4:36                 ` Dave Chinner
2011-04-01  6:32                   ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Christoph Hellwig
2011-04-01  7:23                     ` Dave Chinner
2011-04-01 12:56                       ` Christoph Hellwig
2011-04-21 15:07                         ` Vivek Goyal
2011-04-01 14:49                   ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] " Vivek Goyal
2011-03-31 22:25             ` Vivek Goyal
2011-03-31 14:50           ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Greg Thelen
2011-03-31 22:27             ` Dave Chinner
2011-04-01 17:18               ` Vivek Goyal
2011-04-01 19:57                 ` [LSF]: fc_rport attributes to further populate HBAAPIv2 Giridhar Malavali
2011-04-01 21:49                 ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Dave Chinner
2011-04-02  7:33                   ` Greg Thelen
2011-04-02  7:34                     ` Greg Thelen
2011-04-05 13:13                   ` Vivek Goyal
2011-04-05 22:56                     ` Dave Chinner
2011-04-06 14:49                       ` Curt Wohlgemuth
2011-04-06 15:39                         ` Vivek Goyal
2011-04-06 19:49                           ` Greg Thelen
2011-04-06 23:07                           ` [Lsf] IO less throttling and cgroup aware writeback Greg Thelen
2011-04-06 23:36                             ` Dave Chinner
2011-04-07 19:24                               ` Vivek Goyal
2011-04-07 20:33                                 ` Christoph Hellwig
2011-04-07 21:34                                   ` Vivek Goyal
2011-04-07 23:42                                 ` Dave Chinner
2011-04-08  0:59                                   ` Greg Thelen
2011-04-08  1:25                                     ` Dave Chinner
2011-04-08  1:25                                       ` Dave Chinner
2011-04-12  3:17                                       ` KAMEZAWA Hiroyuki
2011-04-08 13:43                                   ` Vivek Goyal
2011-04-06 23:08                         ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Dave Chinner
2011-04-07 20:04                           ` Vivek Goyal
2011-04-07 23:47                             ` Dave Chinner
2011-04-08 13:50                               ` Vivek Goyal
2011-04-11  1:05                                 ` Dave Chinner
2011-04-06 15:37                       ` Vivek Goyal
2011-04-06 16:08                         ` Vivek Goyal
2011-04-06 17:10                           ` Jan Kara
2011-04-06 17:14                             ` Curt Wohlgemuth
2011-04-08  1:58                             ` Dave Chinner
2011-04-19 14:26                               ` Wu Fengguang
2011-04-06 23:50                         ` Dave Chinner
2011-04-07 17:55                           ` Vivek Goyal
2011-04-11  1:36                             ` Dave Chinner
2011-04-15 21:07                               ` Vivek Goyal
2011-04-16  3:06                                 ` Vivek Goyal
2011-04-18 21:58                                   ` Jan Kara
2011-04-18 22:51                                     ` cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)) Vivek Goyal
2011-04-19  0:33                                       ` Dave Chinner
2011-04-19 14:30                                         ` Vivek Goyal
2011-04-19 14:45                                           ` Jan Kara
2011-04-19 17:17                                           ` Vivek Goyal
2011-04-19 18:30                                             ` Vivek Goyal
2011-04-21  0:32                                               ` Dave Chinner
2011-04-21  0:29                                           ` Dave Chinner
2011-04-19 14:17                               ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Wu Fengguang
2011-04-19 14:34                                 ` Vivek Goyal
2011-04-19 14:48                                   ` Jan Kara
2011-04-19 15:11                                     ` Vivek Goyal
2011-04-19 15:22                                       ` Wu Fengguang
2011-04-19 15:31                                         ` Vivek Goyal
2011-04-19 16:58                                           ` Wu Fengguang
2011-04-19 17:05                                             ` Vivek Goyal
2011-04-19 20:58                                               ` Jan Kara
2011-04-20  1:21                                                 ` Wu Fengguang
2011-04-20 10:56                                                   ` Jan Kara
2011-04-20 11:19                                                     ` Wu Fengguang
2011-04-20 14:42                                                       ` Jan Kara
2011-04-20  1:16                                               ` Wu Fengguang
2011-04-20 18:44                                                 ` Vivek Goyal
2011-04-20 19:16                                                   ` Jan Kara
2011-04-21  0:17                                                   ` Dave Chinner
2011-04-21 15:06                                                   ` Wu Fengguang
2011-04-21 15:10                                                     ` Wu Fengguang
2011-04-21 17:20                                                     ` Vivek Goyal
2011-04-22  4:21                                                       ` Wu Fengguang
2011-04-22 15:25                                                         ` Vivek Goyal
2011-04-22 16:28                                                           ` Andrea Arcangeli
2011-04-25 18:19                                                             ` Vivek Goyal
2011-04-26 14:37                                                               ` Vivek Goyal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.