[LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook

All of lore.kernel.org
 help / color / mirror / Atom feed

* [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook
@ 2013-12-30 21:36 Chris Mason
  2013-12-31  8:49 ` Zheng Liu
  0 siblings, 1 reply; 21+ messages in thread
From: Chris Mason @ 2013-12-30 21:36 UTC (permalink / raw)
  To: lsf-pc, linux-fsdevel

Hi everyone,

I'd like to attend the LSF/MM conference this year.  My current
discussion points include:

All things Btrfs!

Adding cgroups for more filesystem resources, especially to limit the
speed dirty pages are created.

I'm trying to collect a short summary of how Linux sucks from my new
employer.  My inbox isn't exactly overflowing with complaints yet, but
I'm hoping for a few unsolved warts we can all argue over.

-chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook
  2013-12-30 21:36 [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook Chris Mason
@ 2013-12-31  8:49 ` Zheng Liu
  2013-12-31  9:36   ` Jeff Liu
  2013-12-31 12:45   ` [Lsf-pc] " Jan Kara
  0 siblings, 2 replies; 21+ messages in thread
From: Zheng Liu @ 2013-12-31  8:49 UTC (permalink / raw)
  To: Chris Mason; +Cc: lsf-pc, linux-fsdevel

Hi Chris,

On Mon, Dec 30, 2013 at 09:36:20PM +0000, Chris Mason wrote:
> Hi everyone,
> 
> I'd like to attend the LSF/MM conference this year.  My current
> discussion points include:
> 
> All things Btrfs!
> 
> Adding cgroups for more filesystem resources, especially to limit the
> speed dirty pages are created.

Interesting.  If I remember correctly, IO-less dirty throttling has been
applied into upstream kernel, which can limit the speed that dirty pages
are created.  Does it has any defect?

Out of curiosity, which file system resource do you want to control?

Regards,
                                                - Zheng

> 
> I'm trying to collect a short summary of how Linux sucks from my new
> employer.  My inbox isn't exactly overflowing with complaints yet, but
> I'm hoping for a few unsolved warts we can all argue over.
> 
> -chris
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook
  2013-12-31  8:49 ` Zheng Liu
@ 2013-12-31  9:36   ` Jeff Liu
  2013-12-31 12:45   ` [Lsf-pc] " Jan Kara
  1 sibling, 0 replies; 21+ messages in thread
From: Jeff Liu @ 2013-12-31  9:36 UTC (permalink / raw)
  To: Zheng Liu; +Cc: clm, lsf-pc, linux-fsdevel

On 2013年12月31日 16:49, Zheng Liu wrote:
> Hi Chris,
> 
> On Mon, Dec 30, 2013 at 09:36:20PM +0000, Chris Mason wrote:
>> Hi everyone,
>>
>> I'd like to attend the LSF/MM conference this year.  My current
>> discussion points include:
>>
>> All things Btrfs!
>>
>> Adding cgroups for more filesystem resources, especially to limit the
>> speed dirty pages are created.
> 
> Interesting.  If I remember correctly, IO-less dirty throttling has been
> applied into upstream kernel, which can limit the speed that dirty pages
> are created.  Does it has any defect?
Have you really experienced in cgroup + IO less throttle?

AFAICS, IO less is aimed to reduce foreground writeback and low latency pauses
in balance_dirty_pages(), but it is independent to cgroup.  It seems that the
cgroup throttle is still under discussion on:
http://markmail.org/message/kdda272qiah77np4.

Thanks,
-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook
  2013-12-31  8:49 ` Zheng Liu
  2013-12-31  9:36   ` Jeff Liu
@ 2013-12-31 12:45   ` Jan Kara
  2013-12-31 13:19     ` Chris Mason
  1 sibling, 1 reply; 21+ messages in thread
From: Jan Kara @ 2013-12-31 12:45 UTC (permalink / raw)
  To: Zheng Liu; +Cc: Chris Mason, linux-fsdevel, lsf-pc

On Tue 31-12-13 16:49:27, Zheng Liu wrote:
> Hi Chris,
> 
> On Mon, Dec 30, 2013 at 09:36:20PM +0000, Chris Mason wrote:
> > Hi everyone,
> > 
> > I'd like to attend the LSF/MM conference this year.  My current
> > discussion points include:
> > 
> > All things Btrfs!
> > 
> > Adding cgroups for more filesystem resources, especially to limit the
> > speed dirty pages are created.
> 
> Interesting.  If I remember correctly, IO-less dirty throttling has been
> applied into upstream kernel, which can limit the speed that dirty pages
> are created.  Does it has any defect?
  It works as it should. But as Jeff points out, the throttling isn't
cgroup aware. So it can happen that one memcg is full of dirty pages and
reclaim has problems with reclaiming pages for it. I guess what Chris asks
for is that we watch number of dirty pages in each memcg and throttle
processes creating dirty pages in memcg which is close to its limit on
dirty pages.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook
  2013-12-31 12:45   ` [Lsf-pc] " Jan Kara
@ 2013-12-31 13:19     ` Chris Mason
  2013-12-31 14:22       ` Tao Ma
  2014-01-08 15:04       ` Mel Gorman
  0 siblings, 2 replies; 21+ messages in thread
From: Chris Mason @ 2013-12-31 13:19 UTC (permalink / raw)
  To: jack; +Cc: gnehzuil.liu, lsf-pc, linux-fsdevel

On Tue, 2013-12-31 at 13:45 +0100, Jan Kara wrote:
> On Tue 31-12-13 16:49:27, Zheng Liu wrote:
> > Hi Chris,
> > 
> > On Mon, Dec 30, 2013 at 09:36:20PM +0000, Chris Mason wrote:
> > > Hi everyone,
> > > 
> > > I'd like to attend the LSF/MM conference this year.  My current
> > > discussion points include:
> > > 
> > > All things Btrfs!
> > > 
> > > Adding cgroups for more filesystem resources, especially to limit the
> > > speed dirty pages are created.
> > 
> > Interesting.  If I remember correctly, IO-less dirty throttling has been
> > applied into upstream kernel, which can limit the speed that dirty pages
> > are created.  Does it has any defect?
>   It works as it should. But as Jeff points out, the throttling isn't
> cgroup aware. So it can happen that one memcg is full of dirty pages and
> reclaim has problems with reclaiming pages for it. I guess what Chris asks
> for is that we watch number of dirty pages in each memcg and throttle
> processes creating dirty pages in memcg which is close to its limit on
> dirty pages.

Right, the ioless dirty throttling is fantastic, but it's based on the
BDI and you only get one of those per device.

The current cgroup IO controller happens after we've decided to start
sending pages down.  From a buffered write point of view, this is
already too late.  If we delay the buffered IOs, the higher priority
tasks will just wait in balance_dirty_pages instead of waiting on the
drive.

So I'd like to throttle the rate at which dirty pages are created,
preferably based on the rates currently calculated in the BDI of how
quickly the device is doing IO.  This way we can limit dirty creation to
a percentage of the disk capacity during the current workload
(regardless of random vs buffered).

-chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook
  2013-12-31 13:19     ` Chris Mason
@ 2013-12-31 14:22       ` Tao Ma
  2013-12-31 15:34         ` Chris Mason
  2014-01-08 15:04       ` Mel Gorman
  1 sibling, 1 reply; 21+ messages in thread
From: Tao Ma @ 2013-12-31 14:22 UTC (permalink / raw)
  To: Chris Mason, jack; +Cc: gnehzuil.liu, lsf-pc, linux-fsdevel

Hi Chris,
On 12/31/2013 09:19 PM, Chris Mason wrote:
> On Tue, 2013-12-31 at 13:45 퍝, Jan Kara wrote:
>> On Tue 31-12-13 16:49:27, Zheng Liu wrote:
>>> Hi Chris,
>>>
>>> On Mon, Dec 30, 2013 at 09:36:20PM 퍍, Chris Mason wrote:
>>>> Hi everyone,
>>>>
>>>> I'd like to attend the LSF/MM conference this year.  My current
>>>> discussion points include:
>>>>
>>>> All things Btrfs!
>>>>
>>>> Adding cgroups for more filesystem resources, especially to limit the
>>>> speed dirty pages are created.
>>>
>>> Interesting.  If I remember correctly, IO-less dirty throttling has been
>>> applied into upstream kernel, which can limit the speed that dirty pages
>>> are created.  Does it has any defect?
>>   It works as it should. But as Jeff points out, the throttling isn't
>> cgroup aware. So it can happen that one memcg is full of dirty pages and
>> reclaim has problems with reclaiming pages for it. I guess what Chris asks
>> for is that we watch number of dirty pages in each memcg and throttle
>> processes creating dirty pages in memcg which is close to its limit on
>> dirty pages.
> 
> Right, the ioless dirty throttling is fantastic, but it's based on the
> BDI and you only get one of those per device.
> 
> The current cgroup IO controller happens after we've decided to start
> sending pages down.  From a buffered write point of view, this is
> already too late.  If we delay the buffered IOs, the higher priority
> tasks will just wait in balance_dirty_pages instead of waiting on the
> drive.
> 
> So I'd like to throttle the rate at which dirty pages are created,
> preferably based on the rates currently calculated in the BDI of how
> quickly the device is doing IO.  This way we can limit dirty creation to
> a percentage of the disk capacity during the current workload
> (regardless of random vs buffered).
Fengguang had already done some work on this, but it seems that the
community does't have a consensus on where this control file should go.
 You can look at this link: https://lkml.org/lkml/2011/4/4/205

Thanks,
Tao
> 
> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook
  2013-12-31 14:22       ` Tao Ma
@ 2013-12-31 15:34         ` Chris Mason
  2014-01-02  6:46           ` Jan Kara
  0 siblings, 1 reply; 21+ messages in thread
From: Chris Mason @ 2013-12-31 15:34 UTC (permalink / raw)
  To: tm; +Cc: jack, gnehzuil.liu, lsf-pc, linux-fsdevel

On Tue, 2013-12-31 at 22:22 +0800, Tao Ma wrote:
> Hi Chris,
> On 12/31/2013 09:19 PM, Chris Mason wrote:
>  
> > So I'd like to throttle the rate at which dirty pages are created,
> > preferably based on the rates currently calculated in the BDI of how
> > quickly the device is doing IO.  This way we can limit dirty creation to
> > a percentage of the disk capacity during the current workload
> > (regardless of random vs buffered).
> Fengguang had already done some work on this, but it seems that the
> community does't have a consensus on where this control file should go.
>  You can look at this link: https://lkml.org/lkml/2011/4/4/205

I had forgotten Wu's patches here, it's very close to the starting point
I was hoping for.

-chris


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook
  2013-12-31 15:34         ` Chris Mason
@ 2014-01-02  6:46           ` Jan Kara
  2014-01-02 15:21             ` Chris Mason
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2014-01-02  6:46 UTC (permalink / raw)
  To: Chris Mason; +Cc: tm, jack, gnehzuil.liu, lsf-pc, linux-fsdevel

On Tue 31-12-13 15:34:40, Chris Mason wrote:
> On Tue, 2013-12-31 at 22:22 +0800, Tao Ma wrote:
> > Hi Chris,
> > On 12/31/2013 09:19 PM, Chris Mason wrote:
> >  
> > > So I'd like to throttle the rate at which dirty pages are created,
> > > preferably based on the rates currently calculated in the BDI of how
> > > quickly the device is doing IO.  This way we can limit dirty creation to
> > > a percentage of the disk capacity during the current workload
> > > (regardless of random vs buffered).
> > Fengguang had already done some work on this, but it seems that the
> > community does't have a consensus on where this control file should go.
> >  You can look at this link: https://lkml.org/lkml/2011/4/4/205
> 
> I had forgotten Wu's patches here, it's very close to the starting point
> I was hoping for.
  I specifically don't like those patches because throttling pagecache
dirty rate is IMHO rather poor interface. What people want to do is to
limit IO from a container. That means reads & writes, buffered & direct IO.
So dirty rate is just a one of several things which contributes to total IO
rate. When you have both direct IO & buffered IO happening in the container
they influence each other so dirty rate 50 MB/s may be fine when nothing
else is going on in the container but may be far to much for the system if
there are heavy direct IO reads happening as well.

So you really need to tune the limit on the dirty rate depending on how
fast the writeback can happen (which is what current IO-less throttling
does), not based on some hard throughput number like
50 MB/s (which is what Fengguang's patches did if I remember right).

What could work a tad bit better (and that seems to be something you are
proposing) is to have a weight for each memcg and each memcg would be
allowed to dirty at a rate proportional to its weight * writeback
throughput. But this still has a couple of problems:
1) This doesn't take into account local situation in a memcg - for memcg
   full of dirty pages you want to throttle dirtying much more than for a
   memcg which has no dirty pages.
2) Flusher thread (or workqueue these days) doesn't know anything about
   memcgs. So it can happily flush a memcg which is relatively OK for a
   rather long time while some other memcg is full of dirty pages and
   struggling to do any progress.
3) This will be somewhat unfair since the total IO allowed to happen from a
   container will depend on whether you are doing only reads (or DIO), only
   writes or both reads & writes.

In an ideal world you could compute writeback throughput for each memcg
(and writeback from a memcg would be accounted in a proper blkcg - we would
need unified memcg & blkcg hieararchy for that), take into account number of
dirty pages in each memcg, and compute dirty rate according to these two
numbers. But whether this can work in practice heavily depends on the memcg
size and how smooth / fair can the writeback from different memcgs be so
that we don't have excessive stalls and throughput estimation errors...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook
  2014-01-02  6:46           ` Jan Kara
@ 2014-01-02 15:21             ` Chris Mason
  2014-01-02 16:01               ` tj
  0 siblings, 1 reply; 21+ messages in thread
From: Chris Mason @ 2014-01-02 15:21 UTC (permalink / raw)
  To: jack; +Cc: vgoyal, tj, lizefan, gnehzuil.liu, tm, lsf-pc, linux-fsdevel

On Thu, 2014-01-02 at 07:46 +0100, Jan Kara wrote:
> On Tue 31-12-13 15:34:40, Chris Mason wrote:
> > On Tue, 2013-12-31 at 22:22 +0800, Tao Ma wrote:
> > > Hi Chris,
> > > On 12/31/2013 09:19 PM, Chris Mason wrote:
> > >  
> > > > So I'd like to throttle the rate at which dirty pages are created,
> > > > preferably based on the rates currently calculated in the BDI of how
> > > > quickly the device is doing IO.  This way we can limit dirty creation to
> > > > a percentage of the disk capacity during the current workload
> > > > (regardless of random vs buffered).
> > > Fengguang had already done some work on this, but it seems that the
> > > community does't have a consensus on where this control file should go.
> > >  You can look at this link: https://lkml.org/lkml/2011/4/4/205
> > 
> > I had forgotten Wu's patches here, it's very close to the starting point
> > I was hoping for.
>   I specifically don't like those patches because throttling pagecache
> dirty rate is IMHO rather poor interface. What people want to do is to
> limit IO from a container. That means reads & writes, buffered & direct IO.
> So dirty rate is just a one of several things which contributes to total IO
> rate. When you have both direct IO & buffered IO happening in the container
> they influence each other so dirty rate 50 MB/s may be fine when nothing
> else is going on in the container but may be far to much for the system if
> there are heavy direct IO reads happening as well.
> 
> So you really need to tune the limit on the dirty rate depending on how
> fast the writeback can happen (which is what current IO-less throttling
> does), not based on some hard throughput number like
> 50 MB/s (which is what Fengguang's patches did if I remember right).
> 
> What could work a tad bit better (and that seems to be something you are
> proposing) is to have a weight for each memcg and each memcg would be
> allowed to dirty at a rate proportional to its weight * writeback
> throughput. But this still has a couple of problems:
> 1) This doesn't take into account local situation in a memcg - for memcg
>    full of dirty pages you want to throttle dirtying much more than for a
>    memcg which has no dirty pages.
> 2) Flusher thread (or workqueue these days) doesn't know anything about
>    memcgs. So it can happily flush a memcg which is relatively OK for a
>    rather long time while some other memcg is full of dirty pages and
>    struggling to do any progress.
> 3) This will be somewhat unfair since the total IO allowed to happen from a
>    container will depend on whether you are doing only reads (or DIO), only
>    writes or both reads & writes.
> 
> In an ideal world you could compute writeback throughput for each memcg
> (and writeback from a memcg would be accounted in a proper blkcg - we would
> need unified memcg & blkcg hieararchy for that), take into account number of
> dirty pages in each memcg, and compute dirty rate according to these two
> numbers. But whether this can work in practice heavily depends on the memcg
> size and how smooth / fair can the writeback from different memcgs be so
> that we don't have excessive stalls and throughput estimation errors...

[ Adding Tejun, Vivek and Li from another thread ]

I do agree that a basket of knobs is confusing and it doesn't really
help the admin.

My first idea was a complex system where the controller in the block
layer and the BDI flushers all communicated about current usage and
cooperated on a single set of reader/writer rates.  I think it could
work, but it'll be fragile.

But there are a limited number of non-pagecache methods to do IO.  Why
not just push the accounting and throttling for O_DIRECT into a new BDI
controller idea?  Tejun was just telling me how he'd rather fix the
existing controllers than add a new one, but I think we can have a much
better admin experience by having a having a single entry point based on
BDIs.

-chris


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook
  2014-01-02 15:21             ` Chris Mason
@ 2014-01-02 16:01               ` tj
  2014-01-02 16:14                 ` tj
                                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: tj @ 2014-01-02 16:01 UTC (permalink / raw)
  To: Chris Mason
  Cc: jack, vgoyal, lizefan, gnehzuil.liu, tm, lsf-pc, linux-fsdevel

Hello, Chris, Jan.

On Thu, Jan 02, 2014 at 03:21:15PM +0000, Chris Mason wrote:
> On Thu, 2014-01-02 at 07:46 +0100, Jan Kara wrote:
> > In an ideal world you could compute writeback throughput for each memcg
> > (and writeback from a memcg would be accounted in a proper blkcg - we would
> > need unified memcg & blkcg hieararchy for that), take into account number of
> > dirty pages in each memcg, and compute dirty rate according to these two
> > numbers. But whether this can work in practice heavily depends on the memcg
> > size and how smooth / fair can the writeback from different memcgs be so
> > that we don't have excessive stalls and throughput estimation errors...
> 
> [ Adding Tejun, Vivek and Li from another thread ]
> 
> I do agree that a basket of knobs is confusing and it doesn't really
> help the admin.
> 
> My first idea was a complex system where the controller in the block
> layer and the BDI flushers all communicated about current usage and
> cooperated on a single set of reader/writer rates.  I think it could
> work, but it'll be fragile.

One thing I do agree is that bdi would have to play some role.

> But there are a limited number of non-pagecache methods to do IO.  Why
> not just push the accounting and throttling for O_DIRECT into a new BDI
> controller idea?  Tejun was just telling me how he'd rather fix the
> existing controllers than add a new one, but I think we can have a much
> better admin experience by having a having a single entry point based on
> BDIs.

But if we'll have to make bdis blkcg-aware, I think the better way to
do is splitting it per cgroup.  That's what's being don in the lower
layer anyway.  We split request queues to multiple queues according to
cgroup configuration.  Things which can affect request issue and
completion, such as request allocation, are also split and each such
split queue is used for resource provisioning.

What we're missing is a way to make such split visible in the upper
layers for writeback.  It seems rather clear to me that that's the
right way to approach the problem rather than implementing separate
control for writebacks and somehow coordinate that with the rest.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook
  2014-01-02 16:01               ` tj
@ 2014-01-02 16:14                 ` tj
  2014-01-03  6:03                   ` Jan Kara
  2014-01-02 17:06                 ` Vivek Goyal
  2014-01-02 18:27                 ` James Bottomley
  2 siblings, 1 reply; 21+ messages in thread
From: tj @ 2014-01-02 16:14 UTC (permalink / raw)
  To: Chris Mason
  Cc: jack, vgoyal, lizefan, gnehzuil.liu, tm, lsf-pc, linux-fsdevel

Hey, again.

On Thu, Jan 02, 2014 at 11:01:02AM -0500, tj@kernel.org wrote:
> What we're missing is a way to make such split visible in the upper
> layers for writeback.  It seems rather clear to me that that's the
> right way to approach the problem rather than implementing separate
> control for writebacks and somehow coordinate that with the rest.

To clarify a bit.  I think what we need to do is splitting bdi's for
each active blkcg (at least the part which is relevant to propagating
io pressure upwards).  I really don't think a scheme where we try to
somehow split bandwidth number between two separate enforcing
mechanisms is something we should go after.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook
  2014-01-02 16:01               ` tj
  2014-01-02 16:14                 ` tj
@ 2014-01-02 17:06                 ` Vivek Goyal
  2014-01-02 17:10                   ` tj
  2014-01-02 18:27                 ` James Bottomley
  2 siblings, 1 reply; 21+ messages in thread
From: Vivek Goyal @ 2014-01-02 17:06 UTC (permalink / raw)
  To: tj; +Cc: Chris Mason, jack, lizefan, gnehzuil.liu, tm, lsf-pc, linux-fsdevel

On Thu, Jan 02, 2014 at 11:01:02AM -0500, tj@kernel.org wrote:
> Hello, Chris, Jan.
> 
> On Thu, Jan 02, 2014 at 03:21:15PM +0000, Chris Mason wrote:
> > On Thu, 2014-01-02 at 07:46 +0100, Jan Kara wrote:
> > > In an ideal world you could compute writeback throughput for each memcg
> > > (and writeback from a memcg would be accounted in a proper blkcg - we would
> > > need unified memcg & blkcg hieararchy for that), take into account number of
> > > dirty pages in each memcg, and compute dirty rate according to these two
> > > numbers. But whether this can work in practice heavily depends on the memcg
> > > size and how smooth / fair can the writeback from different memcgs be so
> > > that we don't have excessive stalls and throughput estimation errors...
> > 
> > [ Adding Tejun, Vivek and Li from another thread ]
> > 
> > I do agree that a basket of knobs is confusing and it doesn't really
> > help the admin.
> > 
> > My first idea was a complex system where the controller in the block
> > layer and the BDI flushers all communicated about current usage and
> > cooperated on a single set of reader/writer rates.  I think it could
> > work, but it'll be fragile.
> 
> One thing I do agree is that bdi would have to play some role.

So is this a separate configuration which can be done per bdi as opposed
to per device? IOW throttling offered per per cgroup per bdi. This will
help with the case of throttling over NFS too, which some people have
been asking for.

> 
> > But there are a limited number of non-pagecache methods to do IO.  Why
> > not just push the accounting and throttling for O_DIRECT into a new BDI
> > controller idea?  Tejun was just telling me how he'd rather fix the
> > existing controllers than add a new one, but I think we can have a much
> > better admin experience by having a having a single entry point based on
> > BDIs.
> 
> But if we'll have to make bdis blkcg-aware, I think the better way to
> do is splitting it per cgroup.  That's what's being don in the lower
> layer anyway.  We split request queues to multiple queues according to
> cgroup configuration.  Things which can affect request issue and
> completion, such as request allocation, are also split and each such
> split queue is used for resource provisioning.
> 

So is this a separate configuration which can be done per bdi as opposed
to per device? IOW throttling offered per per cgroup per bdi. This will
help with the case of throttling over NFS too, which some people have
been asking for.

So it sounds like re-implementing throttling infrastructure at bdi level
now (Similar to what has been done at device level)? Of course use as
much code as possible. But IIUC, proposal is that effectively there will
can be two throttling controllers. One operating at bdi level and one
operating below it at device level?

And then writeback logic needs to be modified so that it can calculate
per bdi per cgroup throughput (as opposed to bdi throughput only) and
throttle writeback accordingly.

Chris, in the past multiple implementations were proposed for a separate
knob for writeback. A separate knob was frowned upon so I also proposed
a separate implementation which throttle task in balance_dirty_pages()
based on the write limit configured on device.

https://lkml.org/lkml/2011/6/28/243

This approach assumed though that there is a block device associated.

Building a back pressure from device/bdi and adjusting writeback accordingly
(by making it cgroup aware) is complicated but once it works, I guess in long
term might turn out to be a reasonable approach.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook
  2014-01-02 17:06                 ` Vivek Goyal
@ 2014-01-02 17:10                   ` tj
  2014-01-02 19:11                     ` Chris Mason
  0 siblings, 1 reply; 21+ messages in thread
From: tj @ 2014-01-02 17:10 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Chris Mason, jack, lizefan, gnehzuil.liu, tm, lsf-pc, linux-fsdevel

Hey, Vivek.

On Thu, Jan 02, 2014 at 12:06:37PM -0500, Vivek Goyal wrote:
> So is this a separate configuration which can be done per bdi as opposed
> to per device? IOW throttling offered per per cgroup per bdi. This will
> help with the case of throttling over NFS too, which some people have
> been asking for.

Hah? No, bdi just being split per-cgroup on each device so that it can
properly propagate congestion upwards per-blkcg, just like how we
split request allocation per-cgroup in the block layer proper.

> So it sounds like re-implementing throttling infrastructure at bdi level
> now (Similar to what has been done at device level)? Of course use as
> much code as possible. But IIUC, proposal is that effectively there will
> can be two throttling controllers. One operating at bdi level and one
> operating below it at device level?

Not at all.  I was arguing explicitly against that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook
  2014-01-02 16:01               ` tj
  2014-01-02 16:14                 ` tj
  2014-01-02 17:06                 ` Vivek Goyal
@ 2014-01-02 18:27                 ` James Bottomley
  2014-01-02 18:36                   ` tj
  2 siblings, 1 reply; 21+ messages in thread
From: James Bottomley @ 2014-01-02 18:27 UTC (permalink / raw)
  To: tj
  Cc: Chris Mason, jack, vgoyal, lizefan, gnehzuil.liu, tm, lsf-pc,
	linux-fsdevel

On Thu, 2014-01-02 at 11:01 -0500, tj@kernel.org wrote:
> On Thu, Jan 02, 2014 at 03:21:15PM +0000, Chris Mason wrote:
> > But there are a limited number of non-pagecache methods to do IO.  Why
> > not just push the accounting and throttling for O_DIRECT into a new BDI
> > controller idea?  Tejun was just telling me how he'd rather fix the
> > existing controllers than add a new one, but I think we can have a much
> > better admin experience by having a having a single entry point based on
> > BDIs.
> 
> But if we'll have to make bdis blkcg-aware, I think the better way to
> do is splitting it per cgroup.  That's what's being don in the lower
> layer anyway.  We split request queues to multiple queues according to
> cgroup configuration.  Things which can affect request issue and
> completion, such as request allocation, are also split and each such
> split queue is used for resource provisioning.

This seems like the better way to me as well: per-bdi cgroups queues
mirrors fairly exactly the work we had to do in the zones for kernel
memory management.

> What we're missing is a way to make such split visible in the upper
> layers for writeback.  It seems rather clear to me that that's the
> right way to approach the problem rather than implementing separate
> control for writebacks and somehow coordinate that with the rest.

Well, you know, since bdis are block device tied, there's a natural
question if this can be a similar (or identical) control plane to the
one Oren is proposing for the device namespace.  I know you've never
really liked the idea, but this is pushing us down that path.

Perhaps what we should do is a half day on cgroups before the main LSF
(so in collab summit time, or just in the pub the night before) ... I'm
not sure all our audience are cgroup aware ...

James



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook
  2014-01-02 18:27                 ` James Bottomley
@ 2014-01-02 18:36                   ` tj
  2014-01-03  7:44                     ` James Bottomley
  0 siblings, 1 reply; 21+ messages in thread
From: tj @ 2014-01-02 18:36 UTC (permalink / raw)
  To: James Bottomley
  Cc: Chris Mason, jack, vgoyal, lizefan, gnehzuil.liu, tm, lsf-pc,
	linux-fsdevel

Hey, James.

On Thu, Jan 02, 2014 at 10:27:18AM -0800, James Bottomley wrote:
> Well, you know, since bdis are block device tied, there's a natural
> question if this can be a similar (or identical) control plane to the
> one Oren is proposing for the device namespace.  I know you've never
> really liked the idea, but this is pushing us down that path.

The reason I'm reluctant about Oren's proposal is not about where or
how it'll be implemented but about whether it's something we want to
have at all.  The proposed use case seemed exceedingly niche and
transient to me, which is not to say that the use case shouldn't be
supported but more that it probably should be implemented in a way
which is a lot less intrusive even if that means taking compromises
elsewhere (for example, userland basesystem might not experience full
transparency).

> Perhaps what we should do is a half day on cgroups before the main LSF
> (so in collab summit time, or just in the pub the night before) ... I'm
> not sure all our audience are cgroup aware ...

I think a single slot should suffice.  Talking longer doesn't
necessarily seem to lead to something actually useful.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook
  2014-01-02 17:10                   ` tj
@ 2014-01-02 19:11                     ` Chris Mason
  2014-01-03  6:39                       ` Jan Kara
  0 siblings, 1 reply; 21+ messages in thread
From: Chris Mason @ 2014-01-02 19:11 UTC (permalink / raw)
  To: tj; +Cc: vgoyal, lizefan, jack, gnehzuil.liu, tm, lsf-pc, linux-fsdevel

On Thu, 2014-01-02 at 12:10 -0500, tj@kernel.org wrote:
> Hey, Vivek.
> 
> On Thu, Jan 02, 2014 at 12:06:37PM -0500, Vivek Goyal wrote:
> > So is this a separate configuration which can be done per bdi as opposed
> > to per device? IOW throttling offered per per cgroup per bdi. This will
> > help with the case of throttling over NFS too, which some people have
> > been asking for.
> 
> Hah? No, bdi just being split per-cgroup on each device so that it can
> properly propagate congestion upwards per-blkcg, just like how we
> split request allocation per-cgroup in the block layer proper.
> 

I'm not entirely sure how well this will fit with the filesystems (who
already expect a single BDI), but it's worth trying.  I'm definitely
worried about having too many blkcgs and over-committing the dirty
memory limits.  BDI-per device setup we have now already has that
problem, but I'm not sure its a good idea to make it worse.

> > So it sounds like re-implementing throttling infrastructure at bdi level
> > now (Similar to what has been done at device level)? Of course use as
> > much code as possible. But IIUC, proposal is that effectively there will
> > can be two throttling controllers. One operating at bdi level and one
> > operating below it at device level?
> 
> Not at all.  I was arguing explicitly against that.

Keep in mind that we do already have throttling at the BDI level.  I was
definitely hoping we could consolidate some of that since it has grown
some bits of an elevator already.  I'm not saying what we have in the
bdi throttling isn't reasonable, but it is definitely replicating some
of the infrastructure down below.

We'll (Josef, me, perhaps others at FB) will do some experiments before
LSF.  The current plan is to throw some code at the wall and see what
sticks.

-chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook
  2014-01-02 16:14                 ` tj
@ 2014-01-03  6:03                   ` Jan Kara
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2014-01-03  6:03 UTC (permalink / raw)
  To: tj
  Cc: Chris Mason, jack, vgoyal, lizefan, gnehzuil.liu, tm, lsf-pc,
	linux-fsdevel

On Thu 02-01-14 11:14:06, tj@kernel.org wrote:
> Hey, again.
> 
> On Thu, Jan 02, 2014 at 11:01:02AM -0500, tj@kernel.org wrote:
> > What we're missing is a way to make such split visible in the upper
> > layers for writeback.  It seems rather clear to me that that's the
> > right way to approach the problem rather than implementing separate
> > control for writebacks and somehow coordinate that with the rest.
> 
> To clarify a bit.  I think what we need to do is splitting bdi's for
> each active blkcg (at least the part which is relevant to propagating
> io pressure upwards).
  Well, the pressure is currently propagated by the fact that we block on
request allocation while writing back stuff... So this would mean splitting
bdi flusher workqueue per blkcg.

> I really don't think a scheme where we try to somehow split bandwidth
> number between two separate enforcing mechanisms is something we should
> go after.
  Agreed.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook
  2014-01-02 19:11                     ` Chris Mason
@ 2014-01-03  6:39                       ` Jan Kara
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2014-01-03  6:39 UTC (permalink / raw)
  To: Chris Mason
  Cc: tj, vgoyal, lizefan, jack, gnehzuil.liu, tm, lsf-pc, linux-fsdevel

On Thu 02-01-14 19:11:44, Chris Mason wrote:
> On Thu, 2014-01-02 at 12:10 -0500, tj@kernel.org wrote:
> > Hey, Vivek.
> > 
> > On Thu, Jan 02, 2014 at 12:06:37PM -0500, Vivek Goyal wrote:
> > > So is this a separate configuration which can be done per bdi as opposed
> > > to per device? IOW throttling offered per per cgroup per bdi. This will
> > > help with the case of throttling over NFS too, which some people have
> > > been asking for.
> > 
> > Hah? No, bdi just being split per-cgroup on each device so that it can
> > properly propagate congestion upwards per-blkcg, just like how we
> > split request allocation per-cgroup in the block layer proper.
> > 
> 
> I'm not entirely sure how well this will fit with the filesystems (who
> already expect a single BDI), but it's worth trying.  I'm definitely
> worried about having too many blkcgs and over-committing the dirty memory
> limits.  BDI-per device setup we have now already has that problem, but
> I'm not sure its a good idea to make it worse.
  Well, we already have BDIs not tied to any particular device - e.g. in
NFS or in btrfs (I guess I don't have to tell you ;). In more complex
storage situations we rather seem to follow the rule one BDI per filesystem
instance as that's what makes sense for writeback and most of other stuff
in BDI. And we definitely want to split this "high-level" BDI because
that's the only thing which makes sense for writeback (you cannot really
tell whether e.g. dirty inode belongs to sda or sdb in RAID0 setting).

> > > So it sounds like re-implementing throttling infrastructure at bdi level
> > > now (Similar to what has been done at device level)? Of course use as
> > > much code as possible. But IIUC, proposal is that effectively there will
> > > can be two throttling controllers. One operating at bdi level and one
> > > operating below it at device level?
> > 
> > Not at all.  I was arguing explicitly against that.
> 
> Keep in mind that we do already have throttling at the BDI level.  I was
> definitely hoping we could consolidate some of that since it has grown
> some bits of an elevator already.  I'm not saying what we have in the
> bdi throttling isn't reasonable, but it is definitely replicating some
> of the infrastructure down below.
  What exactly are you talking about? Dirty throttling or something else?

> We'll (Josef, me, perhaps others at FB) will do some experiments before
> LSF.  The current plan is to throw some code at the wall and see what
> sticks.
  That would be great. People are talking about this on and off for at
least three years, there were some patches posted but so far noone was
persistent enough to get anything merged (to be fair at least we have
explored some dead ends where we decided that it's not worth the hassle
after we've seen the code / considered the API).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook
  2014-01-02 18:36                   ` tj
@ 2014-01-03  7:44                     ` James Bottomley
  0 siblings, 0 replies; 21+ messages in thread
From: James Bottomley @ 2014-01-03  7:44 UTC (permalink / raw)
  To: tj
  Cc: jack, gnehzuil.liu, Chris Mason, lizefan, linux-fsdevel, tm,
	lsf-pc, vgoyal

On Thu, 2014-01-02 at 13:36 -0500, tj@kernel.org wrote:
> Hey, James.
> 
> On Thu, Jan 02, 2014 at 10:27:18AM -0800, James Bottomley wrote:
> > Well, you know, since bdis are block device tied, there's a natural
> > question if this can be a similar (or identical) control plane to the
> > one Oren is proposing for the device namespace.  I know you've never
> > really liked the idea, but this is pushing us down that path.
> 
> The reason I'm reluctant about Oren's proposal is not about where or
> how it'll be implemented but about whether it's something we want to
> have at all.  The proposed use case seemed exceedingly niche and
> transient to me, which is not to say that the use case shouldn't be
> supported but more that it probably should be implemented in a way
> which is a lot less intrusive even if that means taking compromises
> elsewhere (for example, userland basesystem might not experience full
> transparency).

Don't disagree.  My thought was go from the use cases ... i.e. the
control planes we would need for device namespaces and per-bdi cgroups
and see if anything consistent emerges.  If it does, we still don't have
to implement it, but we know how to if the need arises.

> > Perhaps what we should do is a half day on cgroups before the main LSF
> > (so in collab summit time, or just in the pub the night before) ... I'm
> > not sure all our audience are cgroup aware ...
> 
> I think a single slot should suffice.  Talking longer doesn't
> necessarily seem to lead to something actually useful.

Probably OK ... it would have to be a plenary, though, since this covers
fs and io and most of the cgroup interested people are in mm.

James



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook
  2013-12-31 13:19     ` Chris Mason
  2013-12-31 14:22       ` Tao Ma
@ 2014-01-08 15:04       ` Mel Gorman
  2014-01-08 16:14         ` Chris Mason
  1 sibling, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2014-01-08 15:04 UTC (permalink / raw)
  To: Chris Mason; +Cc: jack, linux-fsdevel, lsf-pc, gnehzuil.liu

On Tue, Dec 31, 2013 at 01:19:15PM +0000, Chris Mason wrote:
> On Tue, 2013-12-31 at 13:45 +0100, Jan Kara wrote:
> > On Tue 31-12-13 16:49:27, Zheng Liu wrote:
> > > Hi Chris,
> > > 
> > > On Mon, Dec 30, 2013 at 09:36:20PM +0000, Chris Mason wrote:
> > > > Hi everyone,
> > > > 
> > > > I'd like to attend the LSF/MM conference this year.  My current
> > > > discussion points include:
> > > > 
> > > > All things Btrfs!
> > > > 
> > > > Adding cgroups for more filesystem resources, especially to limit the
> > > > speed dirty pages are created.
> > > 
> > > Interesting.  If I remember correctly, IO-less dirty throttling has been
> > > applied into upstream kernel, which can limit the speed that dirty pages
> > > are created.  Does it has any defect?
> >   It works as it should. But as Jeff points out, the throttling isn't
> > cgroup aware. So it can happen that one memcg is full of dirty pages and
> > reclaim has problems with reclaiming pages for it. I guess what Chris asks
> > for is that we watch number of dirty pages in each memcg and throttle
> > processes creating dirty pages in memcg which is close to its limit on
> > dirty pages.
> 
> Right, the ioless dirty throttling is fantastic, but it's based on the
> BDI and you only get one of those per device.
> 

It's only partially related but we'll also need to keep in mind that
even with ioless dirty throttling that dirty_ratio and dirty_bytes have
been showing their age for a long time.  dirty_ratio was fine when 20%
of memory was still a few seconds of IO but it has not been the case in a
long time. dirty_bytes is also not a great interface because it ignores the
speed of the underlying device. While proposals to fix it have been raised
in the past, no one (including me) has put themselves on the firing line
to replace that interface with something like dirty_time -- do not dirty
more pages than it takes N seconds to writeback. When/if someone clears
their table sufficiently to tackle that problem they are likely to collide
with any IO controller work.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook
  2014-01-08 15:04       ` Mel Gorman
@ 2014-01-08 16:14         ` Chris Mason
  0 siblings, 0 replies; 21+ messages in thread
From: Chris Mason @ 2014-01-08 16:14 UTC (permalink / raw)
  To: mgorman; +Cc: jack, gnehzuil.liu, lsf-pc, linux-fsdevel

On Wed, 2014-01-08 at 15:04 +0000, Mel Gorman wrote:
> On Tue, Dec 31, 2013 at 01:19:15PM +0000, Chris Mason wrote:
> > On Tue, 2013-12-31 at 13:45 +0100, Jan Kara wrote:
> > > On Tue 31-12-13 16:49:27, Zheng Liu wrote:
> > > > Hi Chris,
> > > > 
> > > > On Mon, Dec 30, 2013 at 09:36:20PM +0000, Chris Mason wrote:
> > > > > Hi everyone,
> > > > > 
> > > > > I'd like to attend the LSF/MM conference this year.  My current
> > > > > discussion points include:
> > > > > 
> > > > > All things Btrfs!
> > > > > 
> > > > > Adding cgroups for more filesystem resources, especially to limit the
> > > > > speed dirty pages are created.
> > > > 
> > > > Interesting.  If I remember correctly, IO-less dirty throttling has been
> > > > applied into upstream kernel, which can limit the speed that dirty pages
> > > > are created.  Does it has any defect?
> > >   It works as it should. But as Jeff points out, the throttling isn't
> > > cgroup aware. So it can happen that one memcg is full of dirty pages and
> > > reclaim has problems with reclaiming pages for it. I guess what Chris asks
> > > for is that we watch number of dirty pages in each memcg and throttle
> > > processes creating dirty pages in memcg which is close to its limit on
> > > dirty pages.
> > 
> > Right, the ioless dirty throttling is fantastic, but it's based on the
> > BDI and you only get one of those per device.
> > 
> 
> It's only partially related but we'll also need to keep in mind that
> even with ioless dirty throttling that dirty_ratio and dirty_bytes have
> been showing their age for a long time.  dirty_ratio was fine when 20%
> of memory was still a few seconds of IO but it has not been the case in a
> long time. dirty_bytes is also not a great interface because it ignores the
> speed of the underlying device. While proposals to fix it have been raised
> in the past, no one (including me) has put themselves on the firing line
> to replace that interface with something like dirty_time -- do not dirty
> more pages than it takes N seconds to writeback. When/if someone clears
> their table sufficiently to tackle that problem they are likely to collide
> with any IO controller work.
> 

I'm really hoping to combine these ideas a bit.  There's a ton of
overlap between the BDI throttling and the io controller.  So, I'll
definitely poke at it with a stick, we'll see what falls out.

-chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-01-08 16:14 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-30 21:36 [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook Chris Mason
2013-12-31  8:49 ` Zheng Liu
2013-12-31  9:36   ` Jeff Liu
2013-12-31 12:45   ` [Lsf-pc] " Jan Kara
2013-12-31 13:19     ` Chris Mason
2013-12-31 14:22       ` Tao Ma
2013-12-31 15:34         ` Chris Mason
2014-01-02  6:46           ` Jan Kara
2014-01-02 15:21             ` Chris Mason
2014-01-02 16:01               ` tj
2014-01-02 16:14                 ` tj
2014-01-03  6:03                   ` Jan Kara
2014-01-02 17:06                 ` Vivek Goyal
2014-01-02 17:10                   ` tj
2014-01-02 19:11                     ` Chris Mason
2014-01-03  6:39                       ` Jan Kara
2014-01-02 18:27                 ` James Bottomley
2014-01-02 18:36                   ` tj
2014-01-03  7:44                     ` James Bottomley
2014-01-08 15:04       ` Mel Gorman
2014-01-08 16:14         ` Chris Mason

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.