linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM TOPIC] filesystems, btrfs, cgroups, debugging tools
@ 2018-01-24 22:02 Chris Mason
  2018-01-25  9:48 ` Jan Kara
  2018-02-07 21:44 ` Randy Dunlap
  0 siblings, 2 replies; 10+ messages in thread
From: Chris Mason @ 2018-01-24 22:02 UTC (permalink / raw)
  To: lsf-pc, linux-fsdevel

Hi everyone,

I'm really looking forward to LSF/MM this year.  I can bring along a 
fair amount of data from production about benchmarking and stability.

We've been expanding our btrfs rollout, and we're also fixing up 
priority inversions when cgroup IO controllers are put in place.  I 
think we have btrfs fixed up, but ext4 seems to be incompatible with IO 
controllers due to data=ordered IO.  We haven't tried XFS with the 
controllers yet but I don't think there will be any major blockers there.

I'm also hoping filesystem slab shrinking gets into the agenda, since we 
have a few ugly hacks there to keep production happy.

Thanks,
Chris

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] filesystems, btrfs, cgroups, debugging tools
  2018-01-24 22:02 [LSF/MM TOPIC] filesystems, btrfs, cgroups, debugging tools Chris Mason
@ 2018-01-25  9:48 ` Jan Kara
  2018-01-25 13:41   ` Chris Mason
  2018-02-07 21:44 ` Randy Dunlap
  1 sibling, 1 reply; 10+ messages in thread
From: Jan Kara @ 2018-01-25  9:48 UTC (permalink / raw)
  To: Chris Mason; +Cc: lsf-pc, linux-fsdevel

Hi Chris,

On Wed 24-01-18 17:02:47, Chris Mason wrote:
> I'm really looking forward to LSF/MM this year.  I can bring along a fair
> amount of data from production about benchmarking and stability.
> 
> We've been expanding our btrfs rollout, and we're also fixing up priority
> inversions when cgroup IO controllers are put in place.  I think we have
> btrfs fixed up, but ext4 seems to be incompatible with IO controllers due to
> data=ordered IO.

Yeah, I suspect I know what you hit but still I'd be interested in hearing
more details about your usecase and the problems you see. Maybe it could be
helped.

> We haven't tried XFS with the controllers yet but I don't think there
> will be any major blockers there.
> 
> I'm also hoping filesystem slab shrinking gets into the agenda, since we
> have a few ugly hacks there to keep production happy.

I'm interested in this as well but I suspect we need someone willing to
spend a long time with slab reclaim guts to improve this. And so far nobody
was bothered enough...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] filesystems, btrfs, cgroups, debugging tools
  2018-01-25  9:48 ` Jan Kara
@ 2018-01-25 13:41   ` Chris Mason
  2018-02-07 10:32     ` Jan Kara
  0 siblings, 1 reply; 10+ messages in thread
From: Chris Mason @ 2018-01-25 13:41 UTC (permalink / raw)
  To: Jan Kara; +Cc: lsf-pc, linux-fsdevel


Hi Jan,

On 01/25/2018 04:48 AM, Jan Kara wrote:
> Hi Chris,
> 
> On Wed 24-01-18 17:02:47, Chris Mason wrote:
>> I'm really looking forward to LSF/MM this year.  I can bring along a fair
>> amount of data from production about benchmarking and stability.
>>
>> We've been expanding our btrfs rollout, and we're also fixing up priority
>> inversions when cgroup IO controllers are put in place.  I think we have
>> btrfs fixed up, but ext4 seems to be incompatible with IO controllers due to
>> data=ordered IO.
> 
> Yeah, I suspect I know what you hit but still I'd be interested in hearing
> more details about your usecase and the problems you see. Maybe it could be
> helped.
> 

Both btrfs and ext4 are root drive filesystems for us.  The IO 
controller is basically making sure the root drive isn't saturated by 
lower priority tasks, which might be anything from system updates to log 
files to actually part of the workload.

With ext4, the data=ordered IO done during transaction commits makes 
priority inversions that I don't see a way around.  It's dramatically 
better than ext3, but still happens enough that we can't enforce IO 
limits at all.  It really only takes one low prio IO to sneak into 
kjournald's list to wreck everything.

>> We haven't tried XFS with the controllers yet but I don't think there
>> will be any major blockers there.
>>
>> I'm also hoping filesystem slab shrinking gets into the agenda, since we
>> have a few ugly hacks there to keep production happy.
> 
> I'm interested in this as well but I suspect we need someone willing to
> spend a long time with slab reclaim guts to improve this. And so far nobody
> was bothered enough...

Yeah, we'll keep chipping away at this one, but it's going to take a while.

-chris

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] filesystems, btrfs, cgroups, debugging tools
  2018-01-25 13:41   ` Chris Mason
@ 2018-02-07 10:32     ` Jan Kara
  2018-02-07 14:51       ` Chris Mason
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Kara @ 2018-02-07 10:32 UTC (permalink / raw)
  To: Chris Mason; +Cc: Jan Kara, lsf-pc, linux-fsdevel

Hi Chris,

On Thu 25-01-18 08:41:58, Chris Mason wrote:
> On 01/25/2018 04:48 AM, Jan Kara wrote:
> > Hi Chris,
> > 
> > On Wed 24-01-18 17:02:47, Chris Mason wrote:
> > > I'm really looking forward to LSF/MM this year.  I can bring along a fair
> > > amount of data from production about benchmarking and stability.
> > > 
> > > We've been expanding our btrfs rollout, and we're also fixing up priority
> > > inversions when cgroup IO controllers are put in place.  I think we have
> > > btrfs fixed up, but ext4 seems to be incompatible with IO controllers due to
> > > data=ordered IO.
> > 
> > Yeah, I suspect I know what you hit but still I'd be interested in hearing
> > more details about your usecase and the problems you see. Maybe it could be
> > helped.
> > 
> 
> Both btrfs and ext4 are root drive filesystems for us.  The IO controller is
> basically making sure the root drive isn't saturated by lower priority
> tasks, which might be anything from system updates to log files to actually
> part of the workload.
> 
> With ext4, the data=ordered IO done during transaction commits makes
> priority inversions that I don't see a way around.  It's dramatically better
> than ext3, but still happens enough that we can't enforce IO limits at all.
> It really only takes one low prio IO to sneak into kjournald's list to wreck
> everything.

AFAIU we could do a similar thing like what Tejun implemented for btrfs
metadata where the submitter can override blkcg to which the IO is
accounted. In ext4's case if kjournald is doing the writeback, it would get
accounted to the root blkcg. It will allow containers to somewhat violate
the bounds set to their blkcg but the priority inversion should be rarer -
sadly we cannot easily make it completely go away as if the original process
not only attaches the inode to the transaction but also submits the data
blocks with low priority, transaction commit still has to wait for this IO to
complete so the whole commit will be still blocked.

So probably a better fix would be to introduce another data journalling
mode for ext4 where we'd unconditionally use unwritten extents for data
writeback. We actually have it implemented in ext4 hidden behind
'dioread_nolock' mount option but it needs more polishing and possibly
testing.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] filesystems, btrfs, cgroups, debugging tools
  2018-02-07 10:32     ` Jan Kara
@ 2018-02-07 14:51       ` Chris Mason
  2018-02-07 16:37         ` Jan Kara
  0 siblings, 1 reply; 10+ messages in thread
From: Chris Mason @ 2018-02-07 14:51 UTC (permalink / raw)
  To: Jan Kara; +Cc: lsf-pc, linux-fsdevel

On 7 Feb 2018, at 5:32, Jan Kara wrote:

> Hi Chris,
>
> On Thu 25-01-18 08:41:58, Chris Mason wrote:
>>
>> With ext4, the data=ordered IO done during transaction commits makes
>> priority inversions that I don't see a way around.  It's dramatically 
>> better
>> than ext3, but still happens enough that we can't enforce IO limits 
>> at all.
>> It really only takes one low prio IO to sneak into kjournald's list 
>> to wreck
>> everything.
>
> AFAIU we could do a similar thing like what Tejun implemented for 
> btrfs
> metadata where the submitter can override blkcg to which the IO is
> accounted. In ext4's case if kjournald is doing the writeback, it 
> would get
> accounted to the root blkcg. It will allow containers to somewhat 
> violate
> the bounds set to their blkcg but the priority inversion should be 
> rarer -
> sadly we cannot easily make it completely go away as if the original 
> process
> not only attaches the inode to the transaction but also submits the 
> data
> blocks with low priority, transaction commit still has to wait for 
> this IO to
> complete so the whole commit will be still blocked.

Yeah, I think this was the problem we hit.  balance_dirty_pages and 
friends will trigger low priority write back, and if kjournald ends up 
waiting on that, we're out of luck.

>
> So probably a better fix would be to introduce another data 
> journalling
> mode for ext4 where we'd unconditionally use unwritten extents for 
> data
> writeback. We actually have it implemented in ext4 hidden behind
> 'dioread_nolock' mount option but it needs more polishing and possibly
> testing.

I wonder how that compares in performance to my old data=guarded idea.  
I think a better step one might be to add tracepoints when blocks are 
added to the ordered list, so we can better understand if we're adding 
them in error.  It felt like it was happening more often than it should.

On the FB side, we found one more prio inversion in btrfs from the free 
space cache (IO going down as data instead of metadata) and we're 
testing the fix for that.  It should hopefully be the last one, and then 
we can compare how effective the different options are.

-chris

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] filesystems, btrfs, cgroups, debugging tools
  2018-02-07 14:51       ` Chris Mason
@ 2018-02-07 16:37         ` Jan Kara
  2018-02-07 21:29           ` Chris Mason
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Kara @ 2018-02-07 16:37 UTC (permalink / raw)
  To: Chris Mason; +Cc: Jan Kara, lsf-pc, linux-fsdevel

On Wed 07-02-18 09:51:02, Chris Mason wrote:
> On 7 Feb 2018, at 5:32, Jan Kara wrote:
> > On Thu 25-01-18 08:41:58, Chris Mason wrote:
> > > 
> > > With ext4, the data=ordered IO done during transaction commits makes
> > > priority inversions that I don't see a way around.  It's
> > > dramatically better
> > > than ext3, but still happens enough that we can't enforce IO limits
> > > at all.
> > > It really only takes one low prio IO to sneak into kjournald's list
> > > to wreck
> > > everything.
> > 
> > AFAIU we could do a similar thing like what Tejun implemented for btrfs
> > metadata where the submitter can override blkcg to which the IO is
> > accounted. In ext4's case if kjournald is doing the writeback, it would
> > get
> > accounted to the root blkcg. It will allow containers to somewhat
> > violate
> > the bounds set to their blkcg but the priority inversion should be rarer
> > -
> > sadly we cannot easily make it completely go away as if the original
> > process
> > not only attaches the inode to the transaction but also submits the data
> > blocks with low priority, transaction commit still has to wait for this
> > IO to
> > complete so the whole commit will be still blocked.
> 
> Yeah, I think this was the problem we hit.  balance_dirty_pages and friends
> will trigger low priority write back, and if kjournald ends up waiting on
> that, we're out of luck.
> 
> > 
> > So probably a better fix would be to introduce another data journalling
> > mode for ext4 where we'd unconditionally use unwritten extents for data
> > writeback. We actually have it implemented in ext4 hidden behind
> > 'dioread_nolock' mount option but it needs more polishing and possibly
> > testing.
> 
> I wonder how that compares in performance to my old data=guarded idea.  I
> think a better step one might be to add tracepoints when blocks are added to
> the ordered list, so we can better understand if we're adding them in error.
> It felt like it was happening more often than it should.

In ext4 / jbd2 this mechanism is actually different from ext3. We don't
track individual blocks in ordered list anymore, we just track inodes and
flush all mapped & dirty pages from those inodes when doing transaction
commit. It does potentially imply more work but the locking imposed by the
original jbd scheme was incompatible with reasonably efficient writeback
and generally very special data writeback path causing various troubles.

We do take care to add inode to a transaction only when we allocate new
block for it so in this sence we shouldn't ever be adding them "in error".
You are right that we could optimize this by doing a special handling for
writeback of blocks beyond end of file (which is what data=guarded was
about if I remember right) however with delayed allocation you then have to
be careful not to flush new inode size to disk before flushing the data
blocks so it gets more complicated.

But creating unwritten extents and converting them to written on IO
completion certainly has non-negligible cost (that's why this is not the
default yet) so possibly the complication is worth it.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] filesystems, btrfs, cgroups, debugging tools
  2018-02-07 16:37         ` Jan Kara
@ 2018-02-07 21:29           ` Chris Mason
  0 siblings, 0 replies; 10+ messages in thread
From: Chris Mason @ 2018-02-07 21:29 UTC (permalink / raw)
  To: Jan Kara; +Cc: lsf-pc, linux-fsdevel

On 7 Feb 2018, at 11:37, Jan Kara wrote:

> On Wed 07-02-18 09:51:02, Chris Mason wrote:
>> On 7 Feb 2018, at 5:32, Jan Kara wrote:
>>> On Thu 25-01-18 08:41:58, Chris Mason wrote:
>>>>
>>>> With ext4, the data=ordered IO done during transaction commits 
>>>> makes
>>>> priority inversions that I don't see a way around.  It's
>>>> dramatically better
>>>> than ext3, but still happens enough that we can't enforce IO limits
>>>> at all.
>>>> It really only takes one low prio IO to sneak into kjournald's list
>>>> to wreck
>>>> everything.
>>>
>>> AFAIU we could do a similar thing like what Tejun implemented for 
>>> btrfs
>>> metadata where the submitter can override blkcg to which the IO is
>>> accounted. In ext4's case if kjournald is doing the writeback, it 
>>> would
>>> get
>>> accounted to the root blkcg. It will allow containers to somewhat
>>> violate
>>> the bounds set to their blkcg but the priority inversion should be 
>>> rarer
>>> -
>>> sadly we cannot easily make it completely go away as if the original
>>> process
>>> not only attaches the inode to the transaction but also submits the 
>>> data
>>> blocks with low priority, transaction commit still has to wait for 
>>> this
>>> IO to
>>> complete so the whole commit will be still blocked.
>>
>> Yeah, I think this was the problem we hit.  balance_dirty_pages and 
>> friends
>> will trigger low priority write back, and if kjournald ends up 
>> waiting on
>> that, we're out of luck.
>>
>>>
>>> So probably a better fix would be to introduce another data 
>>> journalling
>>> mode for ext4 where we'd unconditionally use unwritten extents for 
>>> data
>>> writeback. We actually have it implemented in ext4 hidden behind
>>> 'dioread_nolock' mount option but it needs more polishing and 
>>> possibly
>>> testing.
>>
>> I wonder how that compares in performance to my old data=guarded 
>> idea.  I
>> think a better step one might be to add tracepoints when blocks are 
>> added to
>> the ordered list, so we can better understand if we're adding them in 
>> error.
>> It felt like it was happening more often than it should.
>
> In ext4 / jbd2 this mechanism is actually different from ext3. We 
> don't
> track individual blocks in ordered list anymore, we just track inodes 
> and
> flush all mapped & dirty pages from those inodes when doing 
> transaction
> commit. It does potentially imply more work but the locking imposed by 
> the
> original jbd scheme was incompatible with reasonably efficient 
> writeback
> and generally very special data writeback path causing various 
> troubles.
>
> We do take care to add inode to a transaction only when we allocate 
> new
> block for it so in this sence we shouldn't ever be adding them "in 
> error".
> You are right that we could optimize this by doing a special handling 
> for
> writeback of blocks beyond end of file (which is what data=guarded was
> about if I remember right) however with delayed allocation you then 
> have to
> be careful not to flush new inode size to disk before flushing the 
> data
> blocks so it gets more complicated.
>
> But creating unwritten extents and converting them to written on IO
> completion certainly has non-negligible cost (that's why this is not 
> the
> default yet) so possibly the complication is worth it.
>

This isn't too far from what btrfs does, where we just update the 
metadata to point to the new blocks after the IO is done.  data=guarded 
was similar but much more of a hack ;)

-chris

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] filesystems, btrfs, cgroups, debugging tools
  2018-01-24 22:02 [LSF/MM TOPIC] filesystems, btrfs, cgroups, debugging tools Chris Mason
  2018-01-25  9:48 ` Jan Kara
@ 2018-02-07 21:44 ` Randy Dunlap
  2018-02-07 22:43   ` Chris Mason
  1 sibling, 1 reply; 10+ messages in thread
From: Randy Dunlap @ 2018-02-07 21:44 UTC (permalink / raw)
  To: Chris Mason, lsf-pc, linux-fsdevel

On 01/24/2018 02:02 PM, Chris Mason wrote:
> Hi everyone,
> 
> I'm really looking forward to LSF/MM this year.  I can bring along a fair amount of data from production about benchmarking and stability.
> 
> We've been expanding our btrfs rollout, and we're also fixing up priority inversions when cgroup IO controllers are put in place.  I think we have btrfs fixed up, but ext4 seems to be incompatible with IO controllers due to data=ordered IO.  We haven't tried XFS with the controllers yet but I don't think there will be any major blockers there.
> 
> I'm also hoping filesystem slab shrinking gets into the agenda, since we have a few ugly hacks there to keep production happy.

Hi,
What about the "debugging tools" part of $Subject?

thanks,
-- 
~Randy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] filesystems, btrfs, cgroups, debugging tools
  2018-02-07 21:44 ` Randy Dunlap
@ 2018-02-07 22:43   ` Chris Mason
  2018-02-07 23:26     ` Randy Dunlap
  0 siblings, 1 reply; 10+ messages in thread
From: Chris Mason @ 2018-02-07 22:43 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: lsf-pc, linux-fsdevel

On 7 Feb 2018, at 16:44, Randy Dunlap wrote:

> On 01/24/2018 02:02 PM, Chris Mason wrote:
>> Hi everyone,
>>
>> I'm really looking forward to LSF/MM this year.  I can bring along a 
>> fair amount of data from production about benchmarking and stability.
>>
>> We've been expanding our btrfs rollout, and we're also fixing up 
>> priority inversions when cgroup IO controllers are put in place.  I 
>> think we have btrfs fixed up, but ext4 seems to be incompatible with 
>> IO controllers due to data=ordered IO.  We haven't tried XFS with 
>> the controllers yet but I don't think there will be any major 
>> blockers there.
>>
>> I'm also hoping filesystem slab shrinking gets into the agenda, since 
>> we have a few ugly hacks there to keep production happy.
>
> Hi,
> What about the "debugging tools" part of $Subject?
>

We're mostly adapting and extending bpf as we find fun ways to use it 
for tracing, error injection etc.  I'm more a consumer of other people's 
work here, but I'm happy to join in discussions in this area and talk 
about the tools we're using most often.

Today I debugged an early ENOSPC problem without even adding a printk.  
The future feels weird.

-chris

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] filesystems, btrfs, cgroups, debugging tools
  2018-02-07 22:43   ` Chris Mason
@ 2018-02-07 23:26     ` Randy Dunlap
  0 siblings, 0 replies; 10+ messages in thread
From: Randy Dunlap @ 2018-02-07 23:26 UTC (permalink / raw)
  To: Chris Mason; +Cc: lsf-pc, linux-fsdevel

On 02/07/2018 02:43 PM, Chris Mason wrote:
> On 7 Feb 2018, at 16:44, Randy Dunlap wrote:
> 
>> On 01/24/2018 02:02 PM, Chris Mason wrote:
>>> Hi everyone,
>>>
>>> I'm really looking forward to LSF/MM this year.  I can bring along a fair amount of data from production about benchmarking and stability.
>>>
>>> We've been expanding our btrfs rollout, and we're also fixing up priority inversions when cgroup IO controllers are put in place.  I think we have btrfs fixed up, but ext4 seems to be incompatible with IO controllers due to data=ordered IO.  We haven't tried XFS with the controllers yet but I don't think there will be any major blockers there.
>>>
>>> I'm also hoping filesystem slab shrinking gets into the agenda, since we have a few ugly hacks there to keep production happy.
>>
>> Hi,
>> What about the "debugging tools" part of $Subject?
>>
> 
> We're mostly adapting and extending bpf as we find fun ways to use it for tracing, error injection etc.  I'm more a consumer of other people's work here, but I'm happy to join in discussions in this area and talk about the tools we're using most often.
> 
> Today I debugged an early ENOSPC problem without even adding a printk.  The future feels weird.

That would be interesting (not that I'll be there).

thanks,
-- 
~Randy

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-02-07 23:26 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-24 22:02 [LSF/MM TOPIC] filesystems, btrfs, cgroups, debugging tools Chris Mason
2018-01-25  9:48 ` Jan Kara
2018-01-25 13:41   ` Chris Mason
2018-02-07 10:32     ` Jan Kara
2018-02-07 14:51       ` Chris Mason
2018-02-07 16:37         ` Jan Kara
2018-02-07 21:29           ` Chris Mason
2018-02-07 21:44 ` Randy Dunlap
2018-02-07 22:43   ` Chris Mason
2018-02-07 23:26     ` Randy Dunlap

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).