All of lore.kernel.org
 help / color / mirror / Atom feed
* ceph on non-btrfs file systems
@ 2011-10-24  1:54 Sage Weil
  2011-10-24 16:22 ` Christian Brunner
  0 siblings, 1 reply; 47+ messages in thread
From: Sage Weil @ 2011-10-24  1:54 UTC (permalink / raw)
  To: ceph-devel

Although running on ext4, xfs, or whatever other non-btrfs you want mostly 
works, there are a few important remaining issues:

1- ext4 limits total xattrs for 4KB.  This can cause problems in some 
cases, as Ceph uses xattrs extensively.  Most of the time we don't hit 
this.  We do hit the limit with radosgw pretty easily, though, and may 
also hit it in exceptional cases where the OSD cluster is very unhealthy.

There is a large xattr patch for ext4 from the Lustre folks that has been 
floating around for (I think) years.  Maybe as interest grows in running 
Ceph on ext4 this can move upstream.

Previously we were being forgiving about large setxattr failures on ext3, 
but we found that was leading to corruption in certain cases (because we 
couldn't set our internal metadata), so the next release will assert/crash 
in that case (fail-stop instead of fail-maybe-eventually-corrupt). 

XFS does not have an xattr size limit and thus does have this problem.

2- The other problem is with OSD journal replay of non-idempotent 
transactions.  On non-btrfs backends, the Ceph OSDs use a write-ahead 
journal.  After restart, the OSD does not know exactly which transactions 
in the journal may have already been committed to disk, and may reapply a 
transaction again during replay.  For most operations (write, delete, 
truncate) this is fine.

Some operations, though, are non-idempotent.  The simplest example is 
CLONE, which copies (efficiently, on btrfs) data from one object to 
another.  If the source object is modified, the osd restarts, and then 
the clone is replayed, the target will get incorrect (newer) data.  For 
example,

1- clone A -> B
2- modify A
   <osd crash, replay from 1>

B will get new instead of old contents.  

(This doesn't happen on btrfs because the snapshots allow us to replay 
from a known consistent point in time.)

For things like clone, skipping the operation of the target exists almost 
works, except for cases like

1- clone A -> B
2- modify A
...
3- delete B
   <osd crash, replay from 1>

(Although in that example who cares if B had bad data; it was removed 
anyway.)  The larger problem, though, is that that doesn't always work; 
CLONERANGE copies a range of a file from A to B, where B may already 
exist.  

In practice, the higher level interfaces don't make full use of the 
low-level interface, so it's possible some solution exists that careful 
avoids the problem with a partial solution in the lower layer.  This makes 
me nervous, though, as it is easy to break.

Another possibility:

 - on non-btrfs, we set a xattr on every modified object with the 
   op_seq, the unique sequence number for the transaction.
 - for any (potentially) non-idempotent operation, we fsync() before 
   continuing to the next transaction, to ensure that xattr hits disk.
 - on replay, we skip a transaction if the xattr indicates we already 
   performed this transaction.

Because every 'transaction' only modifies on a single object (file), 
this ought to work.  It'll make things like clone slow, but let's face it: 
they're already slow on non-btrfs file systems because they actually copy 
the data (instead of duplicating the extent refs in btrfs).  And it should 
make the full ObjectStore iterface safe, without upper layers having to 
worry about the kinds and orders of transactions they perform.

Other ideas?

This issue is tracked at http://tracker.newdream.net/issues/213.

sage



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on non-btrfs file systems
  2011-10-24  1:54 ceph on non-btrfs file systems Sage Weil
@ 2011-10-24 16:22 ` Christian Brunner
  2011-10-24 17:06   ` ceph on btrfs [was Re: ceph on non-btrfs file systems] Sage Weil
  0 siblings, 1 reply; 47+ messages in thread
From: Christian Brunner @ 2011-10-24 16:22 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Thanks for explaining this. I don't have any objections against btrfs
as a osd filesystem. Even the fact that there is no btrfs-fsck doesn't
scare me, since I can use the ceph replication to recover a lost
btrfs-filesystem. The only problem I have is, that btrfs is not stable
on our side and I wonder what you are doing to make it work. (Maybe
it's related to the load pattern of using ceph as a backend store for
qemu).

Here is a list of the btrfs problems I'm having:

- When I run ceph with the default configuration (btrfs snaps enabled)
I can see a rapid increase in Disk-I/O after a few hours of uptime.
Btrfs-cleaner is using more and more time in
btrfs_clean_old_snapshots().
- When I run ceph with btrfs snaps disabled, the situation is getting
slightly better. I can run an OSD for about 3 days without problems,
but then again the load increases. This time, I can see that the
ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
than usual.

Another thing is that I'm seeing a WARNING: at fs/btrfs/inode.c:2114
from time to time. Maybe it's related to the performance issues, but
seems to be able to verify this.

It's really sad to see, that ceph performance and stability is
suffering that much from the underlying filesystems and that this
hasn't changed over the last months.

Kind regards,
Christian

2011/10/24 Sage Weil <sage@newdream.net>:
> Although running on ext4, xfs, or whatever other non-btrfs you want mostly
> works, there are a few important remaining issues:
>
> 1- ext4 limits total xattrs for 4KB.  This can cause problems in some
> cases, as Ceph uses xattrs extensively.  Most of the time we don't hit
> this.  We do hit the limit with radosgw pretty easily, though, and may
> also hit it in exceptional cases where the OSD cluster is very unhealthy.
>
> There is a large xattr patch for ext4 from the Lustre folks that has been
> floating around for (I think) years.  Maybe as interest grows in running
> Ceph on ext4 this can move upstream.
>
> Previously we were being forgiving about large setxattr failures on ext3,
> but we found that was leading to corruption in certain cases (because we
> couldn't set our internal metadata), so the next release will assert/crash
> in that case (fail-stop instead of fail-maybe-eventually-corrupt).
>
> XFS does not have an xattr size limit and thus does have this problem.
>
> 2- The other problem is with OSD journal replay of non-idempotent
> transactions.  On non-btrfs backends, the Ceph OSDs use a write-ahead
> journal.  After restart, the OSD does not know exactly which transactions
> in the journal may have already been committed to disk, and may reapply a
> transaction again during replay.  For most operations (write, delete,
> truncate) this is fine.
>
> Some operations, though, are non-idempotent.  The simplest example is
> CLONE, which copies (efficiently, on btrfs) data from one object to
> another.  If the source object is modified, the osd restarts, and then
> the clone is replayed, the target will get incorrect (newer) data.  For
> example,
>
> 1- clone A -> B
> 2- modify A
>   <osd crash, replay from 1>
>
> B will get new instead of old contents.
>
> (This doesn't happen on btrfs because the snapshots allow us to replay
> from a known consistent point in time.)
>
> For things like clone, skipping the operation of the target exists almost
> works, except for cases like
>
> 1- clone A -> B
> 2- modify A
> ...
> 3- delete B
>   <osd crash, replay from 1>
>
> (Although in that example who cares if B had bad data; it was removed
> anyway.)  The larger problem, though, is that that doesn't always work;
> CLONERANGE copies a range of a file from A to B, where B may already
> exist.
>
> In practice, the higher level interfaces don't make full use of the
> low-level interface, so it's possible some solution exists that careful
> avoids the problem with a partial solution in the lower layer.  This makes
> me nervous, though, as it is easy to break.
>
> Another possibility:
>
>  - on non-btrfs, we set a xattr on every modified object with the
>   op_seq, the unique sequence number for the transaction.
>  - for any (potentially) non-idempotent operation, we fsync() before
>   continuing to the next transaction, to ensure that xattr hits disk.
>  - on replay, we skip a transaction if the xattr indicates we already
>   performed this transaction.
>
> Because every 'transaction' only modifies on a single object (file),
> this ought to work.  It'll make things like clone slow, but let's face it:
> they're already slow on non-btrfs file systems because they actually copy
> the data (instead of duplicating the extent refs in btrfs).  And it should
> make the full ObjectStore iterface safe, without upper layers having to
> worry about the kinds and orders of transactions they perform.
>
> Other ideas?
>
> This issue is tracked at http://tracker.newdream.net/issues/213.
>
> sage
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-24 16:22 ` Christian Brunner
@ 2011-10-24 17:06   ` Sage Weil
  2011-10-24 19:51     ` Josef Bacik
  2011-10-25 10:23     ` Christoph Hellwig
  0 siblings, 2 replies; 47+ messages in thread
From: Sage Weil @ 2011-10-24 17:06 UTC (permalink / raw)
  To: Christian Brunner; +Cc: ceph-devel, linux-btrfs

[-- Attachment #1: Type: TEXT/PLAIN, Size: 6878 bytes --]

[adding linux-btrfs to cc]

Josef, Chris, any ideas on the below issues?

On Mon, 24 Oct 2011, Christian Brunner wrote:
> Thanks for explaining this. I don't have any objections against btrfs
> as a osd filesystem. Even the fact that there is no btrfs-fsck doesn't
> scare me, since I can use the ceph replication to recover a lost
> btrfs-filesystem. The only problem I have is, that btrfs is not stable
> on our side and I wonder what you are doing to make it work. (Maybe
> it's related to the load pattern of using ceph as a backend store for
> qemu).
> 
> Here is a list of the btrfs problems I'm having:
> 
> - When I run ceph with the default configuration (btrfs snaps enabled)
> I can see a rapid increase in Disk-I/O after a few hours of uptime.
> Btrfs-cleaner is using more and more time in
> btrfs_clean_old_snapshots().

In theory, there shouldn't be any significant difference between taking a 
snapshot and removing it a few commits later, and the prior root refs that 
btrfs holds on to internally until the new commit is complete.  That's 
clearly not quite the case, though.

In any case, we're going to try to reproduce this issue in our 
environment.

> - When I run ceph with btrfs snaps disabled, the situation is getting
> slightly better. I can run an OSD for about 3 days without problems,
> but then again the load increases. This time, I can see that the
> ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
> than usual.

FYI in this scenario you're exposed to the same journal replay issues that 
ext4 and XFS are.  The btrfs workload that ceph is generating will also 
not be all that special, though, so this problem shouldn't be unique to 
ceph.

> Another thing is that I'm seeing a WARNING: at fs/btrfs/inode.c:2114
> from time to time. Maybe it's related to the performance issues, but
> seems to be able to verify this.

I haven't seen this yet with the latest stuff from Josef, but others have.  
Josef, is there any information we can provide to help track it down?

> It's really sad to see, that ceph performance and stability is
> suffering that much from the underlying filesystems and that this
> hasn't changed over the last months.

We don't have anyone internally working on btrfs at the moment, and are 
still struggling to hire experienced kernel/fs people.  Josef has been 
very helpful with tracking these issues down, but he hass responsibilities 
beyond just the Ceph related issues.  Progress is slow, but we are 
working on it!

sage


> 
> Kind regards,
> Christian
> 
> 2011/10/24 Sage Weil <sage@newdream.net>:
> > Although running on ext4, xfs, or whatever other non-btrfs you want mostly
> > works, there are a few important remaining issues:
> >
> > 1- ext4 limits total xattrs for 4KB.  This can cause problems in some
> > cases, as Ceph uses xattrs extensively.  Most of the time we don't hit
> > this.  We do hit the limit with radosgw pretty easily, though, and may
> > also hit it in exceptional cases where the OSD cluster is very unhealthy.
> >
> > There is a large xattr patch for ext4 from the Lustre folks that has been
> > floating around for (I think) years.  Maybe as interest grows in running
> > Ceph on ext4 this can move upstream.
> >
> > Previously we were being forgiving about large setxattr failures on ext3,
> > but we found that was leading to corruption in certain cases (because we
> > couldn't set our internal metadata), so the next release will assert/crash
> > in that case (fail-stop instead of fail-maybe-eventually-corrupt).
> >
> > XFS does not have an xattr size limit and thus does have this problem.
> >
> > 2- The other problem is with OSD journal replay of non-idempotent
> > transactions.  On non-btrfs backends, the Ceph OSDs use a write-ahead
> > journal.  After restart, the OSD does not know exactly which transactions
> > in the journal may have already been committed to disk, and may reapply a
> > transaction again during replay.  For most operations (write, delete,
> > truncate) this is fine.
> >
> > Some operations, though, are non-idempotent.  The simplest example is
> > CLONE, which copies (efficiently, on btrfs) data from one object to
> > another.  If the source object is modified, the osd restarts, and then
> > the clone is replayed, the target will get incorrect (newer) data.  For
> > example,
> >
> > 1- clone A -> B
> > 2- modify A
> >   <osd crash, replay from 1>
> >
> > B will get new instead of old contents.
> >
> > (This doesn't happen on btrfs because the snapshots allow us to replay
> > from a known consistent point in time.)
> >
> > For things like clone, skipping the operation of the target exists almost
> > works, except for cases like
> >
> > 1- clone A -> B
> > 2- modify A
> > ...
> > 3- delete B
> >   <osd crash, replay from 1>
> >
> > (Although in that example who cares if B had bad data; it was removed
> > anyway.)  The larger problem, though, is that that doesn't always work;
> > CLONERANGE copies a range of a file from A to B, where B may already
> > exist.
> >
> > In practice, the higher level interfaces don't make full use of the
> > low-level interface, so it's possible some solution exists that careful
> > avoids the problem with a partial solution in the lower layer.  This makes
> > me nervous, though, as it is easy to break.
> >
> > Another possibility:
> >
> >  - on non-btrfs, we set a xattr on every modified object with the
> >   op_seq, the unique sequence number for the transaction.
> >  - for any (potentially) non-idempotent operation, we fsync() before
> >   continuing to the next transaction, to ensure that xattr hits disk.
> >  - on replay, we skip a transaction if the xattr indicates we already
> >   performed this transaction.
> >
> > Because every 'transaction' only modifies on a single object (file),
> > this ought to work.  It'll make things like clone slow, but let's face it:
> > they're already slow on non-btrfs file systems because they actually copy
> > the data (instead of duplicating the extent refs in btrfs).  And it should
> > make the full ObjectStore iterface safe, without upper layers having to
> > worry about the kinds and orders of transactions they perform.
> >
> > Other ideas?
> >
> > This issue is tracked at http://tracker.newdream.net/issues/213.
> >
> > sage
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-24 17:06   ` ceph on btrfs [was Re: ceph on non-btrfs file systems] Sage Weil
@ 2011-10-24 19:51     ` Josef Bacik
  2011-10-24 20:35       ` Chris Mason
  2011-10-25 11:56       ` Christian Brunner
  2011-10-25 10:23     ` Christoph Hellwig
  1 sibling, 2 replies; 47+ messages in thread
From: Josef Bacik @ 2011-10-24 19:51 UTC (permalink / raw)
  To: Sage Weil; +Cc: Christian Brunner, ceph-devel, linux-btrfs

On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> [adding linux-btrfs to cc]
> 
> Josef, Chris, any ideas on the below issues?
> 
> On Mon, 24 Oct 2011, Christian Brunner wrote:
> > Thanks for explaining this. I don't have any objections against btrfs
> > as a osd filesystem. Even the fact that there is no btrfs-fsck doesn't
> > scare me, since I can use the ceph replication to recover a lost
> > btrfs-filesystem. The only problem I have is, that btrfs is not stable
> > on our side and I wonder what you are doing to make it work. (Maybe
> > it's related to the load pattern of using ceph as a backend store for
> > qemu).
> > 
> > Here is a list of the btrfs problems I'm having:
> > 
> > - When I run ceph with the default configuration (btrfs snaps enabled)
> > I can see a rapid increase in Disk-I/O after a few hours of uptime.
> > Btrfs-cleaner is using more and more time in
> > btrfs_clean_old_snapshots().
> 
> In theory, there shouldn't be any significant difference between taking a 
> snapshot and removing it a few commits later, and the prior root refs that 
> btrfs holds on to internally until the new commit is complete.  That's 
> clearly not quite the case, though.
> 
> In any case, we're going to try to reproduce this issue in our 
> environment.
> 

I've noticed this problem too, clean_old_snapshots is taking quite a while in
cases where it really shouldn't.  I will see if I can come up with a reproducer
that doesn't require setting up ceph ;).

> > - When I run ceph with btrfs snaps disabled, the situation is getting
> > slightly better. I can run an OSD for about 3 days without problems,
> > but then again the load increases. This time, I can see that the
> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
> > than usual.
> 
> FYI in this scenario you're exposed to the same journal replay issues that 
> ext4 and XFS are.  The btrfs workload that ceph is generating will also 
> not be all that special, though, so this problem shouldn't be unique to 
> ceph.
> 

Can you get sysrq+w when this happens?  I'd like to see what btrfs-endio-write
is up to.

> > Another thing is that I'm seeing a WARNING: at fs/btrfs/inode.c:2114
> > from time to time. Maybe it's related to the performance issues, but
> > seems to be able to verify this.
> 
> I haven't seen this yet with the latest stuff from Josef, but others have.  
> Josef, is there any information we can provide to help track it down?
>

Actually this would show up in 2 cases, I fixed the one most people hit with my
earlier stuff and then fixed the other one more recently, hopefully it will be
fixed in 3.2.  A full backtrace would be nice so I can figure out which one it
is you are hitting.
 
> > It's really sad to see, that ceph performance and stability is
> > suffering that much from the underlying filesystems and that this
> > hasn't changed over the last months.
> 
> We don't have anyone internally working on btrfs at the moment, and are 
> still struggling to hire experienced kernel/fs people.  Josef has been 
> very helpful with tracking these issues down, but he hass responsibilities 
> beyond just the Ceph related issues.  Progress is slow, but we are 
> working on it!

I'm open to offers ;).  These things are being hit by people all over the place,
but it's hard for me to reproduce, especially since most of the reports are "run
X server for Y days and wait for it to start sucking."  I will try and get a box
setup that I can let stress.sh run on for a few days to see if I can make some
of this stuff come out to play with me, but unfortunately I end up having to
debug these kind of things over email, which means they get a whole lot of
nowhere.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-24 19:51     ` Josef Bacik
@ 2011-10-24 20:35       ` Chris Mason
  2011-10-24 21:34           ` Christian Brunner
  2011-10-25 11:56       ` Christian Brunner
  1 sibling, 1 reply; 47+ messages in thread
From: Chris Mason @ 2011-10-24 20:35 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Sage Weil, Christian Brunner, ceph-devel, linux-btrfs

On Mon, Oct 24, 2011 at 03:51:47PM -0400, Josef Bacik wrote:
> On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> > [adding linux-btrfs to cc]
> > 
> > Josef, Chris, any ideas on the below issues?
> > 
> > On Mon, 24 Oct 2011, Christian Brunner wrote:
> > > Thanks for explaining this. I don't have any objections against btrfs
> > > as a osd filesystem. Even the fact that there is no btrfs-fsck doesn't
> > > scare me, since I can use the ceph replication to recover a lost
> > > btrfs-filesystem. The only problem I have is, that btrfs is not stable
> > > on our side and I wonder what you are doing to make it work. (Maybe
> > > it's related to the load pattern of using ceph as a backend store for
> > > qemu).
> > > 
> > > Here is a list of the btrfs problems I'm having:
> > > 
> > > - When I run ceph with the default configuration (btrfs snaps enabled)
> > > I can see a rapid increase in Disk-I/O after a few hours of uptime.
> > > Btrfs-cleaner is using more and more time in
> > > btrfs_clean_old_snapshots().
> > 
> > In theory, there shouldn't be any significant difference between taking a 
> > snapshot and removing it a few commits later, and the prior root refs that 
> > btrfs holds on to internally until the new commit is complete.  That's 
> > clearly not quite the case, though.
> > 
> > In any case, we're going to try to reproduce this issue in our 
> > environment.
> > 
> 
> I've noticed this problem too, clean_old_snapshots is taking quite a while in
> cases where it really shouldn't.  I will see if I can come up with a reproducer
> that doesn't require setting up ceph ;).

This sounds familiar though, I thought we had fixed a similar
regression.  Either way, Arne's readahead code should really help.

Which kernel version were you running?

[ ack on the rest of Josef's comments ]

-chris

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-24 20:35       ` Chris Mason
@ 2011-10-24 21:34           ` Christian Brunner
  0 siblings, 0 replies; 47+ messages in thread
From: Christian Brunner @ 2011-10-24 21:34 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, Sage Weil, Christian Brunner,
	ceph-devel, linux-btrfs

2011/10/24 Chris Mason <chris.mason@oracle.com>:
> On Mon, Oct 24, 2011 at 03:51:47PM -0400, Josef Bacik wrote:
>> On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
>> > [adding linux-btrfs to cc]
>> >
>> > Josef, Chris, any ideas on the below issues?
>> >
>> > On Mon, 24 Oct 2011, Christian Brunner wrote:
>> > > Thanks for explaining this. I don't have any objections against =
btrfs
>> > > as a osd filesystem. Even the fact that there is no btrfs-fsck d=
oesn't
>> > > scare me, since I can use the ceph replication to recover a lost
>> > > btrfs-filesystem. The only problem I have is, that btrfs is not =
stable
>> > > on our side and I wonder what you are doing to make it work. (Ma=
ybe
>> > > it's related to the load pattern of using ceph as a backend stor=
e for
>> > > qemu).
>> > >
>> > > Here is a list of the btrfs problems I'm having:
>> > >
>> > > - When I run ceph with the default configuration (btrfs snaps en=
abled)
>> > > I can see a rapid increase in Disk-I/O after a few hours of upti=
me.
>> > > Btrfs-cleaner is using more and more time in
>> > > btrfs_clean_old_snapshots().
>> >
>> > In theory, there shouldn't be any significant difference between t=
aking a
>> > snapshot and removing it a few commits later, and the prior root r=
efs that
>> > btrfs holds on to internally until the new commit is complete. =A0=
That's
>> > clearly not quite the case, though.
>> >
>> > In any case, we're going to try to reproduce this issue in our
>> > environment.
>> >
>>
>> I've noticed this problem too, clean_old_snapshots is taking quite a=
 while in
>> cases where it really shouldn't. =A0I will see if I can come up with=
 a reproducer
>> that doesn't require setting up ceph ;).
>
> This sounds familiar though, I thought we had fixed a similar
> regression. =A0Either way, Arne's readahead code should really help.
>
> Which kernel version were you running?
>
> [ ack on the rest of Josef's comments ]

This was with a 3.0 kernel, including all btrfs-patches from josefs
git repo plus the "use the global reserve when truncating the free
space cache inode" patch.

I'll try the readahead code.

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
@ 2011-10-24 21:34           ` Christian Brunner
  0 siblings, 0 replies; 47+ messages in thread
From: Christian Brunner @ 2011-10-24 21:34 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, Sage Weil, Christian Brunner,
	ceph-devel, linux-btrfs

2011/10/24 Chris Mason <chris.mason@oracle.com>:
> On Mon, Oct 24, 2011 at 03:51:47PM -0400, Josef Bacik wrote:
>> On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
>> > [adding linux-btrfs to cc]
>> >
>> > Josef, Chris, any ideas on the below issues?
>> >
>> > On Mon, 24 Oct 2011, Christian Brunner wrote:
>> > > Thanks for explaining this. I don't have any objections against btrfs
>> > > as a osd filesystem. Even the fact that there is no btrfs-fsck doesn't
>> > > scare me, since I can use the ceph replication to recover a lost
>> > > btrfs-filesystem. The only problem I have is, that btrfs is not stable
>> > > on our side and I wonder what you are doing to make it work. (Maybe
>> > > it's related to the load pattern of using ceph as a backend store for
>> > > qemu).
>> > >
>> > > Here is a list of the btrfs problems I'm having:
>> > >
>> > > - When I run ceph with the default configuration (btrfs snaps enabled)
>> > > I can see a rapid increase in Disk-I/O after a few hours of uptime.
>> > > Btrfs-cleaner is using more and more time in
>> > > btrfs_clean_old_snapshots().
>> >
>> > In theory, there shouldn't be any significant difference between taking a
>> > snapshot and removing it a few commits later, and the prior root refs that
>> > btrfs holds on to internally until the new commit is complete.  That's
>> > clearly not quite the case, though.
>> >
>> > In any case, we're going to try to reproduce this issue in our
>> > environment.
>> >
>>
>> I've noticed this problem too, clean_old_snapshots is taking quite a while in
>> cases where it really shouldn't.  I will see if I can come up with a reproducer
>> that doesn't require setting up ceph ;).
>
> This sounds familiar though, I thought we had fixed a similar
> regression.  Either way, Arne's readahead code should really help.
>
> Which kernel version were you running?
>
> [ ack on the rest of Josef's comments ]

This was with a 3.0 kernel, including all btrfs-patches from josefs
git repo plus the "use the global reserve when truncating the free
space cache inode" patch.

I'll try the readahead code.

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-24 21:34           ` Christian Brunner
  (?)
@ 2011-10-24 21:37           ` Arne Jansen
  -1 siblings, 0 replies; 47+ messages in thread
From: Arne Jansen @ 2011-10-24 21:37 UTC (permalink / raw)
  To: chb; +Cc: Chris Mason, Josef Bacik, Sage Weil, ceph-devel, linux-btrfs

On 24.10.2011 23:34, Christian Brunner wrote:
> 2011/10/24 Chris Mason<chris.mason@oracle.com>:
>> On Mon, Oct 24, 2011 at 03:51:47PM -0400, Josef Bacik wrote:
>>> On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
>>>> [adding linux-btrfs to cc]
>>>>
>>>> Josef, Chris, any ideas on the below issues?
>>>>
>>>> On Mon, 24 Oct 2011, Christian Brunner wrote:
>>>>> Thanks for explaining this. I don't have any objections against btrfs
>>>>> as a osd filesystem. Even the fact that there is no btrfs-fsck doesn't
>>>>> scare me, since I can use the ceph replication to recover a lost
>>>>> btrfs-filesystem. The only problem I have is, that btrfs is not stable
>>>>> on our side and I wonder what you are doing to make it work. (Maybe
>>>>> it's related to the load pattern of using ceph as a backend store for
>>>>> qemu).
>>>>>
>>>>> Here is a list of the btrfs problems I'm having:
>>>>>
>>>>> - When I run ceph with the default configuration (btrfs snaps enabled)
>>>>> I can see a rapid increase in Disk-I/O after a few hours of uptime.
>>>>> Btrfs-cleaner is using more and more time in
>>>>> btrfs_clean_old_snapshots().
>>>>
>>>> In theory, there shouldn't be any significant difference between taking a
>>>> snapshot and removing it a few commits later, and the prior root refs that
>>>> btrfs holds on to internally until the new commit is complete.  That's
>>>> clearly not quite the case, though.
>>>>
>>>> In any case, we're going to try to reproduce this issue in our
>>>> environment.
>>>>
>>>
>>> I've noticed this problem too, clean_old_snapshots is taking quite a while in
>>> cases where it really shouldn't.  I will see if I can come up with a reproducer
>>> that doesn't require setting up ceph ;).
>>
>> This sounds familiar though, I thought we had fixed a similar
>> regression.  Either way, Arne's readahead code should really help.
>>
>> Which kernel version were you running?
>>
>> [ ack on the rest of Josef's comments ]
>
> This was with a 3.0 kernel, including all btrfs-patches from josefs
> git repo plus the "use the global reserve when truncating the free
> space cache inode" patch.
>
> I'll try the readahead code.

The current readahead code is only used for scrub. I plan to extend it
to snapshot deletion in a next step, but currently I'm afraid it can't
help.

-Arne

>
> Thanks,
> Christian
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-24 17:06   ` ceph on btrfs [was Re: ceph on non-btrfs file systems] Sage Weil
  2011-10-24 19:51     ` Josef Bacik
@ 2011-10-25 10:23     ` Christoph Hellwig
  2011-10-25 16:23       ` Sage Weil
  1 sibling, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2011-10-25 10:23 UTC (permalink / raw)
  To: Sage Weil; +Cc: Christian Brunner, ceph-devel, linux-btrfs

On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> > - When I run ceph with btrfs snaps disabled, the situation is getting
> > slightly better. I can run an OSD for about 3 days without problems,
> > but then again the load increases. This time, I can see that the
> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
> > than usual.
> 
> FYI in this scenario you're exposed to the same journal replay issues that 
> ext4 and XFS are.  The btrfs workload that ceph is generating will also 
> not be all that special, though, so this problem shouldn't be unique to 
> ceph.

What journal replay issues would ext4 and XFS be exposed to?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-24 19:51     ` Josef Bacik
  2011-10-24 20:35       ` Chris Mason
@ 2011-10-25 11:56       ` Christian Brunner
  2011-10-25 12:23           ` Josef Bacik
  2011-10-27 19:52           ` Josef Bacik
  1 sibling, 2 replies; 47+ messages in thread
From: Christian Brunner @ 2011-10-25 11:56 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Sage Weil, ceph-devel, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2441 bytes --]

2011/10/24 Josef Bacik <josef@redhat.com>:
> On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
>> [adding linux-btrfs to cc]
>>
>> Josef, Chris, any ideas on the below issues?
>>
>> On Mon, 24 Oct 2011, Christian Brunner wrote:
>> >
>> > - When I run ceph with btrfs snaps disabled, the situation is getting
>> > slightly better. I can run an OSD for about 3 days without problems,
>> > but then again the load increases. This time, I can see that the
>> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
>> > than usual.
>>
>> FYI in this scenario you're exposed to the same journal replay issues that
>> ext4 and XFS are.  The btrfs workload that ceph is generating will also
>> not be all that special, though, so this problem shouldn't be unique to
>> ceph.
>>
>
> Can you get sysrq+w when this happens?  I'd like to see what btrfs-endio-write
> is up to.

Capturing this seems to be not easy. I have a few traces (see
attachment), but with sysrq+w I do not get a stacktrace of
btrfs-endio-write. What I have is a "latencytop -c" output which is
interesting:

In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
tries to balance the load over all OSDs, so all filesystems should get
an nearly equal load. At the moment one filesystem seems to have a
problem. When running with iostat I see the following

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sdd               0.00     0.00    0.00    4.33     0.00    53.33
12.31     0.08   19.38  12.23   5.30
sdc               0.00     1.00    0.00  228.33     0.00  1957.33
8.57    74.33  380.76   2.74  62.57
sdb               0.00     0.00    0.00    1.33     0.00    16.00
12.00     0.03   25.00  19.75   2.63
sda               0.00     0.00    0.00    0.67     0.00     8.00
12.00     0.01   19.50  12.50   0.83

The PID of the ceph-osd taht is running on sdc is 2053 and when I look
with top I see this process and a btrfs-endio-writer (PID 5447):

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 2053 root      20   0  537m 146m 2364 S 33.2  0.6  43:31.24 ceph-osd
 5447 root      20   0     0    0    0 S 22.6  0.0  19:32.18 btrfs-endio-wri

In the latencytop output you can see that those processes have a much
higher latency, than the other ceph-osd and btrfs-endio-writers.

Regards,
Christian

[-- Attachment #2: dmesg.txt.bz2 --]
[-- Type: application/x-bzip2, Size: 39096 bytes --]

[-- Attachment #3: latencytop.txt.bz2 --]
[-- Type: application/x-bzip2, Size: 5661 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-25 11:56       ` Christian Brunner
@ 2011-10-25 12:23           ` Josef Bacik
  2011-10-27 19:52           ` Josef Bacik
  1 sibling, 0 replies; 47+ messages in thread
From: Josef Bacik @ 2011-10-25 12:23 UTC (permalink / raw)
  To: Christian Brunner; +Cc: Josef Bacik, Sage Weil, ceph-devel, linux-btrfs

On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
> 2011/10/24 Josef Bacik <josef@redhat.com>:
> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> >> [adding linux-btrfs to cc]
> >>
> >> Josef, Chris, any ideas on the below issues?
> >>
> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
> >> >
> >> > - When I run ceph with btrfs snaps disabled, the situation is ge=
tting
> >> > slightly better. I can run an OSD for about 3 days without probl=
ems,
> >> > but then again the load increases. This time, I can see that the
> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more=
 work
> >> > than usual.
> >>
> >> FYI in this scenario you're exposed to the same journal replay iss=
ues that
> >> ext4 and XFS are. =A0The btrfs workload that ceph is generating wi=
ll also
> >> not be all that special, though, so this problem shouldn't be uniq=
ue to
> >> ceph.
> >>
> >
> > Can you get sysrq+w when this happens? =A0I'd like to see what btrf=
s-endio-write
> > is up to.
>=20
> Capturing this seems to be not easy. I have a few traces (see
> attachment), but with sysrq+w I do not get a stacktrace of
> btrfs-endio-write. What I have is a "latencytop -c" output which is
> interesting:
>=20
> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
> tries to balance the load over all OSDs, so all filesystems should ge=
t
> an nearly equal load. At the moment one filesystem seems to have a
> problem. When running with iostat I see the following
>=20
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdd               0.00     0.00    0.00    4.33     0.00    53.33
> 12.31     0.08   19.38  12.23   5.30
> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
> 8.57    74.33  380.76   2.74  62.57
> sdb               0.00     0.00    0.00    1.33     0.00    16.00
> 12.00     0.03   25.00  19.75   2.63
> sda               0.00     0.00    0.00    0.67     0.00     8.00
> 12.00     0.01   19.50  12.50   0.83
>=20
> The PID of the ceph-osd taht is running on sdc is 2053 and when I loo=
k
> with top I see this process and a btrfs-endio-writer (PID 5447):
>=20
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  2053 root      20   0  537m 146m 2364 S 33.2  0.6  43:31.24 ceph-osd
>  5447 root      20   0     0    0    0 S 22.6  0.0  19:32.18 btrfs-en=
dio-wri
>=20
> In the latencytop output you can see that those processes have a much
> higher latency, than the other ceph-osd and btrfs-endio-writers.
>=20

I'm seeing a lot of this

        [schedule]      1654.6 msec         96.4 %
                schedule blkdev_issue_flush blkdev_fsync vfs_fsync_rang=
e
                generic_write_sync blkdev_aio_write do_sync_readv_write=
v
                do_readv_writev vfs_writev sys_writev system_call_fastp=
ath

where ceph-osd's latency is mostly coming from this fsync of a block de=
vice
directly, and not so much being tied up by btrfs directly.  With 22% CP=
U being
taken up by btrfs-endio-wri we must be doing something wrong.  Can you =
run perf
record -ag when this is going on and then perf report so we can see wha=
t
btrfs-endio-wri is doing with the cpu.  You can drill down in perf repo=
rt to get
only what btrfs-endio-wri is doing, so that would be best.  As far as t=
he rest
of the latencytop goes, it doesn't seem like btrfs-endio-wri is doing a=
nything
horribly wrong or introducing a lot of latency.  Most of it seems to be=
 when
running the dleayed refs and having to read in blocks.  I've been suspe=
cting for
a while that the delayed ref stuff ends up doing way more work than it =
needs to
be per task, and it's possible that btrfs-endio-wri is simply getting s=
crewed by
other people doing work.

At this point it seems like the biggest problem with latency in ceph-os=
d is not
related to btrfs, the latency seems to all be from the fact that ceph-o=
sd is
fsyncing a block dev for whatever reason.  As for btrfs-endio-wri it se=
ems like
its blowing a lot of CPU time, so perf record -ag is probably going to =
be your
best bet when it's using lots of cpu so we can figure out what it's spi=
nning on.
Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
@ 2011-10-25 12:23           ` Josef Bacik
  0 siblings, 0 replies; 47+ messages in thread
From: Josef Bacik @ 2011-10-25 12:23 UTC (permalink / raw)
  To: Christian Brunner; +Cc: Josef Bacik, Sage Weil, ceph-devel, linux-btrfs

On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
> 2011/10/24 Josef Bacik <josef@redhat.com>:
> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> >> [adding linux-btrfs to cc]
> >>
> >> Josef, Chris, any ideas on the below issues?
> >>
> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
> >> >
> >> > - When I run ceph with btrfs snaps disabled, the situation is getting
> >> > slightly better. I can run an OSD for about 3 days without problems,
> >> > but then again the load increases. This time, I can see that the
> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
> >> > than usual.
> >>
> >> FYI in this scenario you're exposed to the same journal replay issues that
> >> ext4 and XFS are.  The btrfs workload that ceph is generating will also
> >> not be all that special, though, so this problem shouldn't be unique to
> >> ceph.
> >>
> >
> > Can you get sysrq+w when this happens?  I'd like to see what btrfs-endio-write
> > is up to.
> 
> Capturing this seems to be not easy. I have a few traces (see
> attachment), but with sysrq+w I do not get a stacktrace of
> btrfs-endio-write. What I have is a "latencytop -c" output which is
> interesting:
> 
> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
> tries to balance the load over all OSDs, so all filesystems should get
> an nearly equal load. At the moment one filesystem seems to have a
> problem. When running with iostat I see the following
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdd               0.00     0.00    0.00    4.33     0.00    53.33
> 12.31     0.08   19.38  12.23   5.30
> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
> 8.57    74.33  380.76   2.74  62.57
> sdb               0.00     0.00    0.00    1.33     0.00    16.00
> 12.00     0.03   25.00  19.75   2.63
> sda               0.00     0.00    0.00    0.67     0.00     8.00
> 12.00     0.01   19.50  12.50   0.83
> 
> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
> with top I see this process and a btrfs-endio-writer (PID 5447):
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  2053 root      20   0  537m 146m 2364 S 33.2  0.6  43:31.24 ceph-osd
>  5447 root      20   0     0    0    0 S 22.6  0.0  19:32.18 btrfs-endio-wri
> 
> In the latencytop output you can see that those processes have a much
> higher latency, than the other ceph-osd and btrfs-endio-writers.
> 

I'm seeing a lot of this

        [schedule]      1654.6 msec         96.4 %
                schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range
                generic_write_sync blkdev_aio_write do_sync_readv_writev
                do_readv_writev vfs_writev sys_writev system_call_fastpath

where ceph-osd's latency is mostly coming from this fsync of a block device
directly, and not so much being tied up by btrfs directly.  With 22% CPU being
taken up by btrfs-endio-wri we must be doing something wrong.  Can you run perf
record -ag when this is going on and then perf report so we can see what
btrfs-endio-wri is doing with the cpu.  You can drill down in perf report to get
only what btrfs-endio-wri is doing, so that would be best.  As far as the rest
of the latencytop goes, it doesn't seem like btrfs-endio-wri is doing anything
horribly wrong or introducing a lot of latency.  Most of it seems to be when
running the dleayed refs and having to read in blocks.  I've been suspecting for
a while that the delayed ref stuff ends up doing way more work than it needs to
be per task, and it's possible that btrfs-endio-wri is simply getting screwed by
other people doing work.

At this point it seems like the biggest problem with latency in ceph-osd is not
related to btrfs, the latency seems to all be from the fact that ceph-osd is
fsyncing a block dev for whatever reason.  As for btrfs-endio-wri it seems like
its blowing a lot of CPU time, so perf record -ag is probably going to be your
best bet when it's using lots of cpu so we can figure out what it's spinning on.
Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-25 12:23           ` Josef Bacik
  (?)
@ 2011-10-25 14:25           ` Christian Brunner
  2011-10-25 15:00               ` Josef Bacik
  2011-10-25 15:05               ` Josef Bacik
  -1 siblings, 2 replies; 47+ messages in thread
From: Christian Brunner @ 2011-10-25 14:25 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Sage Weil, ceph-devel, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4510 bytes --]

2011/10/25 Josef Bacik <josef@redhat.com>:
> On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
>> 2011/10/24 Josef Bacik <josef@redhat.com>:
>> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
>> >> [adding linux-btrfs to cc]
>> >>
>> >> Josef, Chris, any ideas on the below issues?
>> >>
>> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
>> >> >
>> >> > - When I run ceph with btrfs snaps disabled, the situation is getting
>> >> > slightly better. I can run an OSD for about 3 days without problems,
>> >> > but then again the load increases. This time, I can see that the
>> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
>> >> > than usual.
>> >>
>> >> FYI in this scenario you're exposed to the same journal replay issues that
>> >> ext4 and XFS are.  The btrfs workload that ceph is generating will also
>> >> not be all that special, though, so this problem shouldn't be unique to
>> >> ceph.
>> >>
>> >
>> > Can you get sysrq+w when this happens?  I'd like to see what btrfs-endio-write
>> > is up to.
>>
>> Capturing this seems to be not easy. I have a few traces (see
>> attachment), but with sysrq+w I do not get a stacktrace of
>> btrfs-endio-write. What I have is a "latencytop -c" output which is
>> interesting:
>>
>> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
>> tries to balance the load over all OSDs, so all filesystems should get
>> an nearly equal load. At the moment one filesystem seems to have a
>> problem. When running with iostat I see the following
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sdd               0.00     0.00    0.00    4.33     0.00    53.33
>> 12.31     0.08   19.38  12.23   5.30
>> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
>> 8.57    74.33  380.76   2.74  62.57
>> sdb               0.00     0.00    0.00    1.33     0.00    16.00
>> 12.00     0.03   25.00 19.75 2.63
>> sda               0.00     0.00    0.00    0.67     0.00     8.00
>> 12.00     0.01   19.50  12.50   0.83
>>
>> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
>> with top I see this process and a btrfs-endio-writer (PID 5447):
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>  2053 root      20   0  537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd
>>  5447 root      20   0     0    0    0 S 22.6 0.0 19:32.18 btrfs-endio-wri
>>
>> In the latencytop output you can see that those processes have a much
>> higher latency, than the other ceph-osd and btrfs-endio-writers.
>>
>
> I'm seeing a lot of this
>
>        [schedule]      1654.6 msec         96.4 %
>                schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range
>                generic_write_sync blkdev_aio_write do_sync_readv_writev
>                do_readv_writev vfs_writev sys_writev system_call_fastpath
>
> where ceph-osd's latency is mostly coming from this fsync of a block device
> directly, and not so much being tied up by btrfs directly.  With 22% CPU being
> taken up by btrfs-endio-wri we must be doing something wrong.  Can you run perf
> record -ag when this is going on and then perf report so we can see what
> btrfs-endio-wri is doing with the cpu.  You can drill down in perf report to get
> only what btrfs-endio-wri is doing, so that would be best.  As far as the rest
> of the latencytop goes, it doesn't seem like btrfs-endio-wri is doing anything
> horribly wrong or introducing a lot of latency.  Most of it seems to be when
> running the dleayed refs and having to read in blocks.  I've been suspecting for
> a while that the delayed ref stuff ends up doing way more work than it needs to
> be per task, and it's possible that btrfs-endio-wri is simply getting screwed by
> other people doing work.
>
> At this point it seems like the biggest problem with latency in ceph-osd is not
> related to btrfs, the latency seems to all be from the fact that ceph-osd is
> fsyncing a block dev for whatever reason.  As for btrfs-endio-wri it seems like
> its blowing a lot of CPU time, so perf record -ag is probably going to be your
> best bet when it's using lots of cpu so we can figure out what it's spinning on.

Attached is a perf-report. I have included the whole report, so that
you can see the difference between the good and the bad
btrfs-endio-wri.

Thanks,
Christian

[-- Attachment #2: perf.report.bz2 --]
[-- Type: application/x-bzip2, Size: 30840 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-25 14:25           ` Christian Brunner
@ 2011-10-25 15:00               ` Josef Bacik
  2011-10-25 15:05               ` Josef Bacik
  1 sibling, 0 replies; 47+ messages in thread
From: Josef Bacik @ 2011-10-25 15:00 UTC (permalink / raw)
  To: Christian Brunner; +Cc: Josef Bacik, Sage Weil, ceph-devel, linux-btrfs

On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
> 2011/10/25 Josef Bacik <josef@redhat.com>:
> > On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
> >> 2011/10/24 Josef Bacik <josef@redhat.com>:
> >> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> >> >> [adding linux-btrfs to cc]
> >> >>
> >> >> Josef, Chris, any ideas on the below issues?
> >> >>
> >> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
> >> >> >
> >> >> > - When I run ceph with btrfs snaps disabled, the situation is=
 getting
> >> >> > slightly better. I can run an OSD for about 3 days without pr=
oblems,
> >> >> > but then again the load increases. This time, I can see that =
the
> >> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing m=
ore work
> >> >> > than usual.
> >> >>
> >> >> FYI in this scenario you're exposed to the same journal replay =
issues that
> >> >> ext4 and XFS are. =A0The btrfs workload that ceph is generating=
 will also
> >> >> not be all that special, though, so this problem shouldn't be u=
nique to
> >> >> ceph.
> >> >>
> >> >
> >> > Can you get sysrq+w when this happens? =A0I'd like to see what b=
trfs-endio-write
> >> > is up to.
> >>
> >> Capturing this seems to be not easy. I have a few traces (see
> >> attachment), but with sysrq+w I do not get a stacktrace of
> >> btrfs-endio-write. What I have is a "latencytop -c" output which i=
s
> >> interesting:
> >>
> >> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. C=
eph
> >> tries to balance the load over all OSDs, so all filesystems should=
 get
> >> an nearly equal load. At the moment one filesystem seems to have a
> >> problem. When running with iostat I see the following
> >>
> >> Device: =A0 =A0 =A0 =A0 rrqm/s =A0 wrqm/s =A0 =A0 r/s =A0 =A0 w/s =
=A0 rsec/s =A0 wsec/s
> >> avgrq-sz avgqu-sz =A0 await =A0svctm =A0%util
> >> sdd =A0 =A0 =A0 =A0 =A0 =A0 =A0 0.00 =A0 =A0 0.00 =A0 =A00.00 =A0 =
=A04.33 =A0 =A0 0.00 =A0 =A053.33
> >> 12.31 =A0 =A0 0.08 =A0 19.38 =A012.23 =A0 5.30
> >> sdc =A0 =A0 =A0 =A0 =A0 =A0 =A0 0.00 =A0 =A0 1.00 =A0 =A00.00 =A02=
28.33 =A0 =A0 0.00 =A01957.33
> >> 8.57 =A0 =A074.33 =A0380.76 =A0 2.74 =A062.57
> >> sdb =A0 =A0 =A0 =A0 =A0 =A0 =A0 0.00 =A0 =A0 0.00 =A0 =A00.00 =A0 =
=A01.33 =A0 =A0 0.00 =A0 =A016.00
> >> 12.00 =A0 =A0 0.03 =A0 25.00 19.75 2.63
> >> sda =A0 =A0 =A0 =A0 =A0 =A0 =A0 0.00 =A0 =A0 0.00 =A0 =A00.00 =A0 =
=A00.67 =A0 =A0 0.00 =A0 =A0 8.00
> >> 12.00 =A0 =A0 0.01 =A0 19.50 =A012.50 =A0 0.83
> >>
> >> The PID of the ceph-osd taht is running on sdc is 2053 and when I =
look
> >> with top I see this process and a btrfs-endio-writer (PID 5447):
> >>
> >> =A0 PID USER =A0 =A0 =A0PR =A0NI =A0VIRT =A0RES =A0SHR S %CPU %MEM=
 =A0 =A0TIME+ =A0COMMAND
> >> =A02053 root =A0 =A0 =A020 =A0 0 =A0537m 146m 2364 S 33.2 0.6 43:3=
1.24 ceph-osd
> >> =A05447 root =A0 =A0 =A020 =A0 0 =A0 =A0 0 =A0 =A00 =A0 =A00 S 22.=
6 0.0 19:32.18 btrfs-endio-wri
> >>
> >> In the latencytop output you can see that those processes have a m=
uch
> >> higher latency, than the other ceph-osd and btrfs-endio-writers.
> >>
> >
> > I'm seeing a lot of this
> >
> > =A0 =A0 =A0 =A0[schedule] =A0 =A0 =A01654.6 msec =A0 =A0 =A0 =A0 96=
=2E4 %
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0schedule blkdev_issue_flush blkdev_f=
sync vfs_fsync_range
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0generic_write_sync blkdev_aio_write =
do_sync_readv_writev
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0do_readv_writev vfs_writev sys_write=
v system_call_fastpath
> >
> > where ceph-osd's latency is mostly coming from this fsync of a bloc=
k device
> > directly, and not so much being tied up by btrfs directly. =A0With =
22% CPU being
> > taken up by btrfs-endio-wri we must be doing something wrong. =A0Ca=
n you run perf
> > record -ag when this is going on and then perf report so we can see=
 what
> > btrfs-endio-wri is doing with the cpu. =A0You can drill down in per=
f report to get
> > only what btrfs-endio-wri is doing, so that would be best. =A0As fa=
r as the rest
> > of the latencytop goes, it doesn't seem like btrfs-endio-wri is doi=
ng anything
> > horribly wrong or introducing a lot of latency. =A0Most of it seems=
 to be when
> > running the dleayed refs and having to read in blocks. =A0I've been=
 suspecting for
> > a while that the delayed ref stuff ends up doing way more work than=
 it needs to
> > be per task, and it's possible that btrfs-endio-wri is simply getti=
ng screwed by
> > other people doing work.
> >
> > At this point it seems like the biggest problem with latency in cep=
h-osd is not
> > related to btrfs, the latency seems to all be from the fact that ce=
ph-osd is
> > fsyncing a block dev for whatever reason. =A0As for btrfs-endio-wri=
 it seems like
> > its blowing a lot of CPU time, so perf record -ag is probably going=
 to be your
> > best bet when it's using lots of cpu so we can figure out what it's=
 spinning on.
>=20
> Attached is a perf-report. I have included the whole report, so that
> you can see the difference between the good and the bad
> btrfs-endio-wri.
>

Oh shit we're inserting xattrs in endio, thats not good.  I'll look mor=
e into
this when I get back home but this is definitely a problem, we're doing=
 a lot
more work in endio than we should.  Thanks,

Josef=20
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
@ 2011-10-25 15:00               ` Josef Bacik
  0 siblings, 0 replies; 47+ messages in thread
From: Josef Bacik @ 2011-10-25 15:00 UTC (permalink / raw)
  To: Christian Brunner; +Cc: Josef Bacik, Sage Weil, ceph-devel, linux-btrfs

On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
> 2011/10/25 Josef Bacik <josef@redhat.com>:
> > On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
> >> 2011/10/24 Josef Bacik <josef@redhat.com>:
> >> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> >> >> [adding linux-btrfs to cc]
> >> >>
> >> >> Josef, Chris, any ideas on the below issues?
> >> >>
> >> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
> >> >> >
> >> >> > - When I run ceph with btrfs snaps disabled, the situation is getting
> >> >> > slightly better. I can run an OSD for about 3 days without problems,
> >> >> > but then again the load increases. This time, I can see that the
> >> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
> >> >> > than usual.
> >> >>
> >> >> FYI in this scenario you're exposed to the same journal replay issues that
> >> >> ext4 and XFS are.  The btrfs workload that ceph is generating will also
> >> >> not be all that special, though, so this problem shouldn't be unique to
> >> >> ceph.
> >> >>
> >> >
> >> > Can you get sysrq+w when this happens?  I'd like to see what btrfs-endio-write
> >> > is up to.
> >>
> >> Capturing this seems to be not easy. I have a few traces (see
> >> attachment), but with sysrq+w I do not get a stacktrace of
> >> btrfs-endio-write. What I have is a "latencytop -c" output which is
> >> interesting:
> >>
> >> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
> >> tries to balance the load over all OSDs, so all filesystems should get
> >> an nearly equal load. At the moment one filesystem seems to have a
> >> problem. When running with iostat I see the following
> >>
> >> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> >> avgrq-sz avgqu-sz   await  svctm  %util
> >> sdd               0.00     0.00    0.00    4.33     0.00    53.33
> >> 12.31     0.08   19.38  12.23   5.30
> >> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
> >> 8.57    74.33  380.76   2.74  62.57
> >> sdb               0.00     0.00    0.00    1.33     0.00    16.00
> >> 12.00     0.03   25.00 19.75 2.63
> >> sda               0.00     0.00    0.00    0.67     0.00     8.00
> >> 12.00     0.01   19.50  12.50   0.83
> >>
> >> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
> >> with top I see this process and a btrfs-endio-writer (PID 5447):
> >>
> >>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >>  2053 root      20   0  537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd
> >>  5447 root      20   0     0    0    0 S 22.6 0.0 19:32.18 btrfs-endio-wri
> >>
> >> In the latencytop output you can see that those processes have a much
> >> higher latency, than the other ceph-osd and btrfs-endio-writers.
> >>
> >
> > I'm seeing a lot of this
> >
> >        [schedule]      1654.6 msec         96.4 %
> >                schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range
> >                generic_write_sync blkdev_aio_write do_sync_readv_writev
> >                do_readv_writev vfs_writev sys_writev system_call_fastpath
> >
> > where ceph-osd's latency is mostly coming from this fsync of a block device
> > directly, and not so much being tied up by btrfs directly.  With 22% CPU being
> > taken up by btrfs-endio-wri we must be doing something wrong.  Can you run perf
> > record -ag when this is going on and then perf report so we can see what
> > btrfs-endio-wri is doing with the cpu.  You can drill down in perf report to get
> > only what btrfs-endio-wri is doing, so that would be best.  As far as the rest
> > of the latencytop goes, it doesn't seem like btrfs-endio-wri is doing anything
> > horribly wrong or introducing a lot of latency.  Most of it seems to be when
> > running the dleayed refs and having to read in blocks.  I've been suspecting for
> > a while that the delayed ref stuff ends up doing way more work than it needs to
> > be per task, and it's possible that btrfs-endio-wri is simply getting screwed by
> > other people doing work.
> >
> > At this point it seems like the biggest problem with latency in ceph-osd is not
> > related to btrfs, the latency seems to all be from the fact that ceph-osd is
> > fsyncing a block dev for whatever reason.  As for btrfs-endio-wri it seems like
> > its blowing a lot of CPU time, so perf record -ag is probably going to be your
> > best bet when it's using lots of cpu so we can figure out what it's spinning on.
> 
> Attached is a perf-report. I have included the whole report, so that
> you can see the difference between the good and the bad
> btrfs-endio-wri.
>

Oh shit we're inserting xattrs in endio, thats not good.  I'll look more into
this when I get back home but this is definitely a problem, we're doing a lot
more work in endio than we should.  Thanks,

Josef 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-25 14:25           ` Christian Brunner
@ 2011-10-25 15:05               ` Josef Bacik
  2011-10-25 15:05               ` Josef Bacik
  1 sibling, 0 replies; 47+ messages in thread
From: Josef Bacik @ 2011-10-25 15:05 UTC (permalink / raw)
  To: Christian Brunner; +Cc: Josef Bacik, Sage Weil, ceph-devel, linux-btrfs

On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
> 2011/10/25 Josef Bacik <josef@redhat.com>:
> > On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
> >> 2011/10/24 Josef Bacik <josef@redhat.com>:
> >> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> >> >> [adding linux-btrfs to cc]
> >> >>
> >> >> Josef, Chris, any ideas on the below issues?
> >> >>
> >> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
> >> >> >
> >> >> > - When I run ceph with btrfs snaps disabled, the situation is=
 getting
> >> >> > slightly better. I can run an OSD for about 3 days without pr=
oblems,
> >> >> > but then again the load increases. This time, I can see that =
the
> >> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing m=
ore work
> >> >> > than usual.
> >> >>
> >> >> FYI in this scenario you're exposed to the same journal replay =
issues that
> >> >> ext4 and XFS are. =A0The btrfs workload that ceph is generating=
 will also
> >> >> not be all that special, though, so this problem shouldn't be u=
nique to
> >> >> ceph.
> >> >>
> >> >
> >> > Can you get sysrq+w when this happens? =A0I'd like to see what b=
trfs-endio-write
> >> > is up to.
> >>
> >> Capturing this seems to be not easy. I have a few traces (see
> >> attachment), but with sysrq+w I do not get a stacktrace of
> >> btrfs-endio-write. What I have is a "latencytop -c" output which i=
s
> >> interesting:
> >>
> >> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. C=
eph
> >> tries to balance the load over all OSDs, so all filesystems should=
 get
> >> an nearly equal load. At the moment one filesystem seems to have a
> >> problem. When running with iostat I see the following
> >>
> >> Device: =A0 =A0 =A0 =A0 rrqm/s =A0 wrqm/s =A0 =A0 r/s =A0 =A0 w/s =
=A0 rsec/s =A0 wsec/s
> >> avgrq-sz avgqu-sz =A0 await =A0svctm =A0%util
> >> sdd =A0 =A0 =A0 =A0 =A0 =A0 =A0 0.00 =A0 =A0 0.00 =A0 =A00.00 =A0 =
=A04.33 =A0 =A0 0.00 =A0 =A053.33
> >> 12.31 =A0 =A0 0.08 =A0 19.38 =A012.23 =A0 5.30
> >> sdc =A0 =A0 =A0 =A0 =A0 =A0 =A0 0.00 =A0 =A0 1.00 =A0 =A00.00 =A02=
28.33 =A0 =A0 0.00 =A01957.33
> >> 8.57 =A0 =A074.33 =A0380.76 =A0 2.74 =A062.57
> >> sdb =A0 =A0 =A0 =A0 =A0 =A0 =A0 0.00 =A0 =A0 0.00 =A0 =A00.00 =A0 =
=A01.33 =A0 =A0 0.00 =A0 =A016.00
> >> 12.00 =A0 =A0 0.03 =A0 25.00 19.75 2.63
> >> sda =A0 =A0 =A0 =A0 =A0 =A0 =A0 0.00 =A0 =A0 0.00 =A0 =A00.00 =A0 =
=A00.67 =A0 =A0 0.00 =A0 =A0 8.00
> >> 12.00 =A0 =A0 0.01 =A0 19.50 =A012.50 =A0 0.83
> >>
> >> The PID of the ceph-osd taht is running on sdc is 2053 and when I =
look
> >> with top I see this process and a btrfs-endio-writer (PID 5447):
> >>
> >> =A0 PID USER =A0 =A0 =A0PR =A0NI =A0VIRT =A0RES =A0SHR S %CPU %MEM=
 =A0 =A0TIME+ =A0COMMAND
> >> =A02053 root =A0 =A0 =A020 =A0 0 =A0537m 146m 2364 S 33.2 0.6 43:3=
1.24 ceph-osd
> >> =A05447 root =A0 =A0 =A020 =A0 0 =A0 =A0 0 =A0 =A00 =A0 =A00 S 22.=
6 0.0 19:32.18 btrfs-endio-wri
> >>
> >> In the latencytop output you can see that those processes have a m=
uch
> >> higher latency, than the other ceph-osd and btrfs-endio-writers.
> >>
> >
> > I'm seeing a lot of this
> >
> > =A0 =A0 =A0 =A0[schedule] =A0 =A0 =A01654.6 msec =A0 =A0 =A0 =A0 96=
=2E4 %
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0schedule blkdev_issue_flush blkdev_f=
sync vfs_fsync_range
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0generic_write_sync blkdev_aio_write =
do_sync_readv_writev
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0do_readv_writev vfs_writev sys_write=
v system_call_fastpath
> >
> > where ceph-osd's latency is mostly coming from this fsync of a bloc=
k device
> > directly, and not so much being tied up by btrfs directly. =A0With =
22% CPU being
> > taken up by btrfs-endio-wri we must be doing something wrong. =A0Ca=
n you run perf
> > record -ag when this is going on and then perf report so we can see=
 what
> > btrfs-endio-wri is doing with the cpu. =A0You can drill down in per=
f report to get
> > only what btrfs-endio-wri is doing, so that would be best. =A0As fa=
r as the rest
> > of the latencytop goes, it doesn't seem like btrfs-endio-wri is doi=
ng anything
> > horribly wrong or introducing a lot of latency. =A0Most of it seems=
 to be when
> > running the dleayed refs and having to read in blocks. =A0I've been=
 suspecting for
> > a while that the delayed ref stuff ends up doing way more work than=
 it needs to
> > be per task, and it's possible that btrfs-endio-wri is simply getti=
ng screwed by
> > other people doing work.
> >
> > At this point it seems like the biggest problem with latency in cep=
h-osd is not
> > related to btrfs, the latency seems to all be from the fact that ce=
ph-osd is
> > fsyncing a block dev for whatever reason. =A0As for btrfs-endio-wri=
 it seems like
> > its blowing a lot of CPU time, so perf record -ag is probably going=
 to be your
> > best bet when it's using lots of cpu so we can figure out what it's=
 spinning on.
>=20
> Attached is a perf-report. I have included the whole report, so that
> you can see the difference between the good and the bad
> btrfs-endio-wri.
>

We also shouldn't be running run_ordered_operations, man this is screwe=
d up,
thanks so much for this, I should be able to nail this down pretty easi=
ly.
Thanks,

Josef=20
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
@ 2011-10-25 15:05               ` Josef Bacik
  0 siblings, 0 replies; 47+ messages in thread
From: Josef Bacik @ 2011-10-25 15:05 UTC (permalink / raw)
  To: Christian Brunner; +Cc: Josef Bacik, Sage Weil, ceph-devel, linux-btrfs

On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
> 2011/10/25 Josef Bacik <josef@redhat.com>:
> > On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
> >> 2011/10/24 Josef Bacik <josef@redhat.com>:
> >> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> >> >> [adding linux-btrfs to cc]
> >> >>
> >> >> Josef, Chris, any ideas on the below issues?
> >> >>
> >> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
> >> >> >
> >> >> > - When I run ceph with btrfs snaps disabled, the situation is getting
> >> >> > slightly better. I can run an OSD for about 3 days without problems,
> >> >> > but then again the load increases. This time, I can see that the
> >> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
> >> >> > than usual.
> >> >>
> >> >> FYI in this scenario you're exposed to the same journal replay issues that
> >> >> ext4 and XFS are.  The btrfs workload that ceph is generating will also
> >> >> not be all that special, though, so this problem shouldn't be unique to
> >> >> ceph.
> >> >>
> >> >
> >> > Can you get sysrq+w when this happens?  I'd like to see what btrfs-endio-write
> >> > is up to.
> >>
> >> Capturing this seems to be not easy. I have a few traces (see
> >> attachment), but with sysrq+w I do not get a stacktrace of
> >> btrfs-endio-write. What I have is a "latencytop -c" output which is
> >> interesting:
> >>
> >> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
> >> tries to balance the load over all OSDs, so all filesystems should get
> >> an nearly equal load. At the moment one filesystem seems to have a
> >> problem. When running with iostat I see the following
> >>
> >> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> >> avgrq-sz avgqu-sz   await  svctm  %util
> >> sdd               0.00     0.00    0.00    4.33     0.00    53.33
> >> 12.31     0.08   19.38  12.23   5.30
> >> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
> >> 8.57    74.33  380.76   2.74  62.57
> >> sdb               0.00     0.00    0.00    1.33     0.00    16.00
> >> 12.00     0.03   25.00 19.75 2.63
> >> sda               0.00     0.00    0.00    0.67     0.00     8.00
> >> 12.00     0.01   19.50  12.50   0.83
> >>
> >> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
> >> with top I see this process and a btrfs-endio-writer (PID 5447):
> >>
> >>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >>  2053 root      20   0  537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd
> >>  5447 root      20   0     0    0    0 S 22.6 0.0 19:32.18 btrfs-endio-wri
> >>
> >> In the latencytop output you can see that those processes have a much
> >> higher latency, than the other ceph-osd and btrfs-endio-writers.
> >>
> >
> > I'm seeing a lot of this
> >
> >        [schedule]      1654.6 msec         96.4 %
> >                schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range
> >                generic_write_sync blkdev_aio_write do_sync_readv_writev
> >                do_readv_writev vfs_writev sys_writev system_call_fastpath
> >
> > where ceph-osd's latency is mostly coming from this fsync of a block device
> > directly, and not so much being tied up by btrfs directly.  With 22% CPU being
> > taken up by btrfs-endio-wri we must be doing something wrong.  Can you run perf
> > record -ag when this is going on and then perf report so we can see what
> > btrfs-endio-wri is doing with the cpu.  You can drill down in perf report to get
> > only what btrfs-endio-wri is doing, so that would be best.  As far as the rest
> > of the latencytop goes, it doesn't seem like btrfs-endio-wri is doing anything
> > horribly wrong or introducing a lot of latency.  Most of it seems to be when
> > running the dleayed refs and having to read in blocks.  I've been suspecting for
> > a while that the delayed ref stuff ends up doing way more work than it needs to
> > be per task, and it's possible that btrfs-endio-wri is simply getting screwed by
> > other people doing work.
> >
> > At this point it seems like the biggest problem with latency in ceph-osd is not
> > related to btrfs, the latency seems to all be from the fact that ceph-osd is
> > fsyncing a block dev for whatever reason.  As for btrfs-endio-wri it seems like
> > its blowing a lot of CPU time, so perf record -ag is probably going to be your
> > best bet when it's using lots of cpu so we can figure out what it's spinning on.
> 
> Attached is a perf-report. I have included the whole report, so that
> you can see the difference between the good and the bad
> btrfs-endio-wri.
>

We also shouldn't be running run_ordered_operations, man this is screwed up,
thanks so much for this, I should be able to nail this down pretty easily.
Thanks,

Josef 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-25 15:05               ` Josef Bacik
@ 2011-10-25 15:13                 ` Christian Brunner
  -1 siblings, 0 replies; 47+ messages in thread
From: Christian Brunner @ 2011-10-25 15:13 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Sage Weil, ceph-devel, linux-btrfs

2011/10/25 Josef Bacik <josef@redhat.com>:
> On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
>> 2011/10/25 Josef Bacik <josef@redhat.com>:
>> > On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
[...]
>> >>
>> >> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. =
Ceph
>> >> tries to balance the load over all OSDs, so all filesystems shoul=
d get
>> >> an nearly equal load. At the moment one filesystem seems to have =
a
>> >> problem. When running with iostat I see the following
>> >>
>> >> Device: =A0 =A0 =A0 =A0 rrqm/s =A0 wrqm/s =A0 =A0 r/s =A0 =A0 w/s=
 =A0 rsec/s =A0 wsec/s
>> >> avgrq-sz avgqu-sz =A0 await =A0svctm =A0%util
>> >> sdd =A0 =A0 =A0 =A0 =A0 =A0 =A0 0.00 =A0 =A0 0.00 =A0 =A00.00 =A0=
 =A04.33 =A0 =A0 0.00 =A0 =A053.33
>> >> 12.31 =A0 =A0 0.08 =A0 19.38 =A012.23 =A0 5.30
>> >> sdc =A0 =A0 =A0 =A0 =A0 =A0 =A0 0.00 =A0 =A0 1.00 =A0 =A00.00 =A0=
228.33 =A0 =A0 0.00 =A01957.33
>> >> 8.57 =A0 =A074.33 =A0380.76 =A0 2.74 =A062.57
>> >> sdb =A0 =A0 =A0 =A0 =A0 =A0 =A0 0.00 =A0 =A0 0.00 =A0 =A00.00 =A0=
 =A01.33 =A0 =A0 0.00 =A0 =A016.00
>> >> 12.00 =A0 =A0 0.03 =A0 25.00 19.75 2.63
>> >> sda =A0 =A0 =A0 =A0 =A0 =A0 =A0 0.00 =A0 =A0 0.00 =A0 =A00.00 =A0=
 =A00.67 =A0 =A0 0.00 =A0 =A0 8.00
>> >> 12.00 =A0 =A0 0.01 =A0 19.50 =A012.50 =A0 0.83
>> >>
>> >> The PID of the ceph-osd taht is running on sdc is 2053 and when I=
 look
>> >> with top I see this process and a btrfs-endio-writer (PID 5447):
>> >>
>> >> =A0 PID USER =A0 =A0 =A0PR =A0NI =A0VIRT =A0RES =A0SHR S %CPU %ME=
M =A0 =A0TIME+ =A0COMMAND
>> >> =A02053 root =A0 =A0 =A020 =A0 0 =A0537m 146m 2364 S 33.2 0.6 43:=
31.24 ceph-osd
>> >> =A05447 root =A0 =A0 =A020 =A0 0 =A0 =A0 0 =A0 =A00 =A0 =A00 S 22=
=2E6 0.0 19:32.18 btrfs-endio-wri
>> >>
>> >> In the latencytop output you can see that those processes have a =
much
>> >> higher latency, than the other ceph-osd and btrfs-endio-writers.
>> >>
>> >
>> > I'm seeing a lot of this
>> >
>> > =A0 =A0 =A0 =A0[schedule] =A0 =A0 =A01654.6 msec =A0 =A0 =A0 =A0 9=
6.4 %
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0schedule blkdev_issue_flush blkdev_=
fsync vfs_fsync_range
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0generic_write_sync blkdev_aio_write=
 do_sync_readv_writev
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0do_readv_writev vfs_writev sys_writ=
ev system_call_fastpath
>> >
>> > where ceph-osd's latency is mostly coming from this fsync of a blo=
ck device
>> > directly, and not so much being tied up by btrfs directly. =A0With=
 22% CPU being
>> > taken up by btrfs-endio-wri we must be doing something wrong. =A0C=
an you run perf
>> > record -ag when this is going on and then perf report so we can se=
e what
>> > btrfs-endio-wri is doing with the cpu. =A0You can drill down in pe=
rf report to get
>> > only what btrfs-endio-wri is doing, so that would be best. =A0As f=
ar as the rest
>> > of the latencytop goes, it doesn't seem like btrfs-endio-wri is do=
ing anything
>> > horribly wrong or introducing a lot of latency. =A0Most of it seem=
s to be when
>> > running the dleayed refs and having to read in blocks. =A0I've bee=
n suspecting for
>> > a while that the delayed ref stuff ends up doing way more work tha=
n it needs to
>> > be per task, and it's possible that btrfs-endio-wri is simply gett=
ing screwed by
>> > other people doing work.
>> >
>> > At this point it seems like the biggest problem with latency in ce=
ph-osd is not
>> > related to btrfs, the latency seems to all be from the fact that c=
eph-osd is
>> > fsyncing a block dev for whatever reason. =A0As for btrfs-endio-wr=
i it seems like
>> > its blowing a lot of CPU time, so perf record -ag is probably goin=
g to be your
>> > best bet when it's using lots of cpu so we can figure out what it'=
s spinning on.
>>
>> Attached is a perf-report. I have included the whole report, so that
>> you can see the difference between the good and the bad
>> btrfs-endio-wri.
>>
>
> We also shouldn't be running run_ordered_operations, man this is scre=
wed up,
> thanks so much for this, I should be able to nail this down pretty ea=
sily.

Please note that this is with "btrfs snaps disabled" in the ceph conf.
When I enable snaps our problems get worse (the btrfs-cleaner thing),
but I would be glad if this one thing gets solved. I can run debugging
with snaps enabled, if you want, but I would suggest, that we do this
afterwards.

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
@ 2011-10-25 15:13                 ` Christian Brunner
  0 siblings, 0 replies; 47+ messages in thread
From: Christian Brunner @ 2011-10-25 15:13 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Sage Weil, ceph-devel, linux-btrfs

2011/10/25 Josef Bacik <josef@redhat.com>:
> On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
>> 2011/10/25 Josef Bacik <josef@redhat.com>:
>> > On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
[...]
>> >>
>> >> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
>> >> tries to balance the load over all OSDs, so all filesystems should get
>> >> an nearly equal load. At the moment one filesystem seems to have a
>> >> problem. When running with iostat I see the following
>> >>
>> >> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> >> avgrq-sz avgqu-sz   await  svctm  %util
>> >> sdd               0.00     0.00    0.00    4.33     0.00    53.33
>> >> 12.31     0.08   19.38  12.23   5.30
>> >> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
>> >> 8.57    74.33  380.76   2.74  62.57
>> >> sdb               0.00     0.00    0.00    1.33     0.00    16.00
>> >> 12.00     0.03   25.00 19.75 2.63
>> >> sda               0.00     0.00    0.00    0.67     0.00     8.00
>> >> 12.00     0.01   19.50  12.50   0.83
>> >>
>> >> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
>> >> with top I see this process and a btrfs-endio-writer (PID 5447):
>> >>
>> >>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>> >>  2053 root      20   0  537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd
>> >>  5447 root      20   0     0    0    0 S 22.6 0.0 19:32.18 btrfs-endio-wri
>> >>
>> >> In the latencytop output you can see that those processes have a much
>> >> higher latency, than the other ceph-osd and btrfs-endio-writers.
>> >>
>> >
>> > I'm seeing a lot of this
>> >
>> >        [schedule]      1654.6 msec         96.4 %
>> >                schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range
>> >                generic_write_sync blkdev_aio_write do_sync_readv_writev
>> >                do_readv_writev vfs_writev sys_writev system_call_fastpath
>> >
>> > where ceph-osd's latency is mostly coming from this fsync of a block device
>> > directly, and not so much being tied up by btrfs directly.  With 22% CPU being
>> > taken up by btrfs-endio-wri we must be doing something wrong.  Can you run perf
>> > record -ag when this is going on and then perf report so we can see what
>> > btrfs-endio-wri is doing with the cpu.  You can drill down in perf report to get
>> > only what btrfs-endio-wri is doing, so that would be best.  As far as the rest
>> > of the latencytop goes, it doesn't seem like btrfs-endio-wri is doing anything
>> > horribly wrong or introducing a lot of latency.  Most of it seems to be when
>> > running the dleayed refs and having to read in blocks.  I've been suspecting for
>> > a while that the delayed ref stuff ends up doing way more work than it needs to
>> > be per task, and it's possible that btrfs-endio-wri is simply getting screwed by
>> > other people doing work.
>> >
>> > At this point it seems like the biggest problem with latency in ceph-osd is not
>> > related to btrfs, the latency seems to all be from the fact that ceph-osd is
>> > fsyncing a block dev for whatever reason.  As for btrfs-endio-wri it seems like
>> > its blowing a lot of CPU time, so perf record -ag is probably going to be your
>> > best bet when it's using lots of cpu so we can figure out what it's spinning on.
>>
>> Attached is a perf-report. I have included the whole report, so that
>> you can see the difference between the good and the bad
>> btrfs-endio-wri.
>>
>
> We also shouldn't be running run_ordered_operations, man this is screwed up,
> thanks so much for this, I should be able to nail this down pretty easily.

Please note that this is with "btrfs snaps disabled" in the ceph conf.
When I enable snaps our problems get worse (the btrfs-cleaner thing),
but I would be glad if this one thing gets solved. I can run debugging
with snaps enabled, if you want, but I would suggest, that we do this
afterwards.

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-25 10:23     ` Christoph Hellwig
@ 2011-10-25 16:23       ` Sage Weil
  0 siblings, 0 replies; 47+ messages in thread
From: Sage Weil @ 2011-10-25 16:23 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Christian Brunner, ceph-devel, linux-btrfs

On Tue, 25 Oct 2011, Christoph Hellwig wrote:
> On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> > > - When I run ceph with btrfs snaps disabled, the situation is getting
> > > slightly better. I can run an OSD for about 3 days without problems,
> > > but then again the load increases. This time, I can see that the
> > > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
> > > than usual.
> > 
> > FYI in this scenario you're exposed to the same journal replay issues that 
> > ext4 and XFS are.  The btrfs workload that ceph is generating will also 
> > not be all that special, though, so this problem shouldn't be unique to 
> > ceph.
> 
> What journal replay issues would ext4 and XFS be exposed to?

It's the ceph-osd journal replay, not the ext4/XFS journal... the #2 
item in

	http://marc.info/?l=ceph-devel&m=131942130322957&w=2

sage

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-25 12:23           ` Josef Bacik
  (?)
  (?)
@ 2011-10-25 16:36           ` Sage Weil
  2011-10-25 19:09               ` Christian Brunner
  -1 siblings, 1 reply; 47+ messages in thread
From: Sage Weil @ 2011-10-25 16:36 UTC (permalink / raw)
  To: Christian Brunner; +Cc: Josef Bacik, ceph-devel, linux-btrfs

On Tue, 25 Oct 2011, Josef Bacik wrote:
> At this point it seems like the biggest problem with latency in ceph-osd 
> is not related to btrfs, the latency seems to all be from the fact that 
> ceph-osd is fsyncing a block dev for whatever reason. 

There is one place where we sync_file_range() on the journal block device, 
but that should only happen if directio is disabled (it's on by default).  

Christian, have you tweaked those settings in your ceph.conf?  It would be 
something like 'journal dio = false'.  If not, can you verify that 
directio shows true when the journal is initialized from your osd log?  
E.g.,

 2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal fd 14: 104857600 bytes, block size 4096 bytes, directio = 1

If directio = 1 for you, something else funky is causing those 
blkdev_fsync's...

sage

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-25 16:36           ` Sage Weil
@ 2011-10-25 19:09               ` Christian Brunner
  0 siblings, 0 replies; 47+ messages in thread
From: Christian Brunner @ 2011-10-25 19:09 UTC (permalink / raw)
  To: Sage Weil; +Cc: Josef Bacik, ceph-devel, linux-btrfs

2011/10/25 Sage Weil <sage@newdream.net>:
> On Tue, 25 Oct 2011, Josef Bacik wrote:
>> At this point it seems like the biggest problem with latency in ceph=
-osd
>> is not related to btrfs, the latency seems to all be from the fact t=
hat
>> ceph-osd is fsyncing a block dev for whatever reason.
>
> There is one place where we sync_file_range() on the journal block de=
vice,
> but that should only happen if directio is disabled (it's on by defau=
lt).
>
> Christian, have you tweaked those settings in your ceph.conf? =A0It w=
ould be
> something like 'journal dio =3D false'. =A0If not, can you verify tha=
t
> directio shows true when the journal is initialized from your osd log=
?
> E.g.,
>
> =A02011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.jou=
rnal fd 14: 104857600 bytes, block size 4096 bytes, directio =3D 1
>
> If directio =3D 1 for you, something else funky is causing those
> blkdev_fsync's...

I've looked it up in the logs - directio is 1:

Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open
/dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096
bytes, directio =3D 1

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
@ 2011-10-25 19:09               ` Christian Brunner
  0 siblings, 0 replies; 47+ messages in thread
From: Christian Brunner @ 2011-10-25 19:09 UTC (permalink / raw)
  To: Sage Weil; +Cc: Josef Bacik, ceph-devel, linux-btrfs

2011/10/25 Sage Weil <sage@newdream.net>:
> On Tue, 25 Oct 2011, Josef Bacik wrote:
>> At this point it seems like the biggest problem with latency in ceph-osd
>> is not related to btrfs, the latency seems to all be from the fact that
>> ceph-osd is fsyncing a block dev for whatever reason.
>
> There is one place where we sync_file_range() on the journal block device,
> but that should only happen if directio is disabled (it's on by default).
>
> Christian, have you tweaked those settings in your ceph.conf?  It would be
> something like 'journal dio = false'.  If not, can you verify that
> directio shows true when the journal is initialized from your osd log?
> E.g.,
>
>  2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal fd 14: 104857600 bytes, block size 4096 bytes, directio = 1
>
> If directio = 1 for you, something else funky is causing those
> blkdev_fsync's...

I've looked it up in the logs - directio is 1:

Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open
/dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096
bytes, directio = 1

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-25 15:05               ` Josef Bacik
  (?)
  (?)
@ 2011-10-25 20:15               ` Chris Mason
  2011-10-25 20:22                 ` Josef Bacik
  -1 siblings, 1 reply; 47+ messages in thread
From: Chris Mason @ 2011-10-25 20:15 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Christian Brunner, Sage Weil, ceph-devel, linux-btrfs

On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
> On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
> > 
> > Attached is a perf-report. I have included the whole report, so that
> > you can see the difference between the good and the bad
> > btrfs-endio-wri.
> >
> 
> We also shouldn't be running run_ordered_operations, man this is screwed up,
> thanks so much for this, I should be able to nail this down pretty easily.
> Thanks,

Looks like we're getting there from reserve_metadata_bytes when we join
the transaction?

-chris

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-25 20:15               ` Chris Mason
@ 2011-10-25 20:22                 ` Josef Bacik
  2011-10-26  0:16                     ` Christian Brunner
  2011-10-26 13:23                   ` Chris Mason
  0 siblings, 2 replies; 47+ messages in thread
From: Josef Bacik @ 2011-10-25 20:22 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, Christian Brunner, Sage Weil,
	ceph-devel, linux-btrfs

On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
> On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
> > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
> > > 
> > > Attached is a perf-report. I have included the whole report, so that
> > > you can see the difference between the good and the bad
> > > btrfs-endio-wri.
> > >
> > 
> > We also shouldn't be running run_ordered_operations, man this is screwed up,
> > thanks so much for this, I should be able to nail this down pretty easily.
> > Thanks,
> 
> Looks like we're getting there from reserve_metadata_bytes when we join
> the transaction?
>

We don't do reservations in the endio stuff, we assume you've reserved all the
space you need in delalloc, plus we would have seen reserve_metadata_bytes in
the trace.  Though it does look like perf is lying to us in at least one case
sicne btrfs_alloc_logged_file_extent is only called from log replay and not
during normal runtime, so it definitely shouldn't be showing up.  Thanks,

Josef 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-25 19:09               ` Christian Brunner
  (?)
@ 2011-10-25 22:27               ` Sage Weil
  -1 siblings, 0 replies; 47+ messages in thread
From: Sage Weil @ 2011-10-25 22:27 UTC (permalink / raw)
  To: Christian Brunner; +Cc: Josef Bacik, ceph-devel, linux-btrfs

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1329 bytes --]

On Tue, 25 Oct 2011, Christian Brunner wrote:
> 2011/10/25 Sage Weil <sage@newdream.net>:
> > On Tue, 25 Oct 2011, Josef Bacik wrote:
> >> At this point it seems like the biggest problem with latency in ceph-osd
> >> is not related to btrfs, the latency seems to all be from the fact that
> >> ceph-osd is fsyncing a block dev for whatever reason.
> >
> > There is one place where we sync_file_range() on the journal block device,
> > but that should only happen if directio is disabled (it's on by default).
> >
> > Christian, have you tweaked those settings in your ceph.conf?  It would be
> > something like 'journal dio = false'.  If not, can you verify that
> > directio shows true when the journal is initialized from your osd log?
> > E.g.,
> >
> >  2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal fd 14: 104857600 bytes, block size 4096 bytes, directio = 1
> >
> > If directio = 1 for you, something else funky is causing those
> > blkdev_fsync's...
> 
> I've looked it up in the logs - directio is 1:
> 
> Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open
> /dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096
> bytes, directio = 1

Do you mind capturing an strace?  I'd like to see where that blkdev_fsync 
is coming from.

thanks!
sage

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-25 20:22                 ` Josef Bacik
@ 2011-10-26  0:16                     ` Christian Brunner
  2011-10-26 13:23                   ` Chris Mason
  1 sibling, 0 replies; 47+ messages in thread
From: Christian Brunner @ 2011-10-26  0:16 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Chris Mason, Sage Weil, ceph-devel, linux-btrfs

2011/10/25 Josef Bacik <josef@redhat.com>:
> On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
>> On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
>> > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
>> > >
>> > > Attached is a perf-report. I have included the whole report, so =
that
>> > > you can see the difference between the good and the bad
>> > > btrfs-endio-wri.
>> > >
>> >
>> > We also shouldn't be running run_ordered_operations, man this is s=
crewed up,
>> > thanks so much for this, I should be able to nail this down pretty=
 easily.
>> > Thanks,
>>
>> Looks like we're getting there from reserve_metadata_bytes when we j=
oin
>> the transaction?
>>
>
> We don't do reservations in the endio stuff, we assume you've reserve=
d all the
> space you need in delalloc, plus we would have seen reserve_metadata_=
bytes in
> the trace. =A0Though it does look like perf is lying to us in at leas=
t one case
> sicne btrfs_alloc_logged_file_extent is only called from log replay a=
nd not
> during normal runtime, so it definitely shouldn't be showing up. =A0T=
hanks,

Strange! - I'll check if symbols got messed up in the report tomorrow.

Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
@ 2011-10-26  0:16                     ` Christian Brunner
  0 siblings, 0 replies; 47+ messages in thread
From: Christian Brunner @ 2011-10-26  0:16 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Chris Mason, Sage Weil, ceph-devel, linux-btrfs

2011/10/25 Josef Bacik <josef@redhat.com>:
> On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
>> On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
>> > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
>> > >
>> > > Attached is a perf-report. I have included the whole report, so that
>> > > you can see the difference between the good and the bad
>> > > btrfs-endio-wri.
>> > >
>> >
>> > We also shouldn't be running run_ordered_operations, man this is screwed up,
>> > thanks so much for this, I should be able to nail this down pretty easily.
>> > Thanks,
>>
>> Looks like we're getting there from reserve_metadata_bytes when we join
>> the transaction?
>>
>
> We don't do reservations in the endio stuff, we assume you've reserved all the
> space you need in delalloc, plus we would have seen reserve_metadata_bytes in
> the trace.  Though it does look like perf is lying to us in at least one case
> sicne btrfs_alloc_logged_file_extent is only called from log replay and not
> during normal runtime, so it definitely shouldn't be showing up.  Thanks,

Strange! - I'll check if symbols got messed up in the report tomorrow.

Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-26  0:16                     ` Christian Brunner
@ 2011-10-26  8:21                       ` Christian Brunner
  -1 siblings, 0 replies; 47+ messages in thread
From: Christian Brunner @ 2011-10-26  8:21 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Chris Mason, Sage Weil, ceph-devel, linux-btrfs

2011/10/26 Christian Brunner <chb@muc.de>:
> 2011/10/25 Josef Bacik <josef@redhat.com>:
>> On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
>>> On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
>>> > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote=
:
>>> > >
>>> > > Attached is a perf-report. I have included the whole report, so=
 that
>>> > > you can see the difference between the good and the bad
>>> > > btrfs-endio-wri.
>>> > >
>>> >
>>> > We also shouldn't be running run_ordered_operations, man this is =
screwed up,
>>> > thanks so much for this, I should be able to nail this down prett=
y easily.
>>> > Thanks,
>>>
>>> Looks like we're getting there from reserve_metadata_bytes when we =
join
>>> the transaction?
>>>
>>
>> We don't do reservations in the endio stuff, we assume you've reserv=
ed all the
>> space you need in delalloc, plus we would have seen reserve_metadata=
_bytes in
>> the trace. =A0Though it does look like perf is lying to us in at lea=
st one case
>> sicne btrfs_alloc_logged_file_extent is only called from log replay =
and not
>> during normal runtime, so it definitely shouldn't be showing up. =A0=
Thanks,
>
> Strange! - I'll check if symbols got messed up in the report tomorrow=
=2E

I've checked this now: Except for the missing symbols for iomemory_vsl
module, everything is looking normal.

I've also run the report on another OSD again, but the results look
quite similar.

Regards,
Christian

PS: This is what perf report -v is saying...

build id event received for [kernel.kallsyms]:
805ca93f4057cc0c8f53b061a849b3f847f2de40
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/fs/btrfs/btrfs.ko:
64a723e05af3908fb9593f4a3401d6563cb1a01b
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/lib/libcrc32c.ko:
b1391be8d33b54b6de20e07b7f2ee8d777fc09d2
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/drivers/net/bonding/bondi=
ng.ko:
663392df0f407211ab8f9527c482d54fce890c5e
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/drivers/scsi/hpsa.ko:
676eecffd476aef1b0f2f8c1bf8c8e6120d369c9
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/drivers/net/ixgbe/ixgbe.k=
o:
db7c200894b27e71ae6fe5cf7adaebf787c90da9
build id event received for [iomemory_vsl]:
4ed417c9a815e6bbe77a1656bceda95d9f06cb13
build id event received for /lib64/libc-2.12.so:
2ab28d41242ede641418966ef08f9aacffd9e8c7
build id event received for /lib64/libpthread-2.12.so:
c177389a6f119b3883ea0b3c33cb04df3f8e5cc7
build id event received for /sbin/rsyslogd:
1372ef1e2ec550967fe20d0bdddbc0aab0bb36dc
build id event received for /lib64/libglib-2.0.so.0.2200.5:
d880be15bf992b5fbcc629e6bbf1c747a928ddd5
build id event received for /usr/sbin/irqbalance:
842de64f46ca9fde55efa29a793c08b197d58354
build id event received for /lib64/libm-2.12.so:
46ac89195918407d2937bd1450c0ec99c8d41a2a
build id event received for /usr/bin/ceph-osd:
9fcb36e020c49fc49171b4c88bd784b38eb0675b
build id event received for /usr/lib64/libstdc++.so.6.0.13:
d1b2ca4e1ec8f81ba820e5f1375d960107ac7e50
build id event received for /usr/lib64/libtcmalloc.so.0.2.0:
02766551b2eb5a453f003daee0c5fc9cd176e831
Looking at the vmlinux_path (6 entries long)
dso__load_sym: cannot get elf header.
Using /proc/kallsyms for symbols
Looking at the vmlinux_path (6 entries long)
No kallsyms or vmlinux with build-id
4ed417c9a815e6bbe77a1656bceda95d9f06cb13 was found
[iomemory_vsl] with build id 4ed417c9a815e6bbe77a1656bceda95d9f06cb13
not found, continuing without symbols
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
@ 2011-10-26  8:21                       ` Christian Brunner
  0 siblings, 0 replies; 47+ messages in thread
From: Christian Brunner @ 2011-10-26  8:21 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Chris Mason, Sage Weil, ceph-devel, linux-btrfs

2011/10/26 Christian Brunner <chb@muc.de>:
> 2011/10/25 Josef Bacik <josef@redhat.com>:
>> On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
>>> On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
>>> > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
>>> > >
>>> > > Attached is a perf-report. I have included the whole report, so that
>>> > > you can see the difference between the good and the bad
>>> > > btrfs-endio-wri.
>>> > >
>>> >
>>> > We also shouldn't be running run_ordered_operations, man this is screwed up,
>>> > thanks so much for this, I should be able to nail this down pretty easily.
>>> > Thanks,
>>>
>>> Looks like we're getting there from reserve_metadata_bytes when we join
>>> the transaction?
>>>
>>
>> We don't do reservations in the endio stuff, we assume you've reserved all the
>> space you need in delalloc, plus we would have seen reserve_metadata_bytes in
>> the trace.  Though it does look like perf is lying to us in at least one case
>> sicne btrfs_alloc_logged_file_extent is only called from log replay and not
>> during normal runtime, so it definitely shouldn't be showing up.  Thanks,
>
> Strange! - I'll check if symbols got messed up in the report tomorrow.

I've checked this now: Except for the missing symbols for iomemory_vsl
module, everything is looking normal.

I've also run the report on another OSD again, but the results look
quite similar.

Regards,
Christian

PS: This is what perf report -v is saying...

build id event received for [kernel.kallsyms]:
805ca93f4057cc0c8f53b061a849b3f847f2de40
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/fs/btrfs/btrfs.ko:
64a723e05af3908fb9593f4a3401d6563cb1a01b
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/lib/libcrc32c.ko:
b1391be8d33b54b6de20e07b7f2ee8d777fc09d2
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/drivers/net/bonding/bonding.ko:
663392df0f407211ab8f9527c482d54fce890c5e
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/drivers/scsi/hpsa.ko:
676eecffd476aef1b0f2f8c1bf8c8e6120d369c9
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/drivers/net/ixgbe/ixgbe.ko:
db7c200894b27e71ae6fe5cf7adaebf787c90da9
build id event received for [iomemory_vsl]:
4ed417c9a815e6bbe77a1656bceda95d9f06cb13
build id event received for /lib64/libc-2.12.so:
2ab28d41242ede641418966ef08f9aacffd9e8c7
build id event received for /lib64/libpthread-2.12.so:
c177389a6f119b3883ea0b3c33cb04df3f8e5cc7
build id event received for /sbin/rsyslogd:
1372ef1e2ec550967fe20d0bdddbc0aab0bb36dc
build id event received for /lib64/libglib-2.0.so.0.2200.5:
d880be15bf992b5fbcc629e6bbf1c747a928ddd5
build id event received for /usr/sbin/irqbalance:
842de64f46ca9fde55efa29a793c08b197d58354
build id event received for /lib64/libm-2.12.so:
46ac89195918407d2937bd1450c0ec99c8d41a2a
build id event received for /usr/bin/ceph-osd:
9fcb36e020c49fc49171b4c88bd784b38eb0675b
build id event received for /usr/lib64/libstdc++.so.6.0.13:
d1b2ca4e1ec8f81ba820e5f1375d960107ac7e50
build id event received for /usr/lib64/libtcmalloc.so.0.2.0:
02766551b2eb5a453f003daee0c5fc9cd176e831
Looking at the vmlinux_path (6 entries long)
dso__load_sym: cannot get elf header.
Using /proc/kallsyms for symbols
Looking at the vmlinux_path (6 entries long)
No kallsyms or vmlinux with build-id
4ed417c9a815e6bbe77a1656bceda95d9f06cb13 was found
[iomemory_vsl] with build id 4ed417c9a815e6bbe77a1656bceda95d9f06cb13
not found, continuing without symbols
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-25 20:22                 ` Josef Bacik
  2011-10-26  0:16                     ` Christian Brunner
@ 2011-10-26 13:23                   ` Chris Mason
  2011-10-27 15:07                     ` Josef Bacik
  1 sibling, 1 reply; 47+ messages in thread
From: Chris Mason @ 2011-10-26 13:23 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Christian Brunner, Sage Weil, ceph-devel, linux-btrfs

On Tue, Oct 25, 2011 at 04:22:48PM -0400, Josef Bacik wrote:
> On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
> > On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
> > > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
> > > > 
> > > > Attached is a perf-report. I have included the whole report, so that
> > > > you can see the difference between the good and the bad
> > > > btrfs-endio-wri.
> > > >
> > > 
> > > We also shouldn't be running run_ordered_operations, man this is screwed up,
> > > thanks so much for this, I should be able to nail this down pretty easily.
> > > Thanks,
> > 
> > Looks like we're getting there from reserve_metadata_bytes when we join
> > the transaction?
> >
> 
> We don't do reservations in the endio stuff, we assume you've reserved all the
> space you need in delalloc, plus we would have seen reserve_metadata_bytes in
> the trace.  Though it does look like perf is lying to us in at least one case
> sicne btrfs_alloc_logged_file_extent is only called from log replay and not
> during normal runtime, so it definitely shouldn't be showing up.  Thanks,

Whoops, I should have read that num_items > 0 check harder.

btrfs_end_transaction is doing it by setting ->blocked = 1

        if (lock && !atomic_read(&root->fs_info->open_ioctl_trans) &&
            should_end_transaction(trans, root)) {
                trans->transaction->blocked = 1;
		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                smp_wmb();
        }

       if (lock && cur_trans->blocked && !cur_trans->in_commit) {
                   ^^^^^^^^^^^^^^^^^^^
                if (throttle) {
                        /*
                         * We may race with somebody else here so end up having
                         * to call end_transaction on ourselves again, so inc
                         * our use_count.
                         */
                        trans->use_count++;
                        return btrfs_commit_transaction(trans, root);
                } else {
                        wake_up_process(info->transaction_kthread);
                }
        }

perf is definitely lying a little bit about the trace ;)

-chris


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-26 13:23                   ` Chris Mason
@ 2011-10-27 15:07                     ` Josef Bacik
  2011-10-27 18:14                       ` Josef Bacik
  0 siblings, 1 reply; 47+ messages in thread
From: Josef Bacik @ 2011-10-27 15:07 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, Christian Brunner, Sage Weil,
	ceph-devel, linux-btrfs

On Wed, Oct 26, 2011 at 09:23:54AM -0400, Chris Mason wrote:
> On Tue, Oct 25, 2011 at 04:22:48PM -0400, Josef Bacik wrote:
> > On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
> > > On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
> > > > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
> > > > > 
> > > > > Attached is a perf-report. I have included the whole report, so that
> > > > > you can see the difference between the good and the bad
> > > > > btrfs-endio-wri.
> > > > >
> > > > 
> > > > We also shouldn't be running run_ordered_operations, man this is screwed up,
> > > > thanks so much for this, I should be able to nail this down pretty easily.
> > > > Thanks,
> > > 
> > > Looks like we're getting there from reserve_metadata_bytes when we join
> > > the transaction?
> > >
> > 
> > We don't do reservations in the endio stuff, we assume you've reserved all the
> > space you need in delalloc, plus we would have seen reserve_metadata_bytes in
> > the trace.  Though it does look like perf is lying to us in at least one case
> > sicne btrfs_alloc_logged_file_extent is only called from log replay and not
> > during normal runtime, so it definitely shouldn't be showing up.  Thanks,
> 
> Whoops, I should have read that num_items > 0 check harder.
> 
> btrfs_end_transaction is doing it by setting ->blocked = 1
> 
>         if (lock && !atomic_read(&root->fs_info->open_ioctl_trans) &&
>             should_end_transaction(trans, root)) {
>                 trans->transaction->blocked = 1;
> 		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>                 smp_wmb();
>         }
> 
>        if (lock && cur_trans->blocked && !cur_trans->in_commit) {
>                    ^^^^^^^^^^^^^^^^^^^
>                 if (throttle) {
>                         /*
>                          * We may race with somebody else here so end up having
>                          * to call end_transaction on ourselves again, so inc
>                          * our use_count.
>                          */
>                         trans->use_count++;
>                         return btrfs_commit_transaction(trans, root);
>                 } else {
>                         wake_up_process(info->transaction_kthread);
>                 }
>         }
> 

Not sure what you are getting at here?  Even if we set blocked, we're not
throttling so it will just wake up the transaction kthread, so we won't do the
commit in the endio case.  Thanks

Josef

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-27 15:07                     ` Josef Bacik
@ 2011-10-27 18:14                       ` Josef Bacik
  0 siblings, 0 replies; 47+ messages in thread
From: Josef Bacik @ 2011-10-27 18:14 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Chris Mason, Christian Brunner, Sage Weil, ceph-devel, linux-btrfs

On Thu, Oct 27, 2011 at 11:07:38AM -0400, Josef Bacik wrote:
> On Wed, Oct 26, 2011 at 09:23:54AM -0400, Chris Mason wrote:
> > On Tue, Oct 25, 2011 at 04:22:48PM -0400, Josef Bacik wrote:
> > > On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
> > > > On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
> > > > > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
> > > > > > 
> > > > > > Attached is a perf-report. I have included the whole report, so that
> > > > > > you can see the difference between the good and the bad
> > > > > > btrfs-endio-wri.
> > > > > >
> > > > > 
> > > > > We also shouldn't be running run_ordered_operations, man this is screwed up,
> > > > > thanks so much for this, I should be able to nail this down pretty easily.
> > > > > Thanks,
> > > > 
> > > > Looks like we're getting there from reserve_metadata_bytes when we join
> > > > the transaction?
> > > >
> > > 
> > > We don't do reservations in the endio stuff, we assume you've reserved all the
> > > space you need in delalloc, plus we would have seen reserve_metadata_bytes in
> > > the trace.  Though it does look like perf is lying to us in at least one case
> > > sicne btrfs_alloc_logged_file_extent is only called from log replay and not
> > > during normal runtime, so it definitely shouldn't be showing up.  Thanks,
> > 
> > Whoops, I should have read that num_items > 0 check harder.
> > 
> > btrfs_end_transaction is doing it by setting ->blocked = 1
> > 
> >         if (lock && !atomic_read(&root->fs_info->open_ioctl_trans) &&
> >             should_end_transaction(trans, root)) {
> >                 trans->transaction->blocked = 1;
> > 		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >                 smp_wmb();
> >         }
> > 
> >        if (lock && cur_trans->blocked && !cur_trans->in_commit) {
> >                    ^^^^^^^^^^^^^^^^^^^
> >                 if (throttle) {
> >                         /*
> >                          * We may race with somebody else here so end up having
> >                          * to call end_transaction on ourselves again, so inc
> >                          * our use_count.
> >                          */
> >                         trans->use_count++;
> >                         return btrfs_commit_transaction(trans, root);
> >                 } else {
> >                         wake_up_process(info->transaction_kthread);
> >                 }
> >         }
> > 
> 
> Not sure what you are getting at here?  Even if we set blocked, we're not
> throttling so it will just wake up the transaction kthread, so we won't do the
> commit in the endio case.  Thanks
> 

Oh I see what you were trying to say, that we'd set blocking and then commit the
transaction from the endio process which would run ordered operations, but since
throttle isn't set that won't happen.  I think that the perf symbols are just
lying to us.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-25 11:56       ` Christian Brunner
@ 2011-10-27 19:52           ` Josef Bacik
  2011-10-27 19:52           ` Josef Bacik
  1 sibling, 0 replies; 47+ messages in thread
From: Josef Bacik @ 2011-10-27 19:52 UTC (permalink / raw)
  To: Christian Brunner; +Cc: Josef Bacik, Sage Weil, ceph-devel, linux-btrfs

On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
> 2011/10/24 Josef Bacik <josef@redhat.com>:
> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> >> [adding linux-btrfs to cc]
> >>
> >> Josef, Chris, any ideas on the below issues?
> >>
> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
> >> >
> >> > - When I run ceph with btrfs snaps disabled, the situation is ge=
tting
> >> > slightly better. I can run an OSD for about 3 days without probl=
ems,
> >> > but then again the load increases. This time, I can see that the
> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more=
 work
> >> > than usual.
> >>
> >> FYI in this scenario you're exposed to the same journal replay iss=
ues that
> >> ext4 and XFS are. =A0The btrfs workload that ceph is generating wi=
ll also
> >> not be all that special, though, so this problem shouldn't be uniq=
ue to
> >> ceph.
> >>
> >
> > Can you get sysrq+w when this happens? =A0I'd like to see what btrf=
s-endio-write
> > is up to.
>=20
> Capturing this seems to be not easy. I have a few traces (see
> attachment), but with sysrq+w I do not get a stacktrace of
> btrfs-endio-write. What I have is a "latencytop -c" output which is
> interesting:
>=20
> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
> tries to balance the load over all OSDs, so all filesystems should ge=
t
> an nearly equal load. At the moment one filesystem seems to have a
> problem. When running with iostat I see the following
>=20
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdd               0.00     0.00    0.00    4.33     0.00    53.33
> 12.31     0.08   19.38  12.23   5.30
> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
> 8.57    74.33  380.76   2.74  62.57
> sdb               0.00     0.00    0.00    1.33     0.00    16.00
> 12.00     0.03   25.00  19.75   2.63
> sda               0.00     0.00    0.00    0.67     0.00     8.00
> 12.00     0.01   19.50  12.50   0.83
>=20
> The PID of the ceph-osd taht is running on sdc is 2053 and when I loo=
k
> with top I see this process and a btrfs-endio-writer (PID 5447):
>=20
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  2053 root      20   0  537m 146m 2364 S 33.2  0.6  43:31.24 ceph-osd
>  5447 root      20   0     0    0    0 S 22.6  0.0  19:32.18 btrfs-en=
dio-wri
>=20
> In the latencytop output you can see that those processes have a much
> higher latency, than the other ceph-osd and btrfs-endio-writers.
>=20
> Regards,
> Christian

Ok just a shot in the dark, but could you give this a whirl and see if =
it helps
you?  Thanks

Josef


diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 125cf76..fbc196e 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -210,9 +210,9 @@ int btrfs_delayed_ref_lock(struct btrfs_trans_handl=
e *trans,
 }
=20
 int btrfs_find_ref_cluster(struct btrfs_trans_handle *trans,
-			   struct list_head *cluster, u64 start)
+			   struct list_head *cluster, u64 start, unsigned long max_count)
 {
-	int count =3D 0;
+	unsigned long count =3D 0;
 	struct btrfs_delayed_ref_root *delayed_refs;
 	struct rb_node *node;
 	struct btrfs_delayed_ref_node *ref;
@@ -242,7 +242,7 @@ int btrfs_find_ref_cluster(struct btrfs_trans_handl=
e *trans,
 			node =3D rb_first(&delayed_refs->root);
 	}
 again:
-	while (node && count < 32) {
+	while (node && count < max_count) {
 		ref =3D rb_entry(node, struct btrfs_delayed_ref_node, rb_node);
 		if (btrfs_delayed_ref_is_head(ref)) {
 			head =3D btrfs_delayed_node_to_head(ref);
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index e287e3b..b15a6ad 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -169,7 +169,8 @@ btrfs_find_delayed_ref_head(struct btrfs_trans_hand=
le *trans, u64 bytenr);
 int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans,
 			   struct btrfs_delayed_ref_head *head);
 int btrfs_find_ref_cluster(struct btrfs_trans_handle *trans,
-			   struct list_head *cluster, u64 search_start);
+			   struct list_head *cluster, u64 search_start,
+			   unsigned long max_count);
 /*
  * a node might live in a head or a regular ref, this lets you
  * test for the proper type to use.
diff --git a/fs/btrfs/dir-item.c b/fs/btrfs/dir-item.c
index 31d84e7..c190282 100644
--- a/fs/btrfs/dir-item.c
+++ b/fs/btrfs/dir-item.c
@@ -81,6 +81,7 @@ int btrfs_insert_xattr_item(struct btrfs_trans_handle=
 *trans,
 	u32 data_size;
=20
 	BUG_ON(name_len + data_len > BTRFS_MAX_XATTR_SIZE(root));
+	WARN_ON(trans->endio);
=20
 	key.objectid =3D objectid;
 	btrfs_set_key_type(&key, BTRFS_XATTR_ITEM_KEY);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4eb7d2b..0977a10 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2295,7 +2295,7 @@ again:
 		 * lock
 		 */
 		ret =3D btrfs_find_ref_cluster(trans, &cluster,
-					     delayed_refs->run_delayed_start);
+					     delayed_refs->run_delayed_start, count);
 		if (ret)
 			break;
=20
@@ -2338,7 +2338,8 @@ again:
 			node =3D rb_next(node);
 		}
 		spin_unlock(&delayed_refs->lock);
-		schedule_timeout(1);
+		if (need_resched())
+			schedule_timeout(1);
 		goto again;
 	}
 out:
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f12747c..73a5e66 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1752,6 +1752,7 @@ static int btrfs_finish_ordered_io(struct inode *=
inode, u64 start, u64 end)
 	else
 		trans =3D btrfs_join_transaction(root);
 	BUG_ON(IS_ERR(trans));
+	trans->endio =3D 1;
 	trans->block_rsv =3D &root->fs_info->delalloc_block_rsv;
=20
 	if (test_bit(BTRFS_ORDERED_COMPRESSED, &ordered_extent->flags))
@@ -2057,8 +2058,11 @@ void btrfs_run_delayed_iputs(struct btrfs_root *=
root)
 	LIST_HEAD(list);
 	struct btrfs_fs_info *fs_info =3D root->fs_info;
 	struct delayed_iput *delayed;
+	struct btrfs_trans_handle *trans;
 	int empty;
=20
+	trans =3D current->journal_info;
+	WARN_ON(trans && trans->endio);
 	spin_lock(&fs_info->delayed_iput_lock);
 	empty =3D list_empty(&fs_info->delayed_iputs);
 	spin_unlock(&fs_info->delayed_iput_lock);
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index a1c9404..ab68cfa 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -527,12 +527,15 @@ int btrfs_wait_ordered_extents(struct btrfs_root =
*root,
  */
 int btrfs_run_ordered_operations(struct btrfs_root *root, int wait)
 {
+	struct btrfs_trans_handle *trans;
 	struct btrfs_inode *btrfs_inode;
 	struct inode *inode;
 	struct list_head splice;
=20
+	trans =3D (struct btrfs_trans_handle *)current->journal_info;
 	INIT_LIST_HEAD(&splice);
=20
+	WARN_ON(trans && trans->endio);
 	mutex_lock(&root->fs_info->ordered_operations_mutex);
 	spin_lock(&root->fs_info->ordered_extent_lock);
 again:
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 29bef63..009d2db 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -310,6 +310,7 @@ again:
 	h->use_count =3D 1;
 	h->block_rsv =3D NULL;
 	h->orig_rsv =3D NULL;
+	h->endio =3D 0;
=20
 	smp_mb();
 	if (cur_trans->blocked && may_wait_transaction(root, type)) {
@@ -467,20 +468,17 @@ static int __btrfs_end_transaction(struct btrfs_t=
rans_handle *trans,
 	while (count < 4) {
 		unsigned long cur =3D trans->delayed_ref_updates;
 		trans->delayed_ref_updates =3D 0;
-		if (cur &&
-		    trans->transaction->delayed_refs.num_heads_ready > 64) {
-			trans->delayed_ref_updates =3D 0;
-
-			/*
-			 * do a full flush if the transaction is trying
-			 * to close
-			 */
-			if (trans->transaction->delayed_refs.flushing)
-				cur =3D 0;
-			btrfs_run_delayed_refs(trans, root, cur);
-		} else {
+		if (!cur ||
+		    trans->transaction->delayed_refs.num_heads_ready <=3D 64)
 			break;
-		}
+
+		/*
+		 * do a full flush if the transaction is trying
+		 * to close
+		 */
+		if (trans->transaction->delayed_refs.flushing && throttle)
+			cur =3D 0;
+		btrfs_run_delayed_refs(trans, root, cur);
 		count++;
 	}
=20
@@ -498,6 +496,7 @@ static int __btrfs_end_transaction(struct btrfs_tra=
ns_handle *trans,
 			 * our use_count.
 			 */
 			trans->use_count++;
+			WARN_ON(trans->endio);
 			return btrfs_commit_transaction(trans, root);
 		} else {
 			wake_up_process(info->transaction_kthread);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 02564e6..7eae404 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -55,6 +55,7 @@ struct btrfs_trans_handle {
 	struct btrfs_transaction *transaction;
 	struct btrfs_block_rsv *block_rsv;
 	struct btrfs_block_rsv *orig_rsv;
+	unsigned endio;
 };
=20
 struct btrfs_pending_snapshot {
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
@ 2011-10-27 19:52           ` Josef Bacik
  0 siblings, 0 replies; 47+ messages in thread
From: Josef Bacik @ 2011-10-27 19:52 UTC (permalink / raw)
  To: Christian Brunner; +Cc: Josef Bacik, Sage Weil, ceph-devel, linux-btrfs

On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
> 2011/10/24 Josef Bacik <josef@redhat.com>:
> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> >> [adding linux-btrfs to cc]
> >>
> >> Josef, Chris, any ideas on the below issues?
> >>
> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
> >> >
> >> > - When I run ceph with btrfs snaps disabled, the situation is getting
> >> > slightly better. I can run an OSD for about 3 days without problems,
> >> > but then again the load increases. This time, I can see that the
> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
> >> > than usual.
> >>
> >> FYI in this scenario you're exposed to the same journal replay issues that
> >> ext4 and XFS are.  The btrfs workload that ceph is generating will also
> >> not be all that special, though, so this problem shouldn't be unique to
> >> ceph.
> >>
> >
> > Can you get sysrq+w when this happens?  I'd like to see what btrfs-endio-write
> > is up to.
> 
> Capturing this seems to be not easy. I have a few traces (see
> attachment), but with sysrq+w I do not get a stacktrace of
> btrfs-endio-write. What I have is a "latencytop -c" output which is
> interesting:
> 
> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
> tries to balance the load over all OSDs, so all filesystems should get
> an nearly equal load. At the moment one filesystem seems to have a
> problem. When running with iostat I see the following
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdd               0.00     0.00    0.00    4.33     0.00    53.33
> 12.31     0.08   19.38  12.23   5.30
> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
> 8.57    74.33  380.76   2.74  62.57
> sdb               0.00     0.00    0.00    1.33     0.00    16.00
> 12.00     0.03   25.00  19.75   2.63
> sda               0.00     0.00    0.00    0.67     0.00     8.00
> 12.00     0.01   19.50  12.50   0.83
> 
> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
> with top I see this process and a btrfs-endio-writer (PID 5447):
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  2053 root      20   0  537m 146m 2364 S 33.2  0.6  43:31.24 ceph-osd
>  5447 root      20   0     0    0    0 S 22.6  0.0  19:32.18 btrfs-endio-wri
> 
> In the latencytop output you can see that those processes have a much
> higher latency, than the other ceph-osd and btrfs-endio-writers.
> 
> Regards,
> Christian

Ok just a shot in the dark, but could you give this a whirl and see if it helps
you?  Thanks

Josef


diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 125cf76..fbc196e 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -210,9 +210,9 @@ int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans,
 }
 
 int btrfs_find_ref_cluster(struct btrfs_trans_handle *trans,
-			   struct list_head *cluster, u64 start)
+			   struct list_head *cluster, u64 start, unsigned long max_count)
 {
-	int count = 0;
+	unsigned long count = 0;
 	struct btrfs_delayed_ref_root *delayed_refs;
 	struct rb_node *node;
 	struct btrfs_delayed_ref_node *ref;
@@ -242,7 +242,7 @@ int btrfs_find_ref_cluster(struct btrfs_trans_handle *trans,
 			node = rb_first(&delayed_refs->root);
 	}
 again:
-	while (node && count < 32) {
+	while (node && count < max_count) {
 		ref = rb_entry(node, struct btrfs_delayed_ref_node, rb_node);
 		if (btrfs_delayed_ref_is_head(ref)) {
 			head = btrfs_delayed_node_to_head(ref);
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index e287e3b..b15a6ad 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -169,7 +169,8 @@ btrfs_find_delayed_ref_head(struct btrfs_trans_handle *trans, u64 bytenr);
 int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans,
 			   struct btrfs_delayed_ref_head *head);
 int btrfs_find_ref_cluster(struct btrfs_trans_handle *trans,
-			   struct list_head *cluster, u64 search_start);
+			   struct list_head *cluster, u64 search_start,
+			   unsigned long max_count);
 /*
  * a node might live in a head or a regular ref, this lets you
  * test for the proper type to use.
diff --git a/fs/btrfs/dir-item.c b/fs/btrfs/dir-item.c
index 31d84e7..c190282 100644
--- a/fs/btrfs/dir-item.c
+++ b/fs/btrfs/dir-item.c
@@ -81,6 +81,7 @@ int btrfs_insert_xattr_item(struct btrfs_trans_handle *trans,
 	u32 data_size;
 
 	BUG_ON(name_len + data_len > BTRFS_MAX_XATTR_SIZE(root));
+	WARN_ON(trans->endio);
 
 	key.objectid = objectid;
 	btrfs_set_key_type(&key, BTRFS_XATTR_ITEM_KEY);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4eb7d2b..0977a10 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2295,7 +2295,7 @@ again:
 		 * lock
 		 */
 		ret = btrfs_find_ref_cluster(trans, &cluster,
-					     delayed_refs->run_delayed_start);
+					     delayed_refs->run_delayed_start, count);
 		if (ret)
 			break;
 
@@ -2338,7 +2338,8 @@ again:
 			node = rb_next(node);
 		}
 		spin_unlock(&delayed_refs->lock);
-		schedule_timeout(1);
+		if (need_resched())
+			schedule_timeout(1);
 		goto again;
 	}
 out:
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f12747c..73a5e66 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1752,6 +1752,7 @@ static int btrfs_finish_ordered_io(struct inode *inode, u64 start, u64 end)
 	else
 		trans = btrfs_join_transaction(root);
 	BUG_ON(IS_ERR(trans));
+	trans->endio = 1;
 	trans->block_rsv = &root->fs_info->delalloc_block_rsv;
 
 	if (test_bit(BTRFS_ORDERED_COMPRESSED, &ordered_extent->flags))
@@ -2057,8 +2058,11 @@ void btrfs_run_delayed_iputs(struct btrfs_root *root)
 	LIST_HEAD(list);
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct delayed_iput *delayed;
+	struct btrfs_trans_handle *trans;
 	int empty;
 
+	trans = current->journal_info;
+	WARN_ON(trans && trans->endio);
 	spin_lock(&fs_info->delayed_iput_lock);
 	empty = list_empty(&fs_info->delayed_iputs);
 	spin_unlock(&fs_info->delayed_iput_lock);
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index a1c9404..ab68cfa 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -527,12 +527,15 @@ int btrfs_wait_ordered_extents(struct btrfs_root *root,
  */
 int btrfs_run_ordered_operations(struct btrfs_root *root, int wait)
 {
+	struct btrfs_trans_handle *trans;
 	struct btrfs_inode *btrfs_inode;
 	struct inode *inode;
 	struct list_head splice;
 
+	trans = (struct btrfs_trans_handle *)current->journal_info;
 	INIT_LIST_HEAD(&splice);
 
+	WARN_ON(trans && trans->endio);
 	mutex_lock(&root->fs_info->ordered_operations_mutex);
 	spin_lock(&root->fs_info->ordered_extent_lock);
 again:
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 29bef63..009d2db 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -310,6 +310,7 @@ again:
 	h->use_count = 1;
 	h->block_rsv = NULL;
 	h->orig_rsv = NULL;
+	h->endio = 0;
 
 	smp_mb();
 	if (cur_trans->blocked && may_wait_transaction(root, type)) {
@@ -467,20 +468,17 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
 	while (count < 4) {
 		unsigned long cur = trans->delayed_ref_updates;
 		trans->delayed_ref_updates = 0;
-		if (cur &&
-		    trans->transaction->delayed_refs.num_heads_ready > 64) {
-			trans->delayed_ref_updates = 0;
-
-			/*
-			 * do a full flush if the transaction is trying
-			 * to close
-			 */
-			if (trans->transaction->delayed_refs.flushing)
-				cur = 0;
-			btrfs_run_delayed_refs(trans, root, cur);
-		} else {
+		if (!cur ||
+		    trans->transaction->delayed_refs.num_heads_ready <= 64)
 			break;
-		}
+
+		/*
+		 * do a full flush if the transaction is trying
+		 * to close
+		 */
+		if (trans->transaction->delayed_refs.flushing && throttle)
+			cur = 0;
+		btrfs_run_delayed_refs(trans, root, cur);
 		count++;
 	}
 
@@ -498,6 +496,7 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
 			 * our use_count.
 			 */
 			trans->use_count++;
+			WARN_ON(trans->endio);
 			return btrfs_commit_transaction(trans, root);
 		} else {
 			wake_up_process(info->transaction_kthread);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 02564e6..7eae404 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -55,6 +55,7 @@ struct btrfs_trans_handle {
 	struct btrfs_transaction *transaction;
 	struct btrfs_block_rsv *block_rsv;
 	struct btrfs_block_rsv *orig_rsv;
+	unsigned endio;
 };
 
 struct btrfs_pending_snapshot {
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-27 19:52           ` Josef Bacik
@ 2011-10-27 20:39             ` Christian Brunner
  -1 siblings, 0 replies; 47+ messages in thread
From: Christian Brunner @ 2011-10-27 20:39 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Sage Weil, ceph-devel, linux-btrfs

2011/10/27 Josef Bacik <josef@redhat.com>:
> On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
>> 2011/10/24 Josef Bacik <josef@redhat.com>:
>> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
>> >> [adding linux-btrfs to cc]
>> >>
>> >> Josef, Chris, any ideas on the below issues?
>> >>
>> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
>> >> >
>> >> > - When I run ceph with btrfs snaps disabled, the situation is g=
etting
>> >> > slightly better. I can run an OSD for about 3 days without prob=
lems,
>> >> > but then again the load increases. This time, I can see that th=
e
>> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing mor=
e work
>> >> > than usual.
>> >>
>> >> FYI in this scenario you're exposed to the same journal replay is=
sues that
>> >> ext4 and XFS are. =A0The btrfs workload that ceph is generating w=
ill also
>> >> not be all that special, though, so this problem shouldn't be uni=
que to
>> >> ceph.
>> >>
>> >
>> > Can you get sysrq+w when this happens? =A0I'd like to see what btr=
fs-endio-write
>> > is up to.
>>
>> Capturing this seems to be not easy. I have a few traces (see
>> attachment), but with sysrq+w I do not get a stacktrace of
>> btrfs-endio-write. What I have is a "latencytop -c" output which is
>> interesting:
>>
>> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Cep=
h
>> tries to balance the load over all OSDs, so all filesystems should g=
et
>> an nearly equal load. At the moment one filesystem seems to have a
>> problem. When running with iostat I see the following
>>
>> Device: =A0 =A0 =A0 =A0 rrqm/s =A0 wrqm/s =A0 =A0 r/s =A0 =A0 w/s =A0=
 rsec/s =A0 wsec/s
>> avgrq-sz avgqu-sz =A0 await =A0svctm =A0%util
>> sdd =A0 =A0 =A0 =A0 =A0 =A0 =A0 0.00 =A0 =A0 0.00 =A0 =A00.00 =A0 =A0=
4.33 =A0 =A0 0.00 =A0 =A053.33
>> 12.31 =A0 =A0 0.08 =A0 19.38 =A012.23 =A0 5.30
>> sdc =A0 =A0 =A0 =A0 =A0 =A0 =A0 0.00 =A0 =A0 1.00 =A0 =A00.00 =A0228=
=2E33 =A0 =A0 0.00 =A01957.33
>> 8.57 =A0 =A074.33 =A0380.76 =A0 2.74 =A062.57
>> sdb =A0 =A0 =A0 =A0 =A0 =A0 =A0 0.00 =A0 =A0 0.00 =A0 =A00.00 =A0 =A0=
1.33 =A0 =A0 0.00 =A0 =A016.00
>> 12.00 =A0 =A0 0.03 =A0 25.00 19.75 2.63
>> sda =A0 =A0 =A0 =A0 =A0 =A0 =A0 0.00 =A0 =A0 0.00 =A0 =A00.00 =A0 =A0=
0.67 =A0 =A0 0.00 =A0 =A0 8.00
>> 12.00 =A0 =A0 0.01 =A0 19.50 =A012.50 =A0 0.83
>>
>> The PID of the ceph-osd taht is running on sdc is 2053 and when I lo=
ok
>> with top I see this process and a btrfs-endio-writer (PID 5447):
>>
>> =A0 PID USER =A0 =A0 =A0PR =A0NI =A0VIRT =A0RES =A0SHR S %CPU %MEM =A0=
 =A0TIME+ =A0COMMAND
>> =A02053 root =A0 =A0 =A020 =A0 0 =A0537m 146m 2364 S 33.2 0.6 43:31.=
24 ceph-osd
>> =A05447 root =A0 =A0 =A020 =A0 0 =A0 =A0 0 =A0 =A00 =A0 =A00 S 22.6 =
0.0 19:32.18 btrfs-endio-wri
>>
>> In the latencytop output you can see that those processes have a muc=
h
>> higher latency, than the other ceph-osd and btrfs-endio-writers.
>>
>> Regards,
>> Christian
>
> Ok just a shot in the dark, but could you give this a whirl and see i=
f it helps
> you? =A0Thanks

Thanks for the patch! I'll install it tomorrow and I think that I can
report back on Monday. It always takes a few days until the load goes
up.

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
@ 2011-10-27 20:39             ` Christian Brunner
  0 siblings, 0 replies; 47+ messages in thread
From: Christian Brunner @ 2011-10-27 20:39 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Sage Weil, ceph-devel, linux-btrfs

2011/10/27 Josef Bacik <josef@redhat.com>:
> On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
>> 2011/10/24 Josef Bacik <josef@redhat.com>:
>> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
>> >> [adding linux-btrfs to cc]
>> >>
>> >> Josef, Chris, any ideas on the below issues?
>> >>
>> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
>> >> >
>> >> > - When I run ceph with btrfs snaps disabled, the situation is getting
>> >> > slightly better. I can run an OSD for about 3 days without problems,
>> >> > but then again the load increases. This time, I can see that the
>> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
>> >> > than usual.
>> >>
>> >> FYI in this scenario you're exposed to the same journal replay issues that
>> >> ext4 and XFS are.  The btrfs workload that ceph is generating will also
>> >> not be all that special, though, so this problem shouldn't be unique to
>> >> ceph.
>> >>
>> >
>> > Can you get sysrq+w when this happens?  I'd like to see what btrfs-endio-write
>> > is up to.
>>
>> Capturing this seems to be not easy. I have a few traces (see
>> attachment), but with sysrq+w I do not get a stacktrace of
>> btrfs-endio-write. What I have is a "latencytop -c" output which is
>> interesting:
>>
>> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
>> tries to balance the load over all OSDs, so all filesystems should get
>> an nearly equal load. At the moment one filesystem seems to have a
>> problem. When running with iostat I see the following
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sdd               0.00     0.00    0.00    4.33     0.00    53.33
>> 12.31     0.08   19.38  12.23   5.30
>> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
>> 8.57    74.33  380.76   2.74  62.57
>> sdb               0.00     0.00    0.00    1.33     0.00    16.00
>> 12.00     0.03   25.00 19.75 2.63
>> sda               0.00     0.00    0.00    0.67     0.00     8.00
>> 12.00     0.01   19.50  12.50   0.83
>>
>> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
>> with top I see this process and a btrfs-endio-writer (PID 5447):
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>  2053 root      20   0  537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd
>>  5447 root      20   0     0    0    0 S 22.6 0.0 19:32.18 btrfs-endio-wri
>>
>> In the latencytop output you can see that those processes have a much
>> higher latency, than the other ceph-osd and btrfs-endio-writers.
>>
>> Regards,
>> Christian
>
> Ok just a shot in the dark, but could you give this a whirl and see if it helps
> you?  Thanks

Thanks for the patch! I'll install it tomorrow and I think that I can
report back on Monday. It always takes a few days until the load goes
up.

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
       [not found]               ` <CAO47_-8YGAxoYOBRKxLP2HULqEtV5bMugzzybq3srCVFZczgGA@mail.gmail.com>
@ 2011-10-31 10:25                 ` Christian Brunner
  2011-10-31 13:29                   ` Christian Brunner
  0 siblings, 1 reply; 47+ messages in thread
From: Christian Brunner @ 2011-10-31 10:25 UTC (permalink / raw)
  To: Josef Bacik, Sage Weil, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3269 bytes --]

2011/10/31 Christian Brunner <chb@muc.de>:
> 2011/10/31 Christian Brunner <chb@muc.de>:
>>
>> The patch didn't hurt, but I've to tell you that I'm still seeing the
>> same old problems. Load is going up again:
>>
>>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>  5502 root      20   0     0    0    0 S 52.5 0.0 106:29.97 btrfs-endio-wri
>>  1976 root      20   0  601m 211m 1464 S 28.3 0.9 115:10.62 ceph-osd
>>
>> And I have hit our warning again:
>>
>> [223560.970713] ------------[ cut here ]------------
>> [223560.976043] WARNING: at fs/btrfs/inode.c:2118
>> btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
>> [223560.985411] Hardware name: ProLiant DL180 G6
>> [223560.990491] Modules linked in: btrfs zlib_deflate libcrc32c sunrpc
>> bonding ipv6 sg serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_support
>> i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs
>> [last unloaded: scsi_wait_scan]
>> [223561.014748] Pid: 2079, comm: ceph-osd Tainted: P
>> 3.0.6-1.fits.9.el6.x86_64 #1
>> [223561.023874] Call Trace:
>> [223561.026738]  [<ffffffff8106344f>] warn_slowpath_common+0x7f/0xc0
>> [223561.033564]  [<ffffffff810634aa>] warn_slowpath_null+0x1a/0x20
>> [223561.040272]  [<ffffffffa0282120>] btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]
>> [223561.048278]  [<ffffffffa027ce55>] commit_fs_roots+0xc5/0x1b0 [btrfs]
>> [223561.055534]  [<ffffffff8154c231>] ? mutex_lock+0x31/0x60
>> [223561.061666]  [<ffffffffa027ddbe>]
>> btrfs_commit_transaction+0x3ce/0x820 [btrfs]
>> [223561.069876]  [<ffffffffa027d1b8>] ? wait_current_trans+0x28/0x110 [btrfs]
>> [223561.077582]  [<ffffffffa027e325>] ? join_transaction+0x25/0x250 [btrfs]
>> [223561.085065]  [<ffffffff81086410>] ? wake_up_bit+0x40/0x40
>> [223561.091251]  [<ffffffffa025a329>] btrfs_sync_fs+0x59/0xd0 [btrfs]
>> [223561.098187]  [<ffffffffa02abc65>] btrfs_ioctl+0x495/0xd50 [btrfs]
>> [223561.105120]  [<ffffffff8125ed20>] ? inode_has_perm+0x30/0x40
>> [223561.111575]  [<ffffffff81261a2c>] ? file_has_perm+0xdc/0xf0
>> [223561.117924]  [<ffffffff8117086a>] do_vfs_ioctl+0x9a/0x5a0
>> [223561.124072]  [<ffffffff81170e11>] sys_ioctl+0xa1/0xb0
>> [223561.129842]  [<ffffffff81555702>] system_call_fastpath+0x16/0x1b
>> [223561.136699] ---[ end trace 176e8be8996f25f6 ]---
>
> [ Not sending this to the lists, as the attachment is large ].
>
> I've spent a little time to do some tracing with ftrace. Its output
> seems to be right (at least as far as I can tell). I hope that its
> output can give you an insight on whats going on.
>
> The interesting PIDs in the trace are:
>
>  5502 root      20   0     0    0    0 S 33.6 0.0 118:28.37 btrfs-endio-wri
>  5518 root      20   0     0    0    0 S 29.3 0.0 41:23.58 btrfs-endio-wri
>  8059 root      20   0  400m  48m 2756 S  8.0  0.2   8:31.56 ceph-osd
>  7993 root      20   0  401m  41m 2808 S 13.6  0.2   7:58.38 ceph-osd
>

[ adding linux-btrfs again ]

I've been digging into this a bit further:

Attached is another ftrace report that I've filtered for "btrfs_*"
calls and limited to CPU0 (this is where PID 5502 was running).

>From what I can see there is a lot of time consumed in
btrfs_reserve_extent(). I this normal?

Thanks,
Christian

[-- Attachment #2: ftrace_btrfs_cpu0.bz2 --]
[-- Type: application/x-bzip2, Size: 19541 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-31 10:25                 ` Christian Brunner
@ 2011-10-31 13:29                   ` Christian Brunner
  2011-10-31 14:04                     ` Josef Bacik
  0 siblings, 1 reply; 47+ messages in thread
From: Christian Brunner @ 2011-10-31 13:29 UTC (permalink / raw)
  To: Josef Bacik, Sage Weil, linux-btrfs

2011/10/31 Christian Brunner <chb@muc.de>:
> 2011/10/31 Christian Brunner <chb@muc.de>:
>> 2011/10/31 Christian Brunner <chb@muc.de>:
>>>
>>> The patch didn't hurt, but I've to tell you that I'm still seeing t=
he
>>> same old problems. Load is going up again:
>>>
>>> =A0PID USER =A0 =A0 =A0PR =A0NI =A0VIRT =A0RES =A0SHR S %CPU %MEM =A0=
 =A0TIME+ =A0COMMAND
>>> =A05502 root =A0 =A0 =A020 =A0 0 =A0 =A0 0 =A0 =A00 =A0 =A00 S 52.5=
 0.0 106:29.97 btrfs-endio-wri
>>> =A01976 root =A0 =A0 =A020 =A0 0 =A0601m 211m 1464 S 28.3 0.9 115:1=
0.62 ceph-osd
>>>
>>> And I have hit our warning again:
>>>
>>> [223560.970713] ------------[ cut here ]------------
>>> [223560.976043] WARNING: at fs/btrfs/inode.c:2118
>>> btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
>>> [223560.985411] Hardware name: ProLiant DL180 G6
>>> [223560.990491] Modules linked in: btrfs zlib_deflate libcrc32c sun=
rpc
>>> bonding ipv6 sg serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_supp=
ort
>>> i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs
>>> [last unloaded: scsi_wait_scan]
>>> [223561.014748] Pid: 2079, comm: ceph-osd Tainted: P
>>> 3.0.6-1.fits.9.el6.x86_64 #1
>>> [223561.023874] Call Trace:
>>> [223561.026738] =A0[<ffffffff8106344f>] warn_slowpath_common+0x7f/0=
xc0
>>> [223561.033564] =A0[<ffffffff810634aa>] warn_slowpath_null+0x1a/0x2=
0
>>> [223561.040272] =A0[<ffffffffa0282120>] btrfs_orphan_commit_root+0x=
b0/0xc0 [btrfs]
>>> [223561.048278] =A0[<ffffffffa027ce55>] commit_fs_roots+0xc5/0x1b0 =
[btrfs]
>>> [223561.055534] =A0[<ffffffff8154c231>] ? mutex_lock+0x31/0x60
>>> [223561.061666] =A0[<ffffffffa027ddbe>]
>>> btrfs_commit_transaction+0x3ce/0x820 [btrfs]
>>> [223561.069876] =A0[<ffffffffa027d1b8>] ? wait_current_trans+0x28/0=
x110 [btrfs]
>>> [223561.077582] =A0[<ffffffffa027e325>] ? join_transaction+0x25/0x2=
50 [btrfs]
>>> [223561.085065] =A0[<ffffffff81086410>] ? wake_up_bit+0x40/0x40
>>> [223561.091251] =A0[<ffffffffa025a329>] btrfs_sync_fs+0x59/0xd0 [bt=
rfs]
>>> [223561.098187] =A0[<ffffffffa02abc65>] btrfs_ioctl+0x495/0xd50 [bt=
rfs]
>>> [223561.105120] =A0[<ffffffff8125ed20>] ? inode_has_perm+0x30/0x40
>>> [223561.111575] =A0[<ffffffff81261a2c>] ? file_has_perm+0xdc/0xf0
>>> [223561.117924] =A0[<ffffffff8117086a>] do_vfs_ioctl+0x9a/0x5a0
>>> [223561.124072] =A0[<ffffffff81170e11>] sys_ioctl+0xa1/0xb0
>>> [223561.129842] =A0[<ffffffff81555702>] system_call_fastpath+0x16/0=
x1b
>>> [223561.136699] ---[ end trace 176e8be8996f25f6 ]---
>>
>> [ Not sending this to the lists, as the attachment is large ].
>>
>> I've spent a little time to do some tracing with ftrace. Its output
>> seems to be right (at least as far as I can tell). I hope that its
>> output can give you an insight on whats going on.
>>
>> The interesting PIDs in the trace are:
>>
>> =A05502 root =A0 =A0 =A020 =A0 0 =A0 =A0 0 =A0 =A00 =A0 =A00 S 33.6 =
0.0 118:28.37 btrfs-endio-wri
>> =A05518 root =A0 =A0 =A020 =A0 0 =A0 =A0 0 =A0 =A00 =A0 =A00 S 29.3 =
0.0 41:23.58 btrfs-endio-wri
>> =A08059 root =A0 =A0 =A020 =A0 0 =A0400m =A048m 2756 S =A08.0 =A00.2=
 =A0 8:31.56 ceph-osd
>> =A07993 root =A0 =A0 =A020 =A0 0 =A0401m =A041m 2808 S 13.6 =A00.2 =A0=
 7:58.38 ceph-osd
>>
>
> [ adding linux-btrfs again ]
>
> I've been digging into this a bit further:
>
> Attached is another ftrace report that I've filtered for "btrfs_*"
> calls and limited to CPU0 (this is where PID 5502 was running).
>
> From what I can see there is a lot of time consumed in
> btrfs_reserve_extent(). I this normal?

Sorry for spamming, but in the meantime I'm almost certain that the
problem is inside find_free_extent (called from btrfs_reserve_extent).

When I'm running ftrace for a sample period of 10s my system is
wasting a total of 4,2 seconds inside find_free_extent(). Each call to
find_free_extent() is taking an average of 4 milliseconds to complete.
On a recently rebooted system this is only 1-2 us!

I'm not sure if the problem is occurring suddenly or slowly over time.
(At the moment I suspect that its occurring suddenly, but I still have
to investigate this).

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-31 13:29                   ` Christian Brunner
@ 2011-10-31 14:04                     ` Josef Bacik
  0 siblings, 0 replies; 47+ messages in thread
From: Josef Bacik @ 2011-10-31 14:04 UTC (permalink / raw)
  To: Christian Brunner; +Cc: Josef Bacik, Sage Weil, linux-btrfs

On Mon, Oct 31, 2011 at 02:29:44PM +0100, Christian Brunner wrote:
> 2011/10/31 Christian Brunner <chb@muc.de>:
> > 2011/10/31 Christian Brunner <chb@muc.de>:
> >> 2011/10/31 Christian Brunner <chb@muc.de>:
> >>>
> >>> The patch didn't hurt, but I've to tell you that I'm still seeing=
 the
> >>> same old problems. Load is going up again:
> >>>
> >>> =A0PID USER =A0 =A0 =A0PR =A0NI =A0VIRT =A0RES =A0SHR S %CPU %MEM=
 =A0 =A0TIME+ =A0COMMAND
> >>> =A05502 root =A0 =A0 =A020 =A0 0 =A0 =A0 0 =A0 =A00 =A0 =A00 S 52=
=2E5 0.0 106:29.97 btrfs-endio-wri
> >>> =A01976 root =A0 =A0 =A020 =A0 0 =A0601m 211m 1464 S 28.3 0.9 115=
:10.62 ceph-osd
> >>>
> >>> And I have hit our warning again:
> >>>
> >>> [223560.970713] ------------[ cut here ]------------
> >>> [223560.976043] WARNING: at fs/btrfs/inode.c:2118
> >>> btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
> >>> [223560.985411] Hardware name: ProLiant DL180 G6
> >>> [223560.990491] Modules linked in: btrfs zlib_deflate libcrc32c s=
unrpc
> >>> bonding ipv6 sg serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_su=
pport
> >>> i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashf=
s
> >>> [last unloaded: scsi_wait_scan]
> >>> [223561.014748] Pid: 2079, comm: ceph-osd Tainted: P
> >>> 3.0.6-1.fits.9.el6.x86_64 #1
> >>> [223561.023874] Call Trace:
> >>> [223561.026738] =A0[<ffffffff8106344f>] warn_slowpath_common+0x7f=
/0xc0
> >>> [223561.033564] =A0[<ffffffff810634aa>] warn_slowpath_null+0x1a/0=
x20
> >>> [223561.040272] =A0[<ffffffffa0282120>] btrfs_orphan_commit_root+=
0xb0/0xc0 [btrfs]
> >>> [223561.048278] =A0[<ffffffffa027ce55>] commit_fs_roots+0xc5/0x1b=
0 [btrfs]
> >>> [223561.055534] =A0[<ffffffff8154c231>] ? mutex_lock+0x31/0x60
> >>> [223561.061666] =A0[<ffffffffa027ddbe>]
> >>> btrfs_commit_transaction+0x3ce/0x820 [btrfs]
> >>> [223561.069876] =A0[<ffffffffa027d1b8>] ? wait_current_trans+0x28=
/0x110 [btrfs]
> >>> [223561.077582] =A0[<ffffffffa027e325>] ? join_transaction+0x25/0=
x250 [btrfs]
> >>> [223561.085065] =A0[<ffffffff81086410>] ? wake_up_bit+0x40/0x40
> >>> [223561.091251] =A0[<ffffffffa025a329>] btrfs_sync_fs+0x59/0xd0 [=
btrfs]
> >>> [223561.098187] =A0[<ffffffffa02abc65>] btrfs_ioctl+0x495/0xd50 [=
btrfs]
> >>> [223561.105120] =A0[<ffffffff8125ed20>] ? inode_has_perm+0x30/0x4=
0
> >>> [223561.111575] =A0[<ffffffff81261a2c>] ? file_has_perm+0xdc/0xf0
> >>> [223561.117924] =A0[<ffffffff8117086a>] do_vfs_ioctl+0x9a/0x5a0
> >>> [223561.124072] =A0[<ffffffff81170e11>] sys_ioctl+0xa1/0xb0
> >>> [223561.129842] =A0[<ffffffff81555702>] system_call_fastpath+0x16=
/0x1b
> >>> [223561.136699] ---[ end trace 176e8be8996f25f6 ]---
> >>
> >> [ Not sending this to the lists, as the attachment is large ].
> >>
> >> I've spent a little time to do some tracing with ftrace. Its outpu=
t
> >> seems to be right (at least as far as I can tell). I hope that its
> >> output can give you an insight on whats going on.
> >>
> >> The interesting PIDs in the trace are:
> >>
> >> =A05502 root =A0 =A0 =A020 =A0 0 =A0 =A0 0 =A0 =A00 =A0 =A00 S 33.=
6 0.0 118:28.37 btrfs-endio-wri
> >> =A05518 root =A0 =A0 =A020 =A0 0 =A0 =A0 0 =A0 =A00 =A0 =A00 S 29.=
3 0.0 41:23.58 btrfs-endio-wri
> >> =A08059 root =A0 =A0 =A020 =A0 0 =A0400m =A048m 2756 S =A08.0 =A00=
=2E2 =A0 8:31.56 ceph-osd
> >> =A07993 root =A0 =A0 =A020 =A0 0 =A0401m =A041m 2808 S 13.6 =A00.2=
 =A0 7:58.38 ceph-osd
> >>
> >
> > [ adding linux-btrfs again ]
> >
> > I've been digging into this a bit further:
> >
> > Attached is another ftrace report that I've filtered for "btrfs_*"
> > calls and limited to CPU0 (this is where PID 5502 was running).
> >
> > From what I can see there is a lot of time consumed in
> > btrfs_reserve_extent(). I this normal?
>=20
> Sorry for spamming, but in the meantime I'm almost certain that the
> problem is inside find_free_extent (called from btrfs_reserve_extent)=
=2E
>=20
> When I'm running ftrace for a sample period of 10s my system is
> wasting a total of 4,2 seconds inside find_free_extent(). Each call t=
o
> find_free_extent() is taking an average of 4 milliseconds to complete=
=2E
> On a recently rebooted system this is only 1-2 us!
>=20
> I'm not sure if the problem is occurring suddenly or slowly over time=
=2E
> (At the moment I suspect that its occurring suddenly, but I still hav=
e
> to investigate this).
>

Ugh ok then this is lxo's problem with our clustering stuff taking way =
too much
time.  I guess it's time to actually take a hard look at that code.  Th=
anks,

Josef=20
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-27 10:59     ` Stefan Majer
  (?)
@ 2011-10-27 11:17     ` Martin Mailand
  -1 siblings, 0 replies; 47+ messages in thread
From: Martin Mailand @ 2011-10-27 11:17 UTC (permalink / raw)
  To: Stefan Majer
  Cc: ceph-devel, linux-btrfs, Sage Weil, chb, Josef Bacik, chris.mason

Hi Stefan,
I think the machine has enough ram.

root@s-brick-003:~# free -m
              total       used       free     shared    buffers     cached
Mem:          3924       2401       1522          0         42       2115
-/+ buffers/cache:        243       3680
Swap:         1951          0       1951

There is no swap usage at all.

-martin


Am 27.10.2011 12:59, schrieb Stefan Majer:
> Hi Martin,
>
> a quick dig into your perf report show a large amount of swapper work.
> If this is the case, i would suspect latency. So do you have not
> enough physical ram in your machine ?
>
> Greetings
>
> Stefan Majer
>
> On Thu, Oct 27, 2011 at 12:53 PM, Martin Mailand<martin@tuxadero.com>  wrote:
>> Hi
>> resend without the perf attachment, which could be found here:
>> http://tuxadero.com/multistorage/perf.report.txt.bz2
>>
>> Best Regards,
>>   martin
>>
>> -------- Original-Nachricht --------
>> Betreff: Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
>> Datum: Wed, 26 Oct 2011 22:38:47 +0200
>> Von: Martin Mailand<martin@tuxadero.com>
>> Antwort an: martin@tuxadero.com
>> An: Sage Weil<sage@newdream.net>
>> Kopie (CC): Christian Brunner<chb@muc.de>, ceph-devel@vger.kernel.org,
>>   linux-btrfs@vger.kernel.org
>>
>> Hi,
>> I have more or less the same setup as Christian and I suffer the same
>> problems.
>> But as far as I can see the output of latencytop and perf differs form
>> Christian one, both are attached.
>> I was wondering about the high latency from btrfs-submit.
>>
>> Process btrfs-submit-0 (970) Total: 2123.5 msec
>>
>> I have as well the high IO rate and high IO wait.
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>             0.60    0.00    2.20   82.40    0.00   14.80
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> sda               0.00     0.00    0.00    8.40     0.00    74.40
>> 17.71     0.03    3.81    0.00    3.81   3.81   3.20
>> sdb               0.00     7.00    0.00  269.80     0.00  1224.80
>> 9.08   107.19  398.69    0.00  398.69   3.15  85.00
>>
>> top - 21:57:41 up  8:41,  1 user,  load average: 0.65, 0.79, 0.76
>> Tasks: 179 total,   1 running, 178 sleeping,   0 stopped,   0 zombie
>> Cpu(s):  0.6%us,  2.4%sy,  0.0%ni, 70.8%id, 25.8%wa,  0.0%hi,  0.3%si,
>> 0.0%st
>> Mem:   4018276k total,  1577728k used,  2440548k free,    10496k buffers
>> Swap:  1998844k total,        0k used,  1998844k free,  1316696k cached
>>
>>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>
>>   1399 root      20   0  548m 103m 3428 S  0.0  2.6   2:01.85 ceph-osd
>>
>>   1401 root      20   0  548m 103m 3428 S  0.0  2.6   1:51.71 ceph-osd
>>
>>   1400 root      20   0  548m 103m 3428 S  0.0  2.6   1:50.30 ceph-osd
>>
>>   1391 root      20   0     0    0    0 S  0.0  0.0   1:18.39
>> btrfs-endio-wri
>>
>>    976 root      20   0     0    0    0 S  0.0  0.0   1:18.11
>> btrfs-endio-wri
>>
>>   1367 root      20   0     0    0    0 S  0.0  0.0   1:05.60
>> btrfs-worker-1
>>
>>    968 root      20   0     0    0    0 S  0.0  0.0   1:05.45
>> btrfs-worker-0
>>
>>   1163 root      20   0  141m 1636 1100 S  0.0  0.0   1:00.56 collectd
>>
>>    970 root      20   0     0    0    0 S  0.0  0.0   0:47.73
>> btrfs-submit-0
>>
>>   1402 root      20   0  548m 103m 3428 S  0.0  2.6   0:34.86 ceph-osd
>>
>>   1392 root      20   0     0    0    0 S  0.0  0.0   0:33.70
>> btrfs-endio-met
>>
>>    975 root      20   0     0    0    0 S  0.0  0.0   0:32.70
>> btrfs-endio-met
>>
>>   1415 root      20   0  548m 103m 3428 S  0.0  2.6   0:28.29 ceph-osd
>>
>>   1414 root      20   0  548m 103m 3428 S  0.0  2.6   0:28.24 ceph-osd
>>
>>   1397 root      20   0  548m 103m 3428 S  0.0  2.6   0:24.60 ceph-osd
>>
>>   1436 root      20   0  548m 103m 3428 S  0.0  2.6   0:13.31 ceph-osd
>>
>>
>> Here ist my setup.
>> Kernel v3.1 + Josef
>>
>> The config for this osd (ceph version 0.37
>> (commit:a6f3bbb744a6faea95ae48317f0b838edb16a896)) is:
>> [osd.1]
>>          host = s-brick-003
>>          osd journal = /dev/sda7
>>          btrfs devs = /dev/sdb
>>         btrfs options = noatime
>>         filestore_btrfs_snap = false
>>
>> I hope this helps to pin point the problem.
>>
>> Best Regards,
>> martin
>>
>>
>> Sage Weil schrieb:
>>>
>>> On Wed, 26 Oct 2011, Christian Brunner wrote:
>>>>
>>>> 2011/10/26 Sage Weil<sage@newdream.net>:
>>>>>
>>>>> On Wed, 26 Oct 2011, Christian Brunner wrote:
>>>>>>>>>
>>>>>>>>> Christian, have you tweaked those settings in your ceph.conf?  It
>>>>>>>>> would be
>>>>>>>>> something like 'journal dio = false'.  If not, can you verify that
>>>>>>>>> directio shows true when the journal is initialized from your osd
>>>>>>>>> log?
>>>>>>>>> E.g.,
>>>>>>>>>
>>>>>>>>>   2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open
>>>>>>>>> dev/osd0.journal fd 14: 104857600 bytes, block size 4096 bytes, directio = 1
>>>>>>>>>
>>>>>>>>> If directio = 1 for you, something else funky is causing those
>>>>>>>>> blkdev_fsync's...
>>>>>>>>
>>>>>>>> I've looked it up in the logs - directio is 1:
>>>>>>>>
>>>>>>>> Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open
>>>>>>>> /dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096
>>>>>>>> bytes, directio = 1
>>>>>>>
>>>>>>> Do you mind capturing an strace?  I'd like to see where that
>>>>>>> blkdev_fsync
>>>>>>> is coming from.
>>>>>>
>>>>>> Here is an strace. I can see a lot of sync_file_range operations.
>>>>>
>>>>> Yeah, these all look like the flusher thread, and shouldn't be hitting
>>>>> blkdev_fsync.  Can you confirm that with
>>>>>
>>>>>        filestore flusher = false
>>>>>        filestore sync flush = false
>>>>>
>>>>> you get no sync_file_range at all?  I wonder if this is also perf lying
>>>>> about the call chain.
>>>>
>>>> Yes, setting this makes the sync_file_range calls go away.
>>>
>>> Okay.  That means either sync_file_range on a regular btrfs file is
>>> triggering blkdev_fsync somewhere in btrfs, there is an extremely sneaky
>>> bug that is mixing up file descriptors, or latencytop is lying.  I'm
>>> guessing the latter, given the other weirdness Josef and Chris were
>>> seeing.  :)
>>>
>>>> Is it safe to use these settings with "filestore btrfs snap = 0"?
>>>
>>> Yeah.  They're purely a performance thing to push as much dirty data to
>>> disk as quickly as possible to minimize the snapshot create latency.
>>> You'll notice the write throughput tends to tank when them off.
>>>
>>> sage
>>
>>
>
>
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-27 10:53 ` Martin Mailand
@ 2011-10-27 10:59     ` Stefan Majer
  0 siblings, 0 replies; 47+ messages in thread
From: Stefan Majer @ 2011-10-27 10:59 UTC (permalink / raw)
  To: Martin Mailand
  Cc: ceph-devel, linux-btrfs, Sage Weil, chb, Josef Bacik, chris.mason

Hi Martin,

a quick dig into your perf report show a large amount of swapper work.
If this is the case, i would suspect latency. So do you have not
enough physical ram in your machine ?

Greetings

Stefan Majer

On Thu, Oct 27, 2011 at 12:53 PM, Martin Mailand <martin@tuxadero.com> =
wrote:
> Hi
> resend without the perf attachment, which could be found here:
> http://tuxadero.com/multistorage/perf.report.txt.bz2
>
> Best Regards,
> =A0martin
>
> -------- Original-Nachricht --------
> Betreff: Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
> Datum: Wed, 26 Oct 2011 22:38:47 +0200
> Von: Martin Mailand <martin@tuxadero.com>
> Antwort an: martin@tuxadero.com
> An: Sage Weil <sage@newdream.net>
> Kopie (CC): Christian Brunner <chb@muc.de>, ceph-devel@vger.kernel.or=
g,
> =A0linux-btrfs@vger.kernel.org
>
> Hi,
> I have more or less the same setup as Christian and I suffer the same
> problems.
> But as far as I can see the output of latencytop and perf differs for=
m
> Christian one, both are attached.
> I was wondering about the high latency from btrfs-submit.
>
> Process btrfs-submit-0 (970) Total: 2123.5 msec
>
> I have as well the high IO rate and high IO wait.
>
> avg-cpu: =A0%user =A0 %nice %system %iowait =A0%steal =A0 %idle
> =A0 =A0 =A0 =A0 =A0 =A00.60 =A0 =A00.00 =A0 =A02.20 =A0 82.40 =A0 =A0=
0.00 =A0 14.80
>
> Device: =A0 =A0 =A0 =A0 rrqm/s =A0 wrqm/s =A0 =A0 r/s =A0 =A0 w/s =A0=
 =A0rkB/s =A0 =A0wkB/s
> avgrq-sz avgqu-sz =A0 await r_await w_await =A0svctm =A0%util
> sda =A0 =A0 =A0 =A0 =A0 =A0 =A0 0.00 =A0 =A0 0.00 =A0 =A00.00 =A0 =A0=
8.40 =A0 =A0 0.00 =A0 =A074.40
> 17.71 =A0 =A0 0.03 =A0 =A03.81 =A0 =A00.00 =A0 =A03.81 =A0 3.81 =A0 3=
=2E20
> sdb =A0 =A0 =A0 =A0 =A0 =A0 =A0 0.00 =A0 =A0 7.00 =A0 =A00.00 =A0269.=
80 =A0 =A0 0.00 =A01224.80
> 9.08 =A0 107.19 =A0398.69 =A0 =A00.00 =A0398.69 =A0 3.15 =A085.00
>
> top - 21:57:41 up =A08:41, =A01 user, =A0load average: 0.65, 0.79, 0.=
76
> Tasks: 179 total, =A0 1 running, 178 sleeping, =A0 0 stopped, =A0 0 z=
ombie
> Cpu(s): =A00.6%us, =A02.4%sy, =A00.0%ni, 70.8%id, 25.8%wa, =A00.0%hi,=
 =A00.3%si,
> 0.0%st
> Mem: =A0 4018276k total, =A01577728k used, =A02440548k free, =A0 =A01=
0496k buffers
> Swap: =A01998844k total, =A0 =A0 =A0 =A00k used, =A01998844k free, =A0=
1316696k cached
>
> =A0 PID USER =A0 =A0 =A0PR =A0NI =A0VIRT =A0RES =A0SHR S %CPU %MEM =A0=
 =A0TIME+ =A0COMMAND
>
> =A01399 root =A0 =A0 =A020 =A0 0 =A0548m 103m 3428 S =A00.0 =A02.6 =A0=
 2:01.85 ceph-osd
>
> =A01401 root =A0 =A0 =A020 =A0 0 =A0548m 103m 3428 S =A00.0 =A02.6 =A0=
 1:51.71 ceph-osd
>
> =A01400 root =A0 =A0 =A020 =A0 0 =A0548m 103m 3428 S =A00.0 =A02.6 =A0=
 1:50.30 ceph-osd
>
> =A01391 root =A0 =A0 =A020 =A0 0 =A0 =A0 0 =A0 =A00 =A0 =A00 S =A00.0=
 =A00.0 =A0 1:18.39
> btrfs-endio-wri
>
> =A0 976 root =A0 =A0 =A020 =A0 0 =A0 =A0 0 =A0 =A00 =A0 =A00 S =A00.0=
 =A00.0 =A0 1:18.11
> btrfs-endio-wri
>
> =A01367 root =A0 =A0 =A020 =A0 0 =A0 =A0 0 =A0 =A00 =A0 =A00 S =A00.0=
 =A00.0 =A0 1:05.60
> btrfs-worker-1
>
> =A0 968 root =A0 =A0 =A020 =A0 0 =A0 =A0 0 =A0 =A00 =A0 =A00 S =A00.0=
 =A00.0 =A0 1:05.45
> btrfs-worker-0
>
> =A01163 root =A0 =A0 =A020 =A0 0 =A0141m 1636 1100 S =A00.0 =A00.0 =A0=
 1:00.56 collectd
>
> =A0 970 root =A0 =A0 =A020 =A0 0 =A0 =A0 0 =A0 =A00 =A0 =A00 S =A00.0=
 =A00.0 =A0 0:47.73
> btrfs-submit-0
>
> =A01402 root =A0 =A0 =A020 =A0 0 =A0548m 103m 3428 S =A00.0 =A02.6 =A0=
 0:34.86 ceph-osd
>
> =A01392 root =A0 =A0 =A020 =A0 0 =A0 =A0 0 =A0 =A00 =A0 =A00 S =A00.0=
 =A00.0 =A0 0:33.70
> btrfs-endio-met
>
> =A0 975 root =A0 =A0 =A020 =A0 0 =A0 =A0 0 =A0 =A00 =A0 =A00 S =A00.0=
 =A00.0 =A0 0:32.70
> btrfs-endio-met
>
> =A01415 root =A0 =A0 =A020 =A0 0 =A0548m 103m 3428 S =A00.0 =A02.6 =A0=
 0:28.29 ceph-osd
>
> =A01414 root =A0 =A0 =A020 =A0 0 =A0548m 103m 3428 S =A00.0 =A02.6 =A0=
 0:28.24 ceph-osd
>
> =A01397 root =A0 =A0 =A020 =A0 0 =A0548m 103m 3428 S =A00.0 =A02.6 =A0=
 0:24.60 ceph-osd
>
> =A01436 root =A0 =A0 =A020 =A0 0 =A0548m 103m 3428 S =A00.0 =A02.6 =A0=
 0:13.31 ceph-osd
>
>
> Here ist my setup.
> Kernel v3.1 + Josef
>
> The config for this osd (ceph version 0.37
> (commit:a6f3bbb744a6faea95ae48317f0b838edb16a896)) is:
> [osd.1]
> =A0 =A0 =A0 =A0 host =3D s-brick-003
> =A0 =A0 =A0 =A0 osd journal =3D /dev/sda7
> =A0 =A0 =A0 =A0 btrfs devs =3D /dev/sdb
> =A0 =A0 =A0 =A0btrfs options =3D noatime
> =A0 =A0 =A0 =A0filestore_btrfs_snap =3D false
>
> I hope this helps to pin point the problem.
>
> Best Regards,
> martin
>
>
> Sage Weil schrieb:
>>
>> On Wed, 26 Oct 2011, Christian Brunner wrote:
>>>
>>> 2011/10/26 Sage Weil <sage@newdream.net>:
>>>>
>>>> On Wed, 26 Oct 2011, Christian Brunner wrote:
>>>>>>>>
>>>>>>>> Christian, have you tweaked those settings in your ceph.conf? =
=A0It
>>>>>>>> would be
>>>>>>>> something like 'journal dio =3D false'. =A0If not, can you ver=
ify that
>>>>>>>> directio shows true when the journal is initialized from your =
osd
>>>>>>>> log?
>>>>>>>> E.g.,
>>>>>>>>
>>>>>>>> =A02011-10-21 15:21:02.026789 7ff7e5c54720 journal _open
>>>>>>>> dev/osd0.journal fd 14: 104857600 bytes, block size 4096 bytes=
, directio =3D 1
>>>>>>>>
>>>>>>>> If directio =3D 1 for you, something else funky is causing tho=
se
>>>>>>>> blkdev_fsync's...
>>>>>>>
>>>>>>> I've looked it up in the logs - directio is 1:
>>>>>>>
>>>>>>> Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open
>>>>>>> /dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size=
 4096
>>>>>>> bytes, directio =3D 1
>>>>>>
>>>>>> Do you mind capturing an strace? =A0I'd like to see where that
>>>>>> blkdev_fsync
>>>>>> is coming from.
>>>>>
>>>>> Here is an strace. I can see a lot of sync_file_range operations.
>>>>
>>>> Yeah, these all look like the flusher thread, and shouldn't be hit=
ting
>>>> blkdev_fsync. =A0Can you confirm that with
>>>>
>>>> =A0 =A0 =A0 filestore flusher =3D false
>>>> =A0 =A0 =A0 filestore sync flush =3D false
>>>>
>>>> you get no sync_file_range at all? =A0I wonder if this is also per=
f lying
>>>> about the call chain.
>>>
>>> Yes, setting this makes the sync_file_range calls go away.
>>
>> Okay. =A0That means either sync_file_range on a regular btrfs file i=
s
>> triggering blkdev_fsync somewhere in btrfs, there is an extremely sn=
eaky
>> bug that is mixing up file descriptors, or latencytop is lying. =A0I=
'm
>> guessing the latter, given the other weirdness Josef and Chris were
>> seeing. =A0:)
>>
>>> Is it safe to use these settings with "filestore btrfs snap =3D 0"?
>>
>> Yeah. =A0They're purely a performance thing to push as much dirty da=
ta to
>> disk as quickly as possible to minimize the snapshot create latency.
>> You'll notice the write throughput tends to tank when them off.
>>
>> sage
>
>



--=20
Stefan Majer
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
@ 2011-10-27 10:59     ` Stefan Majer
  0 siblings, 0 replies; 47+ messages in thread
From: Stefan Majer @ 2011-10-27 10:59 UTC (permalink / raw)
  To: Martin Mailand
  Cc: ceph-devel, linux-btrfs, Sage Weil, chb, Josef Bacik, chris.mason

Hi Martin,

a quick dig into your perf report show a large amount of swapper work.
If this is the case, i would suspect latency. So do you have not
enough physical ram in your machine ?

Greetings

Stefan Majer

On Thu, Oct 27, 2011 at 12:53 PM, Martin Mailand <martin@tuxadero.com> wrote:
> Hi
> resend without the perf attachment, which could be found here:
> http://tuxadero.com/multistorage/perf.report.txt.bz2
>
> Best Regards,
>  martin
>
> -------- Original-Nachricht --------
> Betreff: Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
> Datum: Wed, 26 Oct 2011 22:38:47 +0200
> Von: Martin Mailand <martin@tuxadero.com>
> Antwort an: martin@tuxadero.com
> An: Sage Weil <sage@newdream.net>
> Kopie (CC): Christian Brunner <chb@muc.de>, ceph-devel@vger.kernel.org,
>  linux-btrfs@vger.kernel.org
>
> Hi,
> I have more or less the same setup as Christian and I suffer the same
> problems.
> But as far as I can see the output of latencytop and perf differs form
> Christian one, both are attached.
> I was wondering about the high latency from btrfs-submit.
>
> Process btrfs-submit-0 (970) Total: 2123.5 msec
>
> I have as well the high IO rate and high IO wait.
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0.60    0.00    2.20   82.40    0.00   14.80
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.00    0.00    8.40     0.00    74.40
> 17.71     0.03    3.81    0.00    3.81   3.81   3.20
> sdb               0.00     7.00    0.00  269.80     0.00  1224.80
> 9.08   107.19  398.69    0.00  398.69   3.15  85.00
>
> top - 21:57:41 up  8:41,  1 user,  load average: 0.65, 0.79, 0.76
> Tasks: 179 total,   1 running, 178 sleeping,   0 stopped,   0 zombie
> Cpu(s):  0.6%us,  2.4%sy,  0.0%ni, 70.8%id, 25.8%wa,  0.0%hi,  0.3%si,
> 0.0%st
> Mem:   4018276k total,  1577728k used,  2440548k free,    10496k buffers
> Swap:  1998844k total,        0k used,  1998844k free,  1316696k cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>
>  1399 root      20   0  548m 103m 3428 S  0.0  2.6   2:01.85 ceph-osd
>
>  1401 root      20   0  548m 103m 3428 S  0.0  2.6   1:51.71 ceph-osd
>
>  1400 root      20   0  548m 103m 3428 S  0.0  2.6   1:50.30 ceph-osd
>
>  1391 root      20   0     0    0    0 S  0.0  0.0   1:18.39
> btrfs-endio-wri
>
>   976 root      20   0     0    0    0 S  0.0  0.0   1:18.11
> btrfs-endio-wri
>
>  1367 root      20   0     0    0    0 S  0.0  0.0   1:05.60
> btrfs-worker-1
>
>   968 root      20   0     0    0    0 S  0.0  0.0   1:05.45
> btrfs-worker-0
>
>  1163 root      20   0  141m 1636 1100 S  0.0  0.0   1:00.56 collectd
>
>   970 root      20   0     0    0    0 S  0.0  0.0   0:47.73
> btrfs-submit-0
>
>  1402 root      20   0  548m 103m 3428 S  0.0  2.6   0:34.86 ceph-osd
>
>  1392 root      20   0     0    0    0 S  0.0  0.0   0:33.70
> btrfs-endio-met
>
>   975 root      20   0     0    0    0 S  0.0  0.0   0:32.70
> btrfs-endio-met
>
>  1415 root      20   0  548m 103m 3428 S  0.0  2.6   0:28.29 ceph-osd
>
>  1414 root      20   0  548m 103m 3428 S  0.0  2.6   0:28.24 ceph-osd
>
>  1397 root      20   0  548m 103m 3428 S  0.0  2.6   0:24.60 ceph-osd
>
>  1436 root      20   0  548m 103m 3428 S  0.0  2.6   0:13.31 ceph-osd
>
>
> Here ist my setup.
> Kernel v3.1 + Josef
>
> The config for this osd (ceph version 0.37
> (commit:a6f3bbb744a6faea95ae48317f0b838edb16a896)) is:
> [osd.1]
>         host = s-brick-003
>         osd journal = /dev/sda7
>         btrfs devs = /dev/sdb
>        btrfs options = noatime
>        filestore_btrfs_snap = false
>
> I hope this helps to pin point the problem.
>
> Best Regards,
> martin
>
>
> Sage Weil schrieb:
>>
>> On Wed, 26 Oct 2011, Christian Brunner wrote:
>>>
>>> 2011/10/26 Sage Weil <sage@newdream.net>:
>>>>
>>>> On Wed, 26 Oct 2011, Christian Brunner wrote:
>>>>>>>>
>>>>>>>> Christian, have you tweaked those settings in your ceph.conf?  It
>>>>>>>> would be
>>>>>>>> something like 'journal dio = false'.  If not, can you verify that
>>>>>>>> directio shows true when the journal is initialized from your osd
>>>>>>>> log?
>>>>>>>> E.g.,
>>>>>>>>
>>>>>>>>  2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open
>>>>>>>> dev/osd0.journal fd 14: 104857600 bytes, block size 4096 bytes, directio = 1
>>>>>>>>
>>>>>>>> If directio = 1 for you, something else funky is causing those
>>>>>>>> blkdev_fsync's...
>>>>>>>
>>>>>>> I've looked it up in the logs - directio is 1:
>>>>>>>
>>>>>>> Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open
>>>>>>> /dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096
>>>>>>> bytes, directio = 1
>>>>>>
>>>>>> Do you mind capturing an strace?  I'd like to see where that
>>>>>> blkdev_fsync
>>>>>> is coming from.
>>>>>
>>>>> Here is an strace. I can see a lot of sync_file_range operations.
>>>>
>>>> Yeah, these all look like the flusher thread, and shouldn't be hitting
>>>> blkdev_fsync.  Can you confirm that with
>>>>
>>>>       filestore flusher = false
>>>>       filestore sync flush = false
>>>>
>>>> you get no sync_file_range at all?  I wonder if this is also perf lying
>>>> about the call chain.
>>>
>>> Yes, setting this makes the sync_file_range calls go away.
>>
>> Okay.  That means either sync_file_range on a regular btrfs file is
>> triggering blkdev_fsync somewhere in btrfs, there is an extremely sneaky
>> bug that is mixing up file descriptors, or latencytop is lying.  I'm
>> guessing the latter, given the other weirdness Josef and Chris were
>> seeing.  :)
>>
>>> Is it safe to use these settings with "filestore btrfs snap = 0"?
>>
>> Yeah.  They're purely a performance thing to push as much dirty data to
>> disk as quickly as possible to minimize the snapshot create latency.
>> You'll notice the write throughput tends to tank when them off.
>>
>> sage
>
>



-- 
Stefan Majer
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
       [not found] <4EA86FD7.4030407@tuxadero.com>
@ 2011-10-27 10:53 ` Martin Mailand
  2011-10-27 10:59     ` Stefan Majer
  0 siblings, 1 reply; 47+ messages in thread
From: Martin Mailand @ 2011-10-27 10:53 UTC (permalink / raw)
  To: ceph-devel; +Cc: linux-btrfs, Sage Weil, chb, Josef Bacik, chris.mason

[-- Attachment #1: Type: text/plain, Size: 5372 bytes --]

Hi
resend without the perf attachment, which could be found here:
http://tuxadero.com/multistorage/perf.report.txt.bz2

Best Regards,
  martin

-------- Original-Nachricht --------
Betreff: Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
Datum: Wed, 26 Oct 2011 22:38:47 +0200
Von: Martin Mailand <martin@tuxadero.com>
Antwort an: martin@tuxadero.com
An: Sage Weil <sage@newdream.net>
Kopie (CC): Christian Brunner <chb@muc.de>, ceph-devel@vger.kernel.org, 
  linux-btrfs@vger.kernel.org

Hi,
I have more or less the same setup as Christian and I suffer the same
problems.
But as far as I can see the output of latencytop and perf differs form
Christian one, both are attached.
I was wondering about the high latency from btrfs-submit.

Process btrfs-submit-0 (970) Total: 2123.5 msec

I have as well the high IO rate and high IO wait.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
             0.60    0.00    2.20   82.40    0.00   14.80

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    8.40     0.00    74.40
17.71     0.03    3.81    0.00    3.81   3.81   3.20
sdb               0.00     7.00    0.00  269.80     0.00  1224.80
9.08   107.19  398.69    0.00  398.69   3.15  85.00

top - 21:57:41 up  8:41,  1 user,  load average: 0.65, 0.79, 0.76
Tasks: 179 total,   1 running, 178 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.6%us,  2.4%sy,  0.0%ni, 70.8%id, 25.8%wa,  0.0%hi,  0.3%si,
0.0%st
Mem:   4018276k total,  1577728k used,  2440548k free,    10496k buffers
Swap:  1998844k total,        0k used,  1998844k free,  1316696k cached

    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND

   1399 root      20   0  548m 103m 3428 S  0.0  2.6   2:01.85 ceph-osd

   1401 root      20   0  548m 103m 3428 S  0.0  2.6   1:51.71 ceph-osd

   1400 root      20   0  548m 103m 3428 S  0.0  2.6   1:50.30 ceph-osd

   1391 root      20   0     0    0    0 S  0.0  0.0   1:18.39
btrfs-endio-wri

    976 root      20   0     0    0    0 S  0.0  0.0   1:18.11
btrfs-endio-wri

   1367 root      20   0     0    0    0 S  0.0  0.0   1:05.60
btrfs-worker-1

    968 root      20   0     0    0    0 S  0.0  0.0   1:05.45
btrfs-worker-0

   1163 root      20   0  141m 1636 1100 S  0.0  0.0   1:00.56 collectd

    970 root      20   0     0    0    0 S  0.0  0.0   0:47.73
btrfs-submit-0

   1402 root      20   0  548m 103m 3428 S  0.0  2.6   0:34.86 ceph-osd

   1392 root      20   0     0    0    0 S  0.0  0.0   0:33.70
btrfs-endio-met

    975 root      20   0     0    0    0 S  0.0  0.0   0:32.70
btrfs-endio-met

   1415 root      20   0  548m 103m 3428 S  0.0  2.6   0:28.29 ceph-osd

   1414 root      20   0  548m 103m 3428 S  0.0  2.6   0:28.24 ceph-osd

   1397 root      20   0  548m 103m 3428 S  0.0  2.6   0:24.60 ceph-osd

   1436 root      20   0  548m 103m 3428 S  0.0  2.6   0:13.31 ceph-osd


Here ist my setup.
Kernel v3.1 + Josef

The config for this osd (ceph version 0.37
(commit:a6f3bbb744a6faea95ae48317f0b838edb16a896)) is:
[osd.1]
          host = s-brick-003
          osd journal = /dev/sda7
          btrfs devs = /dev/sdb
	btrfs options = noatime
	filestore_btrfs_snap = false

I hope this helps to pin point the problem.

Best Regards,
martin


Sage Weil schrieb:
> On Wed, 26 Oct 2011, Christian Brunner wrote:
>> 2011/10/26 Sage Weil <sage@newdream.net>:
>>> On Wed, 26 Oct 2011, Christian Brunner wrote:
>>>>>>> Christian, have you tweaked those settings in your ceph.conf?  It would be
>>>>>>> something like 'journal dio = false'.  If not, can you verify that
>>>>>>> directio shows true when the journal is initialized from your osd log?
>>>>>>> E.g.,
>>>>>>>
>>>>>>>  2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal fd 14: 104857600 bytes, block size 4096 bytes, directio = 1
>>>>>>>
>>>>>>> If directio = 1 for you, something else funky is causing those
>>>>>>> blkdev_fsync's...
>>>>>> I've looked it up in the logs - directio is 1:
>>>>>>
>>>>>> Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open
>>>>>> /dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096
>>>>>> bytes, directio = 1
>>>>> Do you mind capturing an strace?  I'd like to see where that blkdev_fsync
>>>>> is coming from.
>>>> Here is an strace. I can see a lot of sync_file_range operations.
>>> Yeah, these all look like the flusher thread, and shouldn't be hitting
>>> blkdev_fsync.  Can you confirm that with
>>>
>>>        filestore flusher = false
>>>        filestore sync flush = false
>>>
>>> you get no sync_file_range at all?  I wonder if this is also perf lying
>>> about the call chain.
>> Yes, setting this makes the sync_file_range calls go away.
>
> Okay.  That means either sync_file_range on a regular btrfs file is
> triggering blkdev_fsync somewhere in btrfs, there is an extremely sneaky
> bug that is mixing up file descriptors, or latencytop is lying.  I'm
> guessing the latter, given the other weirdness Josef and Chris were
> seeing.  :)
>
>> Is it safe to use these settings with "filestore btrfs snap = 0"?
>
> Yeah.  They're purely a performance thing to push as much dirty data to
> disk as quickly as possible to minimize the snapshot create latency.
> You'll notice the write throughput tends to tank when them off.
>
> sage


[-- Attachment #2: latencytop.txt.bz2 --]
[-- Type: application/x-bzip, Size: 5203 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
  2011-10-26  8:12     ` Christian Brunner
  (?)
@ 2011-10-26 16:32     ` Sage Weil
  -1 siblings, 0 replies; 47+ messages in thread
From: Sage Weil @ 2011-10-26 16:32 UTC (permalink / raw)
  To: Christian Brunner; +Cc: ceph-devel, linux-btrfs

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2096 bytes --]

On Wed, 26 Oct 2011, Christian Brunner wrote:
> 2011/10/26 Sage Weil <sage@newdream.net>:
> > On Wed, 26 Oct 2011, Christian Brunner wrote:
> >> >> > Christian, have you tweaked those settings in your ceph.conf?  It would be
> >> >> > something like 'journal dio = false'.  If not, can you verify that
> >> >> > directio shows true when the journal is initialized from your osd log?
> >> >> > E.g.,
> >> >> >
> >> >> >  2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal fd 14: 104857600 bytes, block size 4096 bytes, directio = 1
> >> >> >
> >> >> > If directio = 1 for you, something else funky is causing those
> >> >> > blkdev_fsync's...
> >> >>
> >> >> I've looked it up in the logs - directio is 1:
> >> >>
> >> >> Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open
> >> >> /dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096
> >> >> bytes, directio = 1
> >> >
> >> > Do you mind capturing an strace?  I'd like to see where that blkdev_fsync
> >> > is coming from.
> >>
> >> Here is an strace. I can see a lot of sync_file_range operations.
> >
> > Yeah, these all look like the flusher thread, and shouldn't be hitting
> > blkdev_fsync.  Can you confirm that with
> >
> >        filestore flusher = false
> >        filestore sync flush = false
> >
> > you get no sync_file_range at all?  I wonder if this is also perf lying
> > about the call chain.
> 
> Yes, setting this makes the sync_file_range calls go away.

Okay.  That means either sync_file_range on a regular btrfs file is 
triggering blkdev_fsync somewhere in btrfs, there is an extremely sneaky 
bug that is mixing up file descriptors, or latencytop is lying.  I'm 
guessing the latter, given the other weirdness Josef and Chris were 
seeing.  :)

> Is it safe to use these settings with "filestore btrfs snap = 0"?

Yeah.  They're purely a performance thing to push as much dirty data to 
disk as quickly as possible to minimize the snapshot create latency.  
You'll notice the write throughput tends to tank when them off.

sage

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
       [not found] ` <Pine.LNX.4.64.1110252221510.6574@cobra.newdream.net>
@ 2011-10-26  8:12     ` Christian Brunner
  0 siblings, 0 replies; 47+ messages in thread
From: Christian Brunner @ 2011-10-26  8:12 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, linux-btrfs

2011/10/26 Sage Weil <sage@newdream.net>:
> On Wed, 26 Oct 2011, Christian Brunner wrote:
>> >> > Christian, have you tweaked those settings in your ceph.conf? =A0=
It would be
>> >> > something like 'journal dio =3D false'. =A0If not, can you veri=
fy that
>> >> > directio shows true when the journal is initialized from your o=
sd log?
>> >> > E.g.,
>> >> >
>> >> > =A02011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/os=
d0.journal fd 14: 104857600 bytes, block size 4096 bytes, directio =3D =
1
>> >> >
>> >> > If directio =3D 1 for you, something else funky is causing thos=
e
>> >> > blkdev_fsync's...
>> >>
>> >> I've looked it up in the logs - directio is 1:
>> >>
>> >> Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open
>> >> /dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4=
096
>> >> bytes, directio =3D 1
>> >
>> > Do you mind capturing an strace? =A0I'd like to see where that blk=
dev_fsync
>> > is coming from.
>>
>> Here is an strace. I can see a lot of sync_file_range operations.
>
> Yeah, these all look like the flusher thread, and shouldn't be hittin=
g
> blkdev_fsync. =A0Can you confirm that with
>
> =A0 =A0 =A0 =A0filestore flusher =3D false
> =A0 =A0 =A0 =A0filestore sync flush =3D false
>
> you get no sync_file_range at all? =A0I wonder if this is also perf l=
ying
> about the call chain.

Yes, setting this makes the sync_file_range calls go away.

Is it safe to use these settings with "filestore btrfs snap =3D 0"?

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
@ 2011-10-26  8:12     ` Christian Brunner
  0 siblings, 0 replies; 47+ messages in thread
From: Christian Brunner @ 2011-10-26  8:12 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, linux-btrfs

2011/10/26 Sage Weil <sage@newdream.net>:
> On Wed, 26 Oct 2011, Christian Brunner wrote:
>> >> > Christian, have you tweaked those settings in your ceph.conf?  It would be
>> >> > something like 'journal dio = false'.  If not, can you verify that
>> >> > directio shows true when the journal is initialized from your osd log?
>> >> > E.g.,
>> >> >
>> >> >  2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal fd 14: 104857600 bytes, block size 4096 bytes, directio = 1
>> >> >
>> >> > If directio = 1 for you, something else funky is causing those
>> >> > blkdev_fsync's...
>> >>
>> >> I've looked it up in the logs - directio is 1:
>> >>
>> >> Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open
>> >> /dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096
>> >> bytes, directio = 1
>> >
>> > Do you mind capturing an strace?  I'd like to see where that blkdev_fsync
>> > is coming from.
>>
>> Here is an strace. I can see a lot of sync_file_range operations.
>
> Yeah, these all look like the flusher thread, and shouldn't be hitting
> blkdev_fsync.  Can you confirm that with
>
>        filestore flusher = false
>        filestore sync flush = false
>
> you get no sync_file_range at all?  I wonder if this is also perf lying
> about the call chain.

Yes, setting this makes the sync_file_range calls go away.

Is it safe to use these settings with "filestore btrfs snap = 0"?

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2011-10-31 14:04 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-24  1:54 ceph on non-btrfs file systems Sage Weil
2011-10-24 16:22 ` Christian Brunner
2011-10-24 17:06   ` ceph on btrfs [was Re: ceph on non-btrfs file systems] Sage Weil
2011-10-24 19:51     ` Josef Bacik
2011-10-24 20:35       ` Chris Mason
2011-10-24 21:34         ` Christian Brunner
2011-10-24 21:34           ` Christian Brunner
2011-10-24 21:37           ` Arne Jansen
2011-10-25 11:56       ` Christian Brunner
2011-10-25 12:23         ` Josef Bacik
2011-10-25 12:23           ` Josef Bacik
2011-10-25 14:25           ` Christian Brunner
2011-10-25 15:00             ` Josef Bacik
2011-10-25 15:00               ` Josef Bacik
2011-10-25 15:05             ` Josef Bacik
2011-10-25 15:05               ` Josef Bacik
2011-10-25 15:13               ` Christian Brunner
2011-10-25 15:13                 ` Christian Brunner
2011-10-25 20:15               ` Chris Mason
2011-10-25 20:22                 ` Josef Bacik
2011-10-26  0:16                   ` Christian Brunner
2011-10-26  0:16                     ` Christian Brunner
2011-10-26  8:21                     ` Christian Brunner
2011-10-26  8:21                       ` Christian Brunner
2011-10-26 13:23                   ` Chris Mason
2011-10-27 15:07                     ` Josef Bacik
2011-10-27 18:14                       ` Josef Bacik
2011-10-25 16:36           ` Sage Weil
2011-10-25 19:09             ` Christian Brunner
2011-10-25 19:09               ` Christian Brunner
2011-10-25 22:27               ` Sage Weil
2011-10-27 19:52         ` Josef Bacik
2011-10-27 19:52           ` Josef Bacik
2011-10-27 20:39           ` Christian Brunner
2011-10-27 20:39             ` Christian Brunner
     [not found]             ` <CAO47_-_+Oqs1sHeYEBfxgwugSUYKftQLQ9jEyDgFPFu8fXe34w@mail.gmail.com>
     [not found]               ` <CAO47_-8YGAxoYOBRKxLP2HULqEtV5bMugzzybq3srCVFZczgGA@mail.gmail.com>
2011-10-31 10:25                 ` Christian Brunner
2011-10-31 13:29                   ` Christian Brunner
2011-10-31 14:04                     ` Josef Bacik
2011-10-25 10:23     ` Christoph Hellwig
2011-10-25 16:23       ` Sage Weil
     [not found] <CAO47_-9L7SdQwhJ27B6yzrqG8xvj+CeZHeSutgeCixcv7kUidg@mail.gmail.com>
     [not found] ` <Pine.LNX.4.64.1110252221510.6574@cobra.newdream.net>
2011-10-26  8:12   ` Christian Brunner
2011-10-26  8:12     ` Christian Brunner
2011-10-26 16:32     ` Sage Weil
     [not found] <4EA86FD7.4030407@tuxadero.com>
2011-10-27 10:53 ` Martin Mailand
2011-10-27 10:59   ` Stefan Majer
2011-10-27 10:59     ` Stefan Majer
2011-10-27 11:17     ` Martin Mailand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.