All of lore.kernel.org
 help / color / mirror / Atom feed
* [Lustre-devel] releasing BKL in lustre_fill_super
@ 2010-10-29  3:07 Jeremy Filizetti
  2010-11-02  7:40 ` Andreas Dilger
  0 siblings, 1 reply; 11+ messages in thread
From: Jeremy Filizetti @ 2010-10-29  3:07 UTC (permalink / raw)
  To: lustre-devel

I've seen a lot of issues with mounting all of our OSTs on an OSS taking an
excessive amount of time.  Most of the individual OST mount time was related
to bug 18456, but we still see mount times take minutes per OST with the
relevant patches.  At mount time the llog does a small write which ends up
scanning nearly our entire 7+ TB OSTs to find the desired block and complete
the write. To reduce startup time mounting multiple OSTs simultaneously
would help, but during that process it looks like the code path is still
holding the big kernel lock from the mount system call.  During that time
all other mount commands are in an uninterruptible sleep (D state).  Based
on the discussions from bug 23790 it doesn't appear that Lustre relies on
the BKL so would it be reasonable to call unlock_kernel in lustre_fill_super
or at least before lustre_start_mgc and lock it again before the return so
multiple OSTs could be mounting at the same time?  I think the same thing
would apply to unmounting but I haven't looked at the code path there.

Jeremy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20101028/c72f73a7/attachment.htm>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Lustre-devel] releasing BKL in lustre_fill_super
  2010-10-29  3:07 [Lustre-devel] releasing BKL in lustre_fill_super Jeremy Filizetti
@ 2010-11-02  7:40 ` Andreas Dilger
  2010-11-03 15:57   ` Ashley Pittman
  0 siblings, 1 reply; 11+ messages in thread
From: Andreas Dilger @ 2010-11-02  7:40 UTC (permalink / raw)
  To: lustre-devel

On 2010-10-28, at 21:07, Jeremy Filizetti wrote:
> I've seen a lot of issues with mounting all of our OSTs on an OSS taking an excessive amount of time.  Most of the individual OST mount time was related to bug 18456, but we still see mount times take minutes per OST with the relevant patches.  At mount time the llog does a small write which ends up scanning nearly our entire 7+ TB OSTs to find the desired block and complete the write.
> 
> To reduce startup time mounting multiple OSTs simultaneously would help, but during that process it looks like the code path is still holding the big kernel lock from the mount system call.  During that time all other mount commands are in an uninterruptible sleep (D state).  Based on the discussions from bug 23790 it doesn't appear that Lustre relies on the BKL so would it be reasonable to call unlock_kernel in lustre_fill_super or at least before lustre_start_mgc and lock it again before the return so multiple OSTs could be mounting at the same time?  I think the same thing would apply to unmounting but I haven't looked at the code path there.

IIRC, the BKL is held at mount time to avoid potential races with mounting the same device multiple times.  However, the risk of this is pretty small, and can be controlled on an OSS, which has limited access.  Also, this code is being removed in newer kernels, as I don't think it is needed by most filesystems.

I _think_ it should be OK, but YMMV.

Cheers, Andreas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Lustre-devel] releasing BKL in lustre_fill_super
  2010-11-02  7:40 ` Andreas Dilger
@ 2010-11-03 15:57   ` Ashley Pittman
  2010-11-03 20:51     ` Jeremy Filizetti
  2010-11-08 22:16     ` Jeremy Filizetti
  0 siblings, 2 replies; 11+ messages in thread
From: Ashley Pittman @ 2010-11-03 15:57 UTC (permalink / raw)
  To: lustre-devel


On 2 Nov 2010, at 07:40, Andreas Dilger wrote:

> On 2010-10-28, at 21:07, Jeremy Filizetti wrote:
>> I've seen a lot of issues with mounting all of our OSTs on an OSS taking an excessive amount of time.  Most of the individual OST mount time was related to bug 18456, but we still see mount times take minutes per OST with the relevant patches.  At mount time the llog does a small write which ends up scanning nearly our entire 7+ TB OSTs to find the desired block and complete the write.
>> 
>> To reduce startup time mounting multiple OSTs simultaneously would help, but during that process it looks like the code path is still holding the big kernel lock from the mount system call.  During that time all other mount commands are in an uninterruptible sleep (D state).  Based on the discussions from bug 23790 it doesn't appear that Lustre relies on the BKL so would it be reasonable to call unlock_kernel in lustre_fill_super or at least before lustre_start_mgc and lock it again before the return so multiple OSTs could be mounting at the same time?  I think the same thing would apply to unmounting but I haven't looked at the code path there.
> 
> IIRC, the BKL is held at mount time to avoid potential races with mounting the same device multiple times.  However, the risk of this is pretty small, and can be controlled on an OSS, which has limited access.  Also, this code is being removed in newer kernels, as I don't think it is needed by most filesystems.
> 
> I _think_ it should be OK, but YMMV.

I've been thinking about this and can't make up my mind on if it's a good idea or not, we often see mount times in the ten minute region so anything we can do to speed them up is a good thing, I find it hard to believe the core kernel mount code would accept you doing this behind their back though and I'd be surprised if it worked.

Then again - when we were discussing this yesterday is the mount command *really* holding the BKL for the entire duration?  Surely if this lock is being held for minutes we'd notice this in other ways because other kernel paths that require this lock would block?

Ashley.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Lustre-devel] releasing BKL in lustre_fill_super
  2010-11-03 15:57   ` Ashley Pittman
@ 2010-11-03 20:51     ` Jeremy Filizetti
  2010-11-04 11:35       ` Ashley Pittman
  2010-11-08 22:16     ` Jeremy Filizetti
  1 sibling, 1 reply; 11+ messages in thread
From: Jeremy Filizetti @ 2010-11-03 20:51 UTC (permalink / raw)
  To: lustre-devel

Are you seeing individual OST mount times in the 10 minute region or for all
OSTs?

Releasing the lock is more common then you think.  Both ext3, ext4 release
the lock in their ext[34]_fill_super and reacquire before exiting.  So when
Lustre does the pre mount it gets released, and when it does the real mount
for ldiskfs, but after that during that first llog write when the buddy
allocator is initializing I don't see where it can be getting released.  I
was going to try to confirm things 100% with systemtap but our the version I
have doesn't seem to pick up the lustre modules (or any additionallly added
ones for that matter).  I don't have any easy way to upgrade it either.

Jeremy



> I've been thinking about this and can't make up my mind on if it's a good
> idea or not, we often see mount times in the ten minute region so anything
> we can do to speed them up is a good thing, I find it hard to believe the
> core kernel mount code would accept you doing this behind their back though
> and I'd be surprised if it worked.
>
> Then again - when we were discussing this yesterday is the mount command
> *really* holding the BKL for the entire duration?  Surely if this lock is
> being held for minutes we'd notice this in other ways because other kernel
> paths that require this lock would block?
>
> Ashley.
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20101103/18bd4387/attachment.htm>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Lustre-devel] releasing BKL in lustre_fill_super
  2010-11-03 20:51     ` Jeremy Filizetti
@ 2010-11-04 11:35       ` Ashley Pittman
  0 siblings, 0 replies; 11+ messages in thread
From: Ashley Pittman @ 2010-11-04 11:35 UTC (permalink / raw)
  To: lustre-devel


That is per OST, it's on the outside of the times that we see but it's not uncommon.  We suffer from bug 18456 still and this mainly happens when OSTs get uncomfortably full.

Ashley.

On 3 Nov 2010, at 20:51, Jeremy Filizetti wrote:

> Are you seeing individual OST mount times in the 10 minute region or for all OSTs?  
> 
> Releasing the lock is more common then you think.  Both ext3, ext4 release the lock in their ext[34]_fill_super and reacquire before exiting.  So when Lustre does the pre mount it gets released, and when it does the real mount for ldiskfs, but after that during that first llog write when the buddy allocator is initializing I don't see where it can be getting released.  I was going to try to confirm things 100% with systemtap but our the version I have doesn't seem to pick up the lustre modules (or any additionallly added ones for that matter).  I don't have any easy way to upgrade it either.
> 
> Jeremy
> 
>  
> I've been thinking about this and can't make up my mind on if it's a good idea or not, we often see mount times in the ten minute region so anything we can do to speed them up is a good thing, I find it hard to believe the core kernel mount code would accept you doing this behind their back though and I'd be surprised if it worked.
> 
> Then again - when we were discussing this yesterday is the mount command *really* holding the BKL for the entire duration?  Surely if this lock is being held for minutes we'd notice this in other ways because other kernel paths that require this lock would block?
> 
> Ashley.
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Lustre-devel] releasing BKL in lustre_fill_super
  2010-11-03 15:57   ` Ashley Pittman
  2010-11-03 20:51     ` Jeremy Filizetti
@ 2010-11-08 22:16     ` Jeremy Filizetti
  2010-11-08 23:26       ` Andreas Dilger
  1 sibling, 1 reply; 11+ messages in thread
From: Jeremy Filizetti @ 2010-11-08 22:16 UTC (permalink / raw)
  To: lustre-devel

I've had a chance to take a longer look at this and I think I was wrong
about the BKL.  I still don't see where it would be getting released but the
problem appears to be that all OBD's are using the same MGC from a server.

In server_start_targets, server_mgc_set_fs acquires the cl_mgc_sem, holds it
through lustre_process_log and releases with server_mgc_clear_fs after
that.  As a result all of our mounts that are started at the same time are
waiting for the cl_mgc_sem semaphore.  And each OBD has to process it's llog
one at a time.  When you have OSTs near capacity like bug 18456 the first
write when processing the llog can take minutes to complete.

I don't see any easy way to fix this because they are all using the same
sb->lsi->lsi_mgc.  I was thinking maybe some of these structures could just
modify a copy of that data instead of the actual structure itself but there
are so many functions called its hard to see if anything would be using it.

Any ideas for a way to work around this?

Jeremy
On Wed, Nov 3, 2010 at 11:57 AM, Ashley Pittman <apittman@ddn.com> wrote:

>
> On 2 Nov 2010, at 07:40, Andreas Dilger wrote:
>
> > On 2010-10-28, at 21:07, Jeremy Filizetti wrote:
> >> I've seen a lot of issues with mounting all of our OSTs on an OSS taking
> an excessive amount of time.  Most of the individual OST mount time was
> related to bug 18456, but we still see mount times take minutes per OST with
> the relevant patches.  At mount time the llog does a small write which ends
> up scanning nearly our entire 7+ TB OSTs to find the desired block and
> complete the write.
> >>
> >> To reduce startup time mounting multiple OSTs simultaneously would help,
> but during that process it looks like the code path is still holding the big
> kernel lock from the mount system call.  During that time all other mount
> commands are in an uninterruptible sleep (D state).  Based on the
> discussions from bug 23790 it doesn't appear that Lustre relies on the BKL
> so would it be reasonable to call unlock_kernel in lustre_fill_super or at
> least before lustre_start_mgc and lock it again before the return so
> multiple OSTs could be mounting at the same time?  I think the same thing
> would apply to unmounting but I haven't looked at the code path there.
> >
> > IIRC, the BKL is held at mount time to avoid potential races with
> mounting the same device multiple times.  However, the risk of this is
> pretty small, and can be controlled on an OSS, which has limited access.
>  Also, this code is being removed in newer kernels, as I don't think it is
> needed by most filesystems.
> >
> > I _think_ it should be OK, but YMMV.
>
> I've been thinking about this and can't make up my mind on if it's a good
> idea or not, we often see mount times in the ten minute region so anything
> we can do to speed them up is a good thing, I find it hard to believe the
> core kernel mount code would accept you doing this behind their back though
> and I'd be surprised if it worked.
>
> Then again - when we were discussing this yesterday is the mount command
> *really* holding the BKL for the entire duration?  Surely if this lock is
> being held for minutes we'd notice this in other ways because other kernel
> paths that require this lock would block?
>
> Ashley.
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20101108/99db6b4e/attachment.htm>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Lustre-devel] releasing BKL in lustre_fill_super
  2010-11-08 22:16     ` Jeremy Filizetti
@ 2010-11-08 23:26       ` Andreas Dilger
  2010-11-09 15:55         ` Jeremy Filizetti
  0 siblings, 1 reply; 11+ messages in thread
From: Andreas Dilger @ 2010-11-08 23:26 UTC (permalink / raw)
  To: lustre-devel

On 2010-11-08, at 15:16, Jeremy Filizetti wrote:
> I've had a chance to take a longer look at this and I think I was wrong about the BKL.  I still don't see where it would be getting released but the problem appears to be that all OBD's are using the same MGC from a server.
>  
> In server_start_targets, server_mgc_set_fs acquires the cl_mgc_sem, holds it through lustre_process_log and releases with server_mgc_clear_fs after that.  As a result all of our mounts that are started at the same time are waiting for the cl_mgc_sem semaphore.  And each OBD has to process it's llog one at a time.  When you have OSTs near capacity like bug 18456 the first write when processing the llog can take minutes to complete.
>  
> I don't see any easy way to fix this because they are all using the same sb->lsi->lsi_mgc.  I was thinking maybe some of these structures could just modify a copy of that data instead of the actual structure itself but there are so many functions called its hard to see if anything would be using it.
>  
> Any ideas for a way to work around this?

The first thing I always think about when seeing a problem like this is not "how to reduce this contention" but "do we need to be doing this at all"?

Without having looked at that code in a long time, I'm having a hard time thinking why the OST needs to allocate a new block for the config during mount.  It is probably worthwhile to investigate why this is happening in the first place, and possibly just eliminating useless work, rather than making it slightly less slow.

Unfortunately, I don't have the bandwidth to look at this, but maybe Nathan or someone with more familiarity of the config code can chime in.

> Jeremy 
> On Wed, Nov 3, 2010 at 11:57 AM, Ashley Pittman <apittman@ddn.com> wrote:
> 
> On 2 Nov 2010, at 07:40, Andreas Dilger wrote:
> 
> > On 2010-10-28, at 21:07, Jeremy Filizetti wrote:
> >> I've seen a lot of issues with mounting all of our OSTs on an OSS taking an excessive amount of time.  Most of the individual OST mount time was related to bug 18456, but we still see mount times take minutes per OST with the relevant patches.  At mount time the llog does a small write which ends up scanning nearly our entire 7+ TB OSTs to find the desired block and complete the write.
> >>
> >> To reduce startup time mounting multiple OSTs simultaneously would help, but during that process it looks like the code path is still holding the big kernel lock from the mount system call.  During that time all other mount commands are in an uninterruptible sleep (D state).  Based on the discussions from bug 23790 it doesn't appear that Lustre relies on the BKL so would it be reasonable to call unlock_kernel in lustre_fill_super or at least before lustre_start_mgc and lock it again before the return so multiple OSTs could be mounting at the same time?  I think the same thing would apply to unmounting but I haven't looked at the code path there.
> >
> > IIRC, the BKL is held at mount time to avoid potential races with mounting the same device multiple times.  However, the risk of this is pretty small, and can be controlled on an OSS, which has limited access.  Also, this code is being removed in newer kernels, as I don't think it is needed by most filesystems.
> >
> > I _think_ it should be OK, but YMMV.
> 
> I've been thinking about this and can't make up my mind on if it's a good idea or not, we often see mount times in the ten minute region so anything we can do to speed them up is a good thing, I find it hard to believe the core kernel mount code would accept you doing this behind their back though and I'd be surprised if it worked.
> 
> Then again - when we were discussing this yesterday is the mount command *really* holding the BKL for the entire duration?  Surely if this lock is being held for minutes we'd notice this in other ways because other kernel paths that require this lock would block?
> 
> Ashley.
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel


Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Lustre-devel] releasing BKL in lustre_fill_super
  2010-11-08 23:26       ` Andreas Dilger
@ 2010-11-09 15:55         ` Jeremy Filizetti
  2010-11-10 10:40           ` Andreas Dilger
  0 siblings, 1 reply; 11+ messages in thread
From: Jeremy Filizetti @ 2010-11-09 15:55 UTC (permalink / raw)
  To: lustre-devel

>
>
> The first thing I always think about when seeing a problem like this is not
> "how to reduce this contention" but "do we need to be doing this at all"?


> Without having looked at that code in a long time, I'm having a hard time
> thinking why the OST needs to allocate a new block for the config during
> mount.  It is probably worthwhile to investigate why this is happening in
> the first place, and possibly just eliminating useless work, rather than
> making it slightly less slow.
>

 I can't really answer whether "we need to do it",  but I can elaborate on
what is happening.  The actual write that is being done during the
lustre_process_log is when the llog is being copied from the remote server
to the local server.  I assume this is at least a necessary step and no
getting rid of the llog without some sort of overhaul.  I don't really have
an idea on how large the llog can get but on the sizes I've seen it does
seem reasonable that it could be copied from the remote MGS into memory,
release the lock, and then write the log out to disk.


>
> Unfortunately, I don't have the bandwidth to look at this, but maybe Nathan
> or someone with more familiarity of the config code can chime in.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20101109/565348f6/attachment.htm>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Lustre-devel] releasing BKL in lustre_fill_super
  2010-11-09 15:55         ` Jeremy Filizetti
@ 2010-11-10 10:40           ` Andreas Dilger
  2010-12-16 13:47             ` Jeremy Filizetti
  0 siblings, 1 reply; 11+ messages in thread
From: Andreas Dilger @ 2010-11-10 10:40 UTC (permalink / raw)
  To: lustre-devel

On 2010-11-09, at 08:55, Jeremy Filizetti wrote:
>> The first thing I always think about when seeing a problem like this is not "how to reduce this contention" but "do we need to be doing this at all"?
>  
> I can't really answer whether "we need to do it",  but I can elaborate on what is happening.  The actual write that is being done during the lustre_process_log is when the llog is being copied from the remote server to the local server.  I assume this is at least a necessary step and no getting rid of the llog without some sort of overhaul.

In fact, this config llog copying should only be done on the first mount of the OST, or if the configuration has changed.

We've actually removed this entirely for the 2.4 release, though that doesn't help you now.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Lustre-devel] releasing BKL in lustre_fill_super
  2010-11-10 10:40           ` Andreas Dilger
@ 2010-12-16 13:47             ` Jeremy Filizetti
  2010-12-16 14:39               ` Andreas Dilger
  0 siblings, 1 reply; 11+ messages in thread
From: Jeremy Filizetti @ 2010-12-16 13:47 UTC (permalink / raw)
  To: lustre-devel

On Wed, Nov 10, 2010 at 5:40 AM, Andreas Dilger
<andreas.dilger@oracle.com>wrote:

> On 2010-11-09, at 08:55, Jeremy Filizetti wrote:
> >> The first thing I always think about when seeing a problem like this is
> not "how to reduce this contention" but "do we need to be doing this at
> all"?
> >
> > I can't really answer whether "we need to do it",  but I can elaborate on
> what is happening.  The actual write that is being done during the
> lustre_process_log is when the llog is being copied from the remote server
> to the local server.  I assume this is at least a necessary step and no
> getting rid of the llog without some sort of overhaul.
>
> In fact, this config llog copying should only be done on the first mount of
> the OST, or if the configuration has changed.
>
> We've actually removed this entirely for the 2.4 release, though that
> doesn't help you now.
>

Does that mean the llog component of Lustre is completely removed?  Is 2.4
an Oracle only release?


>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Technical Lead
> Oracle Corporation Canada Inc.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20101216/6157addc/attachment.htm>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Lustre-devel] releasing BKL in lustre_fill_super
  2010-12-16 13:47             ` Jeremy Filizetti
@ 2010-12-16 14:39               ` Andreas Dilger
  0 siblings, 0 replies; 11+ messages in thread
From: Andreas Dilger @ 2010-12-16 14:39 UTC (permalink / raw)
  To: lustre-devel

On 2010-12-16, at 6:47, Jeremy Filizetti <jeremy.filizetti@gmail.com> wrote:
> On Wed, Nov 10, 2010 at 5:40 AM, Andreas Dilger <andreas.dilger@oracle.com> wrote:
> In fact, this config llog copying should only be done on the first mount of the OST, or if the configuration has changed.
> 
> We've actually removed this entirely for the 2.4 release, though that doesn't help you now.
> 
> Does that mean the llog component of Lustre is completely removed?  Is 2.4 an Oracle only release?

No, only that the copying of the config llog from the MGS to the OST has been removed. The llog subsystem still is used to maintain distributed operation consistency.

"Lustre 2.4" is the anticipated release number when that change might become available. As yet, it is still pre-alpha code, and while this change is available in bugzilla as a series of patches, it would need some effort to port it for 1.8.

Cheers, Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20101216/045fd371/attachment.htm>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2010-12-16 14:39 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-29  3:07 [Lustre-devel] releasing BKL in lustre_fill_super Jeremy Filizetti
2010-11-02  7:40 ` Andreas Dilger
2010-11-03 15:57   ` Ashley Pittman
2010-11-03 20:51     ` Jeremy Filizetti
2010-11-04 11:35       ` Ashley Pittman
2010-11-08 22:16     ` Jeremy Filizetti
2010-11-08 23:26       ` Andreas Dilger
2010-11-09 15:55         ` Jeremy Filizetti
2010-11-10 10:40           ` Andreas Dilger
2010-12-16 13:47             ` Jeremy Filizetti
2010-12-16 14:39               ` Andreas Dilger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.