All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: System Lockup while testing nilfs
       [not found] ` <mailman.1.1245380401.15040.users-JrjvKiOkagjYtjvyW6yDsg@public.gmane.org>
@ 2009-06-19  7:37   ` Dipl.-Ing. Michael Niederle
  2009-06-19 11:18     ` Ryusuke Konishi
  0 siblings, 1 reply; 4+ messages in thread
From: Dipl.-Ing. Michael Niederle @ 2009-06-19  7:37 UTC (permalink / raw)
  To: users-JrjvKiOkagjYtjvyW6yDsg

Hi!

> I have turned on debug in the nilfs_cleanerd config file and I
> am capturing the log to a separate location. The lockup seems to occur
> right after a "2 segments selected to be cleaned" message was sent.

> gentoo distribution
> 2.6.30 kernel x86
> nilfs-utils 2.0.12

This looks very much the same problem I encountered! It seems that under some
circumstances the GC daemon issues too many read/write requests.

@Ryusuke Konishi: The last crash occured with default GC settings.

Greetings, Michael

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: System Lockup while testing nilfs
  2009-06-19  7:37   ` System Lockup while testing nilfs Dipl.-Ing. Michael Niederle
@ 2009-06-19 11:18     ` Ryusuke Konishi
  0 siblings, 0 replies; 4+ messages in thread
From: Ryusuke Konishi @ 2009-06-19 11:18 UTC (permalink / raw)
  To: users-JrjvKiOkagjYtjvyW6yDsg, mniederle-RbZlAiThDcE; +Cc: James M Long

Hi,
On Fri, 19 Jun 2009 09:37:06 +0200, "Dipl.-Ing. Michael Niederle" wrote:
> Hi!
> 
> > I have turned on debug in the nilfs_cleanerd config file and I
> > am capturing the log to a separate location. The lockup seems to occur
> > right after a "2 segments selected to be cleaned" message was sent.
> 
> > gentoo distribution
> > 2.6.30 kernel x86
> > nilfs-utils 2.0.12
> 
> This looks very much the same problem I encountered! It seems that under some
> circumstances the GC daemon issues too many read/write requests.
> 
> @Ryusuke Konishi: The last crash occured with default GC settings.
> 
> Greetings, Michael

By default, GC reads and writes 2 segments per 5 seconds.

  1 segment = 8MB (by default)

So, at maximum, it causes roughly 3.2MB read + 3.2MB write per second.

If all blocks in the segments were live, GC will copy all of them into
new logs; that is 6.4MB/s I/O at worst.

If dead blocks occupies 50% of segments, the I/O load is estimated to
be 3.2MB/sec since the number of copy blocks drops by half.

Under the circumstance where many snapshots exist or the protection
period is set long, the rate of the must-copy-blocks can be very high.

Are you onto something?

Essentially, GC should be designed to minimize load in adapting to
situation.  Unfortunately, the current GC is not intelligent at all.
It cannot

 - stop intelligently
 - resume operation intelligently
 - balance reclamation speed, nor
 - select target segments so that the I/O load gets minimized.

Much work to do.

Regards,
Ryusuke Konishi

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: System Lockup while testing nilfs
       [not found] ` <ddfcb4780906181809w4db9aa15u3a28676083ca3405-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-06-19  2:18   ` Ryusuke Konishi
  0 siblings, 0 replies; 4+ messages in thread
From: Ryusuke Konishi @ 2009-06-19  2:18 UTC (permalink / raw)
  To: james.m.long-Re5JQEeQqe8AvxtiuMwx3w, users-JrjvKiOkagjYtjvyW6yDsg

Hi!
On Thu, 18 Jun 2009 21:09:44 -0400, James M Long wrote:
> I am a system engineer that was attracted to nilfs due to some of the
> features that the file system already has. I talked one of our
> developers into doing a performance test against our in house software
> application that is running in a nilfs partition.
> My issue is that the host server continues to lockup during the
> performance test. I thought it may have something to do with the
> server itself, but the performance test runs successfully on the
> server on the same harddrive in a different partition.
> My question is what can I do to monitor and log what the issue is
> since the only way to get back into the machine is to do a hard power
> cycle. I have turned on debug in the nilfs_cleanerd config file and I
> am capturing the log to a separate location. The lockup seems to occur
> right after a "2 segments selected to be cleaned" message was sent.

Thank you for your interest in nilfs.

First, when you meet a hang problem, the magic sysrq feature of kernel
is helpful to trace down the cause.  The following operation will
output stack dump of every task if the sysrq feature is enabled.

 # echo t > /proc/sysrq-trigger
 
In the standalone module version of nilfs, you can get some debug
information by enabling CONFIG_NILFS_DEBUG=y in fs/Makefile.

The standalone package is available from nilfs.org or the git tree
shown in:

[1]  http://www.nilfs.org/git/

In the debug build module, you can adjust verbosity levels of debug
messages:

 level3:
 # echo "-vvv segment -vvv seginfo" > /proc/fs/nilfs2/debug_option

 level2:
 # echo "-vv segment -vv seginfo" > /proc/fs/nilfs2/debug_option

 level1 (default):
 # echo "-v segment -v seginfo" > /proc/fs/nilfs2/debug_option

One problem is that the standalone package does not support the 2.6.30
kernel.  But it is still useful for debugging purpose.

Yesterday, I posted a patch that fixes a hang problem in log writer.
I'm planning to send it upstream, but it's still under testing.

Please try the patch. It's available from the archive:

[2] https://www.nilfs.org/pipermail/users/2009-June/000713.html


Cheers,
Ryusuke Konishi

> gentoo distribution
> 2.6.30 kernel x86
> nilfs-utils 2.0.12
> partition 466GB
> application is postgresql, sqlite and lucene and python
> all db's and indexes are being written into the nilfs partition so
> there are tons of file changes happening at all times during the test.
> During the test, the file system is growing at a rate of 3 MB/sec and
> the test runs for greater than 4 hours straight. The total data the
> test writes is about 1.4 gig when it completes, but as I stated in
> before, most of the data is in lucene indexes, postgresql and sqlite
> db's.
> 
> I am under the impression that our application is hammering the file
> system with too many read and write requests at the same time.
> 
> Thanks for any suggestions,
> 
> James...

^ permalink raw reply	[flat|nested] 4+ messages in thread

* System Lockup while testing nilfs
@ 2009-06-19  1:09 James M Long
       [not found] ` <ddfcb4780906181809w4db9aa15u3a28676083ca3405-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: James M Long @ 2009-06-19  1:09 UTC (permalink / raw)
  To: users-JrjvKiOkagjYtjvyW6yDsg

I am a system engineer that was attracted to nilfs due to some of the
features that the file system already has. I talked one of our
developers into doing a performance test against our in house software
application that is running in a nilfs partition.
My issue is that the host server continues to lockup during the
performance test. I thought it may have something to do with the
server itself, but the performance test runs successfully on the
server on the same harddrive in a different partition.
My question is what can I do to monitor and log what the issue is
since the only way to get back into the machine is to do a hard power
cycle. I have turned on debug in the nilfs_cleanerd config file and I
am capturing the log to a separate location. The lockup seems to occur
right after a "2 segments selected to be cleaned" message was sent.

gentoo distribution
2.6.30 kernel x86
nilfs-utils 2.0.12
partition 466GB
application is postgresql, sqlite and lucene and python
all db's and indexes are being written into the nilfs partition so
there are tons of file changes happening at all times during the test.
During the test, the file system is growing at a rate of 3 MB/sec and
the test runs for greater than 4 hours straight. The total data the
test writes is about 1.4 gig when it completes, but as I stated in
before, most of the data is in lucene indexes, postgresql and sqlite
db's.

I am under the impression that our application is hammering the file
system with too many read and write requests at the same time.

Thanks for any suggestions,

James...

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2009-06-19 11:18 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <mailman.1.1245380401.15040.users@nilfs.org>
     [not found] ` <mailman.1.1245380401.15040.users-JrjvKiOkagjYtjvyW6yDsg@public.gmane.org>
2009-06-19  7:37   ` System Lockup while testing nilfs Dipl.-Ing. Michael Niederle
2009-06-19 11:18     ` Ryusuke Konishi
2009-06-19  1:09 James M Long
     [not found] ` <ddfcb4780906181809w4db9aa15u3a28676083ca3405-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-06-19  2:18   ` Ryusuke Konishi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.