All of lore.kernel.org
 help / color / mirror / Atom feed
* large fs testing
@ 2009-05-23 13:53 Ric Wheeler
  2009-05-26 12:21 ` Joshua Giles
  2009-05-26 17:39 ` Nick Dokos
  0 siblings, 2 replies; 10+ messages in thread
From: Ric Wheeler @ 2009-05-23 13:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Christoph Hellwig, Douglas Shakshober, Joshua Giles,
	Valerie Aurora, Eric Sandeen, Steven Whitehouse, Edward Shishkin,
	Josef Bacik, Jeff Moyer, Chris Mason, Whitney, Eric,
	Theodore Tso

Jeff Moyer & I have been working with EMC elab over the last week or so testing 
ext4, xfs and gfs2 at roughly 80TB striped across a set of 12TB LUNs (single 
server, 6GB of DRAM, 2 quad core HT enabled CPU's).

The goal of the testing is (in decreasing priority) is to validate Val's 64 bit 
patches for ext4 e2fsprogs, do a very quick sanity check that XFS does indeed 
scale as well as I hear (and it has so far :-)) and to test gfs2 tools at that 
high capacity. Not enough time to get it all done and significant fumbling on my 
part made it go even slower.

Never the less, I have come to a rough idea of what a useful benchmark would be. 
If this sounds sane to all, I would like to try and put something together that 
we could provide to places like the EMC people who have large storage 
occasionally, are not kernel hackers, but would be willing to test for us. It 
will need to be fairly bullet proof and avoid doing performance numbers on the 
storage for normal things I assume (to avoid leaking competitive benchmarks out).

Motivation - all things being equal, users benefit from having all storage 
consumed by one massive file system since that single file system manages space 
allocation, avoids seekiness, etc (something that applications have to do 
manually when using sets of file systems, the current state of the art for ext3 
for example).

The challenges are:

(1) object count - how many files can you pack into that file system with 
reasonable performance? (The test to date filled the single ext4 fs with 207 
million 20KB files)

(2) files per directory - how many files per directory?

(3) FS creation time - can you create a file system in reasonable time? 
(mkfs.xfs took seconds, mkfs.ext4 took 90 minutes). I think that 90 minutes is 
definitely on the painful side, but usable for most.

(4) FS check time at a given fill rate for a healthy device (no IO errors). 
Testing at empty, 25%, 50%, 75% and 95% and full would all be interesting. Can 
you run these checks with a reasonable amount of DRAM - if not, what guidance do 
we need to give to customers on how big the servers need to be?

It would seem to be a nice goal to be able to fsck a file system in one working 
day - say 8 hours - so that you could get a customer back on their feet, but 
maybe 24 hours would be an outside goal?

(5) Write rate as the fs fills (picking the same set of fill rates?)

To make is some how a tractable problem, I wanted to define small (20KB), medium 
(MP3 sized, say 4MB) and large (video sized, 4GB?) files to do the test with. I 
used fs_mark (no fsync's and 256 directories) to fill the file system (at least 
until my patience/time ran out!). With these options, it still hits very high 
file/directory counts (I am thinking about tweaking fs_mark to dynamically 
create a time based directory scheme, something like day/hour/min and giving it 
an option to stop at a specified fill rate).

Sorry for the long ramble, I was curious to see if this makes sense to the 
broader set of you all & if you have had any similar experiences to share.

Thanks!

Ric









^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: large fs testing
  2009-05-23 13:53 large fs testing Ric Wheeler
@ 2009-05-26 12:21 ` Joshua Giles
  2009-05-26 12:28   ` Ric Wheeler
  2009-05-26 17:39 ` Nick Dokos
  1 sibling, 1 reply; 10+ messages in thread
From: Joshua Giles @ 2009-05-26 12:21 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Christoph Hellwig, Douglas Shakshober, Valerie Aurora,
	Eric Sandeen, Steven Whitehouse, Edward Shishkin, Josef Bacik,
	Jeff Moyer, Chris Mason, Eric Whitney, Theodore Tso,
	linux-fsdevel, jneedham

Hi Ric,

I'm wondering if we should include a "regression" performance test as part of the tools you'll give out for large fs testing?  Given some simple tools to run and some numbers output, we could request they (or as part of the test) measure the difference between fedora major releases or as part of the test cycle and send us that info?  Would we need some agreements such that they don't share this info with others or is this dirty laundry ok to air?

-Josh Giles
----- Original Message -----
From: "Ric Wheeler" <rwheeler@redhat.com>
To: linux-fsdevel@vger.kernel.org
Cc: "Christoph Hellwig" <hch@infradead.org>, "Douglas Shakshober" <dshaks@redhat.com>, "Joshua Giles" <jgiles@redhat.com>, "Valerie Aurora" <vaurora@redhat.com>, "Eric Sandeen" <esandeen@redhat.com>, "Steven Whitehouse" <swhiteho@redhat.com>, "Edward Shishkin" <edward@redhat.com>, "Josef Bacik" <jbacik@redhat.com>, "Jeff Moyer" <jmoyer@redhat.com>, "Chris Mason" <chris.mason@oracle.com>, "Eric Whitney" <eric.whitney@hp.com>, "Theodore Tso" <tytso@mit.edu>
Sent: Saturday, May 23, 2009 9:53:28 AM GMT -05:00 US/Canada Eastern
Subject: large fs testing

Jeff Moyer & I have been working with EMC elab over the last week or so testing 
ext4, xfs and gfs2 at roughly 80TB striped across a set of 12TB LUNs (single 
server, 6GB of DRAM, 2 quad core HT enabled CPU's).

The goal of the testing is (in decreasing priority) is to validate Val's 64 bit 
patches for ext4 e2fsprogs, do a very quick sanity check that XFS does indeed 
scale as well as I hear (and it has so far :-)) and to test gfs2 tools at that 
high capacity. Not enough time to get it all done and significant fumbling on my 
part made it go even slower.

Never the less, I have come to a rough idea of what a useful benchmark would be. 
If this sounds sane to all, I would like to try and put something together that 
we could provide to places like the EMC people who have large storage 
occasionally, are not kernel hackers, but would be willing to test for us. It 
will need to be fairly bullet proof and avoid doing performance numbers on the 
storage for normal things I assume (to avoid leaking competitive benchmarks out).

Motivation - all things being equal, users benefit from having all storage 
consumed by one massive file system since that single file system manages space 
allocation, avoids seekiness, etc (something that applications have to do 
manually when using sets of file systems, the current state of the art for ext3 
for example).

The challenges are:

(1) object count - how many files can you pack into that file system with 
reasonable performance? (The test to date filled the single ext4 fs with 207 
million 20KB files)

(2) files per directory - how many files per directory?

(3) FS creation time - can you create a file system in reasonable time? 
(mkfs.xfs took seconds, mkfs.ext4 took 90 minutes). I think that 90 minutes is 
definitely on the painful side, but usable for most.

(4) FS check time at a given fill rate for a healthy device (no IO errors). 
Testing at empty, 25%, 50%, 75% and 95% and full would all be interesting. Can 
you run these checks with a reasonable amount of DRAM - if not, what guidance do 
we need to give to customers on how big the servers need to be?

It would seem to be a nice goal to be able to fsck a file system in one working 
day - say 8 hours - so that you could get a customer back on their feet, but 
maybe 24 hours would be an outside goal?

(5) Write rate as the fs fills (picking the same set of fill rates?)

To make is some how a tractable problem, I wanted to define small (20KB), medium 
(MP3 sized, say 4MB) and large (video sized, 4GB?) files to do the test with. I 
used fs_mark (no fsync's and 256 directories) to fill the file system (at least 
until my patience/time ran out!). With these options, it still hits very high 
file/directory counts (I am thinking about tweaking fs_mark to dynamically 
create a time based directory scheme, something like day/hour/min and giving it 
an option to stop at a specified fill rate).

Sorry for the long ramble, I was curious to see if this makes sense to the 
broader set of you all & if you have had any similar experiences to share.

Thanks!

Ric









^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: large fs testing
  2009-05-26 12:21 ` Joshua Giles
@ 2009-05-26 12:28   ` Ric Wheeler
  0 siblings, 0 replies; 10+ messages in thread
From: Ric Wheeler @ 2009-05-26 12:28 UTC (permalink / raw)
  To: Joshua Giles
  Cc: Christoph Hellwig, Douglas Shakshober, Valerie Aurora,
	Eric Sandeen, Steven Whitehouse, Edward Shishkin, Josef Bacik,
	Jeff Moyer, Chris Mason, Eric Whitney, Theodore Tso,
	linux-fsdevel, jneedham

I think that this kind of regression test should be fine, the key to avoiding 
the no benchmark issue is not to compare results on one array against a second....

ric


On 05/26/2009 08:21 AM, Joshua Giles wrote:
> Hi Ric,
>
> I'm wondering if we should include a "regression" performance test as part of the tools you'll give out for large fs testing?  Given some simple tools to run and some numbers output, we could request they (or as part of the test) measure the difference between fedora major releases or as part of the test cycle and send us that info?  Would we need some agreements such that they don't share this info with others or is this dirty laundry ok to air?
>
> -Josh Giles
> ----- Original Message -----
> From: "Ric Wheeler"<rwheeler@redhat.com>
> To: linux-fsdevel@vger.kernel.org
> Cc: "Christoph Hellwig"<hch@infradead.org>, "Douglas Shakshober"<dshaks@redhat.com>, "Joshua Giles"<jgiles@redhat.com>, "Valerie Aurora"<vaurora@redhat.com>, "Eric Sandeen"<esandeen@redhat.com>, "Steven Whitehouse"<swhiteho@redhat.com>, "Edward Shishkin"<edward@redhat.com>, "Josef Bacik"<jbacik@redhat.com>, "Jeff Moyer"<jmoyer@redhat.com>, "Chris Mason"<chris.mason@oracle.com>, "Eric Whitney"<eric.whitney@hp.com>, "Theodore Tso"<tytso@mit.edu>
> Sent: Saturday, May 23, 2009 9:53:28 AM GMT -05:00 US/Canada Eastern
> Subject: large fs testing
>
> Jeff Moyer&  I have been working with EMC elab over the last week or so testing
> ext4, xfs and gfs2 at roughly 80TB striped across a set of 12TB LUNs (single
> server, 6GB of DRAM, 2 quad core HT enabled CPU's).
>
> The goal of the testing is (in decreasing priority) is to validate Val's 64 bit
> patches for ext4 e2fsprogs, do a very quick sanity check that XFS does indeed
> scale as well as I hear (and it has so far :-)) and to test gfs2 tools at that
> high capacity. Not enough time to get it all done and significant fumbling on my
> part made it go even slower.
>
> Never the less, I have come to a rough idea of what a useful benchmark would be.
> If this sounds sane to all, I would like to try and put something together that
> we could provide to places like the EMC people who have large storage
> occasionally, are not kernel hackers, but would be willing to test for us. It
> will need to be fairly bullet proof and avoid doing performance numbers on the
> storage for normal things I assume (to avoid leaking competitive benchmarks out).
>
> Motivation - all things being equal, users benefit from having all storage
> consumed by one massive file system since that single file system manages space
> allocation, avoids seekiness, etc (something that applications have to do
> manually when using sets of file systems, the current state of the art for ext3
> for example).
>
> The challenges are:
>
> (1) object count - how many files can you pack into that file system with
> reasonable performance? (The test to date filled the single ext4 fs with 207
> million 20KB files)
>
> (2) files per directory - how many files per directory?
>
> (3) FS creation time - can you create a file system in reasonable time?
> (mkfs.xfs took seconds, mkfs.ext4 took 90 minutes). I think that 90 minutes is
> definitely on the painful side, but usable for most.
>
> (4) FS check time at a given fill rate for a healthy device (no IO errors).
> Testing at empty, 25%, 50%, 75% and 95% and full would all be interesting. Can
> you run these checks with a reasonable amount of DRAM - if not, what guidance do
> we need to give to customers on how big the servers need to be?
>
> It would seem to be a nice goal to be able to fsck a file system in one working
> day - say 8 hours - so that you could get a customer back on their feet, but
> maybe 24 hours would be an outside goal?
>
> (5) Write rate as the fs fills (picking the same set of fill rates?)
>
> To make is some how a tractable problem, I wanted to define small (20KB), medium
> (MP3 sized, say 4MB) and large (video sized, 4GB?) files to do the test with. I
> used fs_mark (no fsync's and 256 directories) to fill the file system (at least
> until my patience/time ran out!). With these options, it still hits very high
> file/directory counts (I am thinking about tweaking fs_mark to dynamically
> create a time based directory scheme, something like day/hour/min and giving it
> an option to stop at a specified fill rate).
>
> Sorry for the long ramble, I was curious to see if this makes sense to the
> broader set of you all&  if you have had any similar experiences to share.
>
> Thanks!
>
> Ric
>
>
>
>
>
>
>
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: large fs testing
  2009-05-23 13:53 large fs testing Ric Wheeler
  2009-05-26 12:21 ` Joshua Giles
@ 2009-05-26 17:39 ` Nick Dokos
  2009-05-26 17:47   ` Ric Wheeler
  1 sibling, 1 reply; 10+ messages in thread
From: Nick Dokos @ 2009-05-26 17:39 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: linux-fsdevel, Christoph Hellwig, Douglas Shakshober,
	Joshua Giles, Valerie Aurora, Eric Sandeen, Steven Whitehouse,
	Edward Shishkin, Josef Bacik, Jeff Moyer, Chris Mason, Whitney,
	Eric, Theodore Tso, nicholas.dokos

> 
> (3) FS creation time - can you create a file system in reasonable
> time? (mkfs.xfs took seconds, mkfs.ext4 took 90 minutes). I think that
> 90 minutes is definitely on the painful side, but usable for most.
> 

I get better numbers for some reason: on a 32 TiB filesystem (16 LUNs,
2TiB each, 128KiB stripes at both the RAID controller and in LVM), using
the following options, I get:

# time mke2fs -q -t ext4 -O ^resize_inode -E stride=32,stripe-width=512,lazy_itable_init=1 /dev/mapper/bigvg-bigvol

real	1m2.137s
user	0m58.934s
sys	0m1.981s


Without lazy_itable_init, I get

# time mke2fs -q -t ext4 -O ^resize_inode -E stride=32,stripe-width=512 /dev/mapper/bigvg-bigvol

real	12m54.510s
user	1m4.786s
sys	11m44.762s

Thanks,
Nick

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: large fs testing
  2009-05-26 17:39 ` Nick Dokos
@ 2009-05-26 17:47   ` Ric Wheeler
  2009-05-26 21:21     ` Andreas Dilger
  0 siblings, 1 reply; 10+ messages in thread
From: Ric Wheeler @ 2009-05-26 17:47 UTC (permalink / raw)
  To: nicholas.dokos
  Cc: linux-fsdevel, Christoph Hellwig, Douglas Shakshober,
	Joshua Giles, Valerie Aurora, Eric Sandeen, Steven Whitehouse,
	Edward Shishkin, Josef Bacik, Jeff Moyer, Chris Mason, Whitney,
	Eric, Theodore Tso

On 05/26/2009 01:39 PM, Nick Dokos wrote:
>> (3) FS creation time - can you create a file system in reasonable
>> time? (mkfs.xfs took seconds, mkfs.ext4 took 90 minutes). I think that
>> 90 minutes is definitely on the painful side, but usable for most.
>>
>
> I get better numbers for some reason: on a 32 TiB filesystem (16 LUNs,
> 2TiB each, 128KiB stripes at both the RAID controller and in LVM), using
> the following options, I get:
>
> # time mke2fs -q -t ext4 -O ^resize_inode -E stride=32,stripe-width=512,lazy_itable_init=1 /dev/mapper/bigvg-bigvol
>
> real	1m2.137s
> user	0m58.934s
> sys	0m1.981s
>
>
> Without lazy_itable_init, I get
>
> # time mke2fs -q -t ext4 -O ^resize_inode -E stride=32,stripe-width=512 /dev/mapper/bigvg-bigvol
>
> real	12m54.510s
> user	1m4.786s
> sys	11m44.762s
>
> Thanks,
> Nick

Hi Nick,

These runs were without lazy init, so I would expect to be a little more than 
twice as slow as your second run (not the three times I saw) assuming that it 
scales linearly. This run was with limited DRAM on the box (6GB) and only a 
single HBA, but I am afraid that I did not get any good insight into what was 
the bottleneck during my runs. Also, I am pretty certain that most arrays do 
better with more, smaller sized LUN's (like you had) than fewer, larger ones.

Do you have any access to even larger storage, say the mythical 100TB :-) ? Any 
insight on interesting workloads?

Thanks!

Ric


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: large fs testing
  2009-05-26 17:47   ` Ric Wheeler
@ 2009-05-26 21:21     ` Andreas Dilger
  2009-05-26 21:39       ` Theodore Tso
  2009-05-26 22:17       ` Ric Wheeler
  0 siblings, 2 replies; 10+ messages in thread
From: Andreas Dilger @ 2009-05-26 21:21 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: nicholas.dokos, linux-fsdevel, Christoph Hellwig,
	Douglas Shakshober, Joshua Giles, Valerie Aurora, Eric Sandeen,
	Steven Whitehouse, Edward Shishkin, Josef Bacik, Jeff Moyer,
	Chris Mason, Whitney, Eric, Theodore Tso

On May 26, 2009  13:47 -0400, Ric Wheeler wrote:
> These runs were without lazy init, so I would expect to be a little more 
> than twice as slow as your second run (not the three times I saw) 
> assuming that it scales linearly.

Making lazy_itable_init the default formatting option for ext4 is/was
dependent upon the kernel doing the zeroing of the inode table blocks
at first mount time.  I'm not sure if that was implemented yet.

> This run was with limited DRAM on the 
> box (6GB) and only a single HBA, but I am afraid that I did not get any 
> good insight into what was the bottleneck during my runs.

For a very large array (80TB) this could be 1TB or more of inode tables
that are being zeroed out at format time.  After 64TB the default mke2fs
options will cap out at 4B inodes in the filesystem.  1TB/90min ~= 200MB/s
so this is probably your bottleneck.

> Do you have any access to even larger storage, say the mythical 100TB :-) 
> ? Any insight on interesting workloads?

I would definitely be most interested in e2fsck performance at this scale
(RAM usage and elapsed time) because this will in the end be the defining
limit on how large a usable filesystem can actually be in practise.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: large fs testing
  2009-05-26 21:21     ` Andreas Dilger
@ 2009-05-26 21:39       ` Theodore Tso
  2009-05-26 22:17       ` Ric Wheeler
  1 sibling, 0 replies; 10+ messages in thread
From: Theodore Tso @ 2009-05-26 21:39 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Ric Wheeler, nicholas.dokos, linux-fsdevel, Christoph Hellwig,
	Douglas Shakshober, Joshua Giles, Valerie Aurora, Eric Sandeen,
	Steven Whitehouse, Edward Shishkin, Josef Bacik, Jeff Moyer,
	Chris Mason, Whitney, Eric

On Tue, May 26, 2009 at 03:21:32PM -0600, Andreas Dilger wrote:
> On May 26, 2009  13:47 -0400, Ric Wheeler wrote:
> > These runs were without lazy init, so I would expect to be a little more 
> > than twice as slow as your second run (not the three times I saw) 
> > assuming that it scales linearly.
> 
> Making lazy_itable_init the default formatting option for ext4 is/was
> dependent upon the kernel doing the zeroing of the inode table blocks
> at first mount time.  I'm not sure if that was implemented yet.

No, it hasn't been implemented yet.  If someone would like to step
forward, it's not hard patch to write.  The good news is that there's
no need to actually use the journal for most of the zero'ing;
basically, the design would be for each block group, if the inode
table hasn't been initialized, to call write_down on the per-block
group alloc_sem semphaore in ext4_group_info, initialize the inode
table in the unused portion of the block group (which can be calucated
from ext4_itable_unused_count()), then release the alloc_sem
semaphore, and then (under journal protection) set the
EXT4_BG_INODE_ZEROED flag in the block group descriptor.

If someone hurries, we could get this done before the next merge
window opens (probably in a week or two).

						- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: large fs testing
  2009-05-26 21:21     ` Andreas Dilger
  2009-05-26 21:39       ` Theodore Tso
@ 2009-05-26 22:17       ` Ric Wheeler
  2009-05-28  6:30         ` Andreas Dilger
  1 sibling, 1 reply; 10+ messages in thread
From: Ric Wheeler @ 2009-05-26 22:17 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: nicholas.dokos, linux-fsdevel, Christoph Hellwig,
	Douglas Shakshober, Joshua Giles, Valerie Aurora, Eric Sandeen,
	Steven Whitehouse, Edward Shishkin, Josef Bacik, Jeff Moyer,
	Chris Mason, Whitney, Eric, Theodore Tso

On 05/26/2009 05:21 PM, Andreas Dilger wrote:
> On May 26, 2009  13:47 -0400, Ric Wheeler wrote:
>> These runs were without lazy init, so I would expect to be a little more
>> than twice as slow as your second run (not the three times I saw)
>> assuming that it scales linearly.
>
> Making lazy_itable_init the default formatting option for ext4 is/was
> dependent upon the kernel doing the zeroing of the inode table blocks
> at first mount time.  I'm not sure if that was implemented yet.
>
>> This run was with limited DRAM on the
>> box (6GB) and only a single HBA, but I am afraid that I did not get any
>> good insight into what was the bottleneck during my runs.
>
> For a very large array (80TB) this could be 1TB or more of inode tables
> that are being zeroed out at format time.  After 64TB the default mke2fs
> options will cap out at 4B inodes in the filesystem.  1TB/90min ~= 200MB/s
> so this is probably your bottleneck.
>
>> Do you have any access to even larger storage, say the mythical 100TB :-)
>> ? Any insight on interesting workloads?
>
> I would definitely be most interested in e2fsck performance at this scale
> (RAM usage and elapsed time) because this will in the end be the defining
> limit on how large a usable filesystem can actually be in practise.
>
> Cheers, Andreas


Not sure why, but the box rebooted (crashed?) a couple of hours into the run (no 
hints in the logs pointed at anything suspicious).

What I did get was the following from the fsck run:

root@l82bi250:/home/redhat\aYou have new mail in /var/spool/mail/root
[root@l82bi250 redhat]# time /sbin/fsck.ext4 -tt -y /dev/mapper/Big_boy-Big_boy
e2fsck 1.41.4 (27-Jan-2009)
Pass 1: Checking inodes, blocks, and sizes
Pass 1: Memory used: 1596k/1177752k (1447k/150k), time: 1184.73/514.16/344.38
Pass 1: I/O read: 50655MB, write: 0MB, rate: 42.76MB/s
Pass 2: Checking directory structure
Entry '4a1590dc~~~~~~~~O4A0SMJ1VC34YQ1PD3B5DL9Q' in /da (188378) references 
inode 196988 in group 30 where _INODE_UNINIT is set.
Fix? yes

Restarting e2fsck from the beginning...
Group descriptor 15 checksum is invalid.  Fix? yes

Pass 1: Checking inodes, blocks, and sizes
Pass 1: Memory used: 120396k/-1389015k (120134k/263k), time: 1134.71/522.48/323.65
Pass 1: I/O read: 50656MB, write: 0MB, rate: 44.64MB/s
Pass 2: Checking directory structure
Entry '4a15910c~~~~~~~~H8099TRM701Q29CSTCWBVIHJ' in /0b (404925) references 
inode 413100 in group 62 where _INODE_UNINIT is set.
Fix? yes

Restarting e2fsck from the beginning...
Group descriptor 31 checksum is invalid.  Fix? yes

Pass 1: Checking inodes, blocks, and sizes
Pass 1: Memory used: 231360k/246272k (231083k/278k), time: 1140.48/521.00/334.74
Pass 1: I/O read: 50658MB, write: 0MB, rate: 44.42MB/s
Pass 2: Checking directory structure
Pass 2: Memory used: 231360k/1290436k (231083k/278k), time: 538.22/264.56/83.49
Pass 2: I/O read: 13749MB, write: 0MB, rate: 25.55MB/s
Pass 3: Checking directory connectivity
Peak memory: Memory used: 231360k/1789000k (231083k/278k), time: 
4221.57/1947.37/1116.21
Pass 3A: Memory used: 231360k/1789000k (231083k/278k), time:  0.00/ 0.00/ 0.00
Pass 3A: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
Pass 3: Memory used: 231360k/1290436k (231083k/278k), time:  9.99/ 0.26/ 1.37
Pass 3: I/O read: 1MB, write: 0MB, rate: 0.10MB/s
Pass 4: Checking reference counts
Pass 4: Memory used: 231360k/-1481575k (231082k/279k), time: 147.16/139.87/ 1.94
Pass 4: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
Pass 5: Checking group summary information
Inode bitmap differences:  -(98404--98405)

Note that it got truncated in Pass 5 - just after writing out some values that 
look like they sign wrapped?

-(103650--103655) -(103659--103660) -103663 -103665 -103667 -(103669--103670) 
-(103673--103676) -103679 -103684 -103687 -10

ric


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: large fs testing
  2009-05-26 22:17       ` Ric Wheeler
@ 2009-05-28  6:30         ` Andreas Dilger
  2009-05-28 10:52           ` Ric Wheeler
  0 siblings, 1 reply; 10+ messages in thread
From: Andreas Dilger @ 2009-05-28  6:30 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: nicholas.dokos, linux-fsdevel, Christoph Hellwig,
	Douglas Shakshober, Joshua Giles, Valerie Aurora, Eric Sandeen,
	Steven Whitehouse, Edward Shishkin, Josef Bacik, Jeff Moyer,
	Chris Mason, Whitney, Eric, Theodore Tso

On May 26, 2009  18:17 -0400, Ric Wheeler wrote:
> What I did get was the following from the fsck run:
>
> root@l82bi250:/home/redhat\aYou have new mail in /var/spool/mail/root
> [root@l82bi250 redhat]# time /sbin/fsck.ext4 -tt -y /dev/mapper/Big_boy-Big_boy
> e2fsck 1.41.4 (27-Jan-2009)
> Pass 1: Checking inodes, blocks, and sizes
> Pass 1: Memory used: 1596k/1177752k (1447k/150k), time: 1184.73/514.16/344.38
> Pass 1: I/O read: 50655MB, write: 0MB, rate: 42.76MB/s
> Pass 2: Checking directory structure
> Entry '4a1590dc~~~~~~~~O4A0SMJ1VC34YQ1PD3B5DL9Q' in /da (188378) 
> references inode 196988 in group 30 where _INODE_UNINIT is set.
> Fix? yes
>
> Restarting e2fsck from the beginning...
> Group descriptor 15 checksum is invalid.  Fix? yes
>
> Pass 1: Checking inodes, blocks, and sizes
> Pass 1: Memory used: 120396k/-1389015k (120134k/263k), time: 1134.71/522.48/323.65
> Pass 1: I/O read: 50656MB, write: 0MB, rate: 44.64MB/s
> Pass 2: Checking directory structure
> Entry '4a15910c~~~~~~~~H8099TRM701Q29CSTCWBVIHJ' in /0b (404925) 
> references inode 413100 in group 62 where _INODE_UNINIT is set.
> Fix? yes
>
> Restarting e2fsck from the beginning...
> Group descriptor 31 checksum is invalid.  Fix? yes

This looks like there is a patch of ours missing from the upstream e2fsprogs.
We have a patch that will restart e2fsck only a single time for inodes
beyond the high waterwark.  On a large filesystem like yours this would
have cut 30 minutes off the e2fsck time.  I'll submit that separately.

> Pass 1: Checking inodes, blocks, and sizes
> Pass 1: Memory used: 231360k/246272k (231083k/278k), time: 1140.48/521.00/334.74
> Pass 1: I/O read: 50658MB, write: 0MB, rate: 44.42MB/s
> Pass 2: Checking directory structure
> Pass 2: Memory used: 231360k/1290436k (231083k/278k), time: 538.22/264.56/83.49
> Pass 2: I/O read: 13749MB, write: 0MB, rate: 25.55MB/s
> Pass 3: Checking directory connectivity
> Peak memory: Memory used: 231360k/1789000k (231083k/278k), time:  
> 4221.57/1947.37/1116.21
> Pass 3A: Memory used: 231360k/1789000k (231083k/278k), time:  0.00/ 0.00/ 0.00
> Pass 3A: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
> Pass 3: Memory used: 231360k/1290436k (231083k/278k), time:  9.99/ 0.26/ 1.37
> Pass 3: I/O read: 1MB, write: 0MB, rate: 0.10MB/s
> Pass 4: Checking reference counts
> Pass 4: Memory used: 231360k/-1481575k (231082k/279k), time: 147.16/139.87/ 1.94

Sign overflow here...  Looks like we exceed 2.5GB of memory here.   Still,
not too bad considering this is a 80TB filesystem.

> Pass 4: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
> Pass 5: Checking group summary information
> Inode bitmap differences:  -(98404--98405)
>
> Note that it got truncated in Pass 5 - just after writing out some values 
> that look like they sign wrapped?
>
> -(103650--103655) -(103659--103660) -103663 -103665 -103667 
> -(103669--103670) -(103673--103676) -103679 -103684 -103687 -10

No, this is what gets printed when there are inodes (or blocks) marked
in the bitmap that are not in use.  It shouldn't be truncated however.
You said the node crashed at this point?

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: large fs testing
  2009-05-28  6:30         ` Andreas Dilger
@ 2009-05-28 10:52           ` Ric Wheeler
  0 siblings, 0 replies; 10+ messages in thread
From: Ric Wheeler @ 2009-05-28 10:52 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: nicholas.dokos, linux-fsdevel, Christoph Hellwig,
	Douglas Shakshober, Joshua Giles, Valerie Aurora, Eric Sandeen,
	Steven Whitehouse, Edward Shishkin, Josef Bacik, Jeff Moyer,
	Chris Mason, Whitney, Eric, Theodore Tso

On 05/28/2009 02:30 AM, Andreas Dilger wrote:
> On May 26, 2009  18:17 -0400, Ric Wheeler wrote:
>> What I did get was the following from the fsck run:
>>
>> root@l82bi250:/home/redhat\aYou have new mail in /var/spool/mail/root
>> [root@l82bi250 redhat]# time /sbin/fsck.ext4 -tt -y /dev/mapper/Big_boy-Big_boy
>> e2fsck 1.41.4 (27-Jan-2009)
>> Pass 1: Checking inodes, blocks, and sizes
>> Pass 1: Memory used: 1596k/1177752k (1447k/150k), time: 1184.73/514.16/344.38
>> Pass 1: I/O read: 50655MB, write: 0MB, rate: 42.76MB/s
>> Pass 2: Checking directory structure
>> Entry '4a1590dc~~~~~~~~O4A0SMJ1VC34YQ1PD3B5DL9Q' in /da (188378)
>> references inode 196988 in group 30 where _INODE_UNINIT is set.
>> Fix? yes
>>
>> Restarting e2fsck from the beginning...
>> Group descriptor 15 checksum is invalid.  Fix? yes
>>
>> Pass 1: Checking inodes, blocks, and sizes
>> Pass 1: Memory used: 120396k/-1389015k (120134k/263k), time: 1134.71/522.48/323.65
>> Pass 1: I/O read: 50656MB, write: 0MB, rate: 44.64MB/s
>> Pass 2: Checking directory structure
>> Entry '4a15910c~~~~~~~~H8099TRM701Q29CSTCWBVIHJ' in /0b (404925)
>> references inode 413100 in group 62 where _INODE_UNINIT is set.
>> Fix? yes
>>
>> Restarting e2fsck from the beginning...
>> Group descriptor 31 checksum is invalid.  Fix? yes
>
> This looks like there is a patch of ours missing from the upstream e2fsprogs.
> We have a patch that will restart e2fsck only a single time for inodes
> beyond the high waterwark.  On a large filesystem like yours this would
> have cut 30 minutes off the e2fsck time.  I'll submit that separately.
>
>> Pass 1: Checking inodes, blocks, and sizes
>> Pass 1: Memory used: 231360k/246272k (231083k/278k), time: 1140.48/521.00/334.74
>> Pass 1: I/O read: 50658MB, write: 0MB, rate: 44.42MB/s
>> Pass 2: Checking directory structure
>> Pass 2: Memory used: 231360k/1290436k (231083k/278k), time: 538.22/264.56/83.49
>> Pass 2: I/O read: 13749MB, write: 0MB, rate: 25.55MB/s
>> Pass 3: Checking directory connectivity
>> Peak memory: Memory used: 231360k/1789000k (231083k/278k), time:
>> 4221.57/1947.37/1116.21
>> Pass 3A: Memory used: 231360k/1789000k (231083k/278k), time:  0.00/ 0.00/ 0.00
>> Pass 3A: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
>> Pass 3: Memory used: 231360k/1290436k (231083k/278k), time:  9.99/ 0.26/ 1.37
>> Pass 3: I/O read: 1MB, write: 0MB, rate: 0.10MB/s
>> Pass 4: Checking reference counts
>> Pass 4: Memory used: 231360k/-1481575k (231082k/279k), time: 147.16/139.87/ 1.94
>
> Sign overflow here...  Looks like we exceed 2.5GB of memory here.   Still,
> not too bad considering this is a 80TB filesystem.

We fsck had a virtual size of around 10GB (5.4 resident in the 6GB of DRAM) when 
I checked...  I wonder if it would have been significantly faster without the 
excessive swap use (i.e., on a box with more memory)?

>
>> Pass 4: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
>> Pass 5: Checking group summary information
>> Inode bitmap differences:  -(98404--98405)
>>
>> Note that it got truncated in Pass 5 - just after writing out some values
>> that look like they sign wrapped?
>>
>> -(103650--103655) -(103659--103660) -103663 -103665 -103667
>> -(103669--103670) -(103673--103676) -103679 -103684 -103687 -10
>
> No, this is what gets printed when there are inodes (or blocks) marked
> in the bitmap that are not in use.  It shouldn't be truncated however.
> You said the node crashed at this point?
>
> Cheers, Andreas

Yes - unfortunately, no logs or other indication of why it crashed and we did 
not have a serial console setup either so we don't have anything to go on.

I am going to push harder to get some large storage configurations that we can 
use for testing internally, so hopefully, we will have something to test on in a 
couple of months....

Ric

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2009-05-28 10:53 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-23 13:53 large fs testing Ric Wheeler
2009-05-26 12:21 ` Joshua Giles
2009-05-26 12:28   ` Ric Wheeler
2009-05-26 17:39 ` Nick Dokos
2009-05-26 17:47   ` Ric Wheeler
2009-05-26 21:21     ` Andreas Dilger
2009-05-26 21:39       ` Theodore Tso
2009-05-26 22:17       ` Ric Wheeler
2009-05-28  6:30         ` Andreas Dilger
2009-05-28 10:52           ` Ric Wheeler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.