All of lore.kernel.org
 help / color / mirror / Atom feed
* How to pre-allocate files for sequential access?
@ 2012-04-04 23:57 troby
  2012-04-05 15:40 ` Eric Sandeen
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: troby @ 2012-04-04 23:57 UTC (permalink / raw)
  To: xfs


I am trying to set up a 20 TB filesystem which will contain a single
directory with 10000 pre-allocated 2GB files. There will be only a small
number of other directories with very little activity. Once the files are
preallocated there will be almost no new file creation. The files will be
written sequentially, typically with writes of about 120KB, and will not be
updated until the filesystem fills, at which point the earliest files will
start to be overwritten (not deleted). There will be relatively little read
activity. There will be a single writer process using a single thread. The
filesystem application is MongoDB. I am trying to minimize seek activity
during the write process, and would also like to have contiguous file
allocation since the database queries will be retrieving records from a
sequentially-related set of files.
The filesystem as currently created looks like this:

meta-data=/dev/sdb1              isize=256    agcount=20, agsize=268435448
blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=5127012091, imaxpct=1
         =                       sunit=8      swidth=56 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

However what I see is that the earliest created files start about 5TB into
the filesystem. The files are not being created in contiguous block ranges.
Here is an xfs_bmap example of three files created in sequence:
0: [0..4192255]: 24075083136..24079275391
0: [0..4192255]: 26222566720..26226758975
0: [0..4192255]: 28370050304..28374242559

Currently a process is doing continuous data inserts into the database, and
is writing sequential segments within the files, filling a file in about 6
minutes, and moving on to the next. There is also a small amount of write
activity to a single file containing database metadata which is located
about 5TB into the filesystem. The database index files are located on a
separate disk. 

Using seekwatcher I've determined that the actual I/O pattern, even when a
small number of files is being written to, is spread over a fairly wide
range of filesystem offsets, resulting in about 250 seeks per second. I
don't know how to determine how long the seeks are. (I tried to upload the
seekwatcher image but apparently that's not allowed). Seekwatcher shows the
I/O activity is in a range between 15 and 17 TB into the filesystem. During
this time there was a set of about 4 files being actively written as far as
I know.

I'm guessing that the use of multiple allocation groups may explain the
non-contiguous block allocation, although I read at one point that even with
multiple allocation groups, files within a single directory would use the
same group. I don't believe I need multiple allocation groups for this
application due to the single writer and the fact that all files will be
preallocated before use. Would it be reasonable to force mkfs to use a
single 20TB allocation group, and would this be likely to produce contiguous
block allocation?

This is kernel 3.0.25 using xfsprogs 3.1.1.


-- 
View this message in context: http://old.nabble.com/How-to-pre-allocate-files-for-sequential-access--tp33564834p33564834.html
Sent from the Xfs - General mailing list archive at Nabble.com.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to pre-allocate files for sequential access?
  2012-04-04 23:57 How to pre-allocate files for sequential access? troby
@ 2012-04-05 15:40 ` Eric Sandeen
  2012-04-05 21:57 ` Matthias Schniedermeyer
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Eric Sandeen @ 2012-04-05 15:40 UTC (permalink / raw)
  To: troby; +Cc: xfs

On 4/4/12 4:57 PM, troby wrote:
> 
> I am trying to set up a 20 TB filesystem which will contain a single
> directory with 10000 pre-allocated 2GB files. There will be only a small
> number of other directories with very little activity. Once the files are
> preallocated there will be almost no new file creation. The files will be
> written sequentially, typically with writes of about 120KB, and will not be
> updated until the filesystem fills, at which point the earliest files will
> start to be overwritten (not deleted). There will be relatively little read
> activity. There will be a single writer process using a single thread. The
> filesystem application is MongoDB. I am trying to minimize seek activity
> during the write process, and would also like to have contiguous file
> allocation since the database queries will be retrieving records from a
> sequentially-related set of files.
> The filesystem as currently created looks like this:
> 
> meta-data=/dev/sdb1              isize=256    agcount=20, agsize=268435448
> blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=5127012091, imaxpct=1
>          =                       sunit=8      swidth=56 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal               bsize=4096   blocks=521728, version=2
>          =                       sectsz=512   sunit=8 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> However what I see is that the earliest created files start about 5TB into
> the filesystem. The files are not being created in contiguous block ranges.
> Here is an xfs_bmap example of three files created in sequence:
> 0: [0..4192255]: 24075083136..24079275391
> 0: [0..4192255]: 26222566720..26226758975
> 0: [0..4192255]: 28370050304..28374242559

Please try again from scratch after mounting with the -o inode64 mount option,
which we will make default Real Soon Now(tm).

That option will more evenly spread inodes & file data throughout your whole
20T.

> Currently a process is doing continuous data inserts into the database, and
> is writing sequential segments within the files, filling a file in about 6
> minutes, and moving on to the next. There is also a small amount of write
> activity to a single file containing database metadata which is located
> about 5TB into the filesystem. The database index files are located on a
> separate disk. 
> 
> Using seekwatcher I've determined that the actual I/O pattern, even when a
> small number of files is being written to, is spread over a fairly wide
> range of filesystem offsets, resulting in about 250 seeks per second. I

Some of that seeking will be log writes, in the middle of the fs.

> don't know how to determine how long the seeks are. (I tried to upload the
> seekwatcher image but apparently that's not allowed). Seekwatcher shows the
> I/O activity is in a range between 15 and 17 TB into the filesystem. During
> this time there was a set of about 4 files being actively written as far as
> I know.
> 
> I'm guessing that the use of multiple allocation groups may explain the
> non-contiguous block allocation, although I read at one point that even with
> multiple allocation groups, files within a single directory would use the
> same group. I don't believe I need multiple allocation groups for this

That's generally true, but only until that group fills.  If you fill the
whole fs with files in the same dir, of course it will have to spill to other
AGs...

> application due to the single writer and the fact that all files will be
> preallocated before use. Would it be reasonable to force mkfs to use a
> single 20TB allocation group, and would this be likely to produce contiguous
> block allocation?

The AG size maxes out at 1T, so you can't make a single AG.

I'd give it another shot with inode64 and see if things look a little better,
or at least a bit more predictable.

-Eric

> This is kernel 3.0.25 using xfsprogs 3.1.1.
> 
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to pre-allocate files for sequential access?
  2012-04-04 23:57 How to pre-allocate files for sequential access? troby
  2012-04-05 15:40 ` Eric Sandeen
@ 2012-04-05 21:57 ` Matthias Schniedermeyer
  2012-04-05 22:40   ` Eric Sandeen
  2012-04-06  7:06 ` Stan Hoeppner
  2012-04-10 19:12 ` troby
  3 siblings, 1 reply; 6+ messages in thread
From: Matthias Schniedermeyer @ 2012-04-05 21:57 UTC (permalink / raw)
  To: troby; +Cc: xfs

On 04.04.2012 16:57, troby wrote:

...

I think the easiest solution would be to create the said number of 
files, but with dummy-filenames.

Then write a script das does a xfs_bmap on each file, sorts them and 
then renames the dummy-files to the correct-name in the order they are 
on disc.





Bis denn

-- 
Real Programmers consider "what you see is what you get" to be just as 
bad a concept in Text Editors as it is in women. No, the Real Programmer
wants a "you asked for it, you got it" text editor -- complicated, 
cryptic, powerful, unforgiving, dangerous.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to pre-allocate files for sequential access?
  2012-04-05 21:57 ` Matthias Schniedermeyer
@ 2012-04-05 22:40   ` Eric Sandeen
  0 siblings, 0 replies; 6+ messages in thread
From: Eric Sandeen @ 2012-04-05 22:40 UTC (permalink / raw)
  To: Matthias Schniedermeyer; +Cc: troby, xfs

On Apr 5, 2012, at 2:57 PM, Matthias Schniedermeyer <ms@citd.de> wrote:

> On 04.04.2012 16:57, troby wrote:
> 
> ...
> 
> I think the easiest solution would be to create the said number of 
> files, but with dummy-filenames.
> 
> Then write a script das does a xfs_bmap on each file, sorts them and 
> then renames the dummy-files to the correct-name in the order they are 
> on disc.
> 

Inode64 will likely keep them in order without needing tricks like that.

Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to pre-allocate files for sequential access?
  2012-04-04 23:57 How to pre-allocate files for sequential access? troby
  2012-04-05 15:40 ` Eric Sandeen
  2012-04-05 21:57 ` Matthias Schniedermeyer
@ 2012-04-06  7:06 ` Stan Hoeppner
  2012-04-10 19:12 ` troby
  3 siblings, 0 replies; 6+ messages in thread
From: Stan Hoeppner @ 2012-04-06  7:06 UTC (permalink / raw)
  To: troby; +Cc: xfs

On 4/4/2012 6:57 PM, troby wrote:

The high points:

> 20 TB filesystem
> directory with 10000 pre-allocated 2GB files. 
> written sequentially, writes of about 120KB
> overwritten (not deleted)
> single writer process using a single thread. 
> MongoDB. 
> minimize seek activity
> contiguous file allocation

Current filesystem configuration:

> The filesystem as currently created looks like this:
> 
> meta-data=/dev/sdb1              isize=256    agcount=20, agsize=268435448
> blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=5127012091, imaxpct=1
>          =                       sunit=8      swidth=56 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal               bsize=4096   blocks=521728, version=2
>          =                       sectsz=512   sunit=8 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0

RAID configuration based on xfs_info analysis:

Hardware RAID controller, unknown type, unknown cache configuration
8 * 3TB disks, possibly AF and/or 'green', RPM: 5600-7200, cache 32-64MB
RAID5 w/32KB chunk, 7 spindle stripe

> However what I see is that the earliest created files start about 5TB into
> the filesystem. The files are not being created in contiguous block ranges.
> Here is an xfs_bmap example of three files created in sequence:
> 0: [0..4192255]: 24075083136..24079275391
> 0: [0..4192255]: 26222566720..26226758975
> 0: [0..4192255]: 28370050304..28374242559

This is a result of the inode32 allocator behavior.  Try the inode64
allocator as Eric recommended.

> Using seekwatcher I've determined that the actual I/O pattern, even when a
> small number of files is being written to, is spread over a fairly wide
> range of filesystem offsets, resulting in about 250 seeks per second. I

I don't care for the use of the term "seek" as there is not a 1:1
correlation between these "seeks" and actual disk head seeks.  The
latter is what is always of concern because that's where all the latency
is.  Folks most often become head seek bound due to the extra read seeks
in RMW operations when using parity RAID.  These extra RMW seeks are
completely hidden from Seekwatcher when using hardware RAID, though
should be somewhat visible with md RAID.

> don't know how to determine how long the seeks are. (I tried to upload the
> seekwatcher image but apparently that's not allowed). Seekwatcher shows the
> I/O activity is in a range between 15 and 17 TB into the filesystem. During
> this time there was a set of about 4 files being actively written as far as
> I know.

It's coarse, but you might start with iostat interactively to get a
rough idea WRT overall latency.  Look at the await column,  man iostat.
 This gives 1 second interval.

$ iostat -d 1 -x /dev/sdb1

> I'm guessing that the use of multiple allocation groups may explain the
> non-contiguous block allocation, although I read at one point that even with
> multiple allocation groups, files within a single directory would use the
> same group. 

What you describe is the behavior of the inode64 allocator, which is
optional.  IIRC the [default] inode32 allocator puts all the directories
in the first AG and spreads the files around the remaining AGs, which is
what you are seeing.

> I don't believe I need multiple allocation groups for this
> application due to the single writer and the fact that all files will be
> preallocated before use. Would it be reasonable to force mkfs to use a
> single 20TB allocation group, and would this be likely to produce contiguous
> block allocation?

As Eric mentioned, the max AG size is 1TB, though this may increase in
the future as aerial density increases.

One final thought I'd like to share is that parity RAID, level 5 or 6,
is rarely, if ever, good for write mostly applications, especially for
overwriting applications.  You're preallocating all of your files on a
fresh XFS, so you should be able to avoid the extra seeks of RMW until
you start overwriting them.  Once that happens your request latency will
double, if not more, and your throughput will take a 2x or more nose
dive.  If you can get by with 11-12TB, I'd rebuild those 8 drives as a
RAID10 array.  Here are your advantages:

1.  No RMW penalty, near zero RAID computation, high throughput
2.  RAID5 write throughput drops 5-20x when degraded/rebuilding
3.  RAID10 loses zero performance when degraded
4.  RAID10 rebuild time is 10-20x faster than RAID5
5.  RAID10 suffers only a small performance drop during rebuild

Need all 20TB, or something in between 10-20?  What is your current
hardware setup and allowed upgrade budget, if any?  I'd be willing to
offer you some optimal hardware choices/advice based on your current
setup, to achieve the optimal RAID10, if you would like.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to pre-allocate files for sequential access?
  2012-04-04 23:57 How to pre-allocate files for sequential access? troby
                   ` (2 preceding siblings ...)
  2012-04-06  7:06 ` Stan Hoeppner
@ 2012-04-10 19:12 ` troby
  3 siblings, 0 replies; 6+ messages in thread
From: troby @ 2012-04-10 19:12 UTC (permalink / raw)
  To: xfs


Thanks all for your help. Due to unrelated collateral damage within the
database software, I had the "opportunity" to rebuild the filesystems using
some of your suggestions. I set up external log devices for the two busiest
filesystems (RAID5 for the row data, a RAID10 on 4 drives for indexes) and
configured the stripe widths correctly this time. The inode64 mount option
did result in sequential allocation, with a single 2GB segment per file, for
all files which were pre-allocated by the MongoDB software. A small number
of similar files I created manually using dd from /dev/zero with a single
2GB block showed 3 or 4 segments coming from different AG's, with
corresponding disparate block ranges, but there aren't enough of those to
cause problems. 
One thing that puzzles me is that despite my configuration of the underlying
RAID stripe geometry both at filesystem creation and mount time, all the
filesystems show average request sizes (mostly writes at this time) of
around 240 sectors. This is correct for some of them, but the RAID1 stripe
is twice that wide and the RAID5 almost 4 times as wide. The files being
written are all memory-mapped, so I'm wondering if that means the kernel
uses some other settings besides the fstab mount options to determine the
request size. The flush activity only happens a few times a minute and only
lasts a second or two, so I don't think there's really a significant
performance impact under the current load. And since the writes are actually
going to a 1GB controller cache, I suspect there is enough time for the
controller to assemble a full RAID5 stripe before writing to disk.
-- 
View this message in context: http://old.nabble.com/How-to-pre-allocate-files-for-sequential-access--tp33564834p33663734.html
Sent from the Xfs - General mailing list archive at Nabble.com.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-04-10 19:12 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-04 23:57 How to pre-allocate files for sequential access? troby
2012-04-05 15:40 ` Eric Sandeen
2012-04-05 21:57 ` Matthias Schniedermeyer
2012-04-05 22:40   ` Eric Sandeen
2012-04-06  7:06 ` Stan Hoeppner
2012-04-10 19:12 ` troby

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.