All of lore.kernel.org
 help / color / mirror / Atom feed
* Fragmentation Issue We Are Having
@ 2012-04-12  1:04 David Fuller
  2012-04-12  2:16 ` Dave Chinner
  2012-04-12  7:57 ` Brian Candler
  0 siblings, 2 replies; 15+ messages in thread
From: David Fuller @ 2012-04-12  1:04 UTC (permalink / raw)
  To: xfs


[-- Attachment #1.1: Type: text/plain, Size: 1171 bytes --]

We seen to be having an issue whereby our database server
gets to 90% or higher fragmentation.  When it gets to this point
we would need to remove form production and defrag using the
xfs_fsr tool.  The server does get a lot of writes and reads.  Is
there something we can do to reduce the fragmentation or could
this be a result of hard disk tweaks we use or mount options?

here is some fo the tweaks we do:

/bin/echo "512" > /sys/block/sda/queue/read_ahead_kb
/bin/echo "10000" > /sys/block/sda/queue/nr_requests
/bin/echo "512" > /sys/block/sdb/queue/read_ahead_kb
/bin/echo "10000" > /sys/block/sdb/queue/nr_requests
/bin/echo "noop" > /sys/block/sda/queue/scheduler
/bin/echo "noop" > /sys/block/sdb/queue/scheduler


Adn here are the mount options on one of our servers:

 xfs     rw,noikeep,allocsize=256M,logbufs=8,sunit=128,swidth=2304

the sunit and swidth vary on each server based on disk drives.

We do use LVM on the volume where the mysql data is stored
as we need this for snapshotting.  Here is an example of a current state:

xfs_db -c frag -r /dev/mapper/vgmysql-lvmysql
actual 42586, ideal 3134, fragmentation factor 92.64%



Regards,
David Fuller

[-- Attachment #1.2: Type: text/html, Size: 1680 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fragmentation Issue We Are Having
  2012-04-12  1:04 Fragmentation Issue We Are Having David Fuller
@ 2012-04-12  2:16 ` Dave Chinner
  2012-04-12  2:55   ` David Fuller
  2012-04-12  7:57 ` Brian Candler
  1 sibling, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2012-04-12  2:16 UTC (permalink / raw)
  To: David Fuller; +Cc: xfs

On Wed, Apr 11, 2012 at 06:04:25PM -0700, David Fuller wrote:
> We seen to be having an issue whereby our database server
> gets to 90% or higher fragmentation.  When it gets to this point
> we would need to remove form production and defrag using the
> xfs_fsr tool.

Bad assumption.

> The server does get a lot of writes and reads.  Is
> there something we can do to reduce the fragmentation or could
> this be a result of hard disk tweaks we use or mount options?
> 
> here is some fo the tweaks we do:
> 
> /bin/echo "512" > /sys/block/sda/queue/read_ahead_kb
> /bin/echo "10000" > /sys/block/sda/queue/nr_requests
> /bin/echo "512" > /sys/block/sdb/queue/read_ahead_kb
> /bin/echo "10000" > /sys/block/sdb/queue/nr_requests
> /bin/echo "noop" > /sys/block/sda/queue/scheduler
> /bin/echo "noop" > /sys/block/sdb/queue/scheduler

They have no effect on filesystem fragmentation.

> Adn here are the mount options on one of our servers:
> 
>  xfs     rw,noikeep,allocsize=256M,logbufs=8,sunit=128,swidth=2304
> 
> the sunit and swidth vary on each server based on disk drives.
> 
> We do use LVM on the volume where the mysql data is stored
> as we need this for snapshotting.  Here is an example of a current state:
>
> xfs_db -c frag -r /dev/mapper/vgmysql-lvmysql
> actual 42586, ideal 3134, fragmentation factor 92.64%

Read this first:

http://xfs.org/index.php/XFS_FAQ#Q:_The_xfs_db_.22frag.22_command_says_I.27m_over_50.25.__Is_that_bad.3F

Then decide whether 10 extents per file is really a problem or not.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fragmentation Issue We Are Having
  2012-04-12  2:16 ` Dave Chinner
@ 2012-04-12  2:55   ` David Fuller
  2012-04-12  4:24     ` Eric Sandeen
  0 siblings, 1 reply; 15+ messages in thread
From: David Fuller @ 2012-04-12  2:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 2196 bytes --]

Dave,

Thanks so much for that informative read.  This helps me fight my case that
systematic
defrags are not needed and are bad for the system in general.

After reading this I did do some checks against some of our larger tables
and found that
on average we are storing about 2.5GB per extent. For me that seems pretty
reasonable
to me and does not require defrag'ing at this time.

--David Fuller


<http://www.topshelfads.com/email/sig/epochbestbilling.jpg>


On Wed, Apr 11, 2012 at 7:16 PM, Dave Chinner <david@fromorbit.com> wrote:

> On Wed, Apr 11, 2012 at 06:04:25PM -0700, David Fuller wrote:
> > We seen to be having an issue whereby our database server
> > gets to 90% or higher fragmentation.  When it gets to this point
> > we would need to remove form production and defrag using the
> > xfs_fsr tool.
>
> Bad assumption.
>
> > The server does get a lot of writes and reads.  Is
> > there something we can do to reduce the fragmentation or could
> > this be a result of hard disk tweaks we use or mount options?
> >
> > here is some fo the tweaks we do:
> >
> > /bin/echo "512" > /sys/block/sda/queue/read_ahead_kb
> > /bin/echo "10000" > /sys/block/sda/queue/nr_requests
> > /bin/echo "512" > /sys/block/sdb/queue/read_ahead_kb
> > /bin/echo "10000" > /sys/block/sdb/queue/nr_requests
> > /bin/echo "noop" > /sys/block/sda/queue/scheduler
> > /bin/echo "noop" > /sys/block/sdb/queue/scheduler
>
> They have no effect on filesystem fragmentation.
>
> > Adn here are the mount options on one of our servers:
> >
> >  xfs     rw,noikeep,allocsize=256M,logbufs=8,sunit=128,swidth=2304
> >
> > the sunit and swidth vary on each server based on disk drives.
> >
> > We do use LVM on the volume where the mysql data is stored
> > as we need this for snapshotting.  Here is an example of a current state:
> >
> > xfs_db -c frag -r /dev/mapper/vgmysql-lvmysql
> > actual 42586, ideal 3134, fragmentation factor 92.64%
>
> Read this first:
>
>
> http://xfs.org/index.php/XFS_FAQ#Q:_The_xfs_db_.22frag.22_command_says_I.27m_over_50.25.__Is_that_bad.3F
>
> Then decide whether 10 extents per file is really a problem or not.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>

[-- Attachment #1.2: Type: text/html, Size: 3202 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fragmentation Issue We Are Having
  2012-04-12  2:55   ` David Fuller
@ 2012-04-12  4:24     ` Eric Sandeen
  0 siblings, 0 replies; 15+ messages in thread
From: Eric Sandeen @ 2012-04-12  4:24 UTC (permalink / raw)
  To: David Fuller; +Cc: xfs

On 4/11/12 9:55 PM, David Fuller wrote:
> Dave,
> 
> Thanks so much for that informative read.  This helps me fight my case that systematic
> defrags are not needed and are bad for the system in general.
> 
> After reading this I did do some checks against some of our larger tables and found that
> on average we are storing about 2.5GB per extent. For me that seems pretty reasonable
> to me and does not require defrag'ing at this time.

I've also added a visual aid to that faq entry to show how quickly the frag factor
approaches 100% :)

-Eric

> --David Fuller
> 
> 
> <http://www.topshelfads.com/email/sig/epochbestbilling.jpg>
> 
> 
> On Wed, Apr 11, 2012 at 7:16 PM, Dave Chinner <david@fromorbit.com <mailto:david@fromorbit.com>> wrote:
> 
>     On Wed, Apr 11, 2012 at 06:04:25PM -0700, David Fuller wrote:
>     > We seen to be having an issue whereby our database server
>     > gets to 90% or higher fragmentation.  When it gets to this point
>     > we would need to remove form production and defrag using the
>     > xfs_fsr tool.
> 
>     Bad assumption.
> 
>     > The server does get a lot of writes and reads.  Is
>     > there something we can do to reduce the fragmentation or could
>     > this be a result of hard disk tweaks we use or mount options?
>     >
>     > here is some fo the tweaks we do:
>     >
>     > /bin/echo "512" > /sys/block/sda/queue/read_ahead_kb
>     > /bin/echo "10000" > /sys/block/sda/queue/nr_requests
>     > /bin/echo "512" > /sys/block/sdb/queue/read_ahead_kb
>     > /bin/echo "10000" > /sys/block/sdb/queue/nr_requests
>     > /bin/echo "noop" > /sys/block/sda/queue/scheduler
>     > /bin/echo "noop" > /sys/block/sdb/queue/scheduler
> 
>     They have no effect on filesystem fragmentation.
> 
>     > Adn here are the mount options on one of our servers:
>     >
>     >  xfs     rw,noikeep,allocsize=256M,logbufs=8,sunit=128,swidth=2304
>     >
>     > the sunit and swidth vary on each server based on disk drives.
>     >
>     > We do use LVM on the volume where the mysql data is stored
>     > as we need this for snapshotting.  Here is an example of a current state:
>     >
>     > xfs_db -c frag -r /dev/mapper/vgmysql-lvmysql
>     > actual 42586, ideal 3134, fragmentation factor 92.64%
> 
>     Read this first:
> 
>     http://xfs.org/index.php/XFS_FAQ#Q:_The_xfs_db_.22frag.22_command_says_I.27m_over_50.25.__Is_that_bad.3F
> 
>     Then decide whether 10 extents per file is really a problem or not.
> 
>     Cheers,
> 
>     Dave.
>     --
>     Dave Chinner
>     david@fromorbit.com <mailto:david@fromorbit.com>
> 
> 
> 
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fragmentation Issue We Are Having
  2012-04-12  1:04 Fragmentation Issue We Are Having David Fuller
  2012-04-12  2:16 ` Dave Chinner
@ 2012-04-12  7:57 ` Brian Candler
  2012-04-13  0:09   ` David Fuller
  1 sibling, 1 reply; 15+ messages in thread
From: Brian Candler @ 2012-04-12  7:57 UTC (permalink / raw)
  To: David Fuller; +Cc: xfs

On Wed, Apr 11, 2012 at 06:04:25PM -0700, David Fuller wrote:
>    Adn here are the mount options on one of our servers:
> 
>     xfs     rw,noikeep,allocsize=256M,logbufs=8,sunit=128,swidth=2304

What's the total file system size? If it's over 1TB then you almost
certainly should have 'inode64' as well.
http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_inode64_mount_option_for.3F

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fragmentation Issue We Are Having
  2012-04-12  7:57 ` Brian Candler
@ 2012-04-13  0:09   ` David Fuller
  2012-04-13  7:19     ` Brian Candler
  0 siblings, 1 reply; 15+ messages in thread
From: David Fuller @ 2012-04-13  0:09 UTC (permalink / raw)
  To: Brian Candler; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 667 bytes --]

Brian,

The total LVM volume group is 4.5 TB.  The logical volume is around 2.3TB
where the mysql data
is stored.


--David F.


<http://www.topshelfads.com/email/sig/epochbestbilling.jpg>


On Thu, Apr 12, 2012 at 12:57 AM, Brian Candler <B.Candler@pobox.com> wrote:

> On Wed, Apr 11, 2012 at 06:04:25PM -0700, David Fuller wrote:
> >    Adn here are the mount options on one of our servers:
> >
> >     xfs     rw,noikeep,allocsize=256M,logbufs=8,sunit=128,swidth=2304
>
> What's the total file system size? If it's over 1TB then you almost
> certainly should have 'inode64' as well.
> http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_inode64_mount_option_for.3F
>

[-- Attachment #1.2: Type: text/html, Size: 1225 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fragmentation Issue We Are Having
  2012-04-13  0:09   ` David Fuller
@ 2012-04-13  7:19     ` Brian Candler
  2012-04-13  7:56       ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Brian Candler @ 2012-04-13  7:19 UTC (permalink / raw)
  To: David Fuller; +Cc: xfs

On Thu, Apr 12, 2012 at 05:09:40PM -0700, David Fuller wrote:
>    The total LVM volume group is 4.5 TB.  The logical volume is around
>    2.3TB where the mysql data

Hence you have a 2.3TB XFS filesystem? You need inode64. The side warning
"performance sucks" is very true.  In particular, if you create a bunch of
files in the same directory, without inode64 XFS will scatter the extents
all over the disk rather than trying to allocate them next to each other
(possibly not a problem if you're only storing mysql data chunks though)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fragmentation Issue We Are Having
  2012-04-13  7:19     ` Brian Candler
@ 2012-04-13  7:56       ` Dave Chinner
  2012-04-13  8:17         ` Brian Candler
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2012-04-13  7:56 UTC (permalink / raw)
  To: Brian Candler; +Cc: David Fuller, xfs

On Fri, Apr 13, 2012 at 08:19:05AM +0100, Brian Candler wrote:
> On Thu, Apr 12, 2012 at 05:09:40PM -0700, David Fuller wrote:
> >    The total LVM volume group is 4.5 TB.  The logical volume is around
> >    2.3TB where the mysql data
> 
> Hence you have a 2.3TB XFS filesystem? You need inode64.  The side warning
> "performance sucks" is very true. 

In some cases.

You can't just blindly assert that something is needed purely on
the size of the filesystem. Much more information is needed such as
block maps, which files the database regularly uses, how large those
files are, how they are laid out in the directory structure, etc.

For a workload with lots of files and directories, inode64 will be
better, but for a database with realtively few large files the
locality that inode64 gives you may not be any advantage at all.

And sometimes inode32 is the best option available because it
effectievly separates data from metadata until the filesystem is
nearly full.

> In particular, if you create a bunch of
> files in the same directory, without inode64 XFS will scatter the extents
> all over the disk

It doesn't scatter them randomly like you are implying - it places
each subsequent new file in a different AG to balance the data load
across the entire filesystem address space. If you are writing lots
of large files in parallel, that's *exactly* the behaviour you want
to minimise fragmentation and maximise back end drive utilisation.

> rather than trying to allocate them next to each other

Which has always caused much more file fragmentation than the
inode32 style of allocation. That's why we have much more aggressive
speculative delayed allocation now - to make concurrent file writes
behaviour and fragmentation much more like the inode32 allocator
without destroying locality too much.

> (possibly not a problem if you're only storing mysql data chunks though)

Almost definitely not a problem, which is exactly why I'm responding
here. inode64 is not the right solution for every problem, and
there's much more to selecting the right allocation policy for your
workloads than just looking at filesystem size.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fragmentation Issue We Are Having
  2012-04-13  7:56       ` Dave Chinner
@ 2012-04-13  8:17         ` Brian Candler
  2012-04-17  0:26           ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Brian Candler @ 2012-04-13  8:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: David Fuller, xfs

On Fri, Apr 13, 2012 at 05:56:34PM +1000, Dave Chinner wrote:
> In some cases.
> 
> You can't just blindly assert that something is needed purely on
> the size of the filesystem.

Thanks, but then perhaps the XFS FAQ needs updating. It warns that you might
have compatibility problems with old clients (NFS) and inode64, but it
doesn't say "for some workloads inode32 may perform better than inode64 on
large filesystems".

Also, aren't these orthogonal features?

(1) "I want all my inode metadata stored at the front of the disk"

(2) "I want files in the same directory to be distributed between AGs, not
    stored in the same AG"

If there are not explicit knobs for these behaviours, then it seems almost
accidental that limiting yourself to 32-bit inode numbers causes them to
happen (an implementation artefact).

Finally, what happens if you have a filesystem smaller than 1TB? I imagine
that XFS will scale the agsize down so that you have multiple AGs, but will
still have 32-bit inode numbers - so you will get the same behaviour as
inode64 on a large filesystem.  What happens then if your workload requires
behaviour (1) and/or (2) above for optimal performance?

Regards,

Brian.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fragmentation Issue We Are Having
  2012-04-13  8:17         ` Brian Candler
@ 2012-04-17  0:26           ` Dave Chinner
  2012-04-17  8:58             ` Brian Candler
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2012-04-17  0:26 UTC (permalink / raw)
  To: Brian Candler; +Cc: David Fuller, xfs

On Fri, Apr 13, 2012 at 09:17:25AM +0100, Brian Candler wrote:
> On Fri, Apr 13, 2012 at 05:56:34PM +1000, Dave Chinner wrote:
> > In some cases.
> > 
> > You can't just blindly assert that something is needed purely on
> > the size of the filesystem.
> 
> Thanks, but then perhaps the XFS FAQ needs updating. It warns that you might
> have compatibility problems with old clients (NFS) and inode64, but it
> doesn't say "for some workloads inode32 may perform better than inode64 on
> large filesystems".

The FAQ doesn't say anything about whether inode32 performs better
than inode64 or vice versa. All it talks about is inode allocation
locality and possible errors (like ENOSPC with lots of free space)
that can occur with inode32.

> Also, aren't these orthogonal features?
> 
> (1) "I want all my inode metadata stored at the front of the disk"
> 
> (2) "I want files in the same directory to be distributed between AGs, not
>     stored in the same AG"
> 
> If there are not explicit knobs for these behaviours, then it seems almost
> accidental that limiting yourself to 32-bit inode numbers causes them to
> happen (an implementation artefact).

The behaviour of inode32 was defined long before anyone who
currently works on XFS had any say in the matter. Most people really
consider it a nasty hack that was done to avoid needing to make NFS
clients 64 bit inode number clean back in 1998. At the time it was
unpopular, but considered the least worst solution to the problem.
The biggest issue was that it was made the default mount option...

It's now a historical artifact, and all we are doing is preserving
the behaviour of the allocation policies because there are plenty of
applications out there that rely on the way inode32 or inode64
behaves to acheive their performance..

> Finally, what happens if you have a filesystem smaller than 1TB? I imagine
> that XFS will scale the agsize down so that you have multiple AGs, but will
> still have 32-bit inode numbers - so you will get the same behaviour as
> inode64 on a large filesystem.  What happens then if your workload requires
> behaviour (1) and/or (2) above for optimal performance?

Then you get to choose the least worse option.

Making allocation policy more flexible is something that I've been
wanting to do for years - it was something I was working on when I
left SGI almost 4 years ago (along with metadata checksums). Here's
the patchset of what I'd written from that time:

http://oss.sgi.com/archives/xfs/2009-02/msg00250.html

You're more than welcome to pick it up and start working on it again
so that we can have a much more flexible allocation subsystem if you
want....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fragmentation Issue We Are Having
  2012-04-17  0:26           ` Dave Chinner
@ 2012-04-17  8:58             ` Brian Candler
  2012-04-18  1:36               ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Brian Candler @ 2012-04-17  8:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: David Fuller, xfs

On Tue, Apr 17, 2012 at 10:26:10AM +1000, Dave Chinner wrote:
> > > You can't just blindly assert that something is needed purely on
> > > the size of the filesystem.
> > 
> > Thanks, but then perhaps the XFS FAQ needs updating. It warns that you might
> > have compatibility problems with old clients (NFS) and inode64, but it
> > doesn't say "for some workloads inode32 may perform better than inode64 on
> > large filesystems".
> 
> The FAQ doesn't say anything about whether inode32 performs better
> than inode64 or vice versa.

With respect it does, although in only three words:
"Also, performance sucks".

Maybe it would be useful to expand this. How about:

"Also, performance sucks for many common workloads and benchmarks, such as
sequentially extracting or reading a large hierarchy of files.  This is
because in filesystems >1TB without inode64, files created within the same
parent directory are not created in the same allocation group with adjacent
extents."

If as you say inode32 was just a hack for broken NFS clients, then it seems
to me that the *intended* default performance characteristics are those of
inode64?  That is, the designers considered this to be the most appropriate
performance compromise for typical users?

Regards,

Brian.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fragmentation Issue We Are Having
  2012-04-17  8:58             ` Brian Candler
@ 2012-04-18  1:36               ` Dave Chinner
  2012-04-18  9:00                 ` Brian Candler
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2012-04-18  1:36 UTC (permalink / raw)
  To: Brian Candler; +Cc: David Fuller, xfs

On Tue, Apr 17, 2012 at 09:58:28AM +0100, Brian Candler wrote:
> On Tue, Apr 17, 2012 at 10:26:10AM +1000, Dave Chinner wrote:
> > > > You can't just blindly assert that something is needed purely on
> > > > the size of the filesystem.
> > > 
> > > Thanks, but then perhaps the XFS FAQ needs updating. It warns that you might
> > > have compatibility problems with old clients (NFS) and inode64, but it
> > > doesn't say "for some workloads inode32 may perform better than inode64 on
> > > large filesystems".
> > 
> > The FAQ doesn't say anything about whether inode32 performs better
> > than inode64 or vice versa.
> 
> With respect it does, although in only three words:
> "Also, performance sucks".

I missed that. It's a pretty useless comment.

> Maybe it would be useful to expand this. How about:
> 
> "Also, performance sucks for many common workloads and benchmarks, such as
> sequentially extracting or reading a large hierarchy of files.  This is
> because in filesystems >1TB without inode64, files created within the same
> parent directory are not created in the same allocation group with adjacent
> extents."

Even that generalisation is often wrong. It assumes that
separation of metadata and data causes performance degradation,
which is not a valid assumption for many common storage
configurations. And it assumes that inode32 cannot do locality of
files at all, when in fact it has tunable locality through a syctl.

Indeed, here's some the performance enhancing games SGI play that
can only be achieved by using the inode32 allocator:

http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=bks&fname=/SGI_Admin/LX_XFS_AG/ch07.html

> If as you say inode32 was just a hack for broken NFS clients, then it seems
> to me that the *intended* default performance characteristics are those of
> inode64?

inode64 was the *only* allocation policy XFS was designed with.
inode32 was grafted on 5 years later. inode32 has found many other
uses since then, though, so it's not just a "NFS hack" anymore.

Indeed, if you have concatenation-based volumes and you don't have
lots of active directories at once, then inode32 is going to smash
inode64 when it comes to raw performance simply because it will keep
all legs of the concat busy instead of just the one that the
directory is located in. IOWs, in such configurations keeping tight
locality of allocation is actively harmful to performance....

> That is, the designers considered this to be the most appropriate
> performance compromise for typical users?

My experience with XFS is there is no such thing as a typical XFS
user.

Sure, inode64 was the way XFS was originally designed to work, but
history has shown that inode32 is actually significantly more
flexible and tunable than inode64 for different workloads. IOWs,
despite waht was considered the best design 20 years ago, the
inode32 hack has proved a great advantage to XFS over the last 10 or
so years.  You can't tune inode64 at all - you get what you get -
while inode32 can be tweaked and the storage subsystem designed
around it to provide much better resource utilisation and
performance than you can get with inode64....

Simply put: you can't make sweeping generalisations about whether
inode64 is better than inode32 regardless of their original
purpose....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fragmentation Issue We Are Having
  2012-04-18  1:36               ` Dave Chinner
@ 2012-04-18  9:00                 ` Brian Candler
  2012-04-19 23:12                   ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Brian Candler @ 2012-04-18  9:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: David Fuller, xfs

On Wed, Apr 18, 2012 at 11:36:07AM +1000, Dave Chinner wrote:
> And it assumes that inode32 cannot do locality of
> files at all, when in fact it has tunable locality through a syctl.
> 
> Indeed, here's some the performance enhancing games SGI play that
> can only be achieved by using the inode32 allocator:
> 
> http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=bks&fname=/SGI_Admin/LX_XFS_AG/ch07.html

Ah, that's new to me. So with inode32 and
  sysctl fs.xfs.rotorstep=255
you can get roughly the same locality benefit for sequentially-written files
as inode64?  (Aside: if you have two processes writing files to two different
directories, will they end up mixing their files in the same AG? That
could hurt performance at readback time if reading them sequentially)

I'm not really complaining about anything here except the dearth of
readily-accessible information. If I download the whole admin book:
http://techpubs.sgi.com/library/manuals/4000/007-4273-004/pdf/007-4273-004.pdf
I see the option inode64 only mentioned once in passing (as being
incompatible with ibound). So if there's detailled information on
what exactly inode64 does and when to use it, it must be somewhere else.

Here's my user story. As a newbie, my first test was to make a 3TB
filesystem on a single drive, and I was doing a simple workload consisting
of writing 1000 files per directory sequentially.  I could achieve a
sequential write speed of 75MB/s but only a sequential read speed of 25MB/s. 
After questioning this on the list and eventually finding that files were
scattered around the disk (thanks to xfs_bmap).  I was pointed to the
inode64 option, which I had seen in the FAQ but hadn't realised how big a
performance difference it would make.

This wasn't just an idle benchmark: my main application is creating a corpus
of files and then processing that corpus of files (either sequentially, or
with multiple processes each working sequentially through subsections of the
corpus)

Regards,

Brian.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fragmentation Issue We Are Having
  2012-04-18  9:00                 ` Brian Candler
@ 2012-04-19 23:12                   ` Dave Chinner
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2012-04-19 23:12 UTC (permalink / raw)
  To: Brian Candler; +Cc: David Fuller, xfs

On Wed, Apr 18, 2012 at 10:00:51AM +0100, Brian Candler wrote:
> On Wed, Apr 18, 2012 at 11:36:07AM +1000, Dave Chinner wrote:
> > And it assumes that inode32 cannot do locality of
> > files at all, when in fact it has tunable locality through a syctl.
> > 
> > Indeed, here's some the performance enhancing games SGI play that
> > can only be achieved by using the inode32 allocator:
> > 
> > http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=bks&fname=/SGI_Admin/LX_XFS_AG/ch07.html
> 
> Ah, that's new to me. So with inode32 and
>   sysctl fs.xfs.rotorstep=255
> you can get roughly the same locality benefit for sequentially-written files
> as inode64?  (Aside: if you have two processes writing files to two different
> directories, will they end up mixing their files in the same AG? That
> could hurt performance at readback time if reading them sequentially)

As Richard pointed out, the filestreams allocator was designed
specifically to avoid this problem (mainly for file-per-frame
realtima, concurrent e ingest/playout of multiple 2k and 4k HD
uncompressed video streams on a single SAN.

> I'm not really complaining about anything here except the dearth of
> readily-accessible information. If I download the whole admin book:
> http://techpubs.sgi.com/library/manuals/4000/007-4273-004/pdf/007-4273-004.pdf
> I see the option inode64 only mentioned once in passing (as being
> incompatible with ibound). So if there's detailled information on
> what exactly inode64 does and when to use it, it must be somewhere else.

It's scattered around the place. The man pages, the FAQ, kernel
documentation, techpubs, user documentation, in the heads of
developers and experienced admins, etc. There's no one
real definitive source of the information you were looking for...

> Here's my user story. As a newbie, my first test was to make a 3TB
> filesystem on a single drive, and I was doing a simple workload consisting
> of writing 1000 files per directory sequentially.  I could achieve a
> sequential write speed of 75MB/s but only a sequential read speed of 25MB/s. 
> After questioning this on the list and eventually finding that files were
> scattered around the disk (thanks to xfs_bmap).  I was pointed to the
> inode64 option, which I had seen in the FAQ but hadn't realised how big a
> performance difference it would make.
> 
> This wasn't just an idle benchmark: my main application is creating a corpus
> of files and then processing that corpus of files (either sequentially, or
> with multiple processes each working sequentially through subsections of the
> corpus)

What you are learning is the reason why we say "use the defaults".
The process oflearning what all the different options do is
difficult because it requires understanding how the internals of XFS
work. i.e.  You need to understand what your application does (which
you obviously do) and what the filesystem tunables do before really
being able to understand why it helps or hinders.

Given all the issues you had finding out this information, the most
valuable thing you could do right now is update the FAQ on the wiki
to correct the obvious problems, and maybe even add a new "what are
the differences between inode32 and inode64?" page to the wiki to
summarise what we've talked about in this thread. That will make it
much easier for others in future to find the same information....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fragmentation Issue We Are Having
@ 2012-04-19 19:54 Richard Scobie
  0 siblings, 0 replies; 15+ messages in thread
From: Richard Scobie @ 2012-04-19 19:54 UTC (permalink / raw)
  To: xfs; +Cc: b.candler

Brian Candler wrote:

--------------------------

Ah, that's new to me. So with inode32 and
   sysctl fs.xfs.rotorstep=255
you can get roughly the same locality benefit for sequentially-written files
as inode64?  (Aside: if you have two processes writing files to two 
different
directories, will they end up mixing their files in the same AG? That
could hurt performance at readback time if reading them sequentially)

-----------------------------

The "filestreams" mount option may be of use here, see:

http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide//tmp/en-US/html/ch06s16.html

and page 17 of:

http://oss.sgi.com/projects/xfs/training/xfs_slides_06_allocators.pdf

Regards,

Richard

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2012-04-19 23:13 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-12  1:04 Fragmentation Issue We Are Having David Fuller
2012-04-12  2:16 ` Dave Chinner
2012-04-12  2:55   ` David Fuller
2012-04-12  4:24     ` Eric Sandeen
2012-04-12  7:57 ` Brian Candler
2012-04-13  0:09   ` David Fuller
2012-04-13  7:19     ` Brian Candler
2012-04-13  7:56       ` Dave Chinner
2012-04-13  8:17         ` Brian Candler
2012-04-17  0:26           ` Dave Chinner
2012-04-17  8:58             ` Brian Candler
2012-04-18  1:36               ` Dave Chinner
2012-04-18  9:00                 ` Brian Candler
2012-04-19 23:12                   ` Dave Chinner
2012-04-19 19:54 Richard Scobie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.