All of lore.kernel.org
 help / color / mirror / Atom feed
* XFS Preallocation
@ 2011-01-28  2:05 Jef Fox
  2011-01-28  4:52 ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Jef Fox @ 2011-01-28  2:05 UTC (permalink / raw)
  To: xfs


[-- Attachment #1.1: Type: text/plain, Size: 937 bytes --]

We are having some problems with preallocation of large files.  We have
found that we can preallocate about 500 1GB files on a volume using the
resvsp and truncate commands, but the extents are still showing up as
preallocated.  Is this a problem?  The OS appears to think the files are
allocated and correctly sized. 

 

For reference, we are trying to create files for an external piece of
equipment to write to a SSD with.  The SSD would then be mounted in RHEL
and the data pulled off in the 1G chunks.  Because of the nature of the
data, we need to constantly erase and recreate the files and
preallocation seems to be the fastest option.  We don't really care if
the data gets 0'ed out.  Is there another method - allocsp takes too
long for this application?  Or, does it matter if XFS thinks the extents
are preallocated but unwritten if no other files are written to the
disk?

 

Thanks

Jef

 

 


[-- Attachment #1.2: Type: text/html, Size: 2783 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: XFS Preallocation
  2011-01-28  2:05 XFS Preallocation Jef Fox
@ 2011-01-28  4:52 ` Dave Chinner
  2011-01-28 15:15   ` Jef Fox
  2011-01-28 17:33   ` Jef Fox
  0 siblings, 2 replies; 11+ messages in thread
From: Dave Chinner @ 2011-01-28  4:52 UTC (permalink / raw)
  To: Jef Fox; +Cc: xfs

On Thu, Jan 27, 2011 at 07:05:33PM -0700, Jef Fox wrote:
> We are having some problems with preallocation of large files.  We have
> found that we can preallocate about 500 1GB files on a volume using the
> resvsp and truncate commands, but the extents are still showing up as
> preallocated.  Is this a problem?  The OS appears to think the files are
> allocated and correctly sized. 

That's the way it's supposed to work. Preallocated space stays
preallocated (i.e reads as zeros) until it is written to, regardless
of whether you change the file size via truncate commands.

> For reference, we are trying to create files for an external piece of
> equipment to write to a SSD with.  The SSD would then be mounted in RHEL
> and the data pulled off in the 1G chunks.  Because of the nature of the
> data, we need to constantly erase and recreate the files and
> preallocation seems to be the fastest option.

What do you mean by "erase and recreate"? Do you mean you rm the
files, then preallocate them again?

If you were running 2.6.37+ and a TOT xfsprogs, there's also the
"zero" command that converts allocated space back to the
preallocated (zeroed) state without doing any IO. It's the
equivalent unresvsp + resvsp in a single operation.

> We don't really care if
> the data gets 0'ed out.  Is there another method - allocsp takes too
> long for this application?

allocsp is historical interface, pretty much useless and should
probably be removed. I can't think of any situation where allocsp
would be better than resvsp or zero....

> Or, does it matter if XFS thinks the extents
> are preallocated but unwritten if no other files are written to the
> disk?

I'm not sure what you are asking there...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: XFS Preallocation
  2011-01-28  4:52 ` Dave Chinner
@ 2011-01-28 15:15   ` Jef Fox
  2011-01-28 17:33   ` Jef Fox
  1 sibling, 0 replies; 11+ messages in thread
From: Jef Fox @ 2011-01-28 15:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

 
> On Thu, Jan 27, 2011 at 07:05:33PM -0700, Jef Fox wrote:
> > We are having some problems with preallocation of large files.  We
have
> > found that we can preallocate about 500 1GB files on a volume using
the
> > resvsp and truncate commands, but the extents are still showing up
as
> > preallocated.  Is this a problem?  The OS appears to think the files
are
> > allocated and correctly sized.
> 
> That's the way it's supposed to work. Preallocated space stays
> preallocated (i.e reads as zeros) until it is written to, regardless
> of whether you change the file size via truncate commands.

I guess my main question is whether we (or the OS) care if the space
still shows "preallocated".  If we write to the files/drive/data blocks
outside of RHEL (and the filesystem), does XFS care that the extents
still show preallocated if we never write to them again - but only read
from them?  
 
> > For reference, we are trying to create files for an external piece
of
> > equipment to write to a SSD with.  The SSD would then be mounted in
RHEL
> > and the data pulled off in the 1G chunks.  Because of the nature of
the
> > data, we need to constantly erase and recreate the files and
> > preallocation seems to be the fastest option.
> 
> What do you mean by "erase and recreate"? Do you mean you rm the
> files, then preallocate them again?

We were planning to TRIM/RESET out the SSD (partial or whole) and lay
down the filesystem again.

> If you were running 2.6.37+ and a TOT xfsprogs, there's also the
> "zero" command that converts allocated space back to the
> preallocated (zeroed) state without doing any IO. It's the
> equivalent unresvsp + resvsp in a single operation.
> 
> > We don't really care if
> > the data gets 0'ed out.  Is there another method - allocsp takes too
> > long for this application?
> 
> allocsp is historical interface, pretty much useless and should
> probably be removed. I can't think of any situation where allocsp
> would be better than resvsp or zero....
> 
> > Or, does it matter if XFS thinks the extents
> > are preallocated but unwritten if no other files are written to the
> > disk?

So, we setup the SSDs using preallocation for speed and truncate the
files so that the OS thinks the files are of a set size.  We pull the
drives from the system, and put them in an external device that writes
to our given block offsets (staying within the files and blocks we have
preallocated and defined).  We then put the device back into our system
to read the data back off via NFS/FTP/etc.  Will it matter that XFS will
still show unwritten but preallocated extents when the files actually
have data?  XFS/RHEL/etc will not know that the files were ever written
to, but using truncate appears to give us the necessary "trick" (for
lack of a better term) to get RH to think that the files are a certain
size and can be read from.  If we aren't adding files or reallocating
data on those drives, will it matter that everything shows preallocated
and not written?

 
> I'm not sure what you are asking there...
> 
> Cheers,
> 
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: XFS Preallocation
  2011-01-28  4:52 ` Dave Chinner
  2011-01-28 15:15   ` Jef Fox
@ 2011-01-28 17:33   ` Jef Fox
  2011-01-29  0:17     ` Dave Chinner
  1 sibling, 1 reply; 11+ messages in thread
From: Jef Fox @ 2011-01-28 17:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

I guess, disregard my previous message.  After some further testing of
examining the hard disk blocks, we see what you are saying - the file is
presented at 0s to the user even if the blocks are changed on the hard
disk.  So, we will always see 0s until we write to the extent.

So, I think our only question now is if there is a way to force the
extents to be marked as allocated without writing all of the data?  That
is, is there a fast way to lay down a file(s) of 1G size without
actually writing 1G of info.

> -----Original Message-----
> From: Dave Chinner [mailto:david@fromorbit.com]
> Sent: Thursday, January 27, 2011 9:52 PM
> To: Jef Fox
> Cc: xfs@oss.sgi.com
> Subject: Re: XFS Preallocation
> 
> On Thu, Jan 27, 2011 at 07:05:33PM -0700, Jef Fox wrote:
> > We are having some problems with preallocation of large files.  We
have
> > found that we can preallocate about 500 1GB files on a volume using
the
> > resvsp and truncate commands, but the extents are still showing up
as
> > preallocated.  Is this a problem?  The OS appears to think the files
are
> > allocated and correctly sized.
> 
> That's the way it's supposed to work. Preallocated space stays
> preallocated (i.e reads as zeros) until it is written to, regardless
> of whether you change the file size via truncate commands.
> 
> > For reference, we are trying to create files for an external piece
of
> > equipment to write to a SSD with.  The SSD would then be mounted in
RHEL
> > and the data pulled off in the 1G chunks.  Because of the nature of
the
> > data, we need to constantly erase and recreate the files and
> > preallocation seems to be the fastest option.
> 
> What do you mean by "erase and recreate"? Do you mean you rm the
> files, then preallocate them again?
> 
> If you were running 2.6.37+ and a TOT xfsprogs, there's also the
> "zero" command that converts allocated space back to the
> preallocated (zeroed) state without doing any IO. It's the
> equivalent unresvsp + resvsp in a single operation.
> 
> > We don't really care if
> > the data gets 0'ed out.  Is there another method - allocsp takes too
> > long for this application?
> 
> allocsp is historical interface, pretty much useless and should
> probably be removed. I can't think of any situation where allocsp
> would be better than resvsp or zero....
> 
> > Or, does it matter if XFS thinks the extents
> > are preallocated but unwritten if no other files are written to the
> > disk?
> 
> I'm not sure what you are asking there...
> 
> Cheers,
> 
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: XFS Preallocation
  2011-01-28 17:33   ` Jef Fox
@ 2011-01-29  0:17     ` Dave Chinner
  2011-02-01  4:45       ` Peter Vajgel
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2011-01-29  0:17 UTC (permalink / raw)
  To: Jef Fox; +Cc: xfs

On Fri, Jan 28, 2011 at 10:33:03AM -0700, Jef Fox wrote:
> I guess, disregard my previous message.  After some further testing of
> examining the hard disk blocks, we see what you are saying - the file is
> presented at 0s to the user even if the blocks are changed on the hard
> disk.  So, we will always see 0s until we write to the extent.
> 
> So, I think our only question now is if there is a way to force the
> extents to be marked as allocated without writing all of the data?  That
> is, is there a fast way to lay down a file(s) of 1G size without
> actually writing 1G of info.

Preallocation is the only option. Allowing preallocation without
marking extents as unwritten opens a massive security hole (i.e.
exposes stale data) so I say no to any request for addition of such
functionality (and have for years).

You've already demonstrated the workaround you can apply to the
problem for your very specialised application - when you put the
disk back into the original machine you can read the disk blocks
directly to get the data. i.e. use fiemap to map the location of the
file on disk and then read the data directly from the block device
underneath the filesystem...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: XFS Preallocation
  2011-01-29  0:17     ` Dave Chinner
@ 2011-02-01  4:45       ` Peter Vajgel
  2011-02-01  8:03         ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Vajgel @ 2011-02-01  4:45 UTC (permalink / raw)
  To: Dave Chinner, Jef Fox; +Cc: xfs

> Preallocation is the only option. Allowing preallocation without marking extents as unwritten opens a massive security hole (i.e.
> exposes stale data) so I say no to any request for addition of such functionality (and have for years).

How about opening this option to at least root (root can already read the device anyway)?. There are cases when creating large files without writing to them is important. A good example is testing xfs overhead when doing a specific workload (like random reads) to large files. In this case we want to hit the disk on every request. Currently we have a workaround (below) but official support would be preferable.

--pv


# create_xfs_files

dev=$1
mntpt=$2
dircount=$3
filecount=$4
size=$5

# Umount.
umount $2

# Create the fs.
mkfs -t xfs -f -d unwritten=0,su=256k,sw=10 -l su=256k -L "/hay" $dev

# Clear unwritten flag - current xfs ignores this flag
typeset -i agcount=$(xfs_db -c "sb" -c "print" $dev | grep agcount)
typeset -i i=0
while [[ $i != $agcount ]]
do
  xfs_db -x -c "sb $i" -c "write versionnum 0xa4a4" $dev
  i=i+1
done

# Mount the filesystem.
mount -t xfs -o nobarrier,noatime,nodiratime,inode64,allocsize=1g $dev $mntpt

i=0
while [[ $i != $dircount ]]
do
  mkdir $mntpt/dir$i
  typeset -i j=0
  while [[ $j != $filecount ]]
  do
    file=$mntpt/dir$i/file$j
    xfs_io -f -c "resvsp 0 $size" $file
    inum=$(ls -i $file | awk '{print $1}')
    umount $mntpt
    xfs_db -x -c "inode $inum" -c "write core.size $size" $dev
    mount -t xfs -o nobarrier,noatime,nodiratime,inode64,allocsize=1g $dev $mntpt
    j=j+1
  done
  i=i+1
done

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: XFS Preallocation
  2011-02-01  4:45       ` Peter Vajgel
@ 2011-02-01  8:03         ` Dave Chinner
  2011-02-01 19:20           ` Peter Vajgel
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2011-02-01  8:03 UTC (permalink / raw)
  To: Peter Vajgel; +Cc: Jef Fox, xfs

On Tue, Feb 01, 2011 at 04:45:09AM +0000, Peter Vajgel wrote:
> > Preallocation is the only option. Allowing preallocation without
> > marking extents as unwritten opens a massive security hole (i.e.
> > exposes stale data) so I say no to any request for addition of
> > such functionality (and have for years).
> 
> How about opening this option to at least root (root can already
> read the device anyway)?.

# ls -l foo
-rw-r--r-- 1 dave dave        0 Aug 16 10:44 foo
#
# prealloc_without_unwritten 0 1048576 foo
# ls -l foo
-rw-r--r-- 1 dave dave  1048576 Aug 16 10:44 foo
#

Now user dave can read the stale data exposed by the root only
operation. Any combination of making the file available to a
non-root user after a preallocation-without-unwritten-extents
operation has this problem.  IOWs, just making such a syscall "root
only" doesn't solve the security problem.

To fix it, we have to require inodes have 0600 perms, owned by root,
and cannot be chmod/chowned to anyone else, ever. At that point,
we're requiring applications to run as root to to use this
functionality. Same requirement as fiemap + reading from the block
device, which you can do right without any kernel mods or filesystem
hacks...

> There are cases when creating large
> files without writing to them is important. A good example is
> testing xfs overhead when doing a specific workload (like random
> reads) to large files.

For testing it doesn't matter how long it takes you to write
the file in the first place.

> In this case we want to hit the disk on
> every request. Currently we have a workaround (below) but official
> support would be preferable.

Officially, we _removed_ the unwritten=0 option from mkfs because of
the security problems. Not to mention that it was never, ever
tested...

> 
> --pv
> 
> 
> # create_xfs_files
> 
> dev=$1
> mntpt=$2
> dircount=$3
> filecount=$4
> size=$5
> 
> # Umount.
> umount $2
> 
> # Create the fs.
> mkfs -t xfs -f -d unwritten=0,su=256k,sw=10 -l su=256k -L "/hay" $dev

Which fails due to:

unknown option -d unwritten=0
/* blocksize */         [-b log=n|size=num]
/* data subvol */       [-d agcount=n,agsize=n,file,name=xxx,size=num,
                            (sunit=value,swidth=value|su=num,sw=num),
                            sectlog=n|sectsize=num
.....

> # Clear unwritten flag - current xfs ignores this flag
> typeset -i agcount=$(xfs_db -c "sb" -c "print" $dev | grep agcount)
> typeset -i i=0
> while [[ $i != $agcount ]]
> do
>   xfs_db -x -c "sb $i" -c "write versionnum 0xa4a4" $dev
>   i=i+1
> done
> 
> # Mount the filesystem.
> mount -t xfs -o nobarrier,noatime,nodiratime,inode64,allocsize=1g $dev $mntpt
> 
> i=0
> while [[ $i != $dircount ]]
> do
>   mkdir $mntpt/dir$i
>   typeset -i j=0
>   while [[ $j != $filecount ]]
>   do
>     file=$mntpt/dir$i/file$j
>     xfs_io -f -c "resvsp 0 $size" $file
>     inum=$(ls -i $file | awk '{print $1}')
>     umount $mntpt
>     xfs_db -x -c "inode $inum" -c "write core.size $size" $dev
>     mount -t xfs -o nobarrier,noatime,nodiratime,inode64,allocsize=1g $dev $mntpt

That's quite a hack to work around the EOF zeroing that extending the
file size after allocating would do because the preallocated extents
beyond EOF are not marked unwritten. Perhaps truncating the file
first, then preallocating is what you want:

	xfs_io -f -c "truncate $size" -c "resvsp 0 $size" $file

>     j=j+1
>   done
>   i=i+1
> done

Regardless of all this, perhaps themost important point is that your
proposed use of XFS is fundamentally unsupportable by the linux XFS
community: you've got proprietary software on some external hardware
writing to the disk without going through the linux XFS kernel code.
You're basically in the same boat as people running proprietary
kernel modules - unless you can prove the problem is not caused by
your hw/sw or manual filesystem modifications, then it's a waste of
our (limited) resources to even look at the problem.  That generally
comes down to being able to reproduce the problem on a vanilla kernel
on a filesystem created with a supported mkfs....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: XFS Preallocation
  2011-02-01  8:03         ` Dave Chinner
@ 2011-02-01 19:20           ` Peter Vajgel
  2011-02-01 20:12             ` Stan Hoeppner
  2011-02-02  0:07             ` Dave Chinner
  0 siblings, 2 replies; 11+ messages in thread
From: Peter Vajgel @ 2011-02-01 19:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jef Fox, xfs


> -----Original Message-----
> From: Dave Chinner [mailto:david@fromorbit.com]
> Sent: Tuesday, February 01, 2011 12:04 AM
> To: Peter Vajgel
> Cc: Jef Fox; xfs@oss.sgi.com
> Subject: Re: XFS Preallocation
> 
> On Tue, Feb 01, 2011 at 04:45:09AM +0000, Peter Vajgel wrote:
> > > Preallocation is the only option. Allowing preallocation without
> > > marking extents as unwritten opens a massive security hole (i.e.
> > > exposes stale data) so I say no to any request for addition of such
> > > functionality (and have for years).
> >
> > How about opening this option to at least root (root can already read
> > the device anyway)?.
> 
> # ls -l foo
> -rw-r--r-- 1 dave dave        0 Aug 16 10:44 foo
> #
> # prealloc_without_unwritten 0 1048576 foo # ls -l foo
> -rw-r--r-- 1 dave dave  1048576 Aug 16 10:44 foo #
> 
> Now user dave can read the stale data exposed by the root only operation. Any
> combination of making the file available to a non-root user after a preallocation-
> without-unwritten-extents
> operation has this problem.  IOWs, just making such a syscall "root only" doesn't
> solve the security problem.

Correct - if an admin made prealloc_without_unwritten runnable by any user then yes - but I would argue that such an admin should not even have root privileges. Vxfs had this ability since version 1 and I don't' remember a single customer complaint about this feature. Most of the times the feature was used by db to preallocate large amounts of space knowing that they won't incur any overhead (even transactional) when doing direct io to the pre-allocated range. It could be that at those times even a transactional overhead was significant enough that we wanted to eliminate it.

> 
> To fix it, we have to require inodes have 0600 perms, owned by root, and cannot be
> chmod/chowned to anyone else, ever. At that point, we're requiring applications to run
> as root to to use this functionality. Same requirement as fiemap + reading from the
> block device, which you can do right without any kernel mods or filesystem hacks...
> 
> > There are cases when creating large
> > files without writing to them is important. A good example is testing
> > xfs overhead when doing a specific workload (like random
> > reads) to large files.
> 
> For testing it doesn't matter how long it takes you to write the file in the first place.

At the scale we operate it does. We have multiple variables so the number of combinations is large. We have hit every single possible hardware and software problem and problem resolution can take months if it takes days to reproduce the problem. Hardware vendors (disk, controller, motherboard manufacturers) are much more responsive when you can reproduce a problem on the fly in seconds (especially in comparative benchmarking). The tests usually run only couple of minutes. With 12x3TB (possibly multiplied by a factor of X with our new platform) it would be unacceptable to wait for writes to finish.

> 
> > In this case we want to hit the disk on every request. Currently we
> > have a workaround (below) but official support would be preferable.
> 
> Officially, we _removed_ the unwritten=0 option from mkfs because of the security
> problems. Not to mention that it was never, ever tested...
> 
> >
> > --pv
> >
> >
> > # create_xfs_files
> >
> > dev=$1
> > mntpt=$2
> > dircount=$3
> > filecount=$4
> > size=$5
> >
> > # Umount.
> > umount $2
> >
> > # Create the fs.
> > mkfs -t xfs -f -d unwritten=0,su=256k,sw=10 -l su=256k -L "/hay" $dev
> 
> Which fails due to:
> 
> unknown option -d unwritten=0
> /* blocksize */         [-b log=n|size=num]
> /* data subvol */       [-d agcount=n,agsize=n,file,name=xxx,size=num,
>                             (sunit=value,swidth=value|su=num,sw=num),
>                             sectlog=n|sectsize=num .....

It still works for us but we tend to be conservative in moving our releases.

> 
> > # Clear unwritten flag - current xfs ignores this flag typeset -i
> > agcount=$(xfs_db -c "sb" -c "print" $dev | grep agcount) typeset -i
> > i=0 while [[ $i != $agcount ]] do
> >   xfs_db -x -c "sb $i" -c "write versionnum 0xa4a4" $dev
> >   i=i+1
> > done
> >
> > # Mount the filesystem.
> > mount -t xfs -o nobarrier,noatime,nodiratime,inode64,allocsize=1g $dev
> > $mntpt
> >
> > i=0
> > while [[ $i != $dircount ]]
> > do
> >   mkdir $mntpt/dir$i
> >   typeset -i j=0
> >   while [[ $j != $filecount ]]
> >   do
> >     file=$mntpt/dir$i/file$j
> >     xfs_io -f -c "resvsp 0 $size" $file
> >     inum=$(ls -i $file | awk '{print $1}')
> >     umount $mntpt
> >     xfs_db -x -c "inode $inum" -c "write core.size $size" $dev
> >     mount -t xfs -o nobarrier,noatime,nodiratime,inode64,allocsize=1g
> > $dev $mntpt
> 
> That's quite a hack to work around the EOF zeroing that extending the file size after
> allocating would do because the preallocated extents beyond EOF are not marked
> unwritten. Perhaps truncating the file first, then preallocating is what you want:
> 
> 	xfs_io -f -c "truncate $size" -c "resvsp 0 $size" $file


I think I had it in reverse before - allocate and truncate but the truncate got stuck in a loop (probably zeroing out the extents?) making the node unresponsive to the point that it was impossible to ssh to it. It eventually returned but it took a while. But that was like 3 years ago. If I get to it I'll try the other order.

> 
> >     j=j+1
> >   done
> >   i=i+1
> > done
> 
> Regardless of all this, perhaps themost important point is that your proposed use of
> XFS is fundamentally unsupportable by the linux XFS
> community: you've got proprietary software on some external hardware writing to the
> disk without going through the linux XFS kernel code.
> You're basically in the same boat as people running proprietary kernel modules -
> unless you can prove the problem is not caused by your hw/sw or manual filesystem
> modifications, then it's a waste of our (limited) resources to even look at the problem.
> That generally comes down to being able to reproduce the problem on a vanilla kernel
> on a filesystem created with a supported mkfs....

Understood. That's why I limit this hack only to testing. I would never even dream to put this into production. Although one could assume that if xfs_check/xfs_repair bless the filesystem before it's mounted you would be safe. But then you might be exposing yourself to bugs in xfs_check/xfs_repair which might have been overlooked since it's not the usual way of using xfs.

Thank you,

Peter

> 
> Cheers,
> 
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: XFS Preallocation
  2011-02-01 19:20           ` Peter Vajgel
@ 2011-02-01 20:12             ` Stan Hoeppner
  2011-02-01 22:47               ` Peter Vajgel
  2011-02-02  0:07             ` Dave Chinner
  1 sibling, 1 reply; 11+ messages in thread
From: Stan Hoeppner @ 2011-02-01 20:12 UTC (permalink / raw)
  To: Peter Vajgel; +Cc: Jef Fox, xfs

Peter Vajgel put forth on 2/1/2011 1:20 PM:

> At the scale we operate it does. We have multiple variables so the number of combinations is large. We have hit every single possible hardware and software problem and problem resolution can take months if it takes days to reproduce the problem. Hardware vendors (disk, controller, motherboard manufacturers) are much more responsive when you can reproduce a problem on the fly in seconds (especially in comparative benchmarking). The tests usually run only couple of minutes. With 12x3TB (possibly multiplied by a factor of X with our new platform) it would be unacceptable to wait for writes to finish.

Hi Peter,

When you mention scale, you're referring to the storage back end at
facebook.com, your employer, correct?

-- 
Stan


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: XFS Preallocation
  2011-02-01 20:12             ` Stan Hoeppner
@ 2011-02-01 22:47               ` Peter Vajgel
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Vajgel @ 2011-02-01 22:47 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Jef Fox, xfs

Correct. But I've been using similar methods for evaluation and benchmarking purposes at my previous employers as well. I guess it's hard to part with the tools you learn to use. It's not a secret that we use xfs at Facebook. Xfs is the filesystem of choice for the media infrastructure group at the moment. Different projects are free to pick anything they want but I think database tier uses xfs as well. Our usage tends to create large files (databases, haystacks) so limited fragmentation, preallocation and as-close-to-raw performance are important features for us - all aspects where xfs excels. When I am talking about multiple variables which affect the testing it's RAID level, RAID stripe size (same stripe size sometimes produces different results from different controller vendors), IO scheduler, memory available, number of threads, readahead size and other external and application tuning variables.

Peter

> -----Original Message-----
> From: Stan Hoeppner [mailto:stan@hardwarefreak.com]
> Sent: Tuesday, February 01, 2011 12:12 PM
> To: Peter Vajgel
> Cc: Dave Chinner; Jef Fox; xfs@oss.sgi.com
> Subject: Re: XFS Preallocation
> 
> Peter Vajgel put forth on 2/1/2011 1:20 PM:
> 
> > At the scale we operate it does. We have multiple variables so the number of
> combinations is large. We have hit every single possible hardware and software
> problem and problem resolution can take months if it takes days to reproduce the
> problem. Hardware vendors (disk, controller, motherboard manufacturers) are much
> more responsive when you can reproduce a problem on the fly in seconds (especially
> in comparative benchmarking). The tests usually run only couple of minutes. With
> 12x3TB (possibly multiplied by a factor of X with our new platform) it would be
> unacceptable to wait for writes to finish.
> 
> Hi Peter,
> 
> When you mention scale, you're referring to the storage back end at facebook.com,
> your employer, correct?
> 
> --
> Stan
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: XFS Preallocation
  2011-02-01 19:20           ` Peter Vajgel
  2011-02-01 20:12             ` Stan Hoeppner
@ 2011-02-02  0:07             ` Dave Chinner
  1 sibling, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2011-02-02  0:07 UTC (permalink / raw)
  To: Peter Vajgel; +Cc: Jef Fox, xfs

On Tue, Feb 01, 2011 at 07:20:18PM +0000, Peter Vajgel wrote:
> 
> > -----Original Message-----
> > From: Dave Chinner [mailto:david@fromorbit.com]
> > Sent: Tuesday, February 01, 2011 12:04 AM
> > To: Peter Vajgel
> > Cc: Jef Fox; xfs@oss.sgi.com
> > Subject: Re: XFS Preallocation
> > 
> > On Tue, Feb 01, 2011 at 04:45:09AM +0000, Peter Vajgel wrote:
> > > > Preallocation is the only option. Allowing preallocation without
> > > > marking extents as unwritten opens a massive security hole (i.e.
> > > > exposes stale data) so I say no to any request for addition of such
> > > > functionality (and have for years).
> > >
> > > How about opening this option to at least root (root can already read
> > > the device anyway)?.
> > 
> > # ls -l foo
> > -rw-r--r-- 1 dave dave        0 Aug 16 10:44 foo
> > # prealloc_without_unwritten 0 1048576 foo
> > # ls -l foo
> > -rw-r--r-- 1 dave dave  1048576 Aug 16 10:44 foo
> > #
> > 
> > Now user dave can read the stale data exposed by the root only operation. Any
> > combination of making the file available to a non-root user after a preallocation-
> > without-unwritten-extents
> > operation has this problem.  IOWs, just making such a syscall "root only" doesn't
> > solve the security problem.

> Correct - if an admin made prealloc_without_unwritten runnable by
> any user then yes - but I would argue that such an admin should
> not even have root privileges.

Not exactly what I was trying to demonstrate -
the above example uses the convention f "#" as indicating a root
shell (like "$" indicates a user shell). IOWs, it is root
preallocating on a file that is already owned and readable by
another user.

As it is, I think this is a likely use case, because not many people
are going to want to run their applications that would use such
functionality (e.g. database servers) as root. My main point is,
though, that if you can do it, people will do it whether they
understand the ramifications or not.

> Vxfs had this ability since version
> 1 and I don't' remember a single customer complaint about this
> feature.

And XFS used to do it, too. Unwritten extents were only implemented
in XFS (in 1997) once customers complained about the security
problems involved with preallocation without zeroing....

Further, with the rise of ricer filesystem tuning blogs, an emerging
meme was that you should turn off unwritten extents to make your
bonnie++ benchmark run go faster (without even understanding that
bonnie++ doesn't use preallocation). Search engines then started
throwing these up as good infoxpmration. 

Worse is the fact that they still do.  e.g. the first hit on google
for "XFS performance tweaking" makes this suggestion - it's a blog
entry from 2003 and google still considers it the most relevant hit,
even though it is full of misleading and plain wrong information.
IOWs, we're dealing with mis-information as much as a security
problem here...

> Most of the times the feature was used by db to
> preallocate large amounts of space knowing that they won't incur
> any overhead (even transactional) when doing direct io to the
> pre-allocated range. It could be that at those times even a
> transactional overhead was significant enough that we wanted to
> eliminate it.

You're talking historically about on VXFS, right?

BTW, have you recently measured the overhead of unwritten extent
conversion on XFS recently? Is it actually a performance problem for
you in production?

> > To fix it, we have to require inodes have 0600 perms, owned by root, and cannot be
> > chmod/chowned to anyone else, ever. At that point, we're requiring applications to run
> > as root to to use this functionality. Same requirement as fiemap + reading from the
> > block device, which you can do right without any kernel mods or filesystem hacks...
> > 
> > > There are cases when creating large
> > > files without writing to them is important. A good example is testing
> > > xfs overhead when doing a specific workload (like random
> > > reads) to large files.
> > 
> > For testing it doesn't matter how long it takes you to write the
> > file in the first place.
> 
> At the scale we operate it does. We have multiple variables so the
> number of combinations is large. We have hit every single possible
> hardware and software problem and problem resolution can take
> months if it takes days to reproduce the problem. Hardware vendors
> (disk, controller, motherboard manufacturers) are much more
> responsive when you can reproduce a problem on the fly in seconds
> (especially in comparative benchmarking). The tests usually run
> only couple of minutes. With 12x3TB (possibly multiplied by a
> factor of X with our new platform) it would be unacceptable to
> wait for writes to finish.

It's still a test environment, and I think you'd agree that you can
do things in test environments that you'd never, ever do in a
production setting.

> > >   while [[ $j != $filecount ]]
> > >   do
> > >     file=$mntpt/dir$i/file$j
> > >     xfs_io -f -c "resvsp 0 $size" $file
> > >     inum=$(ls -i $file | awk '{print $1}')
> > >     umount $mntpt
> > >     xfs_db -x -c "inode $inum" -c "write core.size $size" $dev
> > >     mount -t xfs -o nobarrier,noatime,nodiratime,inode64,allocsize=1g
> > > $dev $mntpt
> > 
> > That's quite a hack to work around the EOF zeroing that extending the file size after
> > allocating would do because the preallocated extents beyond EOF are not marked
> > unwritten. Perhaps truncating the file first, then preallocating is what you want:
> > 
> > 	xfs_io -f -c "truncate $size" -c "resvsp 0 $size" $file
> 
> 
> I think I had it in reverse before - allocate and truncate but the
> truncate got stuck in a loop (probably zeroing out the extents?)

*nod*

> making the node unresponsive to the point that it was impossible
> to ssh to it. It eventually returned but it took a while. But that
> was like 3 years ago. If I get to it I'll try the other order.

Yes, that would probably be how a 3yo kernel would react to such
a buffered IO writeback storm....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2011-02-02  0:06 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-28  2:05 XFS Preallocation Jef Fox
2011-01-28  4:52 ` Dave Chinner
2011-01-28 15:15   ` Jef Fox
2011-01-28 17:33   ` Jef Fox
2011-01-29  0:17     ` Dave Chinner
2011-02-01  4:45       ` Peter Vajgel
2011-02-01  8:03         ` Dave Chinner
2011-02-01 19:20           ` Peter Vajgel
2011-02-01 20:12             ` Stan Hoeppner
2011-02-01 22:47               ` Peter Vajgel
2011-02-02  0:07             ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.