All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/15] xfs: patchset for 3.11
@ 2013-06-27  6:04 Dave Chinner
  2013-06-27  6:04 ` [PATCH 01/15] xfs: update mount options documentation Dave Chinner
                   ` (15 more replies)
  0 siblings, 16 replies; 38+ messages in thread
From: Dave Chinner @ 2013-06-27  6:04 UTC (permalink / raw)
  To: xfs

Hi Ben,

Here's the current set of patches I have for 3.11. There are 2 new
patches here - the dquot transaction reservation fix, and a simple
one to implement the change count functionality on CRC enabled
filesystems.

Cheers,

Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 01/15] xfs: update mount options documentation
  2013-06-27  6:04 [PATCH 00/15] xfs: patchset for 3.11 Dave Chinner
@ 2013-06-27  6:04 ` Dave Chinner
  2013-06-27 14:48   ` Ben Myers
  2013-06-27  6:04 ` [PATCH 02/15] xfs: add pluging for bulkstat readahead Dave Chinner
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2013-06-27  6:04 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Because it's horribly out of date.

And mark various deprecated options as deprecated and give them a
removal date.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
---
 Documentation/filesystems/xfs.txt |  317 ++++++++++++++++++++++++-------------
 1 file changed, 209 insertions(+), 108 deletions(-)

diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt
index 83577f0..12525b1 100644
--- a/Documentation/filesystems/xfs.txt
+++ b/Documentation/filesystems/xfs.txt
@@ -18,6 +18,8 @@ Mount Options
 =============
 
 When mounting an XFS filesystem, the following options are accepted.
+For boolean mount options, the names with the (*) suffix is the
+default behaviour.
 
   allocsize=size
 	Sets the buffered I/O end-of-file preallocation size when
@@ -25,97 +27,128 @@ When mounting an XFS filesystem, the following options are accepted.
 	Valid values for this option are page size (typically 4KiB)
 	through to 1GiB, inclusive, in power-of-2 increments.
 
-  attr2/noattr2
-	The options enable/disable (default is disabled for backward
-	compatibility on-disk) an "opportunistic" improvement to be
-	made in the way inline extended attributes are stored on-disk.
-	When the new form is used for the first time (by setting or
-	removing extended attributes) the on-disk superblock feature
-	bit field will be updated to reflect this format being in use.
+	The default behaviour is for dynamic end-of-file
+	preallocation size, which uses a set of heuristics to
+	optimise the preallocation size based on the current
+	allocation patterns within the file and the access patterns
+	to the file. Specifying a fixed allocsize value turns off
+	the dynamic behaviour.
+
+  attr2
+  noattr2
+	The options enable/disable an "opportunistic" improvement to
+	be made in the way inline extended attributes are stored
+	on-disk.  When the new form is used for the first time when
+	attr2 is selected (either when setting or removing extended
+	attributes) the on-disk superblock feature bit field will be
+	updated to reflect this format being in use.
+
+	The default behaviour is determined by the on-disk feature
+	bit indicating that attr2 behaviour is active. If either
+	mount option it set, then that becomes the new default used
+	by the filesystem.
 
 	CRC enabled filesystems always use the attr2 format, and so
 	will reject the noattr2 mount option if it is set.
 
-  barrier
-	Enables the use of block layer write barriers for writes into
-	the journal and unwritten extent conversion.  This allows for
-	drive level write caching to be enabled, for devices that
-	support write barriers.
+  barrier (*)
+  nobarrier
+	Enables/disables the use of block layer write barriers for
+	writes into the journal and for data integrity operations.
+	This allows for drive level write caching to be enabled, for
+	devices that support write barriers.
 
   discard
-	Issue command to let the block device reclaim space freed by the
-	filesystem.  This is useful for SSD devices, thinly provisioned
-	LUNs and virtual machine images, but may have a performance
-	impact.
-
-  dmapi
-	Enable the DMAPI (Data Management API) event callouts.
-	Use with the "mtpt" option.
-
-  grpid/bsdgroups and nogrpid/sysvgroups
-	These options define what group ID a newly created file gets.
-	When grpid is set, it takes the group ID of the directory in
-	which it is created; otherwise (the default) it takes the fsgid
-	of the current process, unless the directory has the setgid bit
-	set, in which case it takes the gid from the parent directory,
-	and also gets the setgid bit set if it is a directory itself.
-
-  ihashsize=value
-	In memory inode hashes have been removed, so this option has
-	no function as of August 2007. Option is deprecated.
-
-  ikeep/noikeep
-	When ikeep is specified, XFS does not delete empty inode clusters
-	and keeps them around on disk. ikeep is the traditional XFS
-	behaviour. When noikeep is specified, empty inode clusters
-	are returned to the free space pool. The default is noikeep for
-	non-DMAPI mounts, while ikeep is the default when DMAPI is in use.
-
-  inode64
-	Indicates that XFS is allowed to create inodes at any location
-	in the filesystem, including those which will result in inode
-	numbers occupying more than 32 bits of significance.  This is
-	the default allocation option. Applications which do not handle
-	inode numbers bigger than 32 bits, should use inode32 option.
+  nodiscard (*)
+	Enable/disable the issuing of commands to let the block
+	device reclaim space freed by the filesystem.  This is
+	useful for SSD devices, thinly provisioned LUNs and virtual
+	machine images, but may have a performance impact.
+
+	Note: It is currently recommended that you use the fstrim
+	application to discard unused blocks rather than the discard
+	mount option because the performance impact of this option
+	is quite severe.
+
+  grpid/bsdgroups
+  nogrpid/sysvgroups (*)
+	These options define what group ID a newly created file
+	gets.  When grpid is set, it takes the group ID of the
+	directory in which it is created; otherwise it takes the
+	fsgid of the current process, unless the directory has the
+	setgid bit set, in which case it takes the gid from the
+	parent directory, and also gets the setgid bit set if it is
+	a directory itself.
+
+  filestreams
+	Make the data allocator use the filestreams allocation mode
+	across the entire filesystem rather than just on directories
+	configured to use it.
+
+  ikeep
+  noikeep (*)
+	When ikeep is specified, XFS does not delete empty inode
+	clusters and keeps them around on disk.  When noikeep is
+	specified, empty inode clusters are returned to the free
+	space pool.
 
   inode32
-	Indicates that XFS is limited to create inodes at locations which
-	will not result in inode numbers with more than 32 bits of
-	significance. This is provided for backwards compatibility, since
-	64 bits inode numbers might cause problems for some applications
-	that cannot handle large inode numbers.
-
-  largeio/nolargeio
+  inode64 (*)
+	When inode32 is specified, it indicates that XFS limits
+	inode creation to locations which will not result in inode
+	numbers with more than 32 bits of significance.
+
+	When inode64 is specified, it indicates that XFS is allowed
+	to create inodes at any location in the filesystem,
+	including those which will result in inode numbers occupying
+	more than 32 bits of significance. 
+
+	inode32 is provided for backwards compatibility with older
+	systems and applications, since 64 bits inode numbers might
+	cause problems for some applications that cannot handle
+	large inode numbers.  If applications are in use which do
+	not handle inode numbers bigger than 32 bits, the inode32
+	option should be specified.
+
+
+  largeio
+  nolargeio (*)
 	If "nolargeio" is specified, the optimal I/O reported in
-	st_blksize by stat(2) will be as small as possible to allow user
-	applications to avoid inefficient read/modify/write I/O.
-	If "largeio" specified, a filesystem that has a "swidth" specified
-	will return the "swidth" value (in bytes) in st_blksize. If the
-	filesystem does not have a "swidth" specified but does specify
-	an "allocsize" then "allocsize" (in bytes) will be returned
-	instead.
-	If neither of these two options are specified, then filesystem
-	will behave as if "nolargeio" was specified.
+	st_blksize by stat(2) will be as small as possible to allow
+	user applications to avoid inefficient read/modify/write
+	I/O.  This is typically the page size of the machine, as
+	this is the granularity of the page cache.
+
+	If "largeio" specified, a filesystem that was created with a
+	"swidth" specified will return the "swidth" value (in bytes)
+	in st_blksize. If the filesystem does not have a "swidth"
+	specified but does specify an "allocsize" then "allocsize"
+	(in bytes) will be returned instead. Otherwise the behaviour
+	is the same as if "nolargeio" was specified.
 
   logbufs=value
-	Set the number of in-memory log buffers.  Valid numbers range
-	from 2-8 inclusive.
-	The default value is 8 buffers for filesystems with a
-	blocksize of 64KiB, 4 buffers for filesystems with a blocksize
-	of 32KiB, 3 buffers for filesystems with a blocksize of 16KiB
-	and 2 buffers for all other configurations.  Increasing the
-	number of buffers may increase performance on some workloads
-	at the cost of the memory used for the additional log buffers
-	and their associated control structures.
+	Set the number of in-memory log buffers.  Valid numbers
+	range from 2-8 inclusive.
+
+	The default value is 8 buffers.
+
+	If the memory cost of 8 log buffers is too high on small
+	systems, then it may be reduced at some cost to performance
+	on metadata intensive workloads. The logbsize option below
+	controls the size of each buffer and so is also relevent to
+	this case.
 
   logbsize=value
-	Set the size of each in-memory log buffer.
-	Size may be specified in bytes, or in kilobytes with a "k" suffix.
-	Valid sizes for version 1 and version 2 logs are 16384 (16k) and
-	32768 (32k).  Valid sizes for version 2 logs also include
-	65536 (64k), 131072 (128k) and 262144 (256k).
-	The default value for machines with more than 32MiB of memory
-	is 32768, machines with less memory use 16384 by default.
+	Set the size of each in-memory log buffer.  The size may be
+	specified in bytes, or in kilobytes with a "k" suffix.
+	Valid sizes for version 1 and version 2 logs are 16384 (16k)
+	and 32768 (32k).  Valid sizes for version 2 logs also
+	include 65536 (64k), 131072 (128k) and 262144 (256k). The
+	logbsize must be an integer multiple of the log
+	stripe unit configured at mkfs time.
+
+	The default value for for version 1 logs is 32768, while the
+	default value for version 2 logs is MAX(32768, log_sunit).
 
   logdev=device and rtdev=device
 	Use an external log (metadata journal) and/or real-time device.
@@ -124,16 +157,11 @@ When mounting an XFS filesystem, the following options are accepted.
 	optional, and the log section can be separate from the data
 	section or contained within it.
 
-  mtpt=mountpoint
-	Use with the "dmapi" option.  The value specified here will be
-	included in the DMAPI mount event, and should be the path of
-	the actual mountpoint that is used.
-
   noalign
-	Data allocations will not be aligned at stripe unit boundaries.
-
-  noatime
-	Access timestamps are not updated when a file is read.
+	Data allocations will not be aligned at stripe unit
+	boundaries. This is only relevant to filesystems created
+	with non-zero data alignment parameters (sunit, swidth) by
+	mkfs.
 
   norecovery
 	The filesystem will be mounted without running log recovery.
@@ -144,8 +172,14 @@ When mounting an XFS filesystem, the following options are accepted.
 	the mount will fail.
 
   nouuid
-	Don't check for double mounted file systems using the file system uuid.
-	This is useful to mount LVM snapshot volumes.
+	Don't check for double mounted file systems using the file
+	system uuid.  This is useful to mount LVM snapshot volumes,
+	and often used in combination with "norecovery" for mounting
+	read-only snapshots.
+
+  noquota
+	Forcibly turns off all quota accounting and enforcement
+	within the filesystem.
 
   uquota/usrquota/uqnoenforce/quota
 	User disk quota accounting enabled, and limits (optionally)
@@ -160,24 +194,64 @@ When mounting an XFS filesystem, the following options are accepted.
 	enforced.  Refer to xfs_quota(8) for further details.
 
   sunit=value and swidth=value
-	Used to specify the stripe unit and width for a RAID device or
-	a stripe volume.  "value" must be specified in 512-byte block
-	units.
-	If this option is not specified and the filesystem was made on
-	a stripe volume or the stripe width or unit were specified for
-	the RAID device at mkfs time, then the mount system call will
-	restore the value from the superblock.  For filesystems that
-	are made directly on RAID devices, these options can be used
-	to override the information in the superblock if the underlying
-	disk layout changes after the filesystem has been created.
-	The "swidth" option is required if the "sunit" option has been
-	specified, and must be a multiple of the "sunit" value.
+	Used to specify the stripe unit and width for a RAID device
+	or a stripe volume.  "value" must be specified in 512-byte
+	block units. These options are only relevant to filesystems
+	that were created with non-zero data alignment parameters.
+
+	The sunit and swidth parameters specified must be compatible
+	with the existing filesystem alignment characteristics.  In
+	general, that means the only valid changes to sunit are
+	increasing it by a power-of-2 multiple. Valid swidth values
+	are any integer multiple of a valid sunit value.
+
+	Typically the only time these mount options are necessary if
+	after an underlying RAID device has had it's geometry
+	modified, such as adding a new disk to a RAID5 lun and
+	reshaping it.
 
   swalloc
 	Data allocations will be rounded up to stripe width boundaries
 	when the current end of file is being extended and the file
 	size is larger than the stripe width size.
 
+  wsync
+	When specified, all filesystem namespace operations are
+	executed synchronously. This ensures that when the namespace
+	operation (create, unlink, etc) completes, the change to the
+	namespace is on stable storage. This is useful in HA setups
+	where failover must not result in clients seeing
+	inconsistent namespace presentation during or after a
+	failover event.
+
+
+Deprecated Mount Options
+========================
+
+  delaylog/nodelaylog
+	Delayed logging is the only logging method that XFS supports
+	now, so these mount options are now ignored.
+
+	Due for removal in 3.12.
+
+  ihashsize=value
+	In memory inode hashes have been removed, so this option has
+	no function as of August 2007. Option is deprecated.
+
+	Due for removal in 3.12.
+
+  irixsgid
+	This behaviour is now controlled by a sysctl, so the mount
+	option is ignored.
+
+	Due for removal in 3.12.
+
+  osyncisdsync
+  osyncisosync
+	O_SYNC and O_DSYNC are fully supported, so there is no need
+	for these options any more.
+
+	Due for removal in 3.12.
 
 sysctls
 =======
@@ -189,15 +263,20 @@ The following sysctls are available for the XFS filesystem:
 	in /proc/fs/xfs/stat.  It then immediately resets to "0".
 
   fs.xfs.xfssyncd_centisecs	(Min: 100  Default: 3000  Max: 720000)
-  	The interval at which the xfssyncd thread flushes metadata
-  	out to disk.  This thread will flush log activity out, and
-  	do some processing on unlinked inodes.
+	The interval at which the filesystem flushes metadata
+	out to disk and runs internal cache cleanup routines.
 
-  fs.xfs.xfsbufd_centisecs	(Min: 50  Default: 100	Max: 3000)
-	The interval at which xfsbufd scans the dirty metadata buffers list.
+  fs.xfs.filestream_centisecs	(Min: 1  Default: 3000  Max: 360000)
+	The interval at which the filesystem ages filestreams cache
+	references and returns timed-out AGs back to the free stream
+	pool.
 
-  fs.xfs.age_buffer_centisecs	(Min: 100  Default: 1500  Max: 720000)
-	The age at which xfsbufd flushes dirty metadata buffers to disk.
+  fs.xfs.speculative_prealloc_lifetime
+		(Units: seconds   Min: 1  Default: 300  Max: 86400)
+	The interval at which the background scanning for inodes
+	with unused speculative preallocation runs. The scan
+	removes unused preallocation from clean inodes and releases
+	the unused space back to the free pool.
 
   fs.xfs.error_level		(Min: 0  Default: 3  Max: 11)
 	A volume knob for error reporting when internal errors occur.
@@ -254,9 +333,31 @@ The following sysctls are available for the XFS filesystem:
 	by the xfs_io(8) chattr command on a directory to be
 	inherited by files in that directory.
 
+  fs.xfs.inherit_nodefrag	(Min: 0  Default: 1  Max: 1)
+	Setting this to "1" will cause the "nodefrag" flag set
+	by the xfs_io(8) chattr command on a directory to be
+	inherited by files in that directory.
+
   fs.xfs.rotorstep		(Min: 1  Default: 1  Max: 256)
 	In "inode32" allocation mode, this option determines how many
 	files the allocator attempts to allocate in the same allocation
 	group before moving to the next allocation group.  The intent
 	is to control the rate at which the allocator moves between
 	allocation groups when allocating extents for new files.
+
+Deprecated Sysctls
+==================
+
+  fs.xfs.xfsbufd_centisecs	(Min: 50  Default: 100	Max: 3000)
+	Dirty metadata is now tracked by the log subsystem and
+	flushing is driven by log space and idling demands. The
+	xfsbufd no longer exists, so this syctl does nothing.
+
+	Due for removal in 3.14.
+
+  fs.xfs.age_buffer_centisecs	(Min: 100  Default: 1500  Max: 720000)
+	Dirty metadata is now tracked by the log subsystem and
+	flushing is driven by log space and idling demands. The
+	xfsbufd no longer exists, so this syctl does nothing.
+
+	Due for removal in 3.14.
-- 
1.7.10.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 02/15] xfs: add pluging for bulkstat readahead
  2013-06-27  6:04 [PATCH 00/15] xfs: patchset for 3.11 Dave Chinner
  2013-06-27  6:04 ` [PATCH 01/15] xfs: update mount options documentation Dave Chinner
@ 2013-06-27  6:04 ` Dave Chinner
  2013-06-27  6:04 ` [PATCH 03/15] xfs: plug directory buffer readahead Dave Chinner
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2013-06-27  6:04 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

I was running some tests on bulkstat on CRC enabled filesystems when
I noticed that all the IO being issued was 8k in size, regardless of
the fact taht we are issuing sequential 8k buffers for inodes
clusters. The IO size shoul dbe 16k for 256 byte inodes, and 32k for
512 byte inodes, but this wasn't happening.

blktrace showed that there was an explict plug and unplug happening
around each readahead IO from _xfs_buf_ioapply, and the unplug was
causing the IO to be issued immediately. Hence no opportunity was
being given to the elevator to merge adjacent readahead requests and
dispatch them as a single IO.

Add plugging around the inode chunk readahead dispatch loop tin
bulkstat to ensure that we don't unplug the queue between adjacent
inode buffer readahead IOs and so we get fewer, larger IO requests
hitting the storage subsystemi for bulkstat.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
---
 fs/xfs/xfs_itable.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index 2ea7d40..06d004d 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -383,11 +383,13 @@ xfs_bulkstat(
 			 * Also start read-ahead now for this chunk.
 			 */
 			if (r.ir_freecount < XFS_INODES_PER_CHUNK) {
+				struct blk_plug	plug;
 				/*
 				 * Loop over all clusters in the next chunk.
 				 * Do a readahead if there are any allocated
 				 * inodes in that cluster.
 				 */
+				blk_start_plug(&plug);
 				agbno = XFS_AGINO_TO_AGBNO(mp, r.ir_startino);
 				for (chunkidx = 0;
 				     chunkidx < XFS_INODES_PER_CHUNK;
@@ -399,6 +401,7 @@ xfs_bulkstat(
 							agbno, nbcluster,
 							&xfs_inode_buf_ops);
 				}
+				blk_finish_plug(&plug);
 				irbp->ir_startino = r.ir_startino;
 				irbp->ir_freecount = r.ir_freecount;
 				irbp->ir_free = r.ir_free;
-- 
1.7.10.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 03/15] xfs: plug directory buffer readahead
  2013-06-27  6:04 [PATCH 00/15] xfs: patchset for 3.11 Dave Chinner
  2013-06-27  6:04 ` [PATCH 01/15] xfs: update mount options documentation Dave Chinner
  2013-06-27  6:04 ` [PATCH 02/15] xfs: add pluging for bulkstat readahead Dave Chinner
@ 2013-06-27  6:04 ` Dave Chinner
  2013-06-27  6:04 ` [PATCH 04/15] xfs: don't use speculative prealloc for small files Dave Chinner
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2013-06-27  6:04 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Similar to bulkstat inode chunk readahead, we need to plug directory
data buffer readahead during getdents to ensure that we can merge
adjacent readahead requests and sort out of order requests optimally
before they are dispatched. This improves the readahead efficiency
and reduces the IO load it generates as the IO patterns are
significantly better for both contiguous and fragmented directories.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
---
 fs/xfs/xfs_dir2_leaf.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/xfs/xfs_dir2_leaf.c b/fs/xfs/xfs_dir2_leaf.c
index da71a18..bb22aac 100644
--- a/fs/xfs/xfs_dir2_leaf.c
+++ b/fs/xfs/xfs_dir2_leaf.c
@@ -1108,6 +1108,7 @@ xfs_dir2_leaf_readbuf(
 	struct xfs_mount	*mp = dp->i_mount;
 	struct xfs_buf		*bp = *bpp;
 	struct xfs_bmbt_irec	*map = mip->map;
+	struct blk_plug		plug;
 	int			error = 0;
 	int			length;
 	int			i;
@@ -1236,6 +1237,7 @@ xfs_dir2_leaf_readbuf(
 	/*
 	 * Do we need more readahead?
 	 */
+	blk_start_plug(&plug);
 	for (mip->ra_index = mip->ra_offset = i = 0;
 	     mip->ra_want > mip->ra_current && i < mip->map_blocks;
 	     i += mp->m_dirblkfsbs) {
@@ -1287,6 +1289,7 @@ xfs_dir2_leaf_readbuf(
 			}
 		}
 	}
+	blk_finish_plug(&plug);
 
 out:
 	*bpp = bp;
-- 
1.7.10.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 04/15] xfs: don't use speculative prealloc for small files
  2013-06-27  6:04 [PATCH 00/15] xfs: patchset for 3.11 Dave Chinner
                   ` (2 preceding siblings ...)
  2013-06-27  6:04 ` [PATCH 03/15] xfs: plug directory buffer readahead Dave Chinner
@ 2013-06-27  6:04 ` Dave Chinner
  2013-06-27  6:04 ` [PATCH 05/15] xfs: don't do IO when creating an new inode Dave Chinner
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2013-06-27  6:04 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Dedicated small file workloads have been seeing significant free
space fragmentation causing premature inode allocation failure
when large inode sizes are in use. A particular test case showed
that a workload that runs to a real ENOSPC on 256 byte inodes would
fail inode allocation with ENOSPC about about 80% full with 512 byte
inodes, and at about 50% full with 1024 byte inodes.

The same workload, when run with -o allocsize=4096 on 1024 byte
inodes would run to being 100% full before giving ENOSPC. That is,
no freespace fragmentation at all.

The issue was caused by the specific IO pattern the application had
- the framework it was using did not support direct IO, and so it
was emulating it by using fadvise(DONT_NEED). The result was that
the data was getting written back before the speculative prealloc
had been trimmed from memory by the close(), and so small single
block files were being allocated with 2 blocks, and then having one
truncated away. The result was lots of small 4k free space extents,
and hence each new 8k allocation would take another 8k from
contiguous free space and turn it into 4k of allocated space and 4k
of free space.

Hence inode allocation, which requires contiguous, aligned
allocation of 16k (256 byte inodes), 32k (512 byte inodes) or 64k
(1024 byte inodes) can fail to find sufficiently large freespace and
hence fail while there is still lots of free space available.

There's a simple fix for this, and one that has precendence in the
allocator code already - don't do speculative allocation unless the
size of the file is larger than a certain size. In this case, that
size is the minimum default preallocation size:
mp->m_writeio_blocks. And to keep with the concept of being nice to
people when the files are still relatively small, cap the prealloc
to mp->m_writeio_blocks until the file goes over a stripe unit is
size, at which point we'll fall back to the current behaviour based
on the last extent size.

This will effectively turn off speculative prealloc for very small
files, keep preallocation low for small files, and behave as it
currently does for any file larger than a stripe unit. This
completely avoids the freespace fragmentation problem this
particular IO pattern was causing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
---
 fs/xfs/xfs_iomap.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 8f8aaee..6a70964 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -284,6 +284,15 @@ xfs_iomap_eof_want_preallocate(
 		return 0;
 
 	/*
+	 * If the file is smaller than the minimum prealloc and we are using
+	 * dynamic preallocation, don't do any preallocation at all as it is
+	 * likely this is the only write to the file that is going to be done.
+	 */
+	if (!(mp->m_flags & XFS_MOUNT_DFLT_IOSIZE) &&
+	    XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_writeio_blocks))
+		return 0;
+
+	/*
 	 * If there are any real blocks past eof, then don't
 	 * do any speculative allocation.
 	 */
@@ -345,6 +354,10 @@ xfs_iomap_eof_prealloc_initial_size(
 	if (mp->m_flags & XFS_MOUNT_DFLT_IOSIZE)
 		return 0;
 
+	/* If the file is small, then use the minimum prealloc */
+	if (XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_dalign))
+		return 0;
+
 	/*
 	 * As we write multiple pages, the offset will always align to the
 	 * start of a page and hence point to a hole at EOF. i.e. if the size is
-- 
1.7.10.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 05/15] xfs: don't do IO when creating an new inode
  2013-06-27  6:04 [PATCH 00/15] xfs: patchset for 3.11 Dave Chinner
                   ` (3 preceding siblings ...)
  2013-06-27  6:04 ` [PATCH 04/15] xfs: don't use speculative prealloc for small files Dave Chinner
@ 2013-06-27  6:04 ` Dave Chinner
  2013-06-27  6:04 ` [PATCH 06/15] xfs: xfs_ifree doesn't need to modify the inode buffer Dave Chinner
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2013-06-27  6:04 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

When we are allocating a new inode, we read the inode cluster off
disk to increment the generation number. We are already using a
random generation number for newly allocated inodes, so if we are not
using the ikeep mode, we can just generate a new generation number
when we initialise the newly allocated inode.

This avoids the need for reading the inode buffer during inode
creation. This will speed up allocation of inodes in cold, partially
allocated clusters as they will no longer need to be read from disk
during allocation. It will also reduce the CPU overhead of inode
allocation by not having the process the buffer read, even on cache
hits.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
---
 fs/xfs/xfs_inode.c |   36 ++++++++++++++++++++++++++++--------
 1 file changed, 28 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 7f7be5f..d1f76da 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1028,6 +1028,11 @@ xfs_dinode_calc_crc(
 
 /*
  * Read the disk inode attributes into the in-core inode structure.
+ *
+ * If we are initialising a new inode and we are not utilising the
+ * XFS_MOUNT_IKEEP inode cluster mode, we can simple build the new inode core
+ * with a random generation number. If we are keeping inodes around, we need to
+ * read the inode cluster to get the existing generation number off disk.
  */
 int
 xfs_iread(
@@ -1047,6 +1052,22 @@ xfs_iread(
 	if (error)
 		return error;
 
+	/* shortcut IO on inode allocation if possible */
+	if ((iget_flags & XFS_IGET_CREATE) &&
+	    !(mp->m_flags & XFS_MOUNT_IKEEP)) {
+		/* initialise the on-disk inode core */
+		memset(&ip->i_d, 0, sizeof(ip->i_d));
+		ip->i_d.di_magic = XFS_DINODE_MAGIC;
+		ip->i_d.di_gen = prandom_u32();
+		if (xfs_sb_version_hascrc(&mp->m_sb)) {
+			ip->i_d.di_version = 3;
+			ip->i_d.di_ino = ip->i_ino;
+			uuid_copy(&ip->i_d.di_uuid, &mp->m_sb.sb_uuid);
+		} else
+			ip->i_d.di_version = 2;
+		return 0;
+	}
+
 	/*
 	 * Get pointers to the on-disk inode and the buffer containing it.
 	 */
@@ -1133,17 +1154,16 @@ xfs_iread(
 	xfs_buf_set_ref(bp, XFS_INO_REF);
 
 	/*
-	 * Use xfs_trans_brelse() to release the buffer containing the
-	 * on-disk inode, because it was acquired with xfs_trans_read_buf()
-	 * in xfs_imap_to_bp() above.  If tp is NULL, this is just a normal
+	 * Use xfs_trans_brelse() to release the buffer containing the on-disk
+	 * inode, because it was acquired with xfs_trans_read_buf() in
+	 * xfs_imap_to_bp() above.  If tp is NULL, this is just a normal
 	 * brelse().  If we're within a transaction, then xfs_trans_brelse()
 	 * will only release the buffer if it is not dirty within the
 	 * transaction.  It will be OK to release the buffer in this case,
-	 * because inodes on disk are never destroyed and we will be
-	 * locking the new in-core inode before putting it in the hash
-	 * table where other processes can find it.  Thus we don't have
-	 * to worry about the inode being changed just because we released
-	 * the buffer.
+	 * because inodes on disk are never destroyed and we will be locking the
+	 * new in-core inode before putting it in the cache where other
+	 * processes can find it.  Thus we don't have to worry about the inode
+	 * being changed just because we released the buffer.
 	 */
  out_brelse:
 	xfs_trans_brelse(tp, bp);
-- 
1.7.10.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 06/15] xfs: xfs_ifree doesn't need to modify the inode buffer
  2013-06-27  6:04 [PATCH 00/15] xfs: patchset for 3.11 Dave Chinner
                   ` (4 preceding siblings ...)
  2013-06-27  6:04 ` [PATCH 05/15] xfs: don't do IO when creating an new inode Dave Chinner
@ 2013-06-27  6:04 ` Dave Chinner
  2013-06-27  6:04 ` [PATCH 07/15] xfs: Introduce ordered log vector support Dave Chinner
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2013-06-27  6:04 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Long ago, bulkstat used to read inodes directly fromteh backing
buffer for speed. This had the unfortunate problem of being cache
incoherent with unlinks, and so xfs_ifree() had to mark the inode
as free directly in the backing buffer. bulkstat was changed some
time ago to use inode cache coherent lookups, and so will never see
unlinked inodes in it's lookups. Hence xfs_ifree() does not need to
touch the inode backing buffer anymore.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
---
 fs/xfs/xfs_inode.c |   32 ++++----------------------------
 1 file changed, 4 insertions(+), 28 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index d1f76da..9ecfe1e 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2048,8 +2048,6 @@ xfs_ifree(
 	int			error;
 	int			delete;
 	xfs_ino_t		first_ino;
-	xfs_dinode_t    	*dip;
-	xfs_buf_t       	*ibp;
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 	ASSERT(ip->i_d.di_nlink == 0);
@@ -2062,14 +2060,13 @@ xfs_ifree(
 	 * Pull the on-disk inode from the AGI unlinked list.
 	 */
 	error = xfs_iunlink_remove(tp, ip);
-	if (error != 0) {
+	if (error)
 		return error;
-	}
 
 	error = xfs_difree(tp, ip->i_ino, flist, &delete, &first_ino);
-	if (error != 0) {
+	if (error)
 		return error;
-	}
+
 	ip->i_d.di_mode = 0;		/* mark incore inode as free */
 	ip->i_d.di_flags = 0;
 	ip->i_d.di_dmevmask = 0;
@@ -2081,31 +2078,10 @@ xfs_ifree(
 	 * by reincarnations of this inode.
 	 */
 	ip->i_d.di_gen++;
-
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
 
-	error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, &dip, &ibp,
-			       0, 0);
-	if (error)
-		return error;
-
-        /*
-	* Clear the on-disk di_mode. This is to prevent xfs_bulkstat
-	* from picking up this inode when it is reclaimed (its incore state
-	* initialzed but not flushed to disk yet). The in-core di_mode is
-	* already cleared  and a corresponding transaction logged.
-	* The hack here just synchronizes the in-core to on-disk
-	* di_mode value in advance before the actual inode sync to disk.
-	* This is OK because the inode is already unlinked and would never
-	* change its di_mode again for this inode generation.
-	* This is a temporary hack that would require a proper fix
-	* in the future.
-	*/
-	dip->di_mode = 0;
-
-	if (delete) {
+	if (delete)
 		error = xfs_ifree_cluster(ip, tp, first_ino);
-	}
 
 	return error;
 }
-- 
1.7.10.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 07/15] xfs: Introduce ordered log vector support
  2013-06-27  6:04 [PATCH 00/15] xfs: patchset for 3.11 Dave Chinner
                   ` (5 preceding siblings ...)
  2013-06-27  6:04 ` [PATCH 06/15] xfs: xfs_ifree doesn't need to modify the inode buffer Dave Chinner
@ 2013-06-27  6:04 ` Dave Chinner
  2013-06-27  6:04 ` [PATCH 08/15] xfs: Introduce an ordered buffer item Dave Chinner
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2013-06-27  6:04 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

And "ordered log vector" is a log vector that is used for
tracking a log item through the CIL and into the AIL as part of the
log checkpointing. These ordered log vectors are special in that
they are not written to to journal in any way, and are not accounted
to the checkpoint being written.

The reason for this behaviour is to allow operations to attach items
to transactions and have them follow the normal transactional
lifecycle without actually having to write them to the journal. This
allows logging of items that track high level logical changes and
writing them to the log, while the physical items being modified
pass through into the AIL and pin the tail of the log (and therefore
the logical item in the log) until all the modified items are
physically written to disk.

IOWs, it allows us to write metadata without physically logging
every individual change but still maintain the full transactional
integrity guarantees we currently have w.r.t. crash recovery.

This change modifies some of the CIL item insertion loops, as
ordered log vectors introduce some new constraints as they don't
track any data. One advantage of this change is that it combines
two log vector chain walks into a single pass, so there is less
overhead in the transaction commit pass as well. It also kills some
unused code in the log vector walk loop when committing the CIL.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
---
 fs/xfs/xfs_log.c     |   22 ++++++++++++---
 fs/xfs/xfs_log.h     |    2 ++
 fs/xfs/xfs_log_cil.c |   75 ++++++++++++++++++++++++++++++++++----------------
 3 files changed, 71 insertions(+), 28 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index b345a7c..d852a2b 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1963,6 +1963,10 @@ xlog_write_calc_vec_length(
 		headers++;
 
 	for (lv = log_vector; lv; lv = lv->lv_next) {
+		/* we don't write ordered log vectors */
+		if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED)
+			continue;
+
 		headers += lv->lv_niovecs;
 
 		for (i = 0; i < lv->lv_niovecs; i++) {
@@ -2216,7 +2220,7 @@ xlog_write(
 	index = 0;
 	lv = log_vector;
 	vecp = lv->lv_iovecp;
-	while (lv && index < lv->lv_niovecs) {
+	while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
 		void		*ptr;
 		int		log_offset;
 
@@ -2236,13 +2240,22 @@ xlog_write(
 		 * This loop writes out as many regions as can fit in the amount
 		 * of space which was allocated by xlog_state_get_iclog_space().
 		 */
-		while (lv && index < lv->lv_niovecs) {
-			struct xfs_log_iovec	*reg = &vecp[index];
+		while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
+			struct xfs_log_iovec	*reg;
 			struct xlog_op_header	*ophdr;
 			int			start_rec_copy;
 			int			copy_len;
 			int			copy_off;
+			bool			ordered = false;
+
+			/* ordered log vectors have no regions to write */
+			if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) {
+				ASSERT(lv->lv_niovecs == 0);
+				ordered = true;
+				goto next_lv;
+			}
 
+			reg = &vecp[index];
 			ASSERT(reg->i_len % sizeof(__int32_t) == 0);
 			ASSERT((unsigned long)ptr % sizeof(__int32_t) == 0);
 
@@ -2302,12 +2315,13 @@ xlog_write(
 				break;
 
 			if (++index == lv->lv_niovecs) {
+next_lv:
 				lv = lv->lv_next;
 				index = 0;
 				if (lv)
 					vecp = lv->lv_iovecp;
 			}
-			if (record_cnt == 0) {
+			if (record_cnt == 0 && ordered == false) {
 				if (!lv)
 					return 0;
 				break;
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 5caee96..b20918c 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -105,6 +105,8 @@ struct xfs_log_vec {
 	int			lv_buf_len;	/* size of formatted buffer */
 };
 
+#define XFS_LOG_VEC_ORDERED	(-1)
+
 /*
  * Structure used to pass callback function and the function's argument
  * to the log manager.
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index d0833b5..02b9cf3 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -127,6 +127,7 @@ xlog_cil_prepare_log_vecs(
 		int	index;
 		int	len = 0;
 		uint	niovecs;
+		bool	ordered = false;
 
 		/* Skip items which aren't dirty in this transaction. */
 		if (!(lidp->lid_flags & XFS_LID_DIRTY))
@@ -137,14 +138,30 @@ xlog_cil_prepare_log_vecs(
 		if (!niovecs)
 			continue;
 
+		/*
+		 * Ordered items need to be tracked but we do not wish to write
+		 * them. We need a logvec to track the object, but we do not
+		 * need an iovec or buffer to be allocated for copying data.
+		 */
+		if (niovecs == XFS_LOG_VEC_ORDERED) {
+			ordered = true;
+			niovecs = 0;
+		}
+
 		new_lv = kmem_zalloc(sizeof(*new_lv) +
 				niovecs * sizeof(struct xfs_log_iovec),
 				KM_SLEEP|KM_NOFS);
 
+		new_lv->lv_item = lidp->lid_item;
+		new_lv->lv_niovecs = niovecs;
+		if (ordered) {
+			/* track as an ordered logvec */
+			new_lv->lv_buf_len = XFS_LOG_VEC_ORDERED;
+			goto next;
+		}
+
 		/* The allocated iovec region lies beyond the log vector. */
 		new_lv->lv_iovecp = (struct xfs_log_iovec *)&new_lv[1];
-		new_lv->lv_niovecs = niovecs;
-		new_lv->lv_item = lidp->lid_item;
 
 		/* build the vector array and calculate it's length */
 		IOP_FORMAT(new_lv->lv_item, new_lv->lv_iovecp);
@@ -165,6 +182,7 @@ xlog_cil_prepare_log_vecs(
 		}
 		ASSERT(ptr == new_lv->lv_buf + new_lv->lv_buf_len);
 
+next:
 		if (!ret_lv)
 			ret_lv = new_lv;
 		else
@@ -191,8 +209,18 @@ xfs_cil_prepare_item(
 
 	if (old) {
 		/* existing lv on log item, space used is a delta */
-		ASSERT(!list_empty(&lv->lv_item->li_cil));
-		ASSERT(old->lv_buf && old->lv_buf_len && old->lv_niovecs);
+		ASSERT((old->lv_buf && old->lv_buf_len && old->lv_niovecs) ||
+			old->lv_buf_len == XFS_LOG_VEC_ORDERED);
+
+		/*
+		 * If the new item is ordered, keep the old one that is already
+		 * tracking dirty or ordered regions
+		 */
+		if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) {
+			ASSERT(!lv->lv_buf);
+			kmem_free(lv);
+			return;
+		}
 
 		*len += lv->lv_buf_len - old->lv_buf_len;
 		*diff_iovecs += lv->lv_niovecs - old->lv_niovecs;
@@ -201,10 +229,11 @@ xfs_cil_prepare_item(
 	} else {
 		/* new lv, must pin the log item */
 		ASSERT(!lv->lv_item->li_lv);
-		ASSERT(list_empty(&lv->lv_item->li_cil));
 
-		*len += lv->lv_buf_len;
-		*diff_iovecs += lv->lv_niovecs;
+		if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED) {
+			*len += lv->lv_buf_len;
+			*diff_iovecs += lv->lv_niovecs;
+		}
 		IOP_PIN(lv->lv_item);
 
 	}
@@ -259,18 +288,24 @@ xlog_cil_insert_items(
 	 * We can do this safely because the context can't checkpoint until we
 	 * are done so it doesn't matter exactly how we update the CIL.
 	 */
-	for (lv = log_vector; lv; lv = lv->lv_next)
-		xfs_cil_prepare_item(log, lv, &len, &diff_iovecs);
-
-	/* account for space used by new iovec headers  */
-	len += diff_iovecs * sizeof(xlog_op_header_t);
-
 	spin_lock(&cil->xc_cil_lock);
+	for (lv = log_vector; lv; ) {
+		struct xfs_log_vec *next = lv->lv_next;
 
-	/* move the items to the tail of the CIL */
-	for (lv = log_vector; lv; lv = lv->lv_next)
+		ASSERT(lv->lv_item->li_lv || list_empty(&lv->lv_item->li_cil));
+		lv->lv_next = NULL;
+
+		/*
+		 * xfs_cil_prepare_item() may free the lv, so move the item on
+		 * the CIL first.
+		 */
 		list_move_tail(&lv->lv_item->li_cil, &cil->xc_cil);
+		xfs_cil_prepare_item(log, lv, &len, &diff_iovecs);
+		lv = next;
+	}
 
+	/* account for space used by new iovec headers  */
+	len += diff_iovecs * sizeof(xlog_op_header_t);
 	ctx->nvecs += diff_iovecs;
 
 	/*
@@ -381,9 +416,7 @@ xlog_cil_push(
 	struct xfs_cil_ctx	*new_ctx;
 	struct xlog_in_core	*commit_iclog;
 	struct xlog_ticket	*tic;
-	int			num_lv;
 	int			num_iovecs;
-	int			len;
 	int			error = 0;
 	struct xfs_trans_header thdr;
 	struct xfs_log_iovec	lhdr;
@@ -428,12 +461,9 @@ xlog_cil_push(
 	 * side which is currently locked out by the flush lock.
 	 */
 	lv = NULL;
-	num_lv = 0;
 	num_iovecs = 0;
-	len = 0;
 	while (!list_empty(&cil->xc_cil)) {
 		struct xfs_log_item	*item;
-		int			i;
 
 		item = list_first_entry(&cil->xc_cil,
 					struct xfs_log_item, li_cil);
@@ -444,11 +474,7 @@ xlog_cil_push(
 			lv->lv_next = item->li_lv;
 		lv = item->li_lv;
 		item->li_lv = NULL;
-
-		num_lv++;
 		num_iovecs += lv->lv_niovecs;
-		for (i = 0; i < lv->lv_niovecs; i++)
-			len += lv->lv_iovecp[i].i_len;
 	}
 
 	/*
@@ -701,6 +727,7 @@ xfs_log_commit_cil(
 	if (commit_lsn)
 		*commit_lsn = log->l_cilp->xc_ctx->sequence;
 
+	/* xlog_cil_insert_items() destroys log_vector list */
 	xlog_cil_insert_items(log, log_vector, tp->t_ticket);
 
 	/* check we didn't blow the reservation */
-- 
1.7.10.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 08/15] xfs: Introduce an ordered buffer item
  2013-06-27  6:04 [PATCH 00/15] xfs: patchset for 3.11 Dave Chinner
                   ` (6 preceding siblings ...)
  2013-06-27  6:04 ` [PATCH 07/15] xfs: Introduce ordered log vector support Dave Chinner
@ 2013-06-27  6:04 ` Dave Chinner
  2013-06-27  6:04 ` [PATCH 09/15] xfs: Inode create log items Dave Chinner
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2013-06-27  6:04 UTC (permalink / raw)
  To: xfs

If we have a buffer that we have modified but we do not wish to
physically log in a transaction (e.g. we've logged a logical
change), we still need to ensure that transactional integrity is
maintained. Hence we must not move the tail of the log past the
transaction that the buffer is associated with before the buffer is
written to disk.

This means these special buffers still need to be included in the
transaction and added to the AIL just like a normal buffer, but we
do not want the modifications to the buffer written into the
transaction. IOWs, what we want is an "ordered buffer" that
maintains the same transactional life cycle as a physically logged
buffer, just without the transcribing of the modifications to the
log.

Hence we need to flag the buffer as an "ordered buffer" to avoid
including it in vector size calculations or formatting during the
transaction. Once the transaction is committed, the buffer appears
for all intents to be the same as a physically logged buffer as it
transitions through the log and AIL.

Relogging will also work just fine for such an ordered buffer - the
logical transaction will be replayed before the subsequent
modifications that relog the buffer, so everything will be
reconstructed correctly by recovery.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
---
 fs/xfs/xfs_buf_item.c  |   75 +++++++++++++++++++++++++++++++-----------------
 fs/xfs/xfs_buf_item.h  |    4 ++-
 fs/xfs/xfs_trace.h     |    4 +++
 fs/xfs/xfs_trans.h     |    1 +
 fs/xfs/xfs_trans_buf.c |   34 ++++++++++++++++++++--
 5 files changed, 87 insertions(+), 31 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 4ec4317..61f6876 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -140,6 +140,16 @@ xfs_buf_item_size(
 
 	ASSERT(bip->bli_flags & XFS_BLI_LOGGED);
 
+	if (bip->bli_flags & XFS_BLI_ORDERED) {
+		/*
+		 * The buffer has been logged just to order it.
+		 * It is not being included in the transaction
+		 * commit, so no vectors are used at all.
+		 */
+		trace_xfs_buf_item_size_ordered(bip);
+		return XFS_LOG_VEC_ORDERED;
+	}
+
 	/*
 	 * the vector count is based on the number of buffer vectors we have
 	 * dirty bits in. This will only be greater than one when we have a
@@ -212,6 +222,7 @@ xfs_buf_item_format_segment(
 		goto out;
 	}
 
+
 	/*
 	 * Fill in an iovec for each set of contiguous chunks.
 	 */
@@ -311,6 +322,16 @@ xfs_buf_item_format(
 		bip->bli_flags &= ~XFS_BLI_INODE_BUF;
 	}
 
+	if ((bip->bli_flags & (XFS_BLI_ORDERED|XFS_BLI_STALE)) ==
+							XFS_BLI_ORDERED) {
+		/*
+		 * The buffer has been logged just to order it.  It is not being
+		 * included in the transaction commit, so don't format it.
+		 */
+		trace_xfs_buf_item_format_ordered(bip);
+		return;
+	}
+
 	for (i = 0; i < bip->bli_format_count; i++) {
 		vecp = xfs_buf_item_format_segment(bip, vecp, offset,
 						&bip->bli_formats[i]);
@@ -340,6 +361,7 @@ xfs_buf_item_pin(
 
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 	ASSERT((bip->bli_flags & XFS_BLI_LOGGED) ||
+	       (bip->bli_flags & XFS_BLI_ORDERED) ||
 	       (bip->bli_flags & XFS_BLI_STALE));
 
 	trace_xfs_buf_item_pin(bip);
@@ -512,8 +534,9 @@ xfs_buf_item_unlock(
 {
 	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
 	struct xfs_buf		*bp = bip->bli_buf;
-	int			aborted, clean, i;
-	uint			hold;
+	bool			clean;
+	bool			aborted;
+	int			flags;
 
 	/* Clear the buffer's association with this transaction. */
 	bp->b_transp = NULL;
@@ -524,23 +547,21 @@ xfs_buf_item_unlock(
 	 * (cancelled) buffers at unpin time, but we'll never go through the
 	 * pin/unpin cycle if we abort inside commit.
 	 */
-	aborted = (lip->li_flags & XFS_LI_ABORTED) != 0;
-
+	aborted = (lip->li_flags & XFS_LI_ABORTED) ? true : false;
 	/*
-	 * Before possibly freeing the buf item, determine if we should
-	 * release the buffer at the end of this routine.
+	 * Before possibly freeing the buf item, copy the per-transaction state
+	 * so we can reference it safely later after clearing it from the
+	 * buffer log item.
 	 */
-	hold = bip->bli_flags & XFS_BLI_HOLD;
-
-	/* Clear the per transaction state. */
-	bip->bli_flags &= ~(XFS_BLI_LOGGED | XFS_BLI_HOLD);
+	flags = bip->bli_flags;
+	bip->bli_flags &= ~(XFS_BLI_LOGGED | XFS_BLI_HOLD | XFS_BLI_ORDERED);
 
 	/*
 	 * If the buf item is marked stale, then don't do anything.  We'll
 	 * unlock the buffer and free the buf item when the buffer is unpinned
 	 * for the last time.
 	 */
-	if (bip->bli_flags & XFS_BLI_STALE) {
+	if (flags & XFS_BLI_STALE) {
 		trace_xfs_buf_item_unlock_stale(bip);
 		ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL);
 		if (!aborted) {
@@ -557,13 +578,19 @@ xfs_buf_item_unlock(
 	 * be the only reference to the buf item, so we free it anyway
 	 * regardless of whether it is dirty or not. A dirty abort implies a
 	 * shutdown, anyway.
+	 *
+	 * Ordered buffers are dirty but may have no recorded changes, so ensure
+	 * we only release clean items here.
 	 */
-	clean = 1;
-	for (i = 0; i < bip->bli_format_count; i++) {
-		if (!xfs_bitmap_empty(bip->bli_formats[i].blf_data_map,
-			     bip->bli_formats[i].blf_map_size)) {
-			clean = 0;
-			break;
+	clean = (flags & XFS_BLI_DIRTY) ? false : true;
+	if (clean) {
+		int i;
+		for (i = 0; i < bip->bli_format_count; i++) {
+			if (!xfs_bitmap_empty(bip->bli_formats[i].blf_data_map,
+				     bip->bli_formats[i].blf_map_size)) {
+				clean = false;
+				break;
+			}
 		}
 	}
 	if (clean)
@@ -576,7 +603,7 @@ xfs_buf_item_unlock(
 	} else
 		atomic_dec(&bip->bli_refcount);
 
-	if (!hold)
+	if (!(flags & XFS_BLI_HOLD))
 		xfs_buf_relse(bp);
 }
 
@@ -842,12 +869,6 @@ xfs_buf_item_log(
 	struct xfs_buf		*bp = bip->bli_buf;
 
 	/*
-	 * Mark the item as having some dirty data for
-	 * quick reference in xfs_buf_item_dirty.
-	 */
-	bip->bli_flags |= XFS_BLI_DIRTY;
-
-	/*
 	 * walk each buffer segment and mark them dirty appropriately.
 	 */
 	start = 0;
@@ -873,7 +894,7 @@ xfs_buf_item_log(
 
 
 /*
- * Return 1 if the buffer has some data that has been logged (at any
+ * Return 1 if the buffer has been logged or ordered in a transaction (at any
  * point, not just the current transaction) and 0 if not.
  */
 uint
@@ -907,11 +928,11 @@ void
 xfs_buf_item_relse(
 	xfs_buf_t	*bp)
 {
-	xfs_buf_log_item_t	*bip;
+	xfs_buf_log_item_t	*bip = bp->b_fspriv;
 
 	trace_xfs_buf_item_relse(bp, _RET_IP_);
+	ASSERT(!(bip->bli_item.li_flags & XFS_LI_IN_AIL));
 
-	bip = bp->b_fspriv;
 	bp->b_fspriv = bip->bli_item.li_bio_list;
 	if (bp->b_fspriv == NULL)
 		bp->b_iodone = NULL;
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index 2573d2a..0f1c247 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -120,6 +120,7 @@ xfs_blft_from_flags(struct xfs_buf_log_format *blf)
 #define	XFS_BLI_INODE_ALLOC_BUF	0x10
 #define XFS_BLI_STALE_INODE	0x20
 #define	XFS_BLI_INODE_BUF	0x40
+#define	XFS_BLI_ORDERED		0x80
 
 #define XFS_BLI_FLAGS \
 	{ XFS_BLI_HOLD,		"HOLD" }, \
@@ -128,7 +129,8 @@ xfs_blft_from_flags(struct xfs_buf_log_format *blf)
 	{ XFS_BLI_LOGGED,	"LOGGED" }, \
 	{ XFS_BLI_INODE_ALLOC_BUF, "INODE_ALLOC" }, \
 	{ XFS_BLI_STALE_INODE,	"STALE_INODE" }, \
-	{ XFS_BLI_INODE_BUF,	"INODE_BUF" }
+	{ XFS_BLI_INODE_BUF,	"INODE_BUF" }, \
+	{ XFS_BLI_ORDERED,	"ORDERED" }
 
 
 #ifdef __KERNEL__
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index e318672..ee8b3a3 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -486,9 +486,12 @@ DEFINE_EVENT(xfs_buf_item_class, name, \
 	TP_PROTO(struct xfs_buf_log_item *bip), \
 	TP_ARGS(bip))
 DEFINE_BUF_ITEM_EVENT(xfs_buf_item_size);
+DEFINE_BUF_ITEM_EVENT(xfs_buf_item_size_ordered);
 DEFINE_BUF_ITEM_EVENT(xfs_buf_item_size_stale);
 DEFINE_BUF_ITEM_EVENT(xfs_buf_item_format);
+DEFINE_BUF_ITEM_EVENT(xfs_buf_item_format_ordered);
 DEFINE_BUF_ITEM_EVENT(xfs_buf_item_format_stale);
+DEFINE_BUF_ITEM_EVENT(xfs_buf_item_ordered);
 DEFINE_BUF_ITEM_EVENT(xfs_buf_item_pin);
 DEFINE_BUF_ITEM_EVENT(xfs_buf_item_unpin);
 DEFINE_BUF_ITEM_EVENT(xfs_buf_item_unpin_stale);
@@ -508,6 +511,7 @@ DEFINE_BUF_ITEM_EVENT(xfs_trans_bjoin);
 DEFINE_BUF_ITEM_EVENT(xfs_trans_bhold);
 DEFINE_BUF_ITEM_EVENT(xfs_trans_bhold_release);
 DEFINE_BUF_ITEM_EVENT(xfs_trans_binval);
+DEFINE_BUF_ITEM_EVENT(xfs_trans_buf_ordered);
 
 DECLARE_EVENT_CLASS(xfs_lock_class,
 	TP_PROTO(struct xfs_inode *ip, unsigned lock_flags,
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 6d52656..822570e 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -498,6 +498,7 @@ void		xfs_trans_bhold_release(xfs_trans_t *, struct xfs_buf *);
 void		xfs_trans_binval(xfs_trans_t *, struct xfs_buf *);
 void		xfs_trans_inode_buf(xfs_trans_t *, struct xfs_buf *);
 void		xfs_trans_stale_inode_buf(xfs_trans_t *, struct xfs_buf *);
+void		xfs_trans_ordered_buf(xfs_trans_t *, struct xfs_buf *);
 void		xfs_trans_dquot_buf(xfs_trans_t *, struct xfs_buf *, uint);
 void		xfs_trans_inode_alloc_buf(xfs_trans_t *, struct xfs_buf *);
 void		xfs_trans_ichgtime(struct xfs_trans *, struct xfs_inode *, int);
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 73a5fa4..aa5a04b 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -397,7 +397,6 @@ shutdown_abort:
 	return XFS_ERROR(EIO);
 }
 
-
 /*
  * Release the buffer bp which was previously acquired with one of the
  * xfs_trans_... buffer allocation routines if the buffer has not
@@ -603,8 +602,14 @@ xfs_trans_log_buf(xfs_trans_t	*tp,
 
 	tp->t_flags |= XFS_TRANS_DIRTY;
 	bip->bli_item.li_desc->lid_flags |= XFS_LID_DIRTY;
-	bip->bli_flags |= XFS_BLI_LOGGED;
-	xfs_buf_item_log(bip, first, last);
+
+	/*
+	 * If we have an ordered buffer we are not logging any dirty range but
+	 * it still needs to be marked dirty and that it has been logged.
+	 */
+	bip->bli_flags |= XFS_BLI_DIRTY | XFS_BLI_LOGGED;
+	if (!(bip->bli_flags & XFS_BLI_ORDERED))
+		xfs_buf_item_log(bip, first, last);
 }
 
 
@@ -757,6 +762,29 @@ xfs_trans_inode_alloc_buf(
 }
 
 /*
+ * Mark the buffer as ordered for this transaction. This means
+ * that the contents of the buffer are not recorded in the transaction
+ * but it is tracked in the AIL as though it was. This allows us
+ * to record logical changes in transactions rather than the physical
+ * changes we make to the buffer without changing writeback ordering
+ * constraints of metadata buffers.
+ */
+void
+xfs_trans_ordered_buf(
+	struct xfs_trans	*tp,
+	struct xfs_buf		*bp)
+{
+	struct xfs_buf_log_item	*bip = bp->b_fspriv;
+
+	ASSERT(bp->b_transp == tp);
+	ASSERT(bip != NULL);
+	ASSERT(atomic_read(&bip->bli_refcount) > 0);
+
+	bip->bli_flags |= XFS_BLI_ORDERED;
+	trace_xfs_buf_item_ordered(bip);
+}
+
+/*
  * Set the type of the buffer for log recovery so that it can correctly identify
  * and hence attach the correct buffer ops to the buffer after replay.
  */
-- 
1.7.10.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 09/15] xfs: Inode create log items
  2013-06-27  6:04 [PATCH 00/15] xfs: patchset for 3.11 Dave Chinner
                   ` (7 preceding siblings ...)
  2013-06-27  6:04 ` [PATCH 08/15] xfs: Introduce an ordered buffer item Dave Chinner
@ 2013-06-27  6:04 ` Dave Chinner
  2013-06-27  6:04 ` [PATCH 10/15] xfs: Inode create transaction reservations Dave Chinner
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2013-06-27  6:04 UTC (permalink / raw)
  To: xfs

Introduce the inode create log item type for logical inode create logging.
Instead of logging the changes in buffers, pass the range to be
initialised through the log by a new transaction type.  This reduces
the amount of log space required to record initialisation during
allocation from about 128 bytes per inode to a small fixed amount
per inode extent to be initialised.

This requires a new log item type to track it through the log
and the AIL. This is a relatively simple item - most callbacks are
noops as this item has the same life cycle as the transaction.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
---
 fs/xfs/Makefile           |    1 +
 fs/xfs/xfs_icreate_item.c |  195 +++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_icreate_item.h |   52 ++++++++++++
 fs/xfs/xfs_log.h          |    3 +-
 fs/xfs/xfs_super.c        |    8 ++
 fs/xfs/xfs_trans.h        |    4 +-
 6 files changed, 261 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/xfs_icreate_item.c
 create mode 100644 fs/xfs/xfs_icreate_item.h

diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 6313b69..4a45080 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -71,6 +71,7 @@ xfs-y				+= xfs_alloc.o \
 				   xfs_dir2_sf.o \
 				   xfs_ialloc.o \
 				   xfs_ialloc_btree.o \
+				   xfs_icreate_item.o \
 				   xfs_inode.o \
 				   xfs_log_recover.o \
 				   xfs_mount.o \
diff --git a/fs/xfs/xfs_icreate_item.c b/fs/xfs/xfs_icreate_item.c
new file mode 100644
index 0000000..7716a4e
--- /dev/null
+++ b/fs/xfs/xfs_icreate_item.c
@@ -0,0 +1,195 @@
+/*
+ * Copyright (c) 2008-2010, 2013 Dave Chinner
+ * All Rights Reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_types.h"
+#include "xfs_bit.h"
+#include "xfs_log.h"
+#include "xfs_inum.h"
+#include "xfs_trans.h"
+#include "xfs_buf_item.h"
+#include "xfs_sb.h"
+#include "xfs_ag.h"
+#include "xfs_dir2.h"
+#include "xfs_mount.h"
+#include "xfs_trans_priv.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_alloc_btree.h"
+#include "xfs_ialloc_btree.h"
+#include "xfs_attr_sf.h"
+#include "xfs_dinode.h"
+#include "xfs_inode.h"
+#include "xfs_inode_item.h"
+#include "xfs_btree.h"
+#include "xfs_ialloc.h"
+#include "xfs_error.h"
+#include "xfs_icreate_item.h"
+
+kmem_zone_t	*xfs_icreate_zone;		/* inode create item zone */
+
+static inline struct xfs_icreate_item *ICR_ITEM(struct xfs_log_item *lip)
+{
+	return container_of(lip, struct xfs_icreate_item, ic_item);
+}
+
+/*
+ * This returns the number of iovecs needed to log the given inode item.
+ *
+ * We only need one iovec for the icreate log structure.
+ */
+STATIC uint
+xfs_icreate_item_size(
+	struct xfs_log_item	*lip)
+{
+	return 1;
+}
+
+/*
+ * This is called to fill in the vector of log iovecs for the
+ * given inode create log item.
+ */
+STATIC void
+xfs_icreate_item_format(
+	struct xfs_log_item	*lip,
+	struct xfs_log_iovec	*log_vector)
+{
+	struct xfs_icreate_item	*icp = ICR_ITEM(lip);
+
+	log_vector->i_addr = (xfs_caddr_t)&icp->ic_format;
+	log_vector->i_len  = sizeof(struct xfs_icreate_log);
+	log_vector->i_type = XLOG_REG_TYPE_ICREATE;
+}
+
+
+/* Pinning has no meaning for the create item, so just return. */
+STATIC void
+xfs_icreate_item_pin(
+	struct xfs_log_item	*lip)
+{
+}
+
+
+/* pinning has no meaning for the create item, so just return. */
+STATIC void
+xfs_icreate_item_unpin(
+	struct xfs_log_item	*lip,
+	int			remove)
+{
+}
+
+STATIC void
+xfs_icreate_item_unlock(
+	struct xfs_log_item	*lip)
+{
+	struct xfs_icreate_item	*icp = ICR_ITEM(lip);
+
+	if (icp->ic_item.li_flags & XFS_LI_ABORTED)
+		kmem_zone_free(xfs_icreate_zone, icp);
+	return;
+}
+
+/*
+ * Because we have ordered buffers being tracked in the AIL for the inode
+ * creation, we don't need the create item after this. Hence we can free
+ * the log item and return -1 to tell the caller we're done with the item.
+ */
+STATIC xfs_lsn_t
+xfs_icreate_item_committed(
+	struct xfs_log_item	*lip,
+	xfs_lsn_t		lsn)
+{
+	struct xfs_icreate_item	*icp = ICR_ITEM(lip);
+
+	kmem_zone_free(xfs_icreate_zone, icp);
+	return (xfs_lsn_t)-1;
+}
+
+/* item can never get into the AIL */
+STATIC uint
+xfs_icreate_item_push(
+	struct xfs_log_item	*lip,
+	struct list_head	*buffer_list)
+{
+	ASSERT(0);
+	return XFS_ITEM_SUCCESS;
+}
+
+/* Ordered buffers do the dependency tracking here, so this does nothing. */
+STATIC void
+xfs_icreate_item_committing(
+	struct xfs_log_item	*lip,
+	xfs_lsn_t		lsn)
+{
+}
+
+/*
+ * This is the ops vector shared by all buf log items.
+ */
+static struct xfs_item_ops xfs_icreate_item_ops = {
+	.iop_size	= xfs_icreate_item_size,
+	.iop_format	= xfs_icreate_item_format,
+	.iop_pin	= xfs_icreate_item_pin,
+	.iop_unpin	= xfs_icreate_item_unpin,
+	.iop_push	= xfs_icreate_item_push,
+	.iop_unlock	= xfs_icreate_item_unlock,
+	.iop_committed	= xfs_icreate_item_committed,
+	.iop_committing = xfs_icreate_item_committing,
+};
+
+
+/*
+ * Initialize the inode log item for a newly allocated (in-core) inode.
+ *
+ * Inode extents can only reside within an AG. Hence specify the starting
+ * block for the inode chunk by offset within an AG as well as the
+ * length of the allocated extent.
+ *
+ * This joins the item to the transaction and marks it dirty so
+ * that we don't need a separate call to do this, nor does the
+ * caller need to know anything about the icreate item.
+ */
+void
+xfs_icreate_log(
+	struct xfs_trans	*tp,
+	xfs_agnumber_t		agno,
+	xfs_agblock_t		agbno,
+	unsigned int		count,
+	unsigned int		inode_size,
+	xfs_agblock_t		length,
+	unsigned int		generation)
+{
+	struct xfs_icreate_item	*icp;
+
+	icp = kmem_zone_zalloc(xfs_icreate_zone, KM_SLEEP);
+
+	xfs_log_item_init(tp->t_mountp, &icp->ic_item, XFS_LI_ICREATE,
+			  &xfs_icreate_item_ops);
+
+	icp->ic_format.icl_type = XFS_LI_ICREATE;
+	icp->ic_format.icl_size = 1;	/* single vector */
+	icp->ic_format.icl_ag = cpu_to_be32(agno);
+	icp->ic_format.icl_agbno = cpu_to_be32(agbno);
+	icp->ic_format.icl_count = cpu_to_be32(count);
+	icp->ic_format.icl_isize = cpu_to_be32(inode_size);
+	icp->ic_format.icl_length = cpu_to_be32(length);
+	icp->ic_format.icl_gen = cpu_to_be32(generation);
+
+	xfs_trans_add_item(tp, &icp->ic_item);
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	icp->ic_item.li_desc->lid_flags |= XFS_LID_DIRTY;
+}
diff --git a/fs/xfs/xfs_icreate_item.h b/fs/xfs/xfs_icreate_item.h
new file mode 100644
index 0000000..88ba8aa
--- /dev/null
+++ b/fs/xfs/xfs_icreate_item.h
@@ -0,0 +1,52 @@
+/*
+ * Copyright (c) 2008-2010, Dave Chinner
+ * All Rights Reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+#ifndef XFS_ICREATE_ITEM_H
+#define XFS_ICREATE_ITEM_H	1
+
+/*
+ * on disk log item structure
+ *
+ * Log recovery assumes the first two entries are the type and size and they fit
+ * in 32 bits. Also in host order (ugh) so they have to be 32 bit aligned so
+ * decoding can be done correctly.
+ */
+struct xfs_icreate_log {
+	__uint16_t	icl_type;	/* type of log format structure */
+	__uint16_t	icl_size;	/* size of log format structure */
+	__be32		icl_ag;		/* ag being allocated in */
+	__be32		icl_agbno;	/* start block of inode range */
+	__be32		icl_count;	/* number of inodes to initialise */
+	__be32		icl_isize;	/* size of inodes */
+	__be32		icl_length;	/* length of extent to initialise */
+	__be32		icl_gen;	/* inode generation number to use */
+};
+
+/* in memory log item structure */
+struct xfs_icreate_item {
+	struct xfs_log_item	ic_item;
+	struct xfs_icreate_log	ic_format;
+};
+
+extern kmem_zone_t *xfs_icreate_zone;	/* inode create item zone */
+
+void xfs_icreate_log(struct xfs_trans *tp, xfs_agnumber_t agno,
+			xfs_agblock_t agbno, unsigned int count,
+			unsigned int inode_size, xfs_agblock_t length,
+			unsigned int generation);
+
+#endif	/* XFS_ICREATE_ITEM_H */
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index b20918c..fb630e4 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -88,7 +88,8 @@ static inline xfs_lsn_t	_lsn_cmp(xfs_lsn_t lsn1, xfs_lsn_t lsn2)
 #define XLOG_REG_TYPE_UNMOUNT		17
 #define XLOG_REG_TYPE_COMMIT		18
 #define XLOG_REG_TYPE_TRANSHDR		19
-#define XLOG_REG_TYPE_MAX		19
+#define XLOG_REG_TYPE_ICREATE		20
+#define XLOG_REG_TYPE_MAX		20
 
 typedef struct xfs_log_iovec {
 	void		*i_addr;	/* beginning address of region */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 3033ba5..f59e27f 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -51,6 +51,7 @@
 #include "xfs_inode_item.h"
 #include "xfs_icache.h"
 #include "xfs_trace.h"
+#include "xfs_icreate_item.h"
 
 #include <linux/namei.h>
 #include <linux/init.h>
@@ -1655,9 +1656,15 @@ xfs_init_zones(void)
 					KM_ZONE_SPREAD, NULL);
 	if (!xfs_ili_zone)
 		goto out_destroy_inode_zone;
+	xfs_icreate_zone = kmem_zone_init(sizeof(struct xfs_icreate_item),
+					"xfs_icr");
+	if (!xfs_icreate_zone)
+		goto out_destroy_ili_zone;
 
 	return 0;
 
+ out_destroy_ili_zone:
+	kmem_zone_destroy(xfs_ili_zone);
  out_destroy_inode_zone:
 	kmem_zone_destroy(xfs_inode_zone);
  out_destroy_efi_zone:
@@ -1696,6 +1703,7 @@ xfs_destroy_zones(void)
 	 * destroy caches.
 	 */
 	rcu_barrier();
+	kmem_zone_destroy(xfs_icreate_zone);
 	kmem_zone_destroy(xfs_ili_zone);
 	kmem_zone_destroy(xfs_inode_zone);
 	kmem_zone_destroy(xfs_efi_zone);
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 822570e..2b49463 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -48,6 +48,7 @@ typedef struct xfs_trans_header {
 #define	XFS_LI_BUF		0x123c	/* v2 bufs, variable sized inode bufs */
 #define	XFS_LI_DQUOT		0x123d
 #define	XFS_LI_QUOTAOFF		0x123e
+#define	XFS_LI_ICREATE		0x123f
 
 #define XFS_LI_TYPE_DESC \
 	{ XFS_LI_EFI,		"XFS_LI_EFI" }, \
@@ -107,7 +108,8 @@ typedef struct xfs_trans_header {
 #define	XFS_TRANS_SWAPEXT		40
 #define	XFS_TRANS_SB_COUNT		41
 #define	XFS_TRANS_CHECKPOINT		42
-#define	XFS_TRANS_TYPE_MAX		42
+#define	XFS_TRANS_ICREATE		43
+#define	XFS_TRANS_TYPE_MAX		43
 /* new transaction types need to be reflected in xfs_logprint(8) */
 
 #define XFS_TRANS_TYPES \
-- 
1.7.10.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 10/15] xfs: Inode create transaction reservations
  2013-06-27  6:04 [PATCH 00/15] xfs: patchset for 3.11 Dave Chinner
                   ` (8 preceding siblings ...)
  2013-06-27  6:04 ` [PATCH 09/15] xfs: Inode create log items Dave Chinner
@ 2013-06-27  6:04 ` Dave Chinner
  2013-06-27  6:04 ` [PATCH 11/15] xfs: Inode create item recovery Dave Chinner
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2013-06-27  6:04 UTC (permalink / raw)
  To: xfs

Define the log and space transaction sizes. Factor the current
create log reservation macro into the two logical halves and reuse
one half for the new icreate transactions. The icreate transaction
is transparent to all the high level create code - the
pre-calculated reservations will correctly set the reservations
dependent on whether the filesystem supports the icreate
transaction.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
---
 fs/xfs/xfs_trans.c |  118 ++++++++++++++++++++++++++++++++++------------------
 1 file changed, 77 insertions(+), 41 deletions(-)

diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 2fd7c1f..35a2299 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -234,71 +234,93 @@ xfs_calc_remove_reservation(
 }
 
 /*
- * For symlink we can modify:
+ * For create, break it in to the two cases that the transaction
+ * covers. We start with the modify case - allocation done by modification
+ * of the state of existing inodes - and the allocation case.
+ */
+
+/*
+ * For create we can modify:
  *    the parent directory inode: inode size
  *    the new inode: inode size
- *    the inode btree entry: 1 block
+ *    the inode btree entry: block size
+ *    the superblock for the nlink flag: sector size
  *    the directory btree: (max depth + v2) * dir block size
  *    the directory inode's bmap btree: (max depth + v2) * block size
- *    the blocks for the symlink: 1 kB
- * Or in the first xact we allocate some inodes giving:
+ */
+STATIC uint
+xfs_calc_create_resv_modify(
+	struct xfs_mount	*mp)
+{
+	return xfs_calc_buf_res(2, mp->m_sb.sb_inodesize) +
+		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
+		(uint)XFS_FSB_TO_B(mp, 1) +
+		xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1));
+}
+
+/*
+ * For create we can allocate some inodes giving:
  *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
+ *    the superblock for the nlink flag: sector size
  *    the inode blocks allocated: XFS_IALLOC_BLOCKS * blocksize
  *    the inode btree: max depth * blocksize
- *    the allocation btrees: 2 trees * (2 * max depth - 1) * block size
+ *    the allocation btrees: 2 trees * (max depth - 1) * block size
  */
 STATIC uint
-xfs_calc_symlink_reservation(
+xfs_calc_create_resv_alloc(
+	struct xfs_mount	*mp)
+{
+	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
+		mp->m_sb.sb_sectsize +
+		xfs_calc_buf_res(XFS_IALLOC_BLOCKS(mp), XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
+				 XFS_FSB_TO_B(mp, 1));
+}
+
+STATIC uint
+__xfs_calc_create_reservation(
 	struct xfs_mount	*mp)
 {
 	return XFS_DQUOT_LOGRES(mp) +
-		MAX((xfs_calc_buf_res(2, mp->m_sb.sb_inodesize) +
-		     xfs_calc_buf_res(1, XFS_FSB_TO_B(mp, 1)) +
-		     xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp),
-				      XFS_FSB_TO_B(mp, 1)) +
-		     xfs_calc_buf_res(1, 1024)),
-		    (xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
-		     xfs_calc_buf_res(XFS_IALLOC_BLOCKS(mp),
-				      XFS_FSB_TO_B(mp, 1)) +
-		     xfs_calc_buf_res(mp->m_in_maxlevels,
-				      XFS_FSB_TO_B(mp, 1)) +
-		     xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
-				      XFS_FSB_TO_B(mp, 1))));
+		MAX(xfs_calc_create_resv_alloc(mp),
+		    xfs_calc_create_resv_modify(mp));
 }
 
 /*
- * For create we can modify:
- *    the parent directory inode: inode size
- *    the new inode: inode size
- *    the inode btree entry: block size
- *    the superblock for the nlink flag: sector size
- *    the directory btree: (max depth + v2) * dir block size
- *    the directory inode's bmap btree: (max depth + v2) * block size
- * Or in the first xact we allocate some inodes giving:
+ * For icreate we can allocate some inodes giving:
  *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
  *    the superblock for the nlink flag: sector size
- *    the inode blocks allocated: XFS_IALLOC_BLOCKS * blocksize
  *    the inode btree: max depth * blocksize
  *    the allocation btrees: 2 trees * (max depth - 1) * block size
  */
 STATIC uint
-xfs_calc_create_reservation(
+xfs_calc_icreate_resv_alloc(
 	struct xfs_mount	*mp)
 {
+	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
+		mp->m_sb.sb_sectsize +
+		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
+				 XFS_FSB_TO_B(mp, 1));
+}
+
+STATIC uint
+xfs_calc_icreate_reservation(xfs_mount_t *mp)
+{
 	return XFS_DQUOT_LOGRES(mp) +
-		MAX((xfs_calc_buf_res(2, mp->m_sb.sb_inodesize) +
-		     xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
-		     (uint)XFS_FSB_TO_B(mp, 1) +
-		     xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp),
-				      XFS_FSB_TO_B(mp, 1))),
-		    (xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
-		     mp->m_sb.sb_sectsize +
-		     xfs_calc_buf_res(XFS_IALLOC_BLOCKS(mp),
-				      XFS_FSB_TO_B(mp, 1)) +
-		     xfs_calc_buf_res(mp->m_in_maxlevels,
-				      XFS_FSB_TO_B(mp, 1)) +
-		     xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
-				      XFS_FSB_TO_B(mp, 1))));
+		MAX(xfs_calc_icreate_resv_alloc(mp),
+		    xfs_calc_create_resv_modify(mp));
+}
+
+STATIC uint
+xfs_calc_create_reservation(
+	struct xfs_mount	*mp)
+{
+	if (xfs_sb_version_hascrc(&mp->m_sb))
+		return xfs_calc_icreate_reservation(mp);
+	return __xfs_calc_create_reservation(mp);
+
 }
 
 /*
@@ -311,6 +333,20 @@ xfs_calc_mkdir_reservation(
 	return xfs_calc_create_reservation(mp);
 }
 
+
+/*
+ * Making a new symplink is the same as creating a new file, but
+ * with the added blocks for remote symlink data which can be up to 1kB in
+ * length (MAXPATHLEN).
+ */
+STATIC uint
+xfs_calc_symlink_reservation(
+	struct xfs_mount	*mp)
+{
+	return xfs_calc_create_reservation(mp) +
+	       xfs_calc_buf_res(1, MAXPATHLEN);
+}
+
 /*
  * In freeing an inode we can modify:
  *    the inode being freed: inode size
-- 
1.7.10.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 11/15] xfs: Inode create item recovery
  2013-06-27  6:04 [PATCH 00/15] xfs: patchset for 3.11 Dave Chinner
                   ` (9 preceding siblings ...)
  2013-06-27  6:04 ` [PATCH 10/15] xfs: Inode create transaction reservations Dave Chinner
@ 2013-06-27  6:04 ` Dave Chinner
  2013-06-27  6:04 ` [PATCH 12/15] xfs: Use inode create transaction Dave Chinner
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2013-06-27  6:04 UTC (permalink / raw)
  To: xfs

When we find a icreate transaction, we need to get and initialise
the buffers in the range that has been passed. Extract and verify
the information in the item record, then loop over the range
initialising and issuing the buffer writes delayed.

Support an arbitrary size range to initialise so that in
future when we allocate inodes in much larger chunks all kernels
that understand this transaction can still recover them.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
---
 fs/xfs/xfs_ialloc.c      |   37 +++++++++++----
 fs/xfs/xfs_ialloc.h      |    8 ++++
 fs/xfs/xfs_log_recover.c |  114 ++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 145 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index c8f5ae1..d5ef81a 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -150,12 +150,16 @@ xfs_check_agi_freecount(
 #endif
 
 /*
- * Initialise a new set of inodes.
+ * Initialise a new set of inodes. When called without a transaction context
+ * (e.g. from recovery) we initiate a delayed write of the inode buffers rather
+ * than logging them (which in a transaction context puts them into the AIL
+ * for writeback rather than the xfsbufd queue).
  */
 STATIC int
 xfs_ialloc_inode_init(
 	struct xfs_mount	*mp,
 	struct xfs_trans	*tp,
+	struct list_head	*buffer_list,
 	xfs_agnumber_t		agno,
 	xfs_agblock_t		agbno,
 	xfs_agblock_t		length,
@@ -247,18 +251,33 @@ xfs_ialloc_inode_init(
 				ino++;
 				uuid_copy(&free->di_uuid, &mp->m_sb.sb_uuid);
 				xfs_dinode_calc_crc(mp, free);
-			} else {
+			} else if (tp) {
 				/* just log the inode core */
 				xfs_trans_log_buf(tp, fbuf, ioffset,
 						  ioffset + isize - 1);
 			}
 		}
-		if (version == 3) {
-			/* need to log the entire buffer */
-			xfs_trans_log_buf(tp, fbuf, 0,
-					  BBTOB(fbuf->b_length) - 1);
+
+		if (tp) {
+			/*
+			 * Mark the buffer as an inode allocation buffer so it
+			 * sticks in AIL at the point of this allocation
+			 * transaction. This ensures the they are on disk before
+			 * the tail of the log can be moved past this
+			 * transaction (i.e. by preventing relogging from moving
+			 * it forward in the log).
+			 */
+			xfs_trans_inode_alloc_buf(tp, fbuf);
+			if (version == 3) {
+				/* need to log the entire buffer */
+				xfs_trans_log_buf(tp, fbuf, 0,
+						  BBTOB(fbuf->b_length) - 1);
+			}
+		} else {
+			fbuf->b_flags |= XBF_DONE;
+			xfs_buf_delwri_queue(fbuf, buffer_list);
+			xfs_buf_relse(fbuf);
 		}
-		xfs_trans_inode_alloc_buf(tp, fbuf);
 	}
 	return 0;
 }
@@ -303,7 +322,7 @@ xfs_ialloc_ag_alloc(
 	 * First try to allocate inodes contiguous with the last-allocated
 	 * chunk of inodes.  If the filesystem is striped, this will fill
 	 * an entire stripe unit with inodes.
- 	 */
+	 */
 	agi = XFS_BUF_TO_AGI(agbp);
 	newino = be32_to_cpu(agi->agi_newino);
 	agno = be32_to_cpu(agi->agi_seqno);
@@ -402,7 +421,7 @@ xfs_ialloc_ag_alloc(
 	 * rather than a linear progression to prevent the next generation
 	 * number from being easily guessable.
 	 */
-	error = xfs_ialloc_inode_init(args.mp, tp, agno, args.agbno,
+	error = xfs_ialloc_inode_init(args.mp, tp, NULL, agno, args.agbno,
 			args.len, prandom_u32());
 
 	if (error)
diff --git a/fs/xfs/xfs_ialloc.h b/fs/xfs/xfs_ialloc.h
index c8da3df..68c0732 100644
--- a/fs/xfs/xfs_ialloc.h
+++ b/fs/xfs/xfs_ialloc.h
@@ -150,6 +150,14 @@ int xfs_inobt_lookup(struct xfs_btree_cur *cur, xfs_agino_t ino,
 int xfs_inobt_get_rec(struct xfs_btree_cur *cur,
 		xfs_inobt_rec_incore_t *rec, int *stat);
 
+/*
+ * Inode chunk initialisation routine
+ */
+int xfs_ialloc_inode_init(struct xfs_mount *mp, struct xfs_trans *tp,
+			  struct list_head *buffer_list,
+			  xfs_agnumber_t agno, xfs_agblock_t agbno,
+			  xfs_agblock_t length, unsigned int gen);
+
 extern const struct xfs_buf_ops xfs_agi_buf_ops;
 
 #endif	/* __XFS_IALLOC_H__ */
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 7cf5e4e..6fcc910a 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -45,6 +45,7 @@
 #include "xfs_cksum.h"
 #include "xfs_trace.h"
 #include "xfs_icache.h"
+#include "xfs_icreate_item.h"
 
 /* Need all the magic numbers and buffer ops structures from these headers */
 #include "xfs_symlink.h"
@@ -1617,7 +1618,10 @@ xlog_recover_add_to_trans(
  *	   form the cancelled buffer table. Hence they have tobe done last.
  *
  *	3. Inode allocation buffers must be replayed before inode items that
- *	   read the buffer and replay changes into it.
+ *	   read the buffer and replay changes into it. For filesystems using the
+ *	   ICREATE transactions, this means XFS_LI_ICREATE objects need to get
+ *	   treated the same as inode allocation buffers as they create and
+ *	   initialise the buffers directly.
  *
  *	4. Inode unlink buffers must be replayed after inode items are replayed.
  *	   This ensures that inodes are completely flushed to the inode buffer
@@ -1632,10 +1636,17 @@ xlog_recover_add_to_trans(
  * from all the other buffers and move them to last.
  *
  * Hence, 4 lists, in order from head to tail:
- * 	- buffer_list for all buffers except cancelled/inode unlink buffers
- * 	- item_list for all non-buffer items
- * 	- inode_buffer_list for inode unlink buffers
- * 	- cancel_list for the cancelled buffers
+ *	- buffer_list for all buffers except cancelled/inode unlink buffers
+ *	- item_list for all non-buffer items
+ *	- inode_buffer_list for inode unlink buffers
+ *	- cancel_list for the cancelled buffers
+ *
+ * Note that we add objects to the tail of the lists so that first-to-last
+ * ordering is preserved within the lists. Adding objects to the head of the
+ * list means when we traverse from the head we walk them in last-to-first
+ * order. For cancelled buffers and inode unlink buffers this doesn't matter,
+ * but for all other items there may be specific ordering that we need to
+ * preserve.
  */
 STATIC int
 xlog_recover_reorder_trans(
@@ -1655,6 +1666,9 @@ xlog_recover_reorder_trans(
 		xfs_buf_log_format_t	*buf_f = item->ri_buf[0].i_addr;
 
 		switch (ITEM_TYPE(item)) {
+		case XFS_LI_ICREATE:
+			list_move_tail(&item->ri_list, &buffer_list);
+			break;
 		case XFS_LI_BUF:
 			if (buf_f->blf_flags & XFS_BLF_CANCEL) {
 				trace_xfs_log_recover_item_reorder_head(log,
@@ -2982,6 +2996,93 @@ xlog_recover_efd_pass2(
 }
 
 /*
+ * This routine is called when an inode create format structure is found in a
+ * committed transaction in the log.  It's purpose is to initialise the inodes
+ * being allocated on disk. This requires us to get inode cluster buffers that
+ * match the range to be intialised, stamped with inode templates and written
+ * by delayed write so that subsequent modifications will hit the cached buffer
+ * and only need writing out at the end of recovery.
+ */
+STATIC int
+xlog_recover_do_icreate_pass2(
+	struct xlog		*log,
+	struct list_head	*buffer_list,
+	xlog_recover_item_t	*item)
+{
+	struct xfs_mount	*mp = log->l_mp;
+	struct xfs_icreate_log	*icl;
+	xfs_agnumber_t		agno;
+	xfs_agblock_t		agbno;
+	unsigned int		count;
+	unsigned int		isize;
+	xfs_agblock_t		length;
+
+	icl = (struct xfs_icreate_log *)item->ri_buf[0].i_addr;
+	if (icl->icl_type != XFS_LI_ICREATE) {
+		xfs_warn(log->l_mp, "xlog_recover_do_icreate_trans: bad type");
+		return EINVAL;
+	}
+
+	if (icl->icl_size != 1) {
+		xfs_warn(log->l_mp, "xlog_recover_do_icreate_trans: bad icl size");
+		return EINVAL;
+	}
+
+	agno = be32_to_cpu(icl->icl_ag);
+	if (agno >= mp->m_sb.sb_agcount) {
+		xfs_warn(log->l_mp, "xlog_recover_do_icreate_trans: bad agno");
+		return EINVAL;
+	}
+	agbno = be32_to_cpu(icl->icl_agbno);
+	if (!agbno || agbno == NULLAGBLOCK || agbno >= mp->m_sb.sb_agblocks) {
+		xfs_warn(log->l_mp, "xlog_recover_do_icreate_trans: bad agbno");
+		return EINVAL;
+	}
+	isize = be32_to_cpu(icl->icl_isize);
+	if (isize != mp->m_sb.sb_inodesize) {
+		xfs_warn(log->l_mp, "xlog_recover_do_icreate_trans: bad isize");
+		return EINVAL;
+	}
+	count = be32_to_cpu(icl->icl_count);
+	if (!count) {
+		xfs_warn(log->l_mp, "xlog_recover_do_icreate_trans: bad count");
+		return EINVAL;
+	}
+	length = be32_to_cpu(icl->icl_length);
+	if (!length || length >= mp->m_sb.sb_agblocks) {
+		xfs_warn(log->l_mp, "xlog_recover_do_icreate_trans: bad length");
+		return EINVAL;
+	}
+
+	/* existing allocation is fixed value */
+	ASSERT(count == XFS_IALLOC_INODES(mp));
+	ASSERT(length == XFS_IALLOC_BLOCKS(mp));
+	if (count != XFS_IALLOC_INODES(mp) ||
+	     length != XFS_IALLOC_BLOCKS(mp)) {
+		xfs_warn(log->l_mp, "xlog_recover_do_icreate_trans: bad count 2");
+		return EINVAL;
+	}
+
+	/*
+	 * Inode buffers can be freed. Do not replay the inode initialisation as
+	 * we could be overwriting something written after this inode buffer was
+	 * cancelled.
+	 *
+	 * XXX: we need to iterate all buffers and only init those that are not
+	 * cancelled. I think that a more fine grained factoring of
+	 * xfs_ialloc_inode_init may be appropriate here to enable this to be
+	 * done easily.
+	 */
+	if (xlog_check_buffer_cancelled(log,
+			XFS_AGB_TO_DADDR(mp, agno, agbno), length, 0))
+		return 0;
+
+	xfs_ialloc_inode_init(mp, NULL, buffer_list, agno, agbno, length,
+					be32_to_cpu(icl->icl_gen));
+	return 0;
+}
+
+/*
  * Free up any resources allocated by the transaction
  *
  * Remember that EFIs, EFDs, and IUNLINKs are handled later.
@@ -3023,6 +3124,7 @@ xlog_recover_commit_pass1(
 	case XFS_LI_EFI:
 	case XFS_LI_EFD:
 	case XFS_LI_DQUOT:
+	case XFS_LI_ICREATE:
 		/* nothing to do in pass 1 */
 		return 0;
 	default:
@@ -3053,6 +3155,8 @@ xlog_recover_commit_pass2(
 		return xlog_recover_efd_pass2(log, item);
 	case XFS_LI_DQUOT:
 		return xlog_recover_dquot_pass2(log, buffer_list, item);
+	case XFS_LI_ICREATE:
+		return xlog_recover_do_icreate_pass2(log, buffer_list, item);
 	case XFS_LI_QUOTAOFF:
 		/* nothing to do in pass2 */
 		return 0;
-- 
1.7.10.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 12/15] xfs: Use inode create transaction
  2013-06-27  6:04 [PATCH 00/15] xfs: patchset for 3.11 Dave Chinner
                   ` (10 preceding siblings ...)
  2013-06-27  6:04 ` [PATCH 11/15] xfs: Inode create item recovery Dave Chinner
@ 2013-06-27  6:04 ` Dave Chinner
  2013-06-27  6:04 ` [PATCH 13/15] xfs: remove local fork format handling from xfs_bmapi_write() Dave Chinner
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2013-06-27  6:04 UTC (permalink / raw)
  To: xfs

Replace the use of buffer based logging of inode initialisation,
uses the new logical form to describe the range to be initialised
in recovery. We continue to "log" the inode buffers to push them
into the AIL and ensure that the inode create transaction is not
removed from the log before the inode buffers are written to disk.

Update the transaction identifier and reservations to match the
changed implementation.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
---
 fs/xfs/xfs_buf_item.c |   12 ++++++++++--
 fs/xfs/xfs_ialloc.c   |   32 +++++++++++++++++++++++---------
 2 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 61f6876..bfc4e0c 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -310,13 +310,21 @@ xfs_buf_item_format(
 
 	/*
 	 * If it is an inode buffer, transfer the in-memory state to the
-	 * format flags and clear the in-memory state. We do not transfer
+	 * format flags and clear the in-memory state.
+	 *
+	 * For buffer based inode allocation, we do not transfer
 	 * this state if the inode buffer allocation has not yet been committed
 	 * to the log as setting the XFS_BLI_INODE_BUF flag will prevent
 	 * correct replay of the inode allocation.
+	 *
+	 * For icreate item based inode allocation, the buffers aren't written
+	 * to the journal during allocation, and hence we should always tag the
+	 * buffer as an inode buffer so that the correct unlinked list replay
+	 * occurs during recovery.
 	 */
 	if (bip->bli_flags & XFS_BLI_INODE_BUF) {
-		if (!((bip->bli_flags & XFS_BLI_INODE_ALLOC_BUF) &&
+		if (xfs_sb_version_hascrc(&lip->li_mountp->m_sb) ||
+		    !((bip->bli_flags & XFS_BLI_INODE_ALLOC_BUF) &&
 		      xfs_log_item_in_current_chkpt(lip)))
 			bip->__bli_format.blf_flags |= XFS_BLF_INODE_BUF;
 		bip->bli_flags &= ~XFS_BLI_INODE_BUF;
diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index d5ef81a..319d9e4 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -38,6 +38,7 @@
 #include "xfs_bmap.h"
 #include "xfs_cksum.h"
 #include "xfs_buf_item.h"
+#include "xfs_icreate_item.h"
 
 
 /*
@@ -155,7 +156,7 @@ xfs_check_agi_freecount(
  * than logging them (which in a transaction context puts them into the AIL
  * for writeback rather than the xfsbufd queue).
  */
-STATIC int
+int
 xfs_ialloc_inode_init(
 	struct xfs_mount	*mp,
 	struct xfs_trans	*tp,
@@ -212,6 +213,18 @@ xfs_ialloc_inode_init(
 		version = 3;
 		ino = XFS_AGINO_TO_INO(mp, agno,
 				       XFS_OFFBNO_TO_AGINO(mp, agbno, 0));
+
+		/*
+		 * log the initialisation that is about to take place as an
+		 * logical operation. This means the transaction does not
+		 * need to log the physical changes to the inode buffers as log
+		 * recovery will know what initialisation is actually needed.
+		 * Hence we only need to log the buffers as "ordered" buffers so
+		 * they track in the AIL as if they were physically logged.
+		 */
+		if (tp)
+			xfs_icreate_log(tp, agno, agbno, XFS_IALLOC_INODES(mp),
+					mp->m_sb.sb_inodesize, length, gen);
 	} else if (xfs_sb_version_hasnlink(&mp->m_sb))
 		version = 2;
 	else
@@ -227,13 +240,8 @@ xfs_ialloc_inode_init(
 					 XBF_UNMAPPED);
 		if (!fbuf)
 			return ENOMEM;
-		/*
-		 * Initialize all inodes in this buffer and then log them.
-		 *
-		 * XXX: It would be much better if we had just one transaction
-		 *	to log a whole cluster of inodes instead of all the
-		 *	individual transactions causing a lot of log traffic.
-		 */
+
+		/* Initialize the inode buffers and log them appropriately. */
 		fbuf->b_ops = &xfs_inode_buf_ops;
 		xfs_buf_zero(fbuf, 0, BBTOB(fbuf->b_length));
 		for (i = 0; i < ninodes; i++) {
@@ -269,7 +277,13 @@ xfs_ialloc_inode_init(
 			 */
 			xfs_trans_inode_alloc_buf(tp, fbuf);
 			if (version == 3) {
-				/* need to log the entire buffer */
+				/*
+				 * Mark the buffer as ordered so that they are
+				 * not physically logged in the transaction but
+				 * still tracked in the AIL as part of the
+				 * transaction and pin the log appropriately.
+				 */
+				xfs_trans_ordered_buf(tp, fbuf);
 				xfs_trans_log_buf(tp, fbuf, 0,
 						  BBTOB(fbuf->b_length) - 1);
 			}
-- 
1.7.10.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 13/15] xfs: remove local fork format handling from xfs_bmapi_write()
  2013-06-27  6:04 [PATCH 00/15] xfs: patchset for 3.11 Dave Chinner
                   ` (11 preceding siblings ...)
  2013-06-27  6:04 ` [PATCH 12/15] xfs: Use inode create transaction Dave Chinner
@ 2013-06-27  6:04 ` Dave Chinner
  2013-06-27  6:04 ` [PATCH 14/15] xfs: dquot log reservations are too small Dave Chinner
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2013-06-27  6:04 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

The conversion from local format to extent format requires
interpretation of the data in the fork being converted, so it cannot
be done in a generic way. It is up to the caller to convert the fork
format to extent format before calling into xfs_bmapi_write() so
format conversion can be done correctly.

The code in xfs_bmapi_write() to convert the format is used
implicitly by the attribute and directory code, but they
specifically zero the fork size so that the conversion does not do
any allocation or manipulation. Move this conversion into the
shortform to leaf functions for the dir/attr code so the conversions
are explicitly controlled by all callers.

Now we can remove the conversion code in xfs_bmapi_write.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_attr_leaf.c  |    2 +
 fs/xfs/xfs_bmap.c       |  199 ++++++++++++++++++++---------------------------
 fs/xfs/xfs_bmap.h       |    1 +
 fs/xfs/xfs_dir2_block.c |   20 +++--
 4 files changed, 98 insertions(+), 124 deletions(-)

diff --git a/fs/xfs/xfs_attr_leaf.c b/fs/xfs/xfs_attr_leaf.c
index 31d3cd1..b800fbc 100644
--- a/fs/xfs/xfs_attr_leaf.c
+++ b/fs/xfs/xfs_attr_leaf.c
@@ -690,6 +690,8 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
 	sf = (xfs_attr_shortform_t *)tmpbuffer;
 
 	xfs_idata_realloc(dp, -size, XFS_ATTR_FORK);
+	xfs_bmap_local_to_extents_empty(dp, XFS_ATTR_FORK);
+
 	bp = NULL;
 	error = xfs_da_grow_inode(args, &blkno);
 	if (error) {
diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
index 8904284..05c698c 100644
--- a/fs/xfs/xfs_bmap.c
+++ b/fs/xfs/xfs_bmap.c
@@ -1161,6 +1161,24 @@ xfs_bmap_extents_to_btree(
  * since the file data needs to get logged so things will stay consistent.
  * (The bmap-level manipulations are ok, though).
  */
+void
+xfs_bmap_local_to_extents_empty(
+	struct xfs_inode	*ip,
+	int			whichfork)
+{
+	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
+
+	ASSERT(XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_LOCAL);
+	ASSERT(ifp->if_bytes == 0);
+	ASSERT(XFS_IFORK_NEXTENTS(ip, whichfork) == 0);
+
+	xfs_bmap_forkoff_reset(ip->i_mount, ip, whichfork);
+	ifp->if_flags &= ~XFS_IFINLINE;
+	ifp->if_flags |= XFS_IFEXTENTS;
+	XFS_IFORK_FMT_SET(ip, whichfork, XFS_DINODE_FMT_EXTENTS);
+}
+
+
 STATIC int				/* error */
 xfs_bmap_local_to_extents(
 	xfs_trans_t	*tp,		/* transaction pointer */
@@ -1174,9 +1192,12 @@ xfs_bmap_local_to_extents(
 				   struct xfs_inode *ip,
 				   struct xfs_ifork *ifp))
 {
-	int		error;		/* error return value */
+	int		error = 0;
 	int		flags;		/* logging flags returned */
 	xfs_ifork_t	*ifp;		/* inode fork pointer */
+	xfs_alloc_arg_t	args;		/* allocation arguments */
+	xfs_buf_t	*bp;		/* buffer for extent block */
+	xfs_bmbt_rec_host_t *ep;	/* extent record pointer */
 
 	/*
 	 * We don't want to deal with the case of keeping inode data inline yet.
@@ -1185,68 +1206,65 @@ xfs_bmap_local_to_extents(
 	ASSERT(!(S_ISREG(ip->i_d.di_mode) && whichfork == XFS_DATA_FORK));
 	ifp = XFS_IFORK_PTR(ip, whichfork);
 	ASSERT(XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_LOCAL);
+
+	if (!ifp->if_bytes) {
+		xfs_bmap_local_to_extents_empty(ip, whichfork);
+		flags = XFS_ILOG_CORE;
+		goto done;
+	}
+
 	flags = 0;
 	error = 0;
-	if (ifp->if_bytes) {
-		xfs_alloc_arg_t	args;	/* allocation arguments */
-		xfs_buf_t	*bp;	/* buffer for extent block */
-		xfs_bmbt_rec_host_t *ep;/* extent record pointer */
-
-		ASSERT((ifp->if_flags &
-			(XFS_IFINLINE|XFS_IFEXTENTS|XFS_IFEXTIREC)) == XFS_IFINLINE);
-		memset(&args, 0, sizeof(args));
-		args.tp = tp;
-		args.mp = ip->i_mount;
-		args.firstblock = *firstblock;
-		/*
-		 * Allocate a block.  We know we need only one, since the
-		 * file currently fits in an inode.
-		 */
-		if (*firstblock == NULLFSBLOCK) {
-			args.fsbno = XFS_INO_TO_FSB(args.mp, ip->i_ino);
-			args.type = XFS_ALLOCTYPE_START_BNO;
-		} else {
-			args.fsbno = *firstblock;
-			args.type = XFS_ALLOCTYPE_NEAR_BNO;
-		}
-		args.total = total;
-		args.minlen = args.maxlen = args.prod = 1;
-		error = xfs_alloc_vextent(&args);
-		if (error)
-			goto done;
-
-		/* Can't fail, the space was reserved. */
-		ASSERT(args.fsbno != NULLFSBLOCK);
-		ASSERT(args.len == 1);
-		*firstblock = args.fsbno;
-		bp = xfs_btree_get_bufl(args.mp, tp, args.fsbno, 0);
-
-		/* initialise the block and copy the data */
-		init_fn(tp, bp, ip, ifp);
-
-		/* account for the change in fork size and log everything */
-		xfs_trans_log_buf(tp, bp, 0, ifp->if_bytes - 1);
-		xfs_bmap_forkoff_reset(args.mp, ip, whichfork);
-		xfs_idata_realloc(ip, -ifp->if_bytes, whichfork);
-		xfs_iext_add(ifp, 0, 1);
-		ep = xfs_iext_get_ext(ifp, 0);
-		xfs_bmbt_set_allf(ep, 0, args.fsbno, 1, XFS_EXT_NORM);
-		trace_xfs_bmap_post_update(ip, 0,
-				whichfork == XFS_ATTR_FORK ? BMAP_ATTRFORK : 0,
-				_THIS_IP_);
-		XFS_IFORK_NEXT_SET(ip, whichfork, 1);
-		ip->i_d.di_nblocks = 1;
-		xfs_trans_mod_dquot_byino(tp, ip,
-			XFS_TRANS_DQ_BCOUNT, 1L);
-		flags |= xfs_ilog_fext(whichfork);
+	ASSERT((ifp->if_flags & (XFS_IFINLINE|XFS_IFEXTENTS|XFS_IFEXTIREC)) ==
+								XFS_IFINLINE);
+	memset(&args, 0, sizeof(args));
+	args.tp = tp;
+	args.mp = ip->i_mount;
+	args.firstblock = *firstblock;
+	/*
+	 * Allocate a block.  We know we need only one, since the
+	 * file currently fits in an inode.
+	 */
+	if (*firstblock == NULLFSBLOCK) {
+		args.fsbno = XFS_INO_TO_FSB(args.mp, ip->i_ino);
+		args.type = XFS_ALLOCTYPE_START_BNO;
 	} else {
-		ASSERT(XFS_IFORK_NEXTENTS(ip, whichfork) == 0);
-		xfs_bmap_forkoff_reset(ip->i_mount, ip, whichfork);
+		args.fsbno = *firstblock;
+		args.type = XFS_ALLOCTYPE_NEAR_BNO;
 	}
-	ifp->if_flags &= ~XFS_IFINLINE;
-	ifp->if_flags |= XFS_IFEXTENTS;
-	XFS_IFORK_FMT_SET(ip, whichfork, XFS_DINODE_FMT_EXTENTS);
+	args.total = total;
+	args.minlen = args.maxlen = args.prod = 1;
+	error = xfs_alloc_vextent(&args);
+	if (error)
+		goto done;
+
+	/* Can't fail, the space was reserved. */
+	ASSERT(args.fsbno != NULLFSBLOCK);
+	ASSERT(args.len == 1);
+	*firstblock = args.fsbno;
+	bp = xfs_btree_get_bufl(args.mp, tp, args.fsbno, 0);
+
+	/* initialise the block and copy the data */
+	init_fn(tp, bp, ip, ifp);
+
+	/* account for the change in fork size and log everything */
+	xfs_trans_log_buf(tp, bp, 0, ifp->if_bytes - 1);
+	xfs_idata_realloc(ip, -ifp->if_bytes, whichfork);
+	xfs_bmap_local_to_extents_empty(ip, whichfork);
 	flags |= XFS_ILOG_CORE;
+
+	xfs_iext_add(ifp, 0, 1);
+	ep = xfs_iext_get_ext(ifp, 0);
+	xfs_bmbt_set_allf(ep, 0, args.fsbno, 1, XFS_EXT_NORM);
+	trace_xfs_bmap_post_update(ip, 0,
+			whichfork == XFS_ATTR_FORK ? BMAP_ATTRFORK : 0,
+			_THIS_IP_);
+	XFS_IFORK_NEXT_SET(ip, whichfork, 1);
+	ip->i_d.di_nblocks = 1;
+	xfs_trans_mod_dquot_byino(tp, ip,
+		XFS_TRANS_DQ_BCOUNT, 1L);
+	flags |= xfs_ilog_fext(whichfork);
+
 done:
 	*logflagsp = flags;
 	return error;
@@ -1323,25 +1341,6 @@ xfs_bmap_add_attrfork_extents(
 }
 
 /*
- * Block initialisation function for local to extent format conversion.
- *
- * This shouldn't actually be called by anyone, so make sure debug kernels cause
- * a noticable failure.
- */
-STATIC void
-xfs_bmap_local_to_extents_init_fn(
-	struct xfs_trans	*tp,
-	struct xfs_buf		*bp,
-	struct xfs_inode	*ip,
-	struct xfs_ifork	*ifp)
-{
-	ASSERT(0);
-	bp->b_ops = &xfs_bmbt_buf_ops;
-	memcpy(bp->b_addr, ifp->if_u1.if_data, ifp->if_bytes);
-	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_BTREE_BUF);
-}
-
-/*
  * Called from xfs_bmap_add_attrfork to handle local format files. Each
  * different data fork content type needs a different callout to do the
  * conversion. Some are basic and only require special block initialisation
@@ -1381,9 +1380,9 @@ xfs_bmap_add_attrfork_local(
 						 flags, XFS_DATA_FORK,
 						 xfs_symlink_local_to_remote);
 
-	return xfs_bmap_local_to_extents(tp, ip, firstblock, 1, flags,
-					 XFS_DATA_FORK,
-					 xfs_bmap_local_to_extents_init_fn);
+	/* should only be called for types that support local format data */
+	ASSERT(0);
+	return EFSCORRUPTED;
 }
 
 /*
@@ -4907,20 +4906,19 @@ xfs_bmapi_write(
 	orig_mval = mval;
 	orig_nmap = *nmap;
 #endif
+	whichfork = (flags & XFS_BMAPI_ATTRFORK) ?
+		XFS_ATTR_FORK : XFS_DATA_FORK;
 
 	ASSERT(*nmap >= 1);
 	ASSERT(*nmap <= XFS_BMAP_MAX_NMAP);
 	ASSERT(!(flags & XFS_BMAPI_IGSTATE));
 	ASSERT(tp != NULL);
 	ASSERT(len > 0);
-
-	whichfork = (flags & XFS_BMAPI_ATTRFORK) ?
-		XFS_ATTR_FORK : XFS_DATA_FORK;
+	ASSERT(XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_LOCAL);
 
 	if (unlikely(XFS_TEST_ERROR(
 	    (XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_EXTENTS &&
-	     XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_BTREE &&
-	     XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_LOCAL),
+	     XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_BTREE),
 	     mp, XFS_ERRTAG_BMAPIFORMAT, XFS_RANDOM_BMAPIFORMAT))) {
 		XFS_ERROR_REPORT("xfs_bmapi_write", XFS_ERRLEVEL_LOW, mp);
 		return XFS_ERROR(EFSCORRUPTED);
@@ -4933,37 +4931,6 @@ xfs_bmapi_write(
 
 	XFS_STATS_INC(xs_blk_mapw);
 
-	if (XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_LOCAL) {
-		/*
-		 * XXX (dgc): This assumes we are only called for inodes that
-		 * contain content neutral data in local format. Anything that
-		 * contains caller-specific data in local format that needs
-		 * transformation to move to a block format needs to do the
-		 * conversion to extent format itself.
-		 *
-		 * Directory data forks and attribute forks handle this
-		 * themselves, but with the addition of metadata verifiers every
-		 * data fork in local format now contains caller specific data
-		 * and as such conversion through this function is likely to be
-		 * broken.
-		 *
-		 * The only likely user of this branch is for remote symlinks,
-		 * but we cannot overwrite the data fork contents of the symlink
-		 * (EEXIST occurs higher up the stack) and so it will never go
-		 * from local format to extent format here. Hence I don't think
-		 * this branch is ever executed intentionally and we should
-		 * consider removing it and asserting that xfs_bmapi_write()
-		 * cannot be called directly on local format forks. i.e. callers
-		 * are completely responsible for local to extent format
-		 * conversion, not xfs_bmapi_write().
-		 */
-		error = xfs_bmap_local_to_extents(tp, ip, firstblock, total,
-					&bma.logflags, whichfork,
-					xfs_bmap_local_to_extents_init_fn);
-		if (error)
-			goto error0;
-	}
-
 	if (*firstblock == NULLFSBLOCK) {
 		if (XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE)
 			bma.minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;
diff --git a/fs/xfs/xfs_bmap.h b/fs/xfs/xfs_bmap.h
index 5f469c3..1cf1292 100644
--- a/fs/xfs/xfs_bmap.h
+++ b/fs/xfs/xfs_bmap.h
@@ -172,6 +172,7 @@ void	xfs_bmap_trace_exlist(struct xfs_inode *ip, xfs_extnum_t cnt,
 #endif
 
 int	xfs_bmap_add_attrfork(struct xfs_inode *ip, int size, int rsvd);
+void	xfs_bmap_local_to_extents_empty(struct xfs_inode *ip, int whichfork);
 void	xfs_bmap_add_free(xfs_fsblock_t bno, xfs_filblks_t len,
 		struct xfs_bmap_free *flist, struct xfs_mount *mp);
 void	xfs_bmap_cancel(struct xfs_bmap_free *flist);
diff --git a/fs/xfs/xfs_dir2_block.c b/fs/xfs/xfs_dir2_block.c
index e59f5fc..53b9aa2 100644
--- a/fs/xfs/xfs_dir2_block.c
+++ b/fs/xfs/xfs_dir2_block.c
@@ -29,6 +29,7 @@
 #include "xfs_dinode.h"
 #include "xfs_inode.h"
 #include "xfs_inode_item.h"
+#include "xfs_bmap.h"
 #include "xfs_buf_item.h"
 #include "xfs_dir2.h"
 #include "xfs_dir2_format.h"
@@ -1167,13 +1168,15 @@ xfs_dir2_sf_to_block(
 	__be16			*tagp;		/* end of data entry */
 	xfs_trans_t		*tp;		/* transaction pointer */
 	struct xfs_name		name;
+	struct xfs_ifork	*ifp;
 
 	trace_xfs_dir2_sf_to_block(args);
 
 	dp = args->dp;
 	tp = args->trans;
 	mp = dp->i_mount;
-	ASSERT(dp->i_df.if_flags & XFS_IFINLINE);
+	ifp = XFS_IFORK_PTR(dp, XFS_DATA_FORK);
+	ASSERT(ifp->if_flags & XFS_IFINLINE);
 	/*
 	 * Bomb out if the shortform directory is way too short.
 	 */
@@ -1182,22 +1185,23 @@ xfs_dir2_sf_to_block(
 		return XFS_ERROR(EIO);
 	}
 
-	oldsfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;
+	oldsfp = (xfs_dir2_sf_hdr_t *)ifp->if_u1.if_data;
 
-	ASSERT(dp->i_df.if_bytes == dp->i_d.di_size);
-	ASSERT(dp->i_df.if_u1.if_data != NULL);
+	ASSERT(ifp->if_bytes == dp->i_d.di_size);
+	ASSERT(ifp->if_u1.if_data != NULL);
 	ASSERT(dp->i_d.di_size >= xfs_dir2_sf_hdr_size(oldsfp->i8count));
+	ASSERT(dp->i_d.di_nextents == 0);
 
 	/*
 	 * Copy the directory into a temporary buffer.
 	 * Then pitch the incore inode data so we can make extents.
 	 */
-	sfp = kmem_alloc(dp->i_df.if_bytes, KM_SLEEP);
-	memcpy(sfp, oldsfp, dp->i_df.if_bytes);
+	sfp = kmem_alloc(ifp->if_bytes, KM_SLEEP);
+	memcpy(sfp, oldsfp, ifp->if_bytes);
 
-	xfs_idata_realloc(dp, -dp->i_df.if_bytes, XFS_DATA_FORK);
+	xfs_idata_realloc(dp, -ifp->if_bytes, XFS_DATA_FORK);
+	xfs_bmap_local_to_extents_empty(dp, XFS_DATA_FORK);
 	dp->i_d.di_size = 0;
-	xfs_trans_log_inode(tp, dp, XFS_ILOG_CORE);
 
 	/*
 	 * Add block 0 to the inode.
-- 
1.7.10.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 14/15] xfs: dquot log reservations are too small
  2013-06-27  6:04 [PATCH 00/15] xfs: patchset for 3.11 Dave Chinner
                   ` (12 preceding siblings ...)
  2013-06-27  6:04 ` [PATCH 13/15] xfs: remove local fork format handling from xfs_bmapi_write() Dave Chinner
@ 2013-06-27  6:04 ` Dave Chinner
  2013-06-27 14:38   ` Mark Tinguely
  2013-06-28 17:18   ` Chandra Seetharaman
  2013-06-27  6:04 ` [PATCH 15/15] xfs: implement inode change count Dave Chinner
  2013-06-27 19:48 ` [PATCH 00/15] xfs: patchset for 3.11 Ben Myers
  15 siblings, 2 replies; 38+ messages in thread
From: Dave Chinner @ 2013-06-27  6:04 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

During review of the separate project quota inode patches, it bcame
obvious that the dquot log reservation calculation underestimated
the number dquots that can be modified in a transaction. This has
it's roots way back in the Irix quota implementation.

That is, when quotas were first implemented in XFS, it only
supported user and project quotas as Irix did not have group quotas.
Hence the worst case operation involving dquot modification was
calculated to involve 2 user dquots and 1 project dquot or 1 user
dequot and 2 project dquots. i.e. 3 dquots. This was determined back
in 1996, and has remained unchanged ever since.

However, back in 2001, the Linux XFS port dropped all support for
project quota and implmented group quotas over the top. This was
effectively done with a search-and-replace of project with group,
and as such the log reservation was not changed. However, with the
advent of group quotas, chmod and rename now could modify more than
3 dquots in a single transaction - both could modify 4 dquots. Hence
this log reservation has been wrong for a long time.

In 2005, project quotas were reintroduced into Linux, but they were
implemented to be mutually exclusive to group quotas, and so this
didn't add any new changes to the dquot log reservation. hence when
project quotas were in use, everything was still fine, just like
in the Irix days.

Now, with the addition of the separate project quota inode, group
and project quotas are no longer mutually exclusive, and hence
operations can now modify three dquots per inode where previously it
was only two. The worst case here is the rename transaction, which
can allocate/free space on two different directory inodes, and if
they have different uid/gid/prid configurations and are world
writeable, then rename can actually modify 6 different dquots now.

Further, the dquot log reservation doesn't take into account the
space used by the dquot log format structure that preceeds the dquot
that is logged, and hence further underestimates the worst case
log space required by dquots during a transaction.

Hence the worst case log reservation needs to be increased from 3 to
6, and it needs to take into account a log format header for each of
those dquots.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_quota.h       |   26 ++++++++++++++++++++++----
 fs/xfs/xfs_trans_dquot.c |    9 ++++-----
 2 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/xfs_quota.h b/fs/xfs/xfs_quota.h
index c38068f..aa4ec5a 100644
--- a/fs/xfs/xfs_quota.h
+++ b/fs/xfs/xfs_quota.h
@@ -108,11 +108,29 @@ typedef struct xfs_dqblk {
 	{ XFS_DQ_FREEING,	"FREEING" }
 
 /*
- * In the worst case, when both user and group quotas are on,
- * we can have a max of three dquots changing in a single transaction.
+ * We have the possibility of all three quota types being active at once,
+ * and hence free space modification requires modify all three current dquots in
+ * a single transaction.  For this case, we need to have a reservation of at
+ * least 3 dquots.
+ *
+ * However, a chmod operation can change both UID and GID in the one
+ * transaction, resulting in requiring {old, new} x {uid, gid} dquots to be
+ * modified. This means we need to reserve space for at least 4 dquots are the
+ * worst case reservation.
+ *
+ * And in the worst case, there's a rename operation that can be modifying up to
+ * 4 inodes with dquots attached to them. In reality, the only inodes that can
+ * have their dquots modified are the source and destination directory inodes
+ * due to directory name creation and removal. That can require space allocation
+ * and/or freeing on both directory inodes, and hence all three dquots on each
+ * inode can be modified. And if the directories are world writeable, all the
+ * dquots can be unique and so 6 dquots can be modified....
+ *
+ * And, of course, we also need to take into account the dquot log format item
+ * used to describe each dquot.
  */
-#define XFS_DQUOT_LOGRES(mp)	(sizeof(xfs_disk_dquot_t) * 3)
-
+#define XFS_DQUOT_LOGRES(mp)	\
+	((sizeof(struct xfs_dq_logformat) + sizeof(struct xfs_disk_dquot)) * 6)
 
 /*
  * These are the structures used to lay out dquots and quotaoff
diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
index fec75d0..076cc25 100644
--- a/fs/xfs/xfs_trans_dquot.c
+++ b/fs/xfs/xfs_trans_dquot.c
@@ -292,11 +292,10 @@ xfs_trans_mod_dquot(
 
 
 /*
- * Given an array of dqtrx structures, lock all the dquots associated
- * and join them to the transaction, provided they have been modified.
- * We know that the highest number of dquots (of one type - usr OR grp),
- * involved in a transaction is 2 and that both usr and grp combined - 3.
- * So, we don't attempt to make this very generic.
+ * Given an array of dqtrx structures, lock all the dquots associated and join
+ * them to the transaction, provided they have been modified.  We know that the
+ * highest number of dquots of one type - usr, grp OR prj - involved in a
+ * transaction is 2 so we don't need to make this very generic.
  */
 STATIC void
 xfs_trans_dqlockedjoin(
-- 
1.7.10.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 15/15] xfs: implement inode change count
  2013-06-27  6:04 [PATCH 00/15] xfs: patchset for 3.11 Dave Chinner
                   ` (13 preceding siblings ...)
  2013-06-27  6:04 ` [PATCH 14/15] xfs: dquot log reservations are too small Dave Chinner
@ 2013-06-27  6:04 ` Dave Chinner
  2013-06-27 15:06   ` Mark Tinguely
                     ` (2 more replies)
  2013-06-27 19:48 ` [PATCH 00/15] xfs: patchset for 3.11 Ben Myers
  15 siblings, 3 replies; 38+ messages in thread
From: Dave Chinner @ 2013-06-27  6:04 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

For CRC enabled filesystems, add support for the monotonic inode
version change counter that is needed by protocols like NFSv4 for
determining if the inode has changed in any way at all between two
unrelated operations on the inode.

This bumps the change count the first time an inode is dirtied in a
transaction. Since all modifications to the inode are logged, this
will catch all changes that are made to the inode, including
timestamp updates that occur during data writes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_super.c       |    4 ++++
 fs/xfs/xfs_trans_inode.c |   11 +++++++++++
 2 files changed, 15 insertions(+)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index f59e27f..a1587f9 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1482,6 +1482,10 @@ xfs_fs_fill_super(
 	sb->s_time_gran = 1;
 	set_posix_acl_flag(sb);
 
+	/* version 5 superblocks support inode version counters. */
+	if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5)
+		sb->s_flags |= MS_I_VERSION;
+
 	error = xfs_mountfs(mp);
 	if (error)
 		goto out_filestream_unmount;
diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
index ac6d567..53dfe46 100644
--- a/fs/xfs/xfs_trans_inode.c
+++ b/fs/xfs/xfs_trans_inode.c
@@ -112,6 +112,17 @@ xfs_trans_log_inode(
 	ASSERT(ip->i_itemp != NULL);
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 
+	/*
+	 * First time we log the inode in a transaction, bump the inode change
+	 * counter if it is configured for this to occur.
+	 */
+	if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
+	    IS_I_VERSION(VFS_I(ip))) {
+		inode_inc_iversion(VFS_I(ip));
+		ip->i_d.di_changecount = VFS_I(ip)->i_version;
+		flags |= XFS_ILOG_CORE;
+	}
+
 	tp->t_flags |= XFS_TRANS_DIRTY;
 	ip->i_itemp->ili_item.li_desc->lid_flags |= XFS_LID_DIRTY;
 
-- 
1.7.10.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 14/15] xfs: dquot log reservations are too small
  2013-06-27  6:04 ` [PATCH 14/15] xfs: dquot log reservations are too small Dave Chinner
@ 2013-06-27 14:38   ` Mark Tinguely
  2013-06-28 17:18   ` Chandra Seetharaman
  1 sibling, 0 replies; 38+ messages in thread
From: Mark Tinguely @ 2013-06-27 14:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 06/27/13 01:04, Dave Chinner wrote:
> From: Dave Chinner<dchinner@redhat.com>
>
> During review of the separate project quota inode patches, it bcame
> obvious that the dquot log reservation calculation underestimated
> the number dquots that can be modified in a transaction. This has
> it's roots way back in the Irix quota implementation.
>
> That is, when quotas were first implemented in XFS, it only
> supported user and project quotas as Irix did not have group quotas.
> Hence the worst case operation involving dquot modification was
> calculated to involve 2 user dquots and 1 project dquot or 1 user
> dequot and 2 project dquots. i.e. 3 dquots. This was determined back
> in 1996, and has remained unchanged ever since.
>
> However, back in 2001, the Linux XFS port dropped all support for
> project quota and implmented group quotas over the top. This was
> effectively done with a search-and-replace of project with group,
> and as such the log reservation was not changed. However, with the
> advent of group quotas, chmod and rename now could modify more than
> 3 dquots in a single transaction - both could modify 4 dquots. Hence
> this log reservation has been wrong for a long time.
>
> In 2005, project quotas were reintroduced into Linux, but they were
> implemented to be mutually exclusive to group quotas, and so this
> didn't add any new changes to the dquot log reservation. hence when
> project quotas were in use, everything was still fine, just like
> in the Irix days.
>
> Now, with the addition of the separate project quota inode, group
> and project quotas are no longer mutually exclusive, and hence
> operations can now modify three dquots per inode where previously it
> was only two. The worst case here is the rename transaction, which
> can allocate/free space on two different directory inodes, and if
> they have different uid/gid/prid configurations and are world
> writeable, then rename can actually modify 6 different dquots now.
>
> Further, the dquot log reservation doesn't take into account the
> space used by the dquot log format structure that preceeds the dquot
> that is logged, and hence further underestimates the worst case
> log space required by dquots during a transaction.
>
> Hence the worst case log reservation needs to be increased from 3 to
> 6, and it needs to take into account a log format header for each of
> those dquots.
>
> Signed-off-by: Dave Chinner<dchinner@redhat.com>
> ---

Looks good.

Reviewed-by: Mark Tinguely <tinguely@sgi.com>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/15] xfs: update mount options documentation
  2013-06-27  6:04 ` [PATCH 01/15] xfs: update mount options documentation Dave Chinner
@ 2013-06-27 14:48   ` Ben Myers
  2013-06-27 19:08     ` Ben Myers
  0 siblings, 1 reply; 38+ messages in thread
From: Ben Myers @ 2013-06-27 14:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Jun 27, 2013 at 04:04:45PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Because it's horribly out of date.
> 
> And mark various deprecated options as deprecated and give them a
> removal date.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Mark Tinguely <tinguely@sgi.com>

Regarding removal of these mount options and sysctls:  initially these all look
pretty reasonable but we need to be very careful here.  I've read some
discussions on lkml that seem to suggest that such interfaces which have been
exported to userspace shouldn't be removed at all.  Not that I want to keep
around a bunch of worn out interfaces...

Applied.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 15/15] xfs: implement inode change count
  2013-06-27  6:04 ` [PATCH 15/15] xfs: implement inode change count Dave Chinner
@ 2013-06-27 15:06   ` Mark Tinguely
  2013-06-28 16:07   ` Chandra Seetharaman
  2013-06-28 18:00   ` Ben Myers
  2 siblings, 0 replies; 38+ messages in thread
From: Mark Tinguely @ 2013-06-27 15:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 06/27/13 01:04, Dave Chinner wrote:
> From: Dave Chinner<dchinner@redhat.com>
>
> For CRC enabled filesystems, add support for the monotonic inode
> version change counter that is needed by protocols like NFSv4 for
> determining if the inode has changed in any way at all between two
> unrelated operations on the inode.
>
> This bumps the change count the first time an inode is dirtied in a
> transaction. Since all modifications to the inode are logged, this
> will catch all changes that are made to the inode, including
> timestamp updates that occur during data writes.
>
> Signed-off-by: Dave Chinner<dchinner@redhat.com>
> ---

Looks good.

Reviewed-by: Mark Tinguely <tinguely@sgi.com>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/15] xfs: update mount options documentation
  2013-06-27 14:48   ` Ben Myers
@ 2013-06-27 19:08     ` Ben Myers
  2013-06-28  2:09       ` Dave Chinner
  2013-06-28  2:18       ` Eric Sandeen
  0 siblings, 2 replies; 38+ messages in thread
From: Ben Myers @ 2013-06-27 19:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hey Dave,

On Thu, Jun 27, 2013 at 09:48:14AM -0500, Ben Myers wrote:
> On Thu, Jun 27, 2013 at 04:04:45PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Because it's horribly out of date.
> > 
> > And mark various deprecated options as deprecated and give them a
> > removal date.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Reviewed-by: Mark Tinguely <tinguely@sgi.com>
> 
> Regarding removal of these mount options and sysctls:  initially these all look
> pretty reasonable but we need to be very careful here.  I've read some
> discussions on lkml that seem to suggest that such interfaces which have been
> exported to userspace shouldn't be removed at all.  Not that I want to keep
> around a bunch of worn out interfaces...
> 
> Applied.

On second thought... Not pushed.

I'm going to hold off on pushing this one to oss for now because I'm just not
comfortable with it yet.  I can pull this in sans the removal notices if you
want.  Lets discuss whether the removal of deprecated mount options and sysctls
is acceptable before announcing an intention to remove them.  I'm trending no,
but I can be flexible if this really is ok.

I'm thinking of the 3.3 glusterfs and 3.8 pulseaudio reakeage.  And I would
really like to have a nice holiday weekend. ;)

Thanks,
	Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/15] xfs: patchset for 3.11
  2013-06-27  6:04 [PATCH 00/15] xfs: patchset for 3.11 Dave Chinner
                   ` (14 preceding siblings ...)
  2013-06-27  6:04 ` [PATCH 15/15] xfs: implement inode change count Dave Chinner
@ 2013-06-27 19:48 ` Ben Myers
  15 siblings, 0 replies; 38+ messages in thread
From: Ben Myers @ 2013-06-27 19:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Jun 27, 2013 at 04:04:44PM +1000, Dave Chinner wrote:
> Here's the current set of patches I have for 3.11. There are 2 new
> patches here - the dquot transaction reservation fix, and a simple
> one to implement the change count functionality on CRC enabled
> filesystems.

Pulled in patches 2 through 12.

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/15] xfs: update mount options documentation
  2013-06-27 19:08     ` Ben Myers
@ 2013-06-28  2:09       ` Dave Chinner
  2013-06-28  2:32         ` Dave Chinner
  2013-06-28  2:18       ` Eric Sandeen
  1 sibling, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2013-06-28  2:09 UTC (permalink / raw)
  To: Ben Myers; +Cc: xfs

On Thu, Jun 27, 2013 at 02:08:31PM -0500, Ben Myers wrote:
> Hey Dave,
> 
> On Thu, Jun 27, 2013 at 09:48:14AM -0500, Ben Myers wrote:
> > On Thu, Jun 27, 2013 at 04:04:45PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Because it's horribly out of date.
> > > 
> > > And mark various deprecated options as deprecated and give them a
> > > removal date.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > Reviewed-by: Mark Tinguely <tinguely@sgi.com>
> > 
> > Regarding removal of these mount options and sysctls:  initially these all look
> > pretty reasonable but we need to be very careful here.  I've read some
> > discussions on lkml that seem to suggest that such interfaces which have been
> > exported to userspace shouldn't be removed at all.  Not that I want to keep
> > around a bunch of worn out interfaces...
> > 
> > Applied.
> 
> On second thought... Not pushed.
> 
> I'm going to hold off on pushing this one to oss for now because I'm just not
> comfortable with it yet.  I can pull this in sans the removal notices if you
> want.  Lets discuss whether the removal of deprecated mount options and sysctls
> is acceptable before announcing an intention to remove them.  I'm trending no,
> but I can be flexible if this really is ok.

Mount options are perfectly fine to be removed - they've been given
deprecated warnings for quite some time now (the most recent is the
delaylog which has been doing that since 3.1 IIRC). So they are all
fine to actually remove - 12 months warning is usually considered
sufficient.

As to the sysctls - they haven't had any effect since 3.5 when the
xfsbufd was removed, so it's time to mark them deprecated so we can
remove them in a year's time. That gives anyone using them
(including distros) plenty of time to fix whatever is using them
before they get removed.

> I'm thinking of the 3.3 glusterfs and 3.8 pulseaudio reakeage.  And I would
> really like to have a nice holiday weekend. ;)

I think you're being overly paranoid here - I'm simply following the
normal deprecation protocol here....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/15] xfs: update mount options documentation
  2013-06-27 19:08     ` Ben Myers
  2013-06-28  2:09       ` Dave Chinner
@ 2013-06-28  2:18       ` Eric Sandeen
  2013-06-28 20:46         ` Ben Myers
  1 sibling, 1 reply; 38+ messages in thread
From: Eric Sandeen @ 2013-06-28  2:18 UTC (permalink / raw)
  To: Ben Myers; +Cc: xfs

On 6/27/13 3:08 PM, Ben Myers wrote:
> Hey Dave,
> 
> On Thu, Jun 27, 2013 at 09:48:14AM -0500, Ben Myers wrote:
>> On Thu, Jun 27, 2013 at 04:04:45PM +1000, Dave Chinner wrote:
>>> From: Dave Chinner <dchinner@redhat.com>
>>>
>>> Because it's horribly out of date.
>>>
>>> And mark various deprecated options as deprecated and give them a
>>> removal date.
>>>
>>> Signed-off-by: Dave Chinner <dchinner@redhat.com>
>>> Reviewed-by: Mark Tinguely <tinguely@sgi.com>
>>
>> Regarding removal of these mount options and sysctls:  initially these all look
>> pretty reasonable but we need to be very careful here.  I've read some
>> discussions on lkml that seem to suggest that such interfaces which have been
>> exported to userspace shouldn't be removed at all.  Not that I want to keep
>> around a bunch of worn out interfaces...
>>
>> Applied.
> 
> On second thought... Not pushed.
> 
> I'm going to hold off on pushing this one to oss for now because I'm just not
> comfortable with it yet.  I can pull this in sans the removal notices if you
> want.  Lets discuss whether the removal of deprecated mount options and sysctls
> is acceptable before announcing an intention to remove them.  I'm trending no,
> but I can be flexible if this really is ok.

There's a long history of deprecating mount options - indeed, even entire
filesystems (smbfs)!  Here's a sampling:

5e2a4b25da232a2f4ce264a4b2ae113d0b2a799c btrfs: deprecate subvolrootid mount option
1b359204901e182b27aa9da432095cbe2cfd2512 cifs: remove support for deprecated "forcedirectio" and "strictcache" mount options
67c1f5295150eb86d065d57b4515a472ecbf008f cifs: add deprecation warning to sockopt=TCP_NODELAY option
43829731dd372d04d6706c51052b9dabab9ca356 workqueue: deprecate flush[_delayed]_work_sync()
09983b2fab80fa037b1dcf9a11de5a70df59ef7f cifs: add deprecation warnings to strictcache and forcedirectio
4d61cd6ec764368689fab3bd19e78d76c1e6b176 cifs: add a deprecation warning to CIFS_IOC_CHECKUMOUNT ioctl
f70486055ee351158bd6999f3965ad378b52c694 ext4: try to deprecate noacl and noxattr_user mount options
87f26807e91102f2526d59c0232bf020479c8d0c ext4: remove deprecation warnings for minix_df and grpid
789b4588da40cf572ef982bdc5d590ec1b0386fe cifs: warn about impending deprecation of legacy MultiuserMount code
93b8a5854f247138e401471a9c3b82ccb62ff608 xfs: remove the deprecated nodelaylog option
8bc4392a1e50f346e97f8777aaefd9cfc3d45c9f cifs: warn about deprecation of /proc/fs/cifs/OplockEnabled interface
4113c4caa4f355b8ff8b7ff0510c29c9d00d30b3 ext4: remove deprecated oldalloc
242d621964dd8641df53f7d51d4c6ead655cc5a6 xfs: deprecate the nodelaylog mount option
fbc854027c91fa2813ae7f9de43cc0b5c1119f41 ext3: remove deprecated oldalloc
be8f684d73d8d916847e996bf69cef14352872c6 oom: make deprecated use of oom_adj more verbose
49b28684fdba2c84a3b8e54aaa0faa9ce2e4f140 nfsd: Remove deprecated nfsctl system call and related code.
52ca0e84b05595cf74f1ff772b3f9807256b1b27 hugetlbfs: lessen the impact of a deprecation warning
e52eec13cd6b7f30ab19081b387813e03e592ae5 SYSFS: Allow boot time switching between deprecated and modern sysfs layout
1e1405673e4e40a94ed7620553eb440a21040402 nfsd: allow deprecated interface to be compiled out.
c67874f942e30039442d925b03793e0a46ddcddd nfsd: formally deprecate legacy nfsd syscall interface
51b1bd2ace1595b72956224deda349efa880b693 oom: deprecate oom_adj tunable
a4432345352c2be157ed844603147ac2c82f209c NFSv41: Deprecate nfs_client->cl_minorversion
437ca0fda3b442dff9e591581b5e1ffdfec24660 ext4: deprecate obsoleted mount options
ed58f8027945f1cf415bfe3805e1fa3fe8ed9edf fs/smbfs/inode.c: fix warning message deprecating smbfs
c773633916c66f8362ca01983d97bd33e35b743f deprecate smbfs in favour of cifs
8e9073ed027771bcdee4033eb900a3c09ac90a19 Deprecate a.out ELF interpreters
e10a4437cb37c85f2df95432025b392d98aac2aa [PATCH] Remove final references to deprecated "MAP_ANON" page protection flag
f1ddcaf3393b7a3871809b97fae90fac841a1f39 [CRYPTO] api: Remove deprecated interface
1ec320afdc9552c92191d5f89fcd1ebe588334ca [PATCH] add process_session() helper routine: deprecate old field

Marking as deprecated, and putting back into play if people REALLY care?
That's been done too:

87f26807e91102f2526d59c0232bf020479c8d0c ext4: remove deprecation warnings for minix_df and grpid

> I'm thinking of the 3.3 glusterfs and 3.8 pulseaudio reakeage.  And I would
> really like to have a nice holiday weekend. ;)

Is "reakage" a freudian slip?  :)  I dunno about PA, but the gluster thing
wasn't related to deprecation.

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/15] xfs: update mount options documentation
  2013-06-28  2:09       ` Dave Chinner
@ 2013-06-28  2:32         ` Dave Chinner
  2013-06-28 15:39           ` Geoffrey Wehrman
  2013-06-28 19:39           ` Ben Myers
  0 siblings, 2 replies; 38+ messages in thread
From: Dave Chinner @ 2013-06-28  2:32 UTC (permalink / raw)
  To: Ben Myers; +Cc: xfs

On Fri, Jun 28, 2013 at 12:09:12PM +1000, Dave Chinner wrote:
> On Thu, Jun 27, 2013 at 02:08:31PM -0500, Ben Myers wrote:
> > Hey Dave,
> > 
> > On Thu, Jun 27, 2013 at 09:48:14AM -0500, Ben Myers wrote:
> > > On Thu, Jun 27, 2013 at 04:04:45PM +1000, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > Because it's horribly out of date.
> > > > 
> > > > And mark various deprecated options as deprecated and give them a
> > > > removal date.
> > > > 
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > Reviewed-by: Mark Tinguely <tinguely@sgi.com>
> > > 
> > > Regarding removal of these mount options and sysctls:  initially these all look
> > > pretty reasonable but we need to be very careful here.  I've read some
> > > discussions on lkml that seem to suggest that such interfaces which have been
> > > exported to userspace shouldn't be removed at all.  Not that I want to keep
> > > around a bunch of worn out interfaces...
> > > 
> > > Applied.
> > 
> > On second thought... Not pushed.
> > 
> > I'm going to hold off on pushing this one to oss for now because I'm just not
> > comfortable with it yet.  I can pull this in sans the removal notices if you
> > want.  Lets discuss whether the removal of deprecated mount options and sysctls
> > is acceptable before announcing an intention to remove them.  I'm trending no,
> > but I can be flexible if this really is ok.
> 
> Mount options are perfectly fine to be removed - they've been given
> deprecated warnings for quite some time now (the most recent is the
> delaylog which has been doing that since 3.1 IIRC). So they are all
> fine to actually remove - 12 months warning is usually considered
> sufficient.
> 
> As to the sysctls - they haven't had any effect since 3.5 when the
> xfsbufd was removed, so it's time to mark them deprecated so we can
> remove them in a year's time. That gives anyone using them
> (including distros) plenty of time to fix whatever is using them
> before they get removed.
> 
> > I'm thinking of the 3.3 glusterfs and 3.8 pulseaudio reakeage.  And I would
> > really like to have a nice holiday weekend. ;)
> 
> I think you're being overly paranoid here - I'm simply following the
> normal deprecation protocol here....

Documenation/ABI/README:

We have four different levels of ABI stability, as shown by the four
different subdirectories in this location.  Interfaces may chang levels
of stability according to the rules described below.
....
 obsolete/
         This directory documents interfaces that are still remaining in
	 the kernel, but are marked to be removed at some later point in
	 time.  The description of the interface will document the reason
	 why it is obsolete and when it can be expected to be removed.

I think you'll find that what I done follows this policy. If you
really want, I'll move them to Documenation/ABI/obsolete.  And, of
course, if removing them proves to be a problem, as Eric said we can
always reinstate them or remove the deprecation notices.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/15] xfs: update mount options documentation
  2013-06-28  2:32         ` Dave Chinner
@ 2013-06-28 15:39           ` Geoffrey Wehrman
  2013-06-28 16:49             ` Eric Sandeen
  2013-06-28 17:27             ` Ric Wheeler
  2013-06-28 19:39           ` Ben Myers
  1 sibling, 2 replies; 38+ messages in thread
From: Geoffrey Wehrman @ 2013-06-28 15:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Ben Myers, xfs

On Fri, Jun 28, 2013 at 12:32:04PM +1000, Dave Chinner wrote:
| On Fri, Jun 28, 2013 at 12:09:12PM +1000, Dave Chinner wrote:
| > Mount options are perfectly fine to be removed - they've been given
| > deprecated warnings for quite some time now (the most recent is the
| > delaylog which has been doing that since 3.1 IIRC). So they are all
| > fine to actually remove - 12 months warning is usually considered
| > sufficient.

I hardly consider 12 years to be sufficient.  I have no problem with
deprecating and disabling mount options so that they are ineffective,
but removing them so that an administrator gets an error when upgrading
his system is irresponsible product management, especially when it
requires almost no effort to keep the deprecated, disabled interface.

You move to newer kernels much faster than most people.  Doesn't Red Hat
still support Red Hat 5?  How old is that kernel?  One of the reasons I
and others dread upgrading systems is because there are always
interfaces that change, always data conversions that have to be run,
always new processes to learn.  I realize that XFS is still an evolving
filesystem, by historically one of its greatest achievements has been
that of backward compatibility.  When XFS was ported from IRIX to Linux,
the same filesystem could be used without any conversion.  Why force a
user to modify his fstab just because he has upgraded his kernel?

| > As to the sysctls - they haven't had any effect since 3.5 when the
| > xfsbufd was removed, so it's time to mark them deprecated so we can
| > remove them in a year's time. That gives anyone using them
| > (including distros) plenty of time to fix whatever is using them
| > before they get removed.
| > 
| > > I'm thinking of the 3.3 glusterfs and 3.8 pulseaudio reakeage.  And I would
| > > really like to have a nice holiday weekend. ;)
| > 
| > I think you're being overly paranoid here - I'm simply following the
| > normal deprecation protocol here....
| 
| Documenation/ABI/README:
| 
| We have four different levels of ABI stability, as shown by the four
| different subdirectories in this location.  Interfaces may change levels
| of stability according to the rules described below.
| ....
|  obsolete/
|          This directory documents interfaces that are still remaining in
| 	 the kernel, but are marked to be removed at some later point in
| 	 time.  The description of the interface will document the reason
| 	 why it is obsolete and when it can be expected to be removed.


| I think you'll find that what I done follows this policy. If you
| really want, I'll move them to Documenation/ABI/obsolete.  And, of
| course, if removing them proves to be a problem, as Eric said we can
| always reinstate them or remove the deprecation notices.

It is great that Linux has a documented life cycle for kernel to userspace
interfaces.  These are guidelines for the minimum requirements.  Move the
mount options to obsolete.  I have no problems with making mount options
obsolete.  Remove them and people will make a big fuss.


-- 
Geoffrey Wehrman  651-683-5496  gwehrman@sgi.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 15/15] xfs: implement inode change count
  2013-06-27  6:04 ` [PATCH 15/15] xfs: implement inode change count Dave Chinner
  2013-06-27 15:06   ` Mark Tinguely
@ 2013-06-28 16:07   ` Chandra Seetharaman
  2013-06-28 18:00   ` Ben Myers
  2 siblings, 0 replies; 38+ messages in thread
From: Chandra Seetharaman @ 2013-06-28 16:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, 2013-06-27 at 16:04 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> For CRC enabled filesystems, add support for the monotonic inode
> version change counter that is needed by protocols like NFSv4 for
> determining if the inode has changed in any way at all between two
> unrelated operations on the inode.
> 
> This bumps the change count the first time an inode is dirtied in a
> transaction. Since all modifications to the inode are logged, this
> will catch all changes that are made to the inode, including
> timestamp updates that occur during data writes.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Reviewed-by: Chandra Seetharaman <sekharan@us.ibm.com>

> ---
>  fs/xfs/xfs_super.c       |    4 ++++
>  fs/xfs/xfs_trans_inode.c |   11 +++++++++++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index f59e27f..a1587f9 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1482,6 +1482,10 @@ xfs_fs_fill_super(
>  	sb->s_time_gran = 1;
>  	set_posix_acl_flag(sb);
> 
> +	/* version 5 superblocks support inode version counters. */
> +	if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5)
> +		sb->s_flags |= MS_I_VERSION;
> +
>  	error = xfs_mountfs(mp);
>  	if (error)
>  		goto out_filestream_unmount;
> diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
> index ac6d567..53dfe46 100644
> --- a/fs/xfs/xfs_trans_inode.c
> +++ b/fs/xfs/xfs_trans_inode.c
> @@ -112,6 +112,17 @@ xfs_trans_log_inode(
>  	ASSERT(ip->i_itemp != NULL);
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> 
> +	/*
> +	 * First time we log the inode in a transaction, bump the inode change
> +	 * counter if it is configured for this to occur.
> +	 */
> +	if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
> +	    IS_I_VERSION(VFS_I(ip))) {
> +		inode_inc_iversion(VFS_I(ip));
> +		ip->i_d.di_changecount = VFS_I(ip)->i_version;
> +		flags |= XFS_ILOG_CORE;
> +	}
> +
>  	tp->t_flags |= XFS_TRANS_DIRTY;
>  	ip->i_itemp->ili_item.li_desc->lid_flags |= XFS_LID_DIRTY;
> 


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/15] xfs: update mount options documentation
  2013-06-28 15:39           ` Geoffrey Wehrman
@ 2013-06-28 16:49             ` Eric Sandeen
  2013-06-28 19:58               ` Geoffrey Wehrman
  2013-06-28 17:27             ` Ric Wheeler
  1 sibling, 1 reply; 38+ messages in thread
From: Eric Sandeen @ 2013-06-28 16:49 UTC (permalink / raw)
  To: Geoffrey Wehrman; +Cc: Ben Myers, xfs

On 6/28/13 11:39 AM, Geoffrey Wehrman wrote:
> On Fri, Jun 28, 2013 at 12:32:04PM +1000, Dave Chinner wrote:
> | On Fri, Jun 28, 2013 at 12:09:12PM +1000, Dave Chinner wrote:
> | > Mount options are perfectly fine to be removed - they've been given
> | > deprecated warnings for quite some time now (the most recent is the
> | > delaylog which has been doing that since 3.1 IIRC). So they are all
> | > fine to actually remove - 12 months warning is usually considered
> | > sufficient.
> 
> I hardly consider 12 years to be sufficient.  I have no problem with

12 years, really?  Maybe you meant months.  I hope you meant months!

> deprecating and disabling mount options so that they are ineffective,
> but removing them so that an administrator gets an error when upgrading
> his system is irresponsible product management, especially when it
> requires almost no effort to keep the deprecated, disabled interface.
> 
> You move to newer kernels much faster than most people.  Doesn't Red Hat
> still support Red Hat 5?  How old is that kernel?  

Yes, of course we do - it's based on 2.6.18, which was released in 2006.
But I'm not sure how that's relevant?

> One of the reasons I
> and others dread upgrading systems is because there are always
> interfaces that change, always data conversions that have to be run,
> always new processes to learn.  I realize that XFS is still an evolving
> filesystem, by historically one of its greatest achievements has been
> that of backward compatibility.  When XFS was ported from IRIX to Linux,
> the same filesystem could be used without any conversion.  Why force a
> user to modify his fstab just because he has upgraded his kernel?

That's quite a leap; year-long, orderly deprecation of mount options is
hardly comparable breaking on-disk formats during a port.

But of course we (RHEL) take migration seriously too, which is why we have
migration guides to help admins make the transition, documenting the many
necessary changes when doing a large upgrade.

For the big jumps that come from distro upgrades, they can document these
changes along with myriad others.

For people rolling their own kernels & updating every now and then, they'll
have a 12 month window of warnings to give them time to adapt and/or
request that the option be kept.

If you're upgrading a kernel which is older than 12 months, quite frankly
you'll have a fair bit more to do than worry about than "irixsgid" mount
options when you make that big a jump.  Updating fstab will be the least
of your worries.

There is no need to accumulate dead, no-op code in xfs.  It's one
of the reasons closed-source proprietary code ends up being so crufty
and hard to maintain - years and years of accumulated gunk nobody dares
touch.  We can do better than that.

Deprecating old code in an orderly & documented fashion is a longstanding
best practice in the linux kernel.  And we've already removed other XFS
mount options, and the world didn't end:

f538d4da8d521746ca5ebf8c1a8105eb49bfb45e (removed nologflush mount option)
288699fecaffa1ef8f75f92020cbb593a772e487 xfs: drop dmapi hooks (removed dmapi mount options)
a64afb057b607c04383ab5fb53c51421ba18c434 xfs: remove obsolete osyncisosync mount option
a19d9f887d81106d52cacbc9930207b487e07e0e xfs: kill ino64 mount option

We've removed sysctls too:

e0b8e8b65d578f5d5538465dff8392cf02e1cc5d [XFS] remove restricted chown parameter from xfs linux

Seriously - this is how we maintain code.

> | > As to the sysctls - they haven't had any effect since 3.5 when the
> | > xfsbufd was removed, so it's time to mark them deprecated so we can
> | > remove them in a year's time. That gives anyone using them
> | > (including distros) plenty of time to fix whatever is using them
> | > before they get removed.
> | > 
> | > > I'm thinking of the 3.3 glusterfs and 3.8 pulseaudio reakeage.  And I would
> | > > really like to have a nice holiday weekend. ;)
> | > 
> | > I think you're being overly paranoid here - I'm simply following the
> | > normal deprecation protocol here....
> | 
> | Documenation/ABI/README:
> | 
> | We have four different levels of ABI stability, as shown by the four
> | different subdirectories in this location.  Interfaces may change levels
> | of stability according to the rules described below.
> | ....
> |  obsolete/
> |          This directory documents interfaces that are still remaining in
> | 	 the kernel, but are marked to be removed at some later point in
> | 	 time.  The description of the interface will document the reason
> | 	 why it is obsolete and when it can be expected to be removed.
> 
> 
> | I think you'll find that what I done follows this policy. If you
> | really want, I'll move them to Documenation/ABI/obsolete.  And, of
> | course, if removing them proves to be a problem, as Eric said we can
> | always reinstate them or remove the deprecation notices.
> 
> It is great that Linux has a documented life cycle for kernel to userspace	
> interfaces.  These are guidelines for the minimum requirements.  Move the
> mount options to obsolete.  I have no problems with making mount options
> obsolete.  Remove them and people will make a big fuss.

Well, AFAIK nobody fussed over nodelaylog, ino64, osyncisosync, or the
dmapi options.  Nobody fussed over the restrict_chown sysctl.

And as I mentioned earlier, the whole point of the deprecation period is to let
people speak up if they need it.  Ext4 put options back when people fussed; we
can do that too as needed.  (Unless you consider this a pre-emptive fuss.  ;)
The spirit of the fuss-rule, though, is that people actually *using* it can
fuss; meta-fussing hasn't traditionally counted.)

I understand that you don't want to surprise or inconvenience users; that
should be balanced with keeping the code current, orderly & un-crufty.
A 12-month notification & RFC period strikes that balance well, I think.

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 14/15] xfs: dquot log reservations are too small
  2013-06-27  6:04 ` [PATCH 14/15] xfs: dquot log reservations are too small Dave Chinner
  2013-06-27 14:38   ` Mark Tinguely
@ 2013-06-28 17:18   ` Chandra Seetharaman
  2013-06-29  2:42     ` Dave Chinner
  1 sibling, 1 reply; 38+ messages in thread
From: Chandra Seetharaman @ 2013-06-28 17:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs


some nits below on commit log and comments.

otherwise consider

Reviewed-by: Chandra Seethraman <sekharan@us.ibm.com>

On Thu, 2013-06-27 at 16:04 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> During review of the separate project quota inode patches, it bcame
became(typo)

> obvious that the dquot log reservation calculation underestimated
> the number dquots that can be modified in a transaction. This has
> it's roots way back in the Irix quota implementation.
> 
> That is, when quotas were first implemented in XFS, it only
> supported user and project quotas as Irix did not have group quotas.
> Hence the worst case operation involving dquot modification was
> calculated to involve 2 user dquots and 1 project dquot or 1 user
> dequot and 2 project dquots. i.e. 3 dquots. This was determined back
> in 1996, and has remained unchanged ever since.

How about the missing reservation for the log format header ? It is a
day one problem, right ?

> 
> However, back in 2001, the Linux XFS port dropped all support for
> project quota and implmented group quotas over the top. This was

implemented(typo)

> effectively done with a search-and-replace of project with group,
> and as such the log reservation was not changed. However, with the
> advent of group quotas, chmod and rename now could modify more than
> 3 dquots in a single transaction - both could modify 4 dquots. Hence
> this log reservation has been wrong for a long time.
> 
> In 2005, project quotas were reintroduced into Linux, but they were

introduced ? (it was mentioned above that 2001 Linux port removed
project quota, so it never existed in Linux ?!)

> implemented to be mutually exclusive to group quotas, and so this
> didn't add any new changes to the dquot log reservation. hence when
> project quotas were in use, everything was still fine, just like
> in the Irix days.

you can say that... but, when the group quota is used the problem
mentioned above exists.

> 
> Now, with the addition of the separate project quota inode, group
> and project quotas are no longer mutually exclusive, and hence
> operations can now modify three dquots per inode where previously it
> was only two. The worst case here is the rename transaction, which
> can allocate/free space on two different directory inodes, and if
> they have different uid/gid/prid configurations and are world
> writeable, then rename can actually modify 6 different dquots now.
> 
> Further, the dquot log reservation doesn't take into account the
> space used by the dquot log format structure that preceeds the dquot

precedes(typo)

> that is logged, and hence further underestimates the worst case
> log space required by dquots during a transaction.
> 
> Hence the worst case log reservation needs to be increased from 3 to
> 6, and it needs to take into account a log format header for each of
> those dquots.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_quota.h       |   26 ++++++++++++++++++++++----
>  fs/xfs/xfs_trans_dquot.c |    9 ++++-----
>  2 files changed, 26 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/xfs_quota.h b/fs/xfs/xfs_quota.h
> index c38068f..aa4ec5a 100644
> --- a/fs/xfs/xfs_quota.h
> +++ b/fs/xfs/xfs_quota.h
> @@ -108,11 +108,29 @@ typedef struct xfs_dqblk {
>  	{ XFS_DQ_FREEING,	"FREEING" }
> 
>  /*
> - * In the worst case, when both user and group quotas are on,
> - * we can have a max of three dquots changing in a single transaction.
> + * We have the possibility of all three quota types being active at once,
> + * and hence free space modification requires modify all three current dquots in

s/requires modify all/requires modification of all/ ?

> + * a single transaction.  For this case, we need to have a reservation of at
> + * least 3 dquots.
> + *
> + * However, a chmod operation can change both UID and GID in the one

s/the// ?

> + * transaction, resulting in requiring {old, new} x {uid, gid} dquots to be
> + * modified. This means we need to reserve space for at least 4 dquots are the

s/are/for/ ?
> + * worst case reservation.

This not the worst case (since the next case is the worst case),
correct ?

We can just say "for this case". what do you think ?

> + *
> + * And in the worst case, there's a rename operation that can be modifying up to
> + * 4 inodes with dquots attached to them. In reality, the only inodes that can
> + * have their dquots modified are the source and destination directory inodes
> + * due to directory name creation and removal. That can require space allocation
> + * and/or freeing on both directory inodes, and hence all three dquots on each
> + * inode can be modified. And if the directories are world writeable, all the
> + * dquots can be unique and so 6 dquots can be modified....
> + *
> + * And, of course, we also need to take into account the dquot log format item
> + * used to describe each dquot.
>   */
> -#define XFS_DQUOT_LOGRES(mp)	(sizeof(xfs_disk_dquot_t) * 3)
> -
> +#define XFS_DQUOT_LOGRES(mp)	\
> +	((sizeof(struct xfs_dq_logformat) + sizeof(struct xfs_disk_dquot)) * 6)
> 
>  /*
>   * These are the structures used to lay out dquots and quotaoff
> diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
> index fec75d0..076cc25 100644
> --- a/fs/xfs/xfs_trans_dquot.c
> +++ b/fs/xfs/xfs_trans_dquot.c
> @@ -292,11 +292,10 @@ xfs_trans_mod_dquot(
> 
> 
>  /*
> - * Given an array of dqtrx structures, lock all the dquots associated
> - * and join them to the transaction, provided they have been modified.
> - * We know that the highest number of dquots (of one type - usr OR grp),
> - * involved in a transaction is 2 and that both usr and grp combined - 3.
> - * So, we don't attempt to make this very generic.
> + * Given an array of dqtrx structures, lock all the dquots associated and join
> + * them to the transaction, provided they have been modified.  We know that the
> + * highest number of dquots of one type - usr, grp OR prj - involved in a
> + * transaction is 2 so we don't need to make this very generic.

much clear now.
>   */
>  STATIC void
>  xfs_trans_dqlockedjoin(


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/15] xfs: update mount options documentation
  2013-06-28 15:39           ` Geoffrey Wehrman
  2013-06-28 16:49             ` Eric Sandeen
@ 2013-06-28 17:27             ` Ric Wheeler
  1 sibling, 0 replies; 38+ messages in thread
From: Ric Wheeler @ 2013-06-28 17:27 UTC (permalink / raw)
  To: xfs

On 06/28/2013 11:39 AM, Geoffrey Wehrman wrote:
> On Fri, Jun 28, 2013 at 12:32:04PM +1000, Dave Chinner wrote:
> | On Fri, Jun 28, 2013 at 12:09:12PM +1000, Dave Chinner wrote:
> | > Mount options are perfectly fine to be removed - they've been given
> | > deprecated warnings for quite some time now (the most recent is the
> | > delaylog which has been doing that since 3.1 IIRC). So they are all
> | > fine to actually remove - 12 months warning is usually considered
> | > sufficient.
>
> I hardly consider 12 years to be sufficient.  I have no problem with
> deprecating and disabling mount options so that they are ineffective,
> but removing them so that an administrator gets an error when upgrading
> his system is irresponsible product management, especially when it
> requires almost no effort to keep the deprecated, disabled interface.

In my opinion, 12 years is way more than enough time to let a user know that a 
mount option no longer works.

I think that you need to keep a clear and crisp difference in your mind between 
XFS as an upstream project and something commercially supported.

What you do as a vendor is entirely your call. Feel free to keep things 
supported for 12 years if you like, but that is unreasonable to do to an open 
source project that is shared by multiple users (individual and corporate).

Of course, if you dislike a patch that upstream does or a whole feature set, it 
is usually pretty easy to carry vendor specific patches to disable it.

Regards,

Ric

>
> You move to newer kernels much faster than most people.  Doesn't Red Hat
> still support Red Hat 5?  How old is that kernel?  One of the reasons I
> and others dread upgrading systems is because there are always
> interfaces that change, always data conversions that have to be run,
> always new processes to learn.  I realize that XFS is still an evolving
> filesystem, by historically one of its greatest achievements has been
> that of backward compatibility.  When XFS was ported from IRIX to Linux,
> the same filesystem could be used without any conversion.  Why force a
> user to modify his fstab just because he has upgraded his kernel?
>
> | > As to the sysctls - they haven't had any effect since 3.5 when the
> | > xfsbufd was removed, so it's time to mark them deprecated so we can
> | > remove them in a year's time. That gives anyone using them
> | > (including distros) plenty of time to fix whatever is using them
> | > before they get removed.
> | >
> | > > I'm thinking of the 3.3 glusterfs and 3.8 pulseaudio reakeage.  And I would
> | > > really like to have a nice holiday weekend. ;)
> | >
> | > I think you're being overly paranoid here - I'm simply following the
> | > normal deprecation protocol here....
> |
> | Documenation/ABI/README:
> |
> | We have four different levels of ABI stability, as shown by the four
> | different subdirectories in this location.  Interfaces may change levels
> | of stability according to the rules described below.
> | ....
> |  obsolete/
> |          This directory documents interfaces that are still remaining in
> | 	 the kernel, but are marked to be removed at some later point in
> | 	 time.  The description of the interface will document the reason
> | 	 why it is obsolete and when it can be expected to be removed.
>
>
> | I think you'll find that what I done follows this policy. If you
> | really want, I'll move them to Documenation/ABI/obsolete.  And, of
> | course, if removing them proves to be a problem, as Eric said we can
> | always reinstate them or remove the deprecation notices.
>
> It is great that Linux has a documented life cycle for kernel to userspace
> interfaces.  These are guidelines for the minimum requirements.  Move the
> mount options to obsolete.  I have no problems with making mount options
> obsolete.  Remove them and people will make a big fuss.
>
>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 15/15] xfs: implement inode change count
  2013-06-27  6:04 ` [PATCH 15/15] xfs: implement inode change count Dave Chinner
  2013-06-27 15:06   ` Mark Tinguely
  2013-06-28 16:07   ` Chandra Seetharaman
@ 2013-06-28 18:00   ` Ben Myers
  2 siblings, 0 replies; 38+ messages in thread
From: Ben Myers @ 2013-06-28 18:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Jun 27, 2013 at 04:04:59PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> For CRC enabled filesystems, add support for the monotonic inode
> version change counter that is needed by protocols like NFSv4 for
> determining if the inode has changed in any way at all between two
> unrelated operations on the inode.
> 
> This bumps the change count the first time an inode is dirtied in a
> transaction. Since all modifications to the inode are logged, this
> will catch all changes that are made to the inode, including
> timestamp updates that occur during data writes.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Applied.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/15] xfs: update mount options documentation
  2013-06-28  2:32         ` Dave Chinner
  2013-06-28 15:39           ` Geoffrey Wehrman
@ 2013-06-28 19:39           ` Ben Myers
  2013-06-29  2:38             ` Dave Chinner
  1 sibling, 1 reply; 38+ messages in thread
From: Ben Myers @ 2013-06-28 19:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hey Dave,

On Fri, Jun 28, 2013 at 12:32:04PM +1000, Dave Chinner wrote:
> On Fri, Jun 28, 2013 at 12:09:12PM +1000, Dave Chinner wrote:
> > On Thu, Jun 27, 2013 at 02:08:31PM -0500, Ben Myers wrote:
> > > Hey Dave,
> > > 
> > > On Thu, Jun 27, 2013 at 09:48:14AM -0500, Ben Myers wrote:
> > > > On Thu, Jun 27, 2013 at 04:04:45PM +1000, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > 
> > > > > Because it's horribly out of date.
> > > > > 
> > > > > And mark various deprecated options as deprecated and give them a
> > > > > removal date.
> > > > > 
> > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > > Reviewed-by: Mark Tinguely <tinguely@sgi.com>
> > > > 
> > > > Regarding removal of these mount options and sysctls:  initially these all look
> > > > pretty reasonable but we need to be very careful here.  I've read some
> > > > discussions on lkml that seem to suggest that such interfaces which have been
> > > > exported to userspace shouldn't be removed at all.  Not that I want to keep
> > > > around a bunch of worn out interfaces...
> > > > 
> > > > Applied.
> > > 
> > > On second thought... Not pushed.
> > > 
> > > I'm going to hold off on pushing this one to oss for now because I'm just not
> > > comfortable with it yet.  I can pull this in sans the removal notices if you
> > > want.  Lets discuss whether the removal of deprecated mount options and sysctls
> > > is acceptable before announcing an intention to remove them.  I'm trending no,
> > > but I can be flexible if this really is ok.
> > 
> > Mount options are perfectly fine to be removed - they've been given
> > deprecated warnings for quite some time now (the most recent is the
> > delaylog which has been doing that since 3.1 IIRC). So they are all
> > fine to actually remove - 12 months warning is usually considered
> > sufficient.
> > 
> > As to the sysctls - they haven't had any effect since 3.5 when the
> > xfsbufd was removed, so it's time to mark them deprecated so we can
> > remove them in a year's time. That gives anyone using them
> > (including distros) plenty of time to fix whatever is using them
> > before they get removed.
> > 
> > > I'm thinking of the 3.3 glusterfs and 3.8 pulseaudio reakeage.  And I would
> > > really like to have a nice holiday weekend. ;)
> > 
> > I think you're being overly paranoid here - I'm simply following the
> > normal deprecation protocol here....
> 
> Documenation/ABI/README:
> 
> We have four different levels of ABI stability, as shown by the four
> different subdirectories in this location.  Interfaces may chang levels
> of stability according to the rules described below.
> ....
>  obsolete/
>          This directory documents interfaces that are still remaining in
> 	 the kernel, but are marked to be removed at some later point in
> 	 time.  The description of the interface will document the reason
> 	 why it is obsolete and when it can be expected to be removed.
> 
> I think you'll find that what I done follows this policy.

Thanks.  That's exactly the sort of doc I am looking for.  I'll check it out.
I really just want to make sure that we're not going to be breaking userspace
by removing these...

> If you really want, I'll move them to Documenation/ABI/obsolete.  And, of
> course, if removing them proves to be a problem, as Eric said we can always
> reinstate them or remove the deprecation notices.

I forgot to mention that noatime seems to be missing now.  Was that intentional?

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/15] xfs: update mount options documentation
  2013-06-28 16:49             ` Eric Sandeen
@ 2013-06-28 19:58               ` Geoffrey Wehrman
  0 siblings, 0 replies; 38+ messages in thread
From: Geoffrey Wehrman @ 2013-06-28 19:58 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Ben Myers, xfs

On Fri, Jun 28, 2013 at 12:49:20PM -0400, Eric Sandeen wrote:
| On 6/28/13 11:39 AM, Geoffrey Wehrman wrote:
| > On Fri, Jun 28, 2013 at 12:32:04PM +1000, Dave Chinner wrote:
| > | On Fri, Jun 28, 2013 at 12:09:12PM +1000, Dave Chinner wrote:
| > | > Mount options are perfectly fine to be removed - they've been given
| > | > deprecated warnings for quite some time now (the most recent is the
| > | > delaylog which has been doing that since 3.1 IIRC). So they are all
| > | > fine to actually remove - 12 months warning is usually considered
| > | > sufficient.
| > 
| > I hardly consider 12 years to be sufficient.  I have no problem with
| 
| 12 years, really?  Maybe you meant months.  I hope you meant months!

I really did mean 12 years.  About as long as my dog lived.

| > | > I think you're being overly paranoid here - I'm simply following the
| > | > normal deprecation protocol here....
| > | 
| > | Documenation/ABI/README:
| > | 
| > | We have four different levels of ABI stability, as shown by the four
| > | different subdirectories in this location.  Interfaces may change levels
| > | of stability according to the rules described below.
| > | ....
| > |  obsolete/
| > |          This directory documents interfaces that are still remaining in
| > | 	 the kernel, but are marked to be removed at some later point in
| > | 	 time.  The description of the interface will document the reason
| > | 	 why it is obsolete and when it can be expected to be removed.
| > 
| > 
| > | I think you'll find that what I done follows this policy. If you
| > | really want, I'll move them to Documenation/ABI/obsolete.  And, of
| > | course, if removing them proves to be a problem, as Eric said we can
| > | always reinstate them or remove the deprecation notices.

Please move the mount options and sysctl interfaces removed to
Documenation/ABI/obsolete.  In return, I'll refrain from ranting further
about the ABI instability of Linux.


-- 
Geoffrey Wehrman  651-683-5496  gwehrman@sgi.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/15] xfs: update mount options documentation
  2013-06-28  2:18       ` Eric Sandeen
@ 2013-06-28 20:46         ` Ben Myers
  0 siblings, 0 replies; 38+ messages in thread
From: Ben Myers @ 2013-06-28 20:46 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs

Eric,

On Thu, Jun 27, 2013 at 10:18:20PM -0400, Eric Sandeen wrote:
> On 6/27/13 3:08 PM, Ben Myers wrote:
> > Hey Dave,
> > 
> > On Thu, Jun 27, 2013 at 09:48:14AM -0500, Ben Myers wrote:
> >> On Thu, Jun 27, 2013 at 04:04:45PM +1000, Dave Chinner wrote:
> >>> From: Dave Chinner <dchinner@redhat.com>
> >>>
> >>> Because it's horribly out of date.
> >>>
> >>> And mark various deprecated options as deprecated and give them a
> >>> removal date.
> >>>
> >>> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> >>> Reviewed-by: Mark Tinguely <tinguely@sgi.com>
> >>
> >> Regarding removal of these mount options and sysctls:  initially these all look
> >> pretty reasonable but we need to be very careful here.  I've read some
> >> discussions on lkml that seem to suggest that such interfaces which have been
> >> exported to userspace shouldn't be removed at all.  Not that I want to keep
> >> around a bunch of worn out interfaces...
> >>
> >> Applied.
> > 
> > On second thought... Not pushed.
> > 
> > I'm going to hold off on pushing this one to oss for now because I'm just not
> > comfortable with it yet.  I can pull this in sans the removal notices if you
> > want.  Lets discuss whether the removal of deprecated mount options and sysctls
> > is acceptable before announcing an intention to remove them.  I'm trending no,
> > but I can be flexible if this really is ok.
> 
> There's a long history of deprecating mount options - indeed, even entire
> filesystems (smbfs)!  Here's a sampling:
> 
> 5e2a4b25da232a2f4ce264a4b2ae113d0b2a799c btrfs: deprecate subvolrootid mount option
> 1b359204901e182b27aa9da432095cbe2cfd2512 cifs: remove support for deprecated "forcedirectio" and "strictcache" mount options
> 67c1f5295150eb86d065d57b4515a472ecbf008f cifs: add deprecation warning to sockopt=TCP_NODELAY option
> 43829731dd372d04d6706c51052b9dabab9ca356 workqueue: deprecate flush[_delayed]_work_sync()
> 09983b2fab80fa037b1dcf9a11de5a70df59ef7f cifs: add deprecation warnings to strictcache and forcedirectio
> 4d61cd6ec764368689fab3bd19e78d76c1e6b176 cifs: add a deprecation warning to CIFS_IOC_CHECKUMOUNT ioctl
> f70486055ee351158bd6999f3965ad378b52c694 ext4: try to deprecate noacl and noxattr_user mount options
> 87f26807e91102f2526d59c0232bf020479c8d0c ext4: remove deprecation warnings for minix_df and grpid
> 789b4588da40cf572ef982bdc5d590ec1b0386fe cifs: warn about impending deprecation of legacy MultiuserMount code
> 93b8a5854f247138e401471a9c3b82ccb62ff608 xfs: remove the deprecated nodelaylog option
> 8bc4392a1e50f346e97f8777aaefd9cfc3d45c9f cifs: warn about deprecation of /proc/fs/cifs/OplockEnabled interface
> 4113c4caa4f355b8ff8b7ff0510c29c9d00d30b3 ext4: remove deprecated oldalloc
> 242d621964dd8641df53f7d51d4c6ead655cc5a6 xfs: deprecate the nodelaylog mount option
> fbc854027c91fa2813ae7f9de43cc0b5c1119f41 ext3: remove deprecated oldalloc
> be8f684d73d8d916847e996bf69cef14352872c6 oom: make deprecated use of oom_adj more verbose
> 49b28684fdba2c84a3b8e54aaa0faa9ce2e4f140 nfsd: Remove deprecated nfsctl system call and related code.
> 52ca0e84b05595cf74f1ff772b3f9807256b1b27 hugetlbfs: lessen the impact of a deprecation warning
> e52eec13cd6b7f30ab19081b387813e03e592ae5 SYSFS: Allow boot time switching between deprecated and modern sysfs layout
> 1e1405673e4e40a94ed7620553eb440a21040402 nfsd: allow deprecated interface to be compiled out.
> c67874f942e30039442d925b03793e0a46ddcddd nfsd: formally deprecate legacy nfsd syscall interface
> 51b1bd2ace1595b72956224deda349efa880b693 oom: deprecate oom_adj tunable
> a4432345352c2be157ed844603147ac2c82f209c NFSv41: Deprecate nfs_client->cl_minorversion
> 437ca0fda3b442dff9e591581b5e1ffdfec24660 ext4: deprecate obsoleted mount options
> ed58f8027945f1cf415bfe3805e1fa3fe8ed9edf fs/smbfs/inode.c: fix warning message deprecating smbfs
> c773633916c66f8362ca01983d97bd33e35b743f deprecate smbfs in favour of cifs
> 8e9073ed027771bcdee4033eb900a3c09ac90a19 Deprecate a.out ELF interpreters
> e10a4437cb37c85f2df95432025b392d98aac2aa [PATCH] Remove final references to deprecated "MAP_ANON" page protection flag
> f1ddcaf3393b7a3871809b97fae90fac841a1f39 [CRYPTO] api: Remove deprecated interface
> 1ec320afdc9552c92191d5f89fcd1ebe588334ca [PATCH] add process_session() helper routine: deprecate old field

RIP smbfs.

I'm cool with the deprecation of mount options.  What if we were to completely
remove the deprecated option in such a way that mount returns EINVAL when
someone uses it?  If we remove a sysctl that some application expects to be
there, haven't we broken the application?  That seems like it could be an
important issue, but maybe that's not Dave's intent here.

I took a peek at a few of the above.  The ones I saw seem to print a message
'deprecated mount option is being ignored'...  That seems to be the way to go.

> Marking as deprecated, and putting back into play if people REALLY care?
> That's been done too:
> 
> 87f26807e91102f2526d59c0232bf020479c8d0c ext4: remove deprecation warnings for minix_df and grpid
> 
> > I'm thinking of the 3.3 glusterfs and 3.8 pulseaudio reakeage.  And I would
> > really like to have a nice holiday weekend. ;)
> 
> Is "reakage" a freudian slip?  :)

Yeah, there was supposed to be a 'b' on the front. ;)

> I dunno about PA,

See:
https://lkml.org/lkml/2012/12/23/75

That story made /.
       	
> but the gluster thing wasn't related to deprecation.

There is more than one way to break userspace, right?  My concern is that the
removal of deprecated options and sysctls could be one of them.  Anyway, I'll
check out the doc Dave pointed to and come back to this.

Thanks,
	Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/15] xfs: update mount options documentation
  2013-06-28 19:39           ` Ben Myers
@ 2013-06-29  2:38             ` Dave Chinner
  0 siblings, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2013-06-29  2:38 UTC (permalink / raw)
  To: Ben Myers; +Cc: xfs

On Fri, Jun 28, 2013 at 02:39:59PM -0500, Ben Myers wrote:
> Hey Dave,
> 
> On Fri, Jun 28, 2013 at 12:32:04PM +1000, Dave Chinner wrote:
> > On Fri, Jun 28, 2013 at 12:09:12PM +1000, Dave Chinner wrote:
> > > On Thu, Jun 27, 2013 at 02:08:31PM -0500, Ben Myers wrote:
> > > > Hey Dave,
> > > > 
> > > > On Thu, Jun 27, 2013 at 09:48:14AM -0500, Ben Myers wrote:
> > > > > On Thu, Jun 27, 2013 at 04:04:45PM +1000, Dave Chinner wrote:
> > > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > > 
> > > > > > Because it's horribly out of date.
> > > > > > 
> > > > > > And mark various deprecated options as deprecated and give them a
> > > > > > removal date.
> > > > > > 
> > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > > > Reviewed-by: Mark Tinguely <tinguely@sgi.com>
> > > > > 
> > > > > Regarding removal of these mount options and sysctls:  initially these all look
> > > > > pretty reasonable but we need to be very careful here.  I've read some
> > > > > discussions on lkml that seem to suggest that such interfaces which have been
> > > > > exported to userspace shouldn't be removed at all.  Not that I want to keep
> > > > > around a bunch of worn out interfaces...
> > > > > 
> > > > > Applied.
> > > > 
> > > > On second thought... Not pushed.
> > > > 
> > > > I'm going to hold off on pushing this one to oss for now because I'm just not
> > > > comfortable with it yet.  I can pull this in sans the removal notices if you
> > > > want.  Lets discuss whether the removal of deprecated mount options and sysctls
> > > > is acceptable before announcing an intention to remove them.  I'm trending no,
> > > > but I can be flexible if this really is ok.
> > > 
> > > Mount options are perfectly fine to be removed - they've been given
> > > deprecated warnings for quite some time now (the most recent is the
> > > delaylog which has been doing that since 3.1 IIRC). So they are all
> > > fine to actually remove - 12 months warning is usually considered
> > > sufficient.
> > > 
> > > As to the sysctls - they haven't had any effect since 3.5 when the
> > > xfsbufd was removed, so it's time to mark them deprecated so we can
> > > remove them in a year's time. That gives anyone using them
> > > (including distros) plenty of time to fix whatever is using them
> > > before they get removed.
> > > 
> > > > I'm thinking of the 3.3 glusterfs and 3.8 pulseaudio reakeage.  And I would
> > > > really like to have a nice holiday weekend. ;)
> > > 
> > > I think you're being overly paranoid here - I'm simply following the
> > > normal deprecation protocol here....
> > 
> > Documenation/ABI/README:
> > 
> > We have four different levels of ABI stability, as shown by the four
> > different subdirectories in this location.  Interfaces may chang levels
> > of stability according to the rules described below.
> > ....
> >  obsolete/
> >          This directory documents interfaces that are still remaining in
> > 	 the kernel, but are marked to be removed at some later point in
> > 	 time.  The description of the interface will document the reason
> > 	 why it is obsolete and when it can be expected to be removed.
> > 
> > I think you'll find that what I done follows this policy.
> 
> Thanks.  That's exactly the sort of doc I am looking for.  I'll check it out.
> I really just want to make sure that we're not going to be breaking userspace
> by removing these...
> 
> > If you really want, I'll move them to Documenation/ABI/obsolete.  And, of
> > course, if removing them proves to be a problem, as Eric said we can always
> > reinstate them or remove the deprecation notices.
> 
> I forgot to mention that noatime seems to be missing now.  Was that intentional?

Yes - it's not an XFS mount option - it's handled by the VFS and
documented as a common mount option for all filesystems just like
nodiratime, relatime, strictatime, sync, dirsync, nosuid, noexec,
etc. IOWs, there's no need to document it as an XFS specific mount
option....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 14/15] xfs: dquot log reservations are too small
  2013-06-28 17:18   ` Chandra Seetharaman
@ 2013-06-29  2:42     ` Dave Chinner
  2013-07-09 19:31       ` Ben Myers
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2013-06-29  2:42 UTC (permalink / raw)
  To: Chandra Seetharaman; +Cc: xfs

On Fri, Jun 28, 2013 at 12:18:22PM -0500, Chandra Seetharaman wrote:
> 
> some nits below on commit log and comments.
> 
> otherwise consider
> 
> Reviewed-by: Chandra Seethraman <sekharan@us.ibm.com>
> 
> On Thu, 2013-06-27 at 16:04 +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > During review of the separate project quota inode patches, it bcame
> became(typo)
> 
> > obvious that the dquot log reservation calculation underestimated
> > the number dquots that can be modified in a transaction. This has
> > it's roots way back in the Irix quota implementation.
> > 
> > That is, when quotas were first implemented in XFS, it only
> > supported user and project quotas as Irix did not have group quotas.
> > Hence the worst case operation involving dquot modification was
> > calculated to involve 2 user dquots and 1 project dquot or 1 user
> > dequot and 2 project dquots. i.e. 3 dquots. This was determined back
> > in 1996, and has remained unchanged ever since.
> 
> How about the missing reservation for the log format header ? It is a
> day one problem, right ?

Yup.

> > However, back in 2001, the Linux XFS port dropped all support for
> > project quota and implmented group quotas over the top. This was
> 
> implemented(typo)
> 
> > effectively done with a search-and-replace of project with group,
> > and as such the log reservation was not changed. However, with the
> > advent of group quotas, chmod and rename now could modify more than
> > 3 dquots in a single transaction - both could modify 4 dquots. Hence
> > this log reservation has been wrong for a long time.
> > 
> > In 2005, project quotas were reintroduced into Linux, but they were
> 
> introduced ? (it was mentioned above that 2001 Linux port removed
> project quota, so it never existed in Linux ?!)

I'm talking from a code perspective, not whether they worked or not
from a user persepctive.....

> > implemented to be mutually exclusive to group quotas, and so this
> > didn't add any new changes to the dquot log reservation. hence when
> > project quotas were in use, everything was still fine, just like
> > in the Irix days.
> 
> you can say that... but, when the group quota is used the problem
> mentioned above exists.

Yes, it does. All this points out was that group quotas were broken,
not project quotas....

I'll fix up the various typos and clarify the comments as you've
suggested.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 14/15] xfs: dquot log reservations are too small
  2013-06-29  2:42     ` Dave Chinner
@ 2013-07-09 19:31       ` Ben Myers
  2013-07-09 20:39         ` Dave Chinner
  0 siblings, 1 reply; 38+ messages in thread
From: Ben Myers @ 2013-07-09 19:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chandra Seetharaman, xfs

Hey Dave,

On Sat, Jun 29, 2013 at 12:42:05PM +1000, Dave Chinner wrote:
> I'll fix up the various typos and clarify the comments as you've
> suggested.

Are you still planning on reposting this one for 3.11?

Thanks,
	Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 14/15] xfs: dquot log reservations are too small
  2013-07-09 19:31       ` Ben Myers
@ 2013-07-09 20:39         ` Dave Chinner
  2013-07-09 20:42           ` Ben Myers
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2013-07-09 20:39 UTC (permalink / raw)
  To: Ben Myers; +Cc: Chandra Seetharaman, xfs

On Tue, Jul 09, 2013 at 02:31:28PM -0500, Ben Myers wrote:
> Hey Dave,
> 
> On Sat, Jun 29, 2013 at 12:42:05PM +1000, Dave Chinner wrote:
> > I'll fix up the various typos and clarify the comments as you've
> > suggested.
> 
> Are you still planning on reposting this one for 3.11?

Yes. I was waiting for you to push out an updated tree so I could
rebase everything I have before reposting....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 14/15] xfs: dquot log reservations are too small
  2013-07-09 20:39         ` Dave Chinner
@ 2013-07-09 20:42           ` Ben Myers
  0 siblings, 0 replies; 38+ messages in thread
From: Ben Myers @ 2013-07-09 20:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chandra Seetharaman, xfs

On Wed, Jul 10, 2013 at 06:39:31AM +1000, Dave Chinner wrote:
> On Tue, Jul 09, 2013 at 02:31:28PM -0500, Ben Myers wrote:
> > Hey Dave,
> > 
> > On Sat, Jun 29, 2013 at 12:42:05PM +1000, Dave Chinner wrote:
> > > I'll fix up the various typos and clarify the comments as you've
> > > suggested.
> > 
> > Are you still planning on reposting this one for 3.11?
> 
> Yes. I was waiting for you to push out an updated tree so I could
> rebase everything I have before reposting....

Oh.  Well it's out there.  ;)

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2013-07-09 20:42 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-27  6:04 [PATCH 00/15] xfs: patchset for 3.11 Dave Chinner
2013-06-27  6:04 ` [PATCH 01/15] xfs: update mount options documentation Dave Chinner
2013-06-27 14:48   ` Ben Myers
2013-06-27 19:08     ` Ben Myers
2013-06-28  2:09       ` Dave Chinner
2013-06-28  2:32         ` Dave Chinner
2013-06-28 15:39           ` Geoffrey Wehrman
2013-06-28 16:49             ` Eric Sandeen
2013-06-28 19:58               ` Geoffrey Wehrman
2013-06-28 17:27             ` Ric Wheeler
2013-06-28 19:39           ` Ben Myers
2013-06-29  2:38             ` Dave Chinner
2013-06-28  2:18       ` Eric Sandeen
2013-06-28 20:46         ` Ben Myers
2013-06-27  6:04 ` [PATCH 02/15] xfs: add pluging for bulkstat readahead Dave Chinner
2013-06-27  6:04 ` [PATCH 03/15] xfs: plug directory buffer readahead Dave Chinner
2013-06-27  6:04 ` [PATCH 04/15] xfs: don't use speculative prealloc for small files Dave Chinner
2013-06-27  6:04 ` [PATCH 05/15] xfs: don't do IO when creating an new inode Dave Chinner
2013-06-27  6:04 ` [PATCH 06/15] xfs: xfs_ifree doesn't need to modify the inode buffer Dave Chinner
2013-06-27  6:04 ` [PATCH 07/15] xfs: Introduce ordered log vector support Dave Chinner
2013-06-27  6:04 ` [PATCH 08/15] xfs: Introduce an ordered buffer item Dave Chinner
2013-06-27  6:04 ` [PATCH 09/15] xfs: Inode create log items Dave Chinner
2013-06-27  6:04 ` [PATCH 10/15] xfs: Inode create transaction reservations Dave Chinner
2013-06-27  6:04 ` [PATCH 11/15] xfs: Inode create item recovery Dave Chinner
2013-06-27  6:04 ` [PATCH 12/15] xfs: Use inode create transaction Dave Chinner
2013-06-27  6:04 ` [PATCH 13/15] xfs: remove local fork format handling from xfs_bmapi_write() Dave Chinner
2013-06-27  6:04 ` [PATCH 14/15] xfs: dquot log reservations are too small Dave Chinner
2013-06-27 14:38   ` Mark Tinguely
2013-06-28 17:18   ` Chandra Seetharaman
2013-06-29  2:42     ` Dave Chinner
2013-07-09 19:31       ` Ben Myers
2013-07-09 20:39         ` Dave Chinner
2013-07-09 20:42           ` Ben Myers
2013-06-27  6:04 ` [PATCH 15/15] xfs: implement inode change count Dave Chinner
2013-06-27 15:06   ` Mark Tinguely
2013-06-28 16:07   ` Chandra Seetharaman
2013-06-28 18:00   ` Ben Myers
2013-06-27 19:48 ` [PATCH 00/15] xfs: patchset for 3.11 Ben Myers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.