All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6 V2] xfs: upfront block zeroing for DAX
@ 2015-10-19  3:27 Dave Chinner
  2015-10-19  3:27 ` [PATCH 1/6] xfs: fix inode size update overflow in xfs_map_direct() Dave Chinner
                   ` (7 more replies)
  0 siblings, 8 replies; 47+ messages in thread
From: Dave Chinner @ 2015-10-19  3:27 UTC (permalink / raw)
  To: xfs; +Cc: ross.zwisler, jack

Hi folks,

This is an updated patch set that was first posted here:

http://oss.sgi.com/archives/xfs/2015-10/msg00006.html

I've dropped the DAX locking revert patch from it; that's on it's
way to Linus via other channels and is essentially independent to
this set of XFS changes.

The only real change in the XFS code between the two versions is the
addition of XFS_TRANS_RESERVE in the DAX path in
xfs_iomap_write_direct() to allow it to dip into the reserve block
pool for unwritten extent conversion rather than reporting ENOSPC.

Patches are against 4.3-rc5 + XFS for-next branch.

-Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 1/6] xfs: fix inode size update overflow in xfs_map_direct()
  2015-10-19  3:27 [PATCH 0/6 V2] xfs: upfront block zeroing for DAX Dave Chinner
@ 2015-10-19  3:27 ` Dave Chinner
  2015-10-29 14:27   ` Brian Foster
  2015-10-19  3:27 ` [PATCH 2/6] xfs: introduce BMAPI_ZERO for allocating zeroed extents Dave Chinner
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2015-10-19  3:27 UTC (permalink / raw)
  To: xfs; +Cc: ross.zwisler, jack

From: Dave Chinner <dchinner@redhat.com>

Both direct IO and DAX pass an offset and count into get_blocks that
will overflow a s64 variable when an IO goes into the last supported
block in a file (i.e. at offset 2^63 - 1FSB bytes). This can be seen
from the tracing:

xfs_get_blocks_alloc: [...] offset 0x7ffffffffffff000 count 4096
xfs_gbmap_direct:     [...] offset 0x7ffffffffffff000 count 4096
xfs_gbmap_direct_none:[...] offset 0x7ffffffffffff000 count 4096

0x7ffffffffffff000 + 4096 = 0x8000000000000000, and hence that
overflows the s64 offset and we fail to detect the need for a
filesize update and an ioend is not allocated.

This is *mostly* avoided for direct IO because such extending IOs
occur with full block allocation, and so the "IS_UNWRITTEN()" check
still evaluates as true and we get an ioend that way. However, doing
single sector extending IOs to this last block will expose the fact
that file size updates will not occur after the first allocating
direct IO as the overflow will then be exposed.

There is one further complexity: the DAX page fault path also
exposes the same issue in block allocation. However, page faults
cannot extend the file size, so in this case we want to allocate the
block but do not want to allocate an ioend to enable file size
update at IO completion. Hence we now need to distinguish between
the direct IO patch allocation and dax fault path allocation to
avoid leaking ioend structures.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c | 50 ++++++++++++++++++++++++++++++++++++++++++++------
 fs/xfs/xfs_aops.h |  2 ++
 fs/xfs/xfs_file.c |  6 +++---
 3 files changed, 49 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index e4fff58..366e41eb 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1259,13 +1259,28 @@ xfs_vm_releasepage(
  * the DIO. There is only going to be one reference to the ioend and its life
  * cycle is constrained by the DIO completion code. hence we don't need
  * reference counting here.
+ *
+ * Note that for DIO, an IO to the highest supported file block offset (i.e.
+ * 2^63 - 1FSB bytes) will result in the offset + count overflowing a signed 64
+ * bit variable. Hence if we see this overflow, we have to assume that the IO is
+ * extending the file size. We won't know for sure until IO completion is run
+ * and the actual max write offset is communicated to the IO completion
+ * routine.
+ *
+ * For DAX page faults, we are preparing to never see unwritten extents here,
+ * nor should we ever extend the inode size. Hence we will soon have nothing to
+ * do here for this case, ensuring we don't have to provide an IO completion
+ * callback to free an ioend that we don't actually need for a fault into the
+ * page at offset (2^63 - 1FSB) bytes.
  */
+
 static void
 xfs_map_direct(
 	struct inode		*inode,
 	struct buffer_head	*bh_result,
 	struct xfs_bmbt_irec	*imap,
-	xfs_off_t		offset)
+	xfs_off_t		offset,
+	bool			dax_fault)
 {
 	struct xfs_ioend	*ioend;
 	xfs_off_t		size = bh_result->b_size;
@@ -1278,6 +1293,16 @@ xfs_map_direct(
 
 	trace_xfs_gbmap_direct(XFS_I(inode), offset, size, type, imap);
 
+	/* XXX: preparation for removing unwritten extents in DAX */
+#if 0
+	if (dax_fault) {
+		ASSERT(type == XFS_IO_OVERWRITE);
+		trace_xfs_gbmap_direct_none(XFS_I(inode), offset, size, type,
+					    imap);
+		return;
+	}
+#endif
+
 	if (bh_result->b_private) {
 		ioend = bh_result->b_private;
 		ASSERT(ioend->io_size > 0);
@@ -1292,7 +1317,8 @@ xfs_map_direct(
 					      ioend->io_size, ioend->io_type,
 					      imap);
 	} else if (type == XFS_IO_UNWRITTEN ||
-		   offset + size > i_size_read(inode)) {
+		   offset + size > i_size_read(inode) ||
+		   offset + size < 0) {
 		ioend = xfs_alloc_ioend(inode, type);
 		ioend->io_offset = offset;
 		ioend->io_size = size;
@@ -1354,7 +1380,8 @@ __xfs_get_blocks(
 	sector_t		iblock,
 	struct buffer_head	*bh_result,
 	int			create,
-	bool			direct)
+	bool			direct,
+	bool			dax_fault)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
@@ -1467,7 +1494,8 @@ __xfs_get_blocks(
 			set_buffer_unwritten(bh_result);
 		/* direct IO needs special help */
 		if (create && direct)
-			xfs_map_direct(inode, bh_result, &imap, offset);
+			xfs_map_direct(inode, bh_result, &imap, offset,
+				       dax_fault);
 	}
 
 	/*
@@ -1514,7 +1542,7 @@ xfs_get_blocks(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __xfs_get_blocks(inode, iblock, bh_result, create, false);
+	return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
 }
 
 int
@@ -1524,7 +1552,17 @@ xfs_get_blocks_direct(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __xfs_get_blocks(inode, iblock, bh_result, create, true);
+	return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
+}
+
+int
+xfs_get_blocks_dax_fault(
+	struct inode		*inode,
+	sector_t		iblock,
+	struct buffer_head	*bh_result,
+	int			create)
+{
+	return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
 }
 
 static void
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index 86afd1a..d39ba25 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -58,6 +58,8 @@ int	xfs_get_blocks(struct inode *inode, sector_t offset,
 		       struct buffer_head *map_bh, int create);
 int	xfs_get_blocks_direct(struct inode *inode, sector_t offset,
 			      struct buffer_head *map_bh, int create);
+int	xfs_get_blocks_dax_fault(struct inode *inode, sector_t offset,
+			         struct buffer_head *map_bh, int create);
 void	xfs_end_io_dax_write(struct buffer_head *bh, int uptodate);
 
 extern void xfs_count_page_state(struct page *, int *, int *);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 2f7b6bd..7f873bc 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1508,7 +1508,7 @@ xfs_filemap_page_mkwrite(
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (IS_DAX(inode)) {
-		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_direct,
+		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault,
 				    xfs_end_io_dax_write);
 	} else {
 		ret = __block_page_mkwrite(vma, vmf, xfs_get_blocks);
@@ -1543,7 +1543,7 @@ xfs_filemap_fault(
 		 * changes to xfs_get_blocks_direct() to map unwritten extent
 		 * ioend for conversion on read-only mappings.
 		 */
-		ret = __dax_fault(vma, vmf, xfs_get_blocks_direct, NULL);
+		ret = __dax_fault(vma, vmf, xfs_get_blocks_dax_fault, NULL);
 	} else
 		ret = filemap_fault(vma, vmf);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
@@ -1570,7 +1570,7 @@ xfs_filemap_pmd_fault(
 	sb_start_pagefault(inode->i_sb);
 	file_update_time(vma->vm_file);
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-	ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_direct,
+	ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault,
 				    xfs_end_io_dax_write);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 	sb_end_pagefault(inode->i_sb);
-- 
2.5.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 2/6] xfs: introduce BMAPI_ZERO for allocating zeroed extents
  2015-10-19  3:27 [PATCH 0/6 V2] xfs: upfront block zeroing for DAX Dave Chinner
  2015-10-19  3:27 ` [PATCH 1/6] xfs: fix inode size update overflow in xfs_map_direct() Dave Chinner
@ 2015-10-19  3:27 ` Dave Chinner
  2015-10-29 14:27   ` Brian Foster
  2015-10-19  3:27 ` [PATCH 3/6] xfs: Don't use unwritten extents for DAX Dave Chinner
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2015-10-19  3:27 UTC (permalink / raw)
  To: xfs; +Cc: ross.zwisler, jack

From: Dave Chinner <dchinner@redhat.com>

To enable DAX to do atomic allocation of zeroed extents, we need to
drive the block zeroing deep into the allocator. Because
xfs_bmapi_write() can return merged extents on allocation that were
only partially allocated (i.e. requested range spans allocated and
hole regions, allocation into the hole was contiguous), we cannot
zero the extent returned from xfs_bmapi_write() as that can
overwrite existing data with zeros.

Hence we have to drive the extent zeroing into the allocation code,
prior to where we merge the extents into the BMBT and return the
resultant map. This means we need to propagate this need down to
the xfs_alloc_vextent() and issue the block zeroing at this point.

While this functionality is being introduced for DAX, there is no
reason why it is specific to DAX - we can per-zero blocks during the
allocation transaction on any type of device. It's just slow (and
usually slower than unwritten allocation and conversion) on
traditional block devices so doesn't tend to get used. We can,
however, hook hardware zeroing optimisations via sb_issue_zeroout()
to this operation, so it may be useful in future and hence the
"allocate zeroed blocks" API needs to be implementation neutral.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c | 10 +++++++++-
 fs/xfs/libxfs/xfs_alloc.h |  8 +++++---
 fs/xfs/libxfs/xfs_bmap.c  | 25 +++++++++++++++++++++++--
 fs/xfs/libxfs/xfs_bmap.h  | 13 +++++++++++--
 fs/xfs/xfs_bmap_util.c    | 36 ++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_mount.h        |  3 +++
 6 files changed, 87 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index e926197..3479294 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2509,7 +2509,7 @@ xfs_alloc_vextent(
 		 * Try near allocation first, then anywhere-in-ag after
 		 * the first a.g. fails.
 		 */
-		if ((args->userdata  == XFS_ALLOC_INITIAL_USER_DATA) &&
+		if ((args->userdata & XFS_ALLOC_INITIAL_USER_DATA) &&
 		    (mp->m_flags & XFS_MOUNT_32BITINODES)) {
 			args->fsbno = XFS_AGB_TO_FSB(mp,
 					((mp->m_agfrotor / rotorstep) %
@@ -2640,6 +2640,14 @@ xfs_alloc_vextent(
 		XFS_AG_CHECK_DADDR(mp, XFS_FSB_TO_DADDR(mp, args->fsbno),
 			args->len);
 #endif
+
+		/* Zero the extent if we were asked to do so */
+		if (args->userdata & XFS_ALLOC_USERDATA_ZERO) {
+			error = xfs_zero_extent(args->ip, args->fsbno, args->len);
+			if (error)
+				goto error0;
+		}
+
 	}
 	xfs_perag_put(args->pag);
 	return 0;
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index ca1c816..0ecde4d 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -101,6 +101,7 @@ typedef struct xfs_alloc_arg {
 	struct xfs_mount *mp;		/* file system mount point */
 	struct xfs_buf	*agbp;		/* buffer for a.g. freelist header */
 	struct xfs_perag *pag;		/* per-ag struct for this agno */
+	struct xfs_inode *ip;		/* for userdata zeroing method */
 	xfs_fsblock_t	fsbno;		/* file system block number */
 	xfs_agnumber_t	agno;		/* allocation group number */
 	xfs_agblock_t	agbno;		/* allocation group-relative block # */
@@ -120,15 +121,16 @@ typedef struct xfs_alloc_arg {
 	char		wasdel;		/* set if allocation was prev delayed */
 	char		wasfromfl;	/* set if allocation is from freelist */
 	char		isfl;		/* set if is freelist blocks - !acctg */
-	char		userdata;	/* set if this is user data */
+	char		userdata;	/* mask defining userdata treatment */
 	xfs_fsblock_t	firstblock;	/* io first block allocated */
 } xfs_alloc_arg_t;
 
 /*
  * Defines for userdata
  */
-#define XFS_ALLOC_USERDATA		1	/* allocation is for user data*/
-#define XFS_ALLOC_INITIAL_USER_DATA	2	/* special case start of file */
+#define XFS_ALLOC_USERDATA		(1 << 0)/* allocation is for user data*/
+#define XFS_ALLOC_INITIAL_USER_DATA	(1 << 1)/* special case start of file */
+#define XFS_ALLOC_USERDATA_ZERO		(1 << 2)/* zero extent on allocation */
 
 xfs_extlen_t xfs_alloc_longest_free_extent(struct xfs_mount *mp,
 		struct xfs_perag *pag, xfs_extlen_t need);
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index ab92d10..590cbec 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3802,8 +3802,13 @@ xfs_bmap_btalloc(
 	args.wasdel = ap->wasdel;
 	args.isfl = 0;
 	args.userdata = ap->userdata;
-	if ((error = xfs_alloc_vextent(&args)))
+	if (ap->userdata & XFS_ALLOC_USERDATA_ZERO)
+		args.ip = ap->ip;
+
+	error = xfs_alloc_vextent(&args);
+	if (error)
 		return error;
+
 	if (tryagain && args.fsbno == NULLFSBLOCK) {
 		/*
 		 * Exact allocation failed. Now try with alignment
@@ -4302,11 +4307,14 @@ xfs_bmapi_allocate(
 
 	/*
 	 * Indicate if this is the first user data in the file, or just any
-	 * user data.
+	 * user data. And if it is userdata, indicate whether it needs to
+	 * be initialised to zero during allocation.
 	 */
 	if (!(bma->flags & XFS_BMAPI_METADATA)) {
 		bma->userdata = (bma->offset == 0) ?
 			XFS_ALLOC_INITIAL_USER_DATA : XFS_ALLOC_USERDATA;
+		if (bma->flags & XFS_BMAPI_ZERO)
+			bma->userdata |= XFS_ALLOC_USERDATA_ZERO;
 	}
 
 	bma->minlen = (bma->flags & XFS_BMAPI_CONTIG) ? bma->length : 1;
@@ -4421,6 +4429,17 @@ xfs_bmapi_convert_unwritten(
 	mval->br_state = (mval->br_state == XFS_EXT_UNWRITTEN)
 				? XFS_EXT_NORM : XFS_EXT_UNWRITTEN;
 
+	/*
+	 * Before insertion into the bmbt, zero the range being converted
+	 * if required.
+	 */
+	if (flags & XFS_BMAPI_ZERO) {
+		error = xfs_zero_extent(bma->ip, mval->br_startblock,
+					mval->br_blockcount);
+		if (error)
+			return error;
+	}
+
 	error = xfs_bmap_add_extent_unwritten_real(bma->tp, bma->ip, &bma->idx,
 			&bma->cur, mval, bma->firstblock, bma->flist,
 			&tmp_logflags);
@@ -4513,6 +4532,8 @@ xfs_bmapi_write(
 	ASSERT(len > 0);
 	ASSERT(XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_LOCAL);
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
+	ASSERT((flags & (XFS_BMAPI_METADATA | XFS_BMAPI_ZERO)) !=
+			(XFS_BMAPI_METADATA | XFS_BMAPI_ZERO));
 
 	if (unlikely(XFS_TEST_ERROR(
 	    (XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_EXTENTS &&
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 6aaa0c1..a160f8a 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -52,9 +52,9 @@ struct xfs_bmalloca {
 	xfs_extlen_t		minleft; /* amount must be left after alloc */
 	bool			eof;	/* set if allocating past last extent */
 	bool			wasdel;	/* replacing a delayed allocation */
-	bool			userdata;/* set if is user data */
 	bool			aeof;	/* allocated space at eof */
 	bool			conv;	/* overwriting unwritten extents */
+	char			userdata;/* userdata mask */
 	int			flags;
 };
 
@@ -109,6 +109,14 @@ typedef	struct xfs_bmap_free
  */
 #define XFS_BMAPI_CONVERT	0x040
 
+/*
+ * allocate zeroed extents - this requires all newly allocated user data extents
+ * to be initialised to zero. It will be ignored if XFS_BMAPI_METADATA is set.
+ * Use in conjunction with XFS_BMAPI_CONVERT to convert unwritten extents found
+ * during the allocation range to zeroed written extents.
+ */
+#define XFS_BMAPI_ZERO		0x080
+
 #define XFS_BMAPI_FLAGS \
 	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
 	{ XFS_BMAPI_METADATA,	"METADATA" }, \
@@ -116,7 +124,8 @@ typedef	struct xfs_bmap_free
 	{ XFS_BMAPI_PREALLOC,	"PREALLOC" }, \
 	{ XFS_BMAPI_IGSTATE,	"IGSTATE" }, \
 	{ XFS_BMAPI_CONTIG,	"CONTIG" }, \
-	{ XFS_BMAPI_CONVERT,	"CONVERT" }
+	{ XFS_BMAPI_CONVERT,	"CONVERT" }, \
+	{ XFS_BMAPI_ZERO,	"ZERO" }
 
 
 static inline int xfs_bmapi_aflag(int w)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index eca325e..dbae649 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -57,6 +57,35 @@ xfs_fsb_to_db(struct xfs_inode *ip, xfs_fsblock_t fsb)
 }
 
 /*
+ * Routine to zero an extent on disk allocated to the specific inode.
+ *
+ * The VFS functions take a linearised filesystem block offset, so we have to
+ * convert the sparse xfs fsb to the right format first.
+ * VFS types are real funky, too.
+ */
+int
+xfs_zero_extent(
+	struct xfs_inode *ip,
+	xfs_fsblock_t	start_fsb,
+	xfs_off_t	count_fsb)
+{
+	struct xfs_mount *mp = ip->i_mount;
+	xfs_daddr_t	sector = xfs_fsb_to_db(ip, start_fsb);
+	sector_t	block = XFS_BB_TO_FSBT(mp, sector);
+	ssize_t		size = XFS_FSB_TO_B(mp, count_fsb);
+
+	if (IS_DAX(VFS_I(ip)))
+		return dax_clear_blocks(VFS_I(ip), block, size);
+
+	/*
+	 * let the block layer decide on the fastest method of
+	 * implementing the zeroing.
+	 */
+	return sb_issue_zeroout(mp->m_super, block, count_fsb, GFP_NOFS);
+
+}
+
+/*
  * Routine to be called at transaction's end by xfs_bmapi, xfs_bunmapi
  * caller.  Frees all the extents that need freeing, which must be done
  * last due to locking considerations.  We never free any extents in
@@ -229,6 +258,13 @@ xfs_bmap_rtalloc(
 		xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
 			ap->wasdel ? XFS_TRANS_DQ_DELRTBCOUNT :
 					XFS_TRANS_DQ_RTBCOUNT, (long) ralen);
+
+		/* Zero the extent if we were asked to do so */
+		if (ap->userdata & XFS_ALLOC_USERDATA_ZERO) {
+			error = xfs_zero_extent(ap->ip, ap->blkno, ap->length);
+			if (error)
+				return error;
+		}
 	} else {
 		ap->length = 0;
 	}
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 8795272..f20e5de 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -337,4 +337,7 @@ extern int	xfs_dev_is_read_only(struct xfs_mount *, char *);
 
 extern void	xfs_set_low_space_thresholds(struct xfs_mount *);
 
+int	xfs_zero_extent(struct xfs_inode *ip, xfs_fsblock_t start_fsb,
+			xfs_off_t count_fsb);
+
 #endif	/* __XFS_MOUNT_H__ */
-- 
2.5.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 3/6] xfs: Don't use unwritten extents for DAX
  2015-10-19  3:27 [PATCH 0/6 V2] xfs: upfront block zeroing for DAX Dave Chinner
  2015-10-19  3:27 ` [PATCH 1/6] xfs: fix inode size update overflow in xfs_map_direct() Dave Chinner
  2015-10-19  3:27 ` [PATCH 2/6] xfs: introduce BMAPI_ZERO for allocating zeroed extents Dave Chinner
@ 2015-10-19  3:27 ` Dave Chinner
  2015-10-29 14:29   ` Brian Foster
  2015-10-19  3:27 ` [PATCH 4/6] xfs: DAX does not use IO completion callbacks Dave Chinner
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2015-10-19  3:27 UTC (permalink / raw)
  To: xfs; +Cc: ross.zwisler, jack

From: Dave Chinner <dchinner@redhat.com>

DAX has a page fault serialisation problem with block allocation.
Because it allows concurrent page faults and does not have a page
lock to serialise faults to the same page, it can get two concurrent
faults to the page that race.

When two read faults race, this isn't a huge problem as the data
underlying the page is not changing and so "detect and drop" works
just fine. The issues are to do with write faults.

When two write faults occur, we serialise block allocation in
get_blocks() so only one faul will allocate the extent. It will,
however, be marked as an unwritten extent, and that is where the
problem lies - the DAX fault code cannot differentiate between a
block that was just allocated and a block that was preallocated and
needs zeroing. The result is that both write faults end up zeroing
the block and attempting to convert it back to written.

The problem is that the first fault can zero and convert before the
second fault starts zeroing, resulting in the zeroing for the second
fault overwriting the data that the first fault wrote with zeros.
The second fault then attempts to convert the unwritten extent,
which is then a no-op because it's already written. Data loss occurs
as a result of this race.

Because there is no sane locking construct in the page fault code
that we can use for serialisation across the page faults, we need to
ensure block allocation and zeroing occurs atomically in the
filesystem. This means we can still take concurrent page faults and
the only time they will serialise is in the filesystem
mapping/allocation callback. The page fault code will always see
written, initialised extents, so we will be able to remove the
unwritten extent handling from the DAX code when all filesystems are
converted.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/dax.c           |  5 +++++
 fs/xfs/xfs_aops.c  | 13 +++++++++----
 fs/xfs/xfs_iomap.c | 21 ++++++++++++++++++++-
 3 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index bcfb14b..571c631 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -29,6 +29,11 @@
 #include <linux/uio.h>
 #include <linux/vmstat.h>
 
+/*
+ * dax_clear_blocks() is called from within transaction context from XFS,
+ * and hence this means the stack from this point must follow GFP_NOFS
+ * semantics for all operations.
+ */
 int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 {
 	struct block_device *bdev = inode->i_sb->s_bdev;
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 366e41eb..7b4f849 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1293,15 +1293,12 @@ xfs_map_direct(
 
 	trace_xfs_gbmap_direct(XFS_I(inode), offset, size, type, imap);
 
-	/* XXX: preparation for removing unwritten extents in DAX */
-#if 0
 	if (dax_fault) {
 		ASSERT(type == XFS_IO_OVERWRITE);
 		trace_xfs_gbmap_direct_none(XFS_I(inode), offset, size, type,
 					    imap);
 		return;
 	}
-#endif
 
 	if (bh_result->b_private) {
 		ioend = bh_result->b_private;
@@ -1429,10 +1426,12 @@ __xfs_get_blocks(
 	if (error)
 		goto out_unlock;
 
+	/* for DAX, we convert unwritten extents directly */
 	if (create &&
 	    (!nimaps ||
 	     (imap.br_startblock == HOLESTARTBLOCK ||
-	      imap.br_startblock == DELAYSTARTBLOCK))) {
+	      imap.br_startblock == DELAYSTARTBLOCK) ||
+	     (IS_DAX(inode) && ISUNWRITTEN(&imap)))) {
 		if (direct || xfs_get_extsz_hint(ip)) {
 			/*
 			 * xfs_iomap_write_direct() expects the shared lock. It
@@ -1477,6 +1476,12 @@ __xfs_get_blocks(
 		goto out_unlock;
 	}
 
+	if (IS_DAX(inode) && create) {
+		ASSERT(!ISUNWRITTEN(&imap));
+		/* zeroing is not needed at a higher layer */
+		new = 0;
+	}
+
 	/* trim mapping down to size requested */
 	if (direct || size > (1 << inode->i_blkbits))
 		xfs_map_trim_size(inode, iblock, bh_result,
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index c3cb5a5..f4f5b43 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -132,6 +132,7 @@ xfs_iomap_write_direct(
 	int		committed;
 	int		error;
 	int		lockmode;
+	int		bmapi_flags = XFS_BMAPI_PREALLOC;
 
 	rt = XFS_IS_REALTIME_INODE(ip);
 	extsz = xfs_get_extsz_hint(ip);
@@ -195,6 +196,23 @@ xfs_iomap_write_direct(
 	 * Allocate and setup the transaction
 	 */
 	tp = xfs_trans_alloc(mp, XFS_TRANS_DIOSTRAT);
+
+	/*
+	 * For DAX, we do not allocate unwritten extents, but instead we zero
+	 * the block before we commit the transaction.  Ideally we'd like to do
+	 * this outside the transaction context, but if we commit and then crash
+	 * we may not have zeroed the blocks and this will be exposed on
+	 * recovery of the allocation. Hence we must zero before commit.
+	 * Further, if we are mapping unwritten extents here, we need to zero
+	 * and convert them to written so that we don't need an unwritten extent
+	 * callback for DAX. This also means that we need to be able to dip into
+	 * the reserve block pool if there is no space left but we need to do
+	 * unwritten extent conversion.
+	 */
+	if (IS_DAX(VFS_I(ip))) {
+		bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO;
+		tp->t_flags |= XFS_TRANS_RESERVE;
+	}
 	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_write,
 				  resblks, resrtextents);
 	/*
@@ -221,7 +239,7 @@ xfs_iomap_write_direct(
 	xfs_bmap_init(&free_list, &firstfsb);
 	nimaps = 1;
 	error = xfs_bmapi_write(tp, ip, offset_fsb, count_fsb,
-				XFS_BMAPI_PREALLOC, &firstfsb, resblks, imap,
+				bmapi_flags, &firstfsb, resblks, imap,
 				&nimaps, &free_list);
 	if (error)
 		goto out_bmap_cancel;
@@ -232,6 +250,7 @@ xfs_iomap_write_direct(
 	error = xfs_bmap_finish(&tp, &free_list, &committed);
 	if (error)
 		goto out_bmap_cancel;
+
 	error = xfs_trans_commit(tp);
 	if (error)
 		goto out_unlock;
-- 
2.5.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 4/6] xfs: DAX does not use IO completion callbacks
  2015-10-19  3:27 [PATCH 0/6 V2] xfs: upfront block zeroing for DAX Dave Chinner
                   ` (2 preceding siblings ...)
  2015-10-19  3:27 ` [PATCH 3/6] xfs: Don't use unwritten extents for DAX Dave Chinner
@ 2015-10-19  3:27 ` Dave Chinner
  2015-10-29 14:29   ` Brian Foster
  2015-10-19  3:27 ` [PATCH 5/6] xfs: add ->pfn_mkwrite support for DAX Dave Chinner
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2015-10-19  3:27 UTC (permalink / raw)
  To: xfs; +Cc: ross.zwisler, jack

From: Dave Chinner <dchinner@redhat.com>

For DAX, we are now doing block zeroing and
we are updating the file size during allocation. This means we no
longer need an IO completion callback to do these things, so remove
the completion callbacks from the __dax_fault and __dax_mkwrite
calls.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c | 39 ---------------------------------------
 fs/xfs/xfs_aops.h |  1 -
 fs/xfs/xfs_file.c |  5 ++---
 3 files changed, 2 insertions(+), 43 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 7b4f849..29e7e5d 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1666,45 +1666,6 @@ xfs_end_io_direct_write(
 	__xfs_end_io_direct_write(inode, ioend, offset, size);
 }
 
-/*
- * For DAX we need a mapping buffer callback for unwritten extent conversion
- * when page faults allocate blocks and then zero them. Note that in this
- * case the mapping indicated by the ioend may extend beyond EOF. We most
- * definitely do not want to extend EOF here, so we trim back the ioend size to
- * EOF.
- */
-#ifdef CONFIG_FS_DAX
-void
-xfs_end_io_dax_write(
-	struct buffer_head	*bh,
-	int			uptodate)
-{
-	struct xfs_ioend	*ioend = bh->b_private;
-	struct inode		*inode = ioend->io_inode;
-	ssize_t			size = ioend->io_size;
-
-	ASSERT(IS_DAX(ioend->io_inode));
-
-	/* if there was an error zeroing, then don't convert it */
-	if (!uptodate)
-		ioend->io_error = -EIO;
-
-	/*
-	 * Trim update to EOF, so we don't extend EOF during unwritten extent
-	 * conversion of partial EOF blocks.
-	 */
-	spin_lock(&XFS_I(inode)->i_flags_lock);
-	if (ioend->io_offset + size > i_size_read(inode))
-		size = i_size_read(inode) - ioend->io_offset;
-	spin_unlock(&XFS_I(inode)->i_flags_lock);
-
-	__xfs_end_io_direct_write(inode, ioend, ioend->io_offset, size);
-
-}
-#else
-void xfs_end_io_dax_write(struct buffer_head *bh, int uptodate) { }
-#endif
-
 static inline ssize_t
 xfs_vm_do_dio(
 	struct inode		*inode,
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index d39ba25..f6ffc9a 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -60,7 +60,6 @@ int	xfs_get_blocks_direct(struct inode *inode, sector_t offset,
 			      struct buffer_head *map_bh, int create);
 int	xfs_get_blocks_dax_fault(struct inode *inode, sector_t offset,
 			         struct buffer_head *map_bh, int create);
-void	xfs_end_io_dax_write(struct buffer_head *bh, int uptodate);
 
 extern void xfs_count_page_state(struct page *, int *, int *);
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 7f873bc..403151a 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1508,8 +1508,7 @@ xfs_filemap_page_mkwrite(
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (IS_DAX(inode)) {
-		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault,
-				    xfs_end_io_dax_write);
+		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault, NULL);
 	} else {
 		ret = __block_page_mkwrite(vma, vmf, xfs_get_blocks);
 		ret = block_page_mkwrite_return(ret);
@@ -1571,7 +1570,7 @@ xfs_filemap_pmd_fault(
 	file_update_time(vma->vm_file);
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 	ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault,
-				    xfs_end_io_dax_write);
+			      NULL);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 	sb_end_pagefault(inode->i_sb);
 
-- 
2.5.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 5/6] xfs: add ->pfn_mkwrite support for DAX
  2015-10-19  3:27 [PATCH 0/6 V2] xfs: upfront block zeroing for DAX Dave Chinner
                   ` (3 preceding siblings ...)
  2015-10-19  3:27 ` [PATCH 4/6] xfs: DAX does not use IO completion callbacks Dave Chinner
@ 2015-10-19  3:27 ` Dave Chinner
  2015-10-29 14:30   ` Brian Foster
  2015-10-19  3:27 ` [PATCH 6/6] xfs: xfs_filemap_pmd_fault treats read faults as write faults Dave Chinner
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2015-10-19  3:27 UTC (permalink / raw)
  To: xfs; +Cc: ross.zwisler, jack

From: Dave Chinner <dchinner@redhat.com>

->pfn_mkwrite support is needed so that when a page with allocated
backing store takes a write fault we can check that the fault has
not raced with a truncate and is pointing to a region beyond the
current end of file.

This also allows us to update the timestamp on the inode, too, which
fixes a generic/080 failure.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c  | 35 +++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_trace.h |  1 +
 2 files changed, 36 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 403151a..e7cf9ec 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1577,11 +1577,46 @@ xfs_filemap_pmd_fault(
 	return ret;
 }
 
+/*
+ * pfn_mkwrite was originally inteneded to ensure we capture time stamp
+ * updates on write faults. In reality, it's need to serialise against
+ * truncate similar to page_mkwrite. Hence we open-code dax_pfn_mkwrite()
+ * here and cycle the XFS_MMAPLOCK_SHARED to ensure we serialise the fault
+ * barrier in place.
+ */
+static int
+xfs_filemap_pfn_mkwrite(
+	struct vm_area_struct	*vma,
+	struct vm_fault		*vmf)
+{
+
+	struct inode		*inode = file_inode(vma->vm_file);
+	struct xfs_inode	*ip = XFS_I(inode);
+	int			ret = VM_FAULT_NOPAGE;
+	loff_t			size;
+
+	trace_xfs_filemap_pfn_mkwrite(ip);
+
+	sb_start_pagefault(inode->i_sb);
+	file_update_time(vma->vm_file);
+
+	/* check if the faulting page hasn't raced with truncate */
+	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (vmf->pgoff >= size)
+		ret = VM_FAULT_SIGBUS;
+	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
+	sb_end_pagefault(inode->i_sb);
+	return ret;
+
+}
+
 static const struct vm_operations_struct xfs_file_vm_ops = {
 	.fault		= xfs_filemap_fault,
 	.pmd_fault	= xfs_filemap_pmd_fault,
 	.map_pages	= filemap_map_pages,
 	.page_mkwrite	= xfs_filemap_page_mkwrite,
+	.pfn_mkwrite	= xfs_filemap_pfn_mkwrite,
 };
 
 STATIC int
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 957f5cc..877079eb 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -689,6 +689,7 @@ DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid);
 DEFINE_INODE_EVENT(xfs_filemap_fault);
 DEFINE_INODE_EVENT(xfs_filemap_pmd_fault);
 DEFINE_INODE_EVENT(xfs_filemap_page_mkwrite);
+DEFINE_INODE_EVENT(xfs_filemap_pfn_mkwrite);
 
 DECLARE_EVENT_CLASS(xfs_iref_class,
 	TP_PROTO(struct xfs_inode *ip, unsigned long caller_ip),
-- 
2.5.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 6/6] xfs: xfs_filemap_pmd_fault treats read faults as write faults
  2015-10-19  3:27 [PATCH 0/6 V2] xfs: upfront block zeroing for DAX Dave Chinner
                   ` (4 preceding siblings ...)
  2015-10-19  3:27 ` [PATCH 5/6] xfs: add ->pfn_mkwrite support for DAX Dave Chinner
@ 2015-10-19  3:27 ` Dave Chinner
  2015-10-29 14:30   ` Brian Foster
  2015-11-05 23:48 ` [PATCH 0/6 V2] xfs: upfront block zeroing for DAX Ross Zwisler
  2015-11-06 18:12 ` Boylston, Brian
  7 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2015-10-19  3:27 UTC (permalink / raw)
  To: xfs; +Cc: ross.zwisler, jack

From: Dave Chinner <dchinner@redhat.com>

The code initially committed didn't have the same checks for write
faults as the dax_pmd_fault code and hence treats all faults as
write faults. We can get read faults through this path because they
is no pmd_mkwrite path for write faults similar to the normal page
fault path. Hence we need to ensure that we only do c/mtime updates
on write faults, and freeze protection is unnecessary for read
faults.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e7cf9ec..0045b0a 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1482,7 +1482,7 @@ xfs_file_llseek(
  *
  * mmap_sem (MM)
  *   sb_start_pagefault(vfs, freeze)
- *     i_mmap_lock (XFS - truncate serialisation)
+ *     i_mmaplock (XFS - truncate serialisation)
  *       page_lock (MM)
  *         i_lock (XFS - extent map serialisation)
  */
@@ -1550,6 +1550,13 @@ xfs_filemap_fault(
 	return ret;
 }
 
+/*
+ * Similar to xfs_filemap_fault(), the DAX fault path can call into here on
+ * both read and write faults. Hence we need to handle both cases. There is no
+ * ->pmd_mkwrite callout for huge pages, so we have a single function here to
+ * handle both cases here. @flags carries the information on the type of fault
+ * occuring.
+ */
 STATIC int
 xfs_filemap_pmd_fault(
 	struct vm_area_struct	*vma,
@@ -1566,13 +1573,18 @@ xfs_filemap_pmd_fault(
 
 	trace_xfs_filemap_pmd_fault(ip);
 
-	sb_start_pagefault(inode->i_sb);
-	file_update_time(vma->vm_file);
+	if (flags & FAULT_FLAG_WRITE) {
+		sb_start_pagefault(inode->i_sb);
+		file_update_time(vma->vm_file);
+	}
+
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 	ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault,
 			      NULL);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-	sb_end_pagefault(inode->i_sb);
+
+	if (flags & FAULT_FLAG_WRITE)
+		sb_end_pagefault(inode->i_sb);
 
 	return ret;
 }
-- 
2.5.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/6] xfs: fix inode size update overflow in xfs_map_direct()
  2015-10-19  3:27 ` [PATCH 1/6] xfs: fix inode size update overflow in xfs_map_direct() Dave Chinner
@ 2015-10-29 14:27   ` Brian Foster
  0 siblings, 0 replies; 47+ messages in thread
From: Brian Foster @ 2015-10-29 14:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: ross.zwisler, jack, xfs

On Mon, Oct 19, 2015 at 02:27:13PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Both direct IO and DAX pass an offset and count into get_blocks that
> will overflow a s64 variable when an IO goes into the last supported
> block in a file (i.e. at offset 2^63 - 1FSB bytes). This can be seen
> from the tracing:
> 
> xfs_get_blocks_alloc: [...] offset 0x7ffffffffffff000 count 4096
> xfs_gbmap_direct:     [...] offset 0x7ffffffffffff000 count 4096
> xfs_gbmap_direct_none:[...] offset 0x7ffffffffffff000 count 4096
> 
> 0x7ffffffffffff000 + 4096 = 0x8000000000000000, and hence that
> overflows the s64 offset and we fail to detect the need for a
> filesize update and an ioend is not allocated.
> 
> This is *mostly* avoided for direct IO because such extending IOs
> occur with full block allocation, and so the "IS_UNWRITTEN()" check
> still evaluates as true and we get an ioend that way. However, doing
> single sector extending IOs to this last block will expose the fact
> that file size updates will not occur after the first allocating
> direct IO as the overflow will then be exposed.
> 
> There is one further complexity: the DAX page fault path also
> exposes the same issue in block allocation. However, page faults
> cannot extend the file size, so in this case we want to allocate the
> block but do not want to allocate an ioend to enable file size
> update at IO completion. Hence we now need to distinguish between
> the direct IO patch allocation and dax fault path allocation to
> avoid leaking ioend structures.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_aops.c | 50 ++++++++++++++++++++++++++++++++++++++++++++------
>  fs/xfs/xfs_aops.h |  2 ++
>  fs/xfs/xfs_file.c |  6 +++---
>  3 files changed, 49 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index e4fff58..366e41eb 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1259,13 +1259,28 @@ xfs_vm_releasepage(
>   * the DIO. There is only going to be one reference to the ioend and its life
>   * cycle is constrained by the DIO completion code. hence we don't need
>   * reference counting here.
> + *
> + * Note that for DIO, an IO to the highest supported file block offset (i.e.
> + * 2^63 - 1FSB bytes) will result in the offset + count overflowing a signed 64
> + * bit variable. Hence if we see this overflow, we have to assume that the IO is
> + * extending the file size. We won't know for sure until IO completion is run
> + * and the actual max write offset is communicated to the IO completion
> + * routine.
> + *
> + * For DAX page faults, we are preparing to never see unwritten extents here,
> + * nor should we ever extend the inode size. Hence we will soon have nothing to
> + * do here for this case, ensuring we don't have to provide an IO completion
> + * callback to free an ioend that we don't actually need for a fault into the
> + * page at offset (2^63 - 1FSB) bytes.
>   */
> +
>  static void
>  xfs_map_direct(
>  	struct inode		*inode,
>  	struct buffer_head	*bh_result,
>  	struct xfs_bmbt_irec	*imap,
> -	xfs_off_t		offset)
> +	xfs_off_t		offset,
> +	bool			dax_fault)
>  {
>  	struct xfs_ioend	*ioend;
>  	xfs_off_t		size = bh_result->b_size;
> @@ -1278,6 +1293,16 @@ xfs_map_direct(
>  
>  	trace_xfs_gbmap_direct(XFS_I(inode), offset, size, type, imap);
>  
> +	/* XXX: preparation for removing unwritten extents in DAX */
> +#if 0
> +	if (dax_fault) {
> +		ASSERT(type == XFS_IO_OVERWRITE);
> +		trace_xfs_gbmap_direct_none(XFS_I(inode), offset, size, type,
> +					    imap);
> +		return;
> +	}
> +#endif
> +
>  	if (bh_result->b_private) {
>  		ioend = bh_result->b_private;
>  		ASSERT(ioend->io_size > 0);
> @@ -1292,7 +1317,8 @@ xfs_map_direct(
>  					      ioend->io_size, ioend->io_type,
>  					      imap);
>  	} else if (type == XFS_IO_UNWRITTEN ||
> -		   offset + size > i_size_read(inode)) {
> +		   offset + size > i_size_read(inode) ||
> +		   offset + size < 0) {
>  		ioend = xfs_alloc_ioend(inode, type);
>  		ioend->io_offset = offset;
>  		ioend->io_size = size;
> @@ -1354,7 +1380,8 @@ __xfs_get_blocks(
>  	sector_t		iblock,
>  	struct buffer_head	*bh_result,
>  	int			create,
> -	bool			direct)
> +	bool			direct,
> +	bool			dax_fault)
>  {
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
> @@ -1467,7 +1494,8 @@ __xfs_get_blocks(
>  			set_buffer_unwritten(bh_result);
>  		/* direct IO needs special help */
>  		if (create && direct)
> -			xfs_map_direct(inode, bh_result, &imap, offset);
> +			xfs_map_direct(inode, bh_result, &imap, offset,
> +				       dax_fault);
>  	}
>  
>  	/*
> @@ -1514,7 +1542,7 @@ xfs_get_blocks(
>  	struct buffer_head	*bh_result,
>  	int			create)
>  {
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, false);
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
>  }
>  
>  int
> @@ -1524,7 +1552,17 @@ xfs_get_blocks_direct(
>  	struct buffer_head	*bh_result,
>  	int			create)
>  {
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, true);
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
> +}
> +
> +int
> +xfs_get_blocks_dax_fault(
> +	struct inode		*inode,
> +	sector_t		iblock,
> +	struct buffer_head	*bh_result,
> +	int			create)
> +{
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
>  }
>  
>  static void
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index 86afd1a..d39ba25 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -58,6 +58,8 @@ int	xfs_get_blocks(struct inode *inode, sector_t offset,
>  		       struct buffer_head *map_bh, int create);
>  int	xfs_get_blocks_direct(struct inode *inode, sector_t offset,
>  			      struct buffer_head *map_bh, int create);
> +int	xfs_get_blocks_dax_fault(struct inode *inode, sector_t offset,
> +			         struct buffer_head *map_bh, int create);
>  void	xfs_end_io_dax_write(struct buffer_head *bh, int uptodate);
>  
>  extern void xfs_count_page_state(struct page *, int *, int *);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 2f7b6bd..7f873bc 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1508,7 +1508,7 @@ xfs_filemap_page_mkwrite(
>  	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>  
>  	if (IS_DAX(inode)) {
> -		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_direct,
> +		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault,
>  				    xfs_end_io_dax_write);
>  	} else {
>  		ret = __block_page_mkwrite(vma, vmf, xfs_get_blocks);
> @@ -1543,7 +1543,7 @@ xfs_filemap_fault(
>  		 * changes to xfs_get_blocks_direct() to map unwritten extent
>  		 * ioend for conversion on read-only mappings.
>  		 */
> -		ret = __dax_fault(vma, vmf, xfs_get_blocks_direct, NULL);
> +		ret = __dax_fault(vma, vmf, xfs_get_blocks_dax_fault, NULL);
>  	} else
>  		ret = filemap_fault(vma, vmf);
>  	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> @@ -1570,7 +1570,7 @@ xfs_filemap_pmd_fault(
>  	sb_start_pagefault(inode->i_sb);
>  	file_update_time(vma->vm_file);
>  	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> -	ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_direct,
> +	ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault,
>  				    xfs_end_io_dax_write);
>  	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>  	sb_end_pagefault(inode->i_sb);
> -- 
> 2.5.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/6] xfs: introduce BMAPI_ZERO for allocating zeroed extents
  2015-10-19  3:27 ` [PATCH 2/6] xfs: introduce BMAPI_ZERO for allocating zeroed extents Dave Chinner
@ 2015-10-29 14:27   ` Brian Foster
  2015-10-29 23:35     ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Brian Foster @ 2015-10-29 14:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: ross.zwisler, jack, xfs

On Mon, Oct 19, 2015 at 02:27:14PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> To enable DAX to do atomic allocation of zeroed extents, we need to
> drive the block zeroing deep into the allocator. Because
> xfs_bmapi_write() can return merged extents on allocation that were
> only partially allocated (i.e. requested range spans allocated and
> hole regions, allocation into the hole was contiguous), we cannot
> zero the extent returned from xfs_bmapi_write() as that can
> overwrite existing data with zeros.
> 
> Hence we have to drive the extent zeroing into the allocation code,
> prior to where we merge the extents into the BMBT and return the
> resultant map. This means we need to propagate this need down to
> the xfs_alloc_vextent() and issue the block zeroing at this point.
> 
> While this functionality is being introduced for DAX, there is no
> reason why it is specific to DAX - we can per-zero blocks during the
> allocation transaction on any type of device. It's just slow (and
> usually slower than unwritten allocation and conversion) on
> traditional block devices so doesn't tend to get used. We can,
> however, hook hardware zeroing optimisations via sb_issue_zeroout()
> to this operation, so it may be useful in future and hence the
> "allocate zeroed blocks" API needs to be implementation neutral.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_alloc.c | 10 +++++++++-
>  fs/xfs/libxfs/xfs_alloc.h |  8 +++++---
>  fs/xfs/libxfs/xfs_bmap.c  | 25 +++++++++++++++++++++++--
>  fs/xfs/libxfs/xfs_bmap.h  | 13 +++++++++++--
>  fs/xfs/xfs_bmap_util.c    | 36 ++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_mount.h        |  3 +++
>  6 files changed, 87 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index e926197..3479294 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -2509,7 +2509,7 @@ xfs_alloc_vextent(
>  		 * Try near allocation first, then anywhere-in-ag after
>  		 * the first a.g. fails.
>  		 */
> -		if ((args->userdata  == XFS_ALLOC_INITIAL_USER_DATA) &&
> +		if ((args->userdata & XFS_ALLOC_INITIAL_USER_DATA) &&
>  		    (mp->m_flags & XFS_MOUNT_32BITINODES)) {
>  			args->fsbno = XFS_AGB_TO_FSB(mp,
>  					((mp->m_agfrotor / rotorstep) %
> @@ -2640,6 +2640,14 @@ xfs_alloc_vextent(
>  		XFS_AG_CHECK_DADDR(mp, XFS_FSB_TO_DADDR(mp, args->fsbno),
>  			args->len);
>  #endif
> +
> +		/* Zero the extent if we were asked to do so */
> +		if (args->userdata & XFS_ALLOC_USERDATA_ZERO) {
> +			error = xfs_zero_extent(args->ip, args->fsbno, args->len);
> +			if (error)
> +				goto error0;
> +		}
> +

Ok, so we wire up zeroing to the actual block allocation here...

>  	}
>  	xfs_perag_put(args->pag);
>  	return 0;
> diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> index ca1c816..0ecde4d 100644
> --- a/fs/xfs/libxfs/xfs_alloc.h
> +++ b/fs/xfs/libxfs/xfs_alloc.h
> @@ -101,6 +101,7 @@ typedef struct xfs_alloc_arg {
>  	struct xfs_mount *mp;		/* file system mount point */
>  	struct xfs_buf	*agbp;		/* buffer for a.g. freelist header */
>  	struct xfs_perag *pag;		/* per-ag struct for this agno */
> +	struct xfs_inode *ip;		/* for userdata zeroing method */
>  	xfs_fsblock_t	fsbno;		/* file system block number */
>  	xfs_agnumber_t	agno;		/* allocation group number */
>  	xfs_agblock_t	agbno;		/* allocation group-relative block # */
> @@ -120,15 +121,16 @@ typedef struct xfs_alloc_arg {
>  	char		wasdel;		/* set if allocation was prev delayed */
>  	char		wasfromfl;	/* set if allocation is from freelist */
>  	char		isfl;		/* set if is freelist blocks - !acctg */
> -	char		userdata;	/* set if this is user data */
> +	char		userdata;	/* mask defining userdata treatment */
>  	xfs_fsblock_t	firstblock;	/* io first block allocated */
>  } xfs_alloc_arg_t;
>  
>  /*
>   * Defines for userdata
>   */
> -#define XFS_ALLOC_USERDATA		1	/* allocation is for user data*/
> -#define XFS_ALLOC_INITIAL_USER_DATA	2	/* special case start of file */
> +#define XFS_ALLOC_USERDATA		(1 << 0)/* allocation is for user data*/
> +#define XFS_ALLOC_INITIAL_USER_DATA	(1 << 1)/* special case start of file */
> +#define XFS_ALLOC_USERDATA_ZERO		(1 << 2)/* zero extent on allocation */
>  
>  xfs_extlen_t xfs_alloc_longest_free_extent(struct xfs_mount *mp,
>  		struct xfs_perag *pag, xfs_extlen_t need);
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index ab92d10..590cbec 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3802,8 +3802,13 @@ xfs_bmap_btalloc(
>  	args.wasdel = ap->wasdel;
>  	args.isfl = 0;
>  	args.userdata = ap->userdata;
> -	if ((error = xfs_alloc_vextent(&args)))
> +	if (ap->userdata & XFS_ALLOC_USERDATA_ZERO)
> +		args.ip = ap->ip;
> +
> +	error = xfs_alloc_vextent(&args);
> +	if (error)
>  		return error;
> +
>  	if (tryagain && args.fsbno == NULLFSBLOCK) {
>  		/*
>  		 * Exact allocation failed. Now try with alignment
> @@ -4302,11 +4307,14 @@ xfs_bmapi_allocate(
>  
>  	/*
>  	 * Indicate if this is the first user data in the file, or just any
> -	 * user data.
> +	 * user data. And if it is userdata, indicate whether it needs to
> +	 * be initialised to zero during allocation.
>  	 */
>  	if (!(bma->flags & XFS_BMAPI_METADATA)) {
>  		bma->userdata = (bma->offset == 0) ?
>  			XFS_ALLOC_INITIAL_USER_DATA : XFS_ALLOC_USERDATA;
> +		if (bma->flags & XFS_BMAPI_ZERO)
> +			bma->userdata |= XFS_ALLOC_USERDATA_ZERO;
>  	}
>  
>  	bma->minlen = (bma->flags & XFS_BMAPI_CONTIG) ? bma->length : 1;
> @@ -4421,6 +4429,17 @@ xfs_bmapi_convert_unwritten(
>  	mval->br_state = (mval->br_state == XFS_EXT_UNWRITTEN)
>  				? XFS_EXT_NORM : XFS_EXT_UNWRITTEN;
>  
> +	/*
> +	 * Before insertion into the bmbt, zero the range being converted
> +	 * if required.
> +	 */
> +	if (flags & XFS_BMAPI_ZERO) {
> +		error = xfs_zero_extent(bma->ip, mval->br_startblock,
> +					mval->br_blockcount);
> +		if (error)
> +			return error;
> +	}
> +

... and also zero the extent on unwritten conversion, if necessary.

I suspect there's no use case to pass XFS_BMAPI_ZERO for new unwritten
allocations, but if the flag is passed, doesn't this cause duplicate
block zeroing? Perhaps we should drop the zero flag from 'flags' after
allocation in xfs_bmapi_write() just to ensure this executes in one
place or the other..?

>  	error = xfs_bmap_add_extent_unwritten_real(bma->tp, bma->ip, &bma->idx,
>  			&bma->cur, mval, bma->firstblock, bma->flist,
>  			&tmp_logflags);
> @@ -4513,6 +4532,8 @@ xfs_bmapi_write(
>  	ASSERT(len > 0);
>  	ASSERT(XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_LOCAL);
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> +	ASSERT((flags & (XFS_BMAPI_METADATA | XFS_BMAPI_ZERO)) !=
> +			(XFS_BMAPI_METADATA | XFS_BMAPI_ZERO));
>  
>  	if (unlikely(XFS_TEST_ERROR(
>  	    (XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_EXTENTS &&
> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> index 6aaa0c1..a160f8a 100644
> --- a/fs/xfs/libxfs/xfs_bmap.h
> +++ b/fs/xfs/libxfs/xfs_bmap.h
> @@ -52,9 +52,9 @@ struct xfs_bmalloca {
>  	xfs_extlen_t		minleft; /* amount must be left after alloc */
>  	bool			eof;	/* set if allocating past last extent */
>  	bool			wasdel;	/* replacing a delayed allocation */
> -	bool			userdata;/* set if is user data */
>  	bool			aeof;	/* allocated space at eof */
>  	bool			conv;	/* overwriting unwritten extents */
> +	char			userdata;/* userdata mask */
>  	int			flags;
>  };
>  
> @@ -109,6 +109,14 @@ typedef	struct xfs_bmap_free
>   */
>  #define XFS_BMAPI_CONVERT	0x040
>  
> +/*
> + * allocate zeroed extents - this requires all newly allocated user data extents
> + * to be initialised to zero. It will be ignored if XFS_BMAPI_METADATA is set.
> + * Use in conjunction with XFS_BMAPI_CONVERT to convert unwritten extents found
> + * during the allocation range to zeroed written extents.
> + */
> +#define XFS_BMAPI_ZERO		0x080
> +
>  #define XFS_BMAPI_FLAGS \
>  	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
>  	{ XFS_BMAPI_METADATA,	"METADATA" }, \
> @@ -116,7 +124,8 @@ typedef	struct xfs_bmap_free
>  	{ XFS_BMAPI_PREALLOC,	"PREALLOC" }, \
>  	{ XFS_BMAPI_IGSTATE,	"IGSTATE" }, \
>  	{ XFS_BMAPI_CONTIG,	"CONTIG" }, \
> -	{ XFS_BMAPI_CONVERT,	"CONVERT" }
> +	{ XFS_BMAPI_CONVERT,	"CONVERT" }, \
> +	{ XFS_BMAPI_ZERO,	"ZERO" }
>  
>  
>  static inline int xfs_bmapi_aflag(int w)
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index eca325e..dbae649 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -57,6 +57,35 @@ xfs_fsb_to_db(struct xfs_inode *ip, xfs_fsblock_t fsb)
>  }
>  
>  /*
> + * Routine to zero an extent on disk allocated to the specific inode.
> + *
> + * The VFS functions take a linearised filesystem block offset, so we have to
> + * convert the sparse xfs fsb to the right format first.
> + * VFS types are real funky, too.
> + */
> +int
> +xfs_zero_extent(
> +	struct xfs_inode *ip,
> +	xfs_fsblock_t	start_fsb,
> +	xfs_off_t	count_fsb)
> +{
> +	struct xfs_mount *mp = ip->i_mount;
> +	xfs_daddr_t	sector = xfs_fsb_to_db(ip, start_fsb);
> +	sector_t	block = XFS_BB_TO_FSBT(mp, sector);
> +	ssize_t		size = XFS_FSB_TO_B(mp, count_fsb);
> +
> +	if (IS_DAX(VFS_I(ip)))
> +		return dax_clear_blocks(VFS_I(ip), block, size);
> +
> +	/*
> +	 * let the block layer decide on the fastest method of
> +	 * implementing the zeroing.
> +	 */
> +	return sb_issue_zeroout(mp->m_super, block, count_fsb, GFP_NOFS);

The count param to sb_issue_zeroout() is a sector_t and we're passing an
FSB.

Brian

> +
> +}
> +
> +/*
>   * Routine to be called at transaction's end by xfs_bmapi, xfs_bunmapi
>   * caller.  Frees all the extents that need freeing, which must be done
>   * last due to locking considerations.  We never free any extents in
> @@ -229,6 +258,13 @@ xfs_bmap_rtalloc(
>  		xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
>  			ap->wasdel ? XFS_TRANS_DQ_DELRTBCOUNT :
>  					XFS_TRANS_DQ_RTBCOUNT, (long) ralen);
> +
> +		/* Zero the extent if we were asked to do so */
> +		if (ap->userdata & XFS_ALLOC_USERDATA_ZERO) {
> +			error = xfs_zero_extent(ap->ip, ap->blkno, ap->length);
> +			if (error)
> +				return error;
> +		}
>  	} else {
>  		ap->length = 0;
>  	}
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 8795272..f20e5de 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -337,4 +337,7 @@ extern int	xfs_dev_is_read_only(struct xfs_mount *, char *);
>  
>  extern void	xfs_set_low_space_thresholds(struct xfs_mount *);
>  
> +int	xfs_zero_extent(struct xfs_inode *ip, xfs_fsblock_t start_fsb,
> +			xfs_off_t count_fsb);
> +
>  #endif	/* __XFS_MOUNT_H__ */
> -- 
> 2.5.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
  2015-10-19  3:27 ` [PATCH 3/6] xfs: Don't use unwritten extents for DAX Dave Chinner
@ 2015-10-29 14:29   ` Brian Foster
  2015-10-29 23:37     ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Brian Foster @ 2015-10-29 14:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: ross.zwisler, jack, xfs

On Mon, Oct 19, 2015 at 02:27:15PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> DAX has a page fault serialisation problem with block allocation.
> Because it allows concurrent page faults and does not have a page
> lock to serialise faults to the same page, it can get two concurrent
> faults to the page that race.
> 
> When two read faults race, this isn't a huge problem as the data
> underlying the page is not changing and so "detect and drop" works
> just fine. The issues are to do with write faults.
> 
> When two write faults occur, we serialise block allocation in
> get_blocks() so only one faul will allocate the extent. It will,
> however, be marked as an unwritten extent, and that is where the
> problem lies - the DAX fault code cannot differentiate between a
> block that was just allocated and a block that was preallocated and
> needs zeroing. The result is that both write faults end up zeroing
> the block and attempting to convert it back to written.
> 
> The problem is that the first fault can zero and convert before the
> second fault starts zeroing, resulting in the zeroing for the second
> fault overwriting the data that the first fault wrote with zeros.
> The second fault then attempts to convert the unwritten extent,
> which is then a no-op because it's already written. Data loss occurs
> as a result of this race.
> 
> Because there is no sane locking construct in the page fault code
> that we can use for serialisation across the page faults, we need to
> ensure block allocation and zeroing occurs atomically in the
> filesystem. This means we can still take concurrent page faults and
> the only time they will serialise is in the filesystem
> mapping/allocation callback. The page fault code will always see
> written, initialised extents, so we will be able to remove the
> unwritten extent handling from the DAX code when all filesystems are
> converted.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/dax.c           |  5 +++++
>  fs/xfs/xfs_aops.c  | 13 +++++++++----
>  fs/xfs/xfs_iomap.c | 21 ++++++++++++++++++++-
>  3 files changed, 34 insertions(+), 5 deletions(-)
> 
...
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index c3cb5a5..f4f5b43 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
...
>  	tp = xfs_trans_alloc(mp, XFS_TRANS_DIOSTRAT);
> +
> +	/*
> +	 * For DAX, we do not allocate unwritten extents, but instead we zero
> +	 * the block before we commit the transaction.  Ideally we'd like to do
> +	 * this outside the transaction context, but if we commit and then crash
> +	 * we may not have zeroed the blocks and this will be exposed on
> +	 * recovery of the allocation. Hence we must zero before commit.
> +	 * Further, if we are mapping unwritten extents here, we need to zero
> +	 * and convert them to written so that we don't need an unwritten extent
> +	 * callback for DAX. This also means that we need to be able to dip into
> +	 * the reserve block pool if there is no space left but we need to do
> +	 * unwritten extent conversion.
> +	 */
> +	if (IS_DAX(VFS_I(ip))) {
> +		bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO;
> +		tp->t_flags |= XFS_TRANS_RESERVE;
> +	}

Am I following the commit log description correctly in that block
zeroing is only required for DAX faults? Do we zero blocks for DAX DIO
as well to be consistent, or is that also required (because it looks
like we still have end_io completion for dio writes anyways)?

Brian

>  	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_write,
>  				  resblks, resrtextents);
>  	/*
> @@ -221,7 +239,7 @@ xfs_iomap_write_direct(
>  	xfs_bmap_init(&free_list, &firstfsb);
>  	nimaps = 1;
>  	error = xfs_bmapi_write(tp, ip, offset_fsb, count_fsb,
> -				XFS_BMAPI_PREALLOC, &firstfsb, resblks, imap,
> +				bmapi_flags, &firstfsb, resblks, imap,
>  				&nimaps, &free_list);
>  	if (error)
>  		goto out_bmap_cancel;
> @@ -232,6 +250,7 @@ xfs_iomap_write_direct(
>  	error = xfs_bmap_finish(&tp, &free_list, &committed);
>  	if (error)
>  		goto out_bmap_cancel;
> +
>  	error = xfs_trans_commit(tp);
>  	if (error)
>  		goto out_unlock;
> -- 
> 2.5.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/6] xfs: DAX does not use IO completion callbacks
  2015-10-19  3:27 ` [PATCH 4/6] xfs: DAX does not use IO completion callbacks Dave Chinner
@ 2015-10-29 14:29   ` Brian Foster
  2015-10-29 23:39     ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Brian Foster @ 2015-10-29 14:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: ross.zwisler, jack, xfs

On Mon, Oct 19, 2015 at 02:27:16PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> For DAX, we are now doing block zeroing and
> we are updating the file size during allocation. This means we no
> longer need an IO completion callback to do these things, so remove
> the completion callbacks from the __dax_fault and __dax_mkwrite
> calls.
> 

Where do we "update the file size during allocation?"

Brian

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c | 39 ---------------------------------------
>  fs/xfs/xfs_aops.h |  1 -
>  fs/xfs/xfs_file.c |  5 ++---
>  3 files changed, 2 insertions(+), 43 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 7b4f849..29e7e5d 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1666,45 +1666,6 @@ xfs_end_io_direct_write(
>  	__xfs_end_io_direct_write(inode, ioend, offset, size);
>  }
>  
> -/*
> - * For DAX we need a mapping buffer callback for unwritten extent conversion
> - * when page faults allocate blocks and then zero them. Note that in this
> - * case the mapping indicated by the ioend may extend beyond EOF. We most
> - * definitely do not want to extend EOF here, so we trim back the ioend size to
> - * EOF.
> - */
> -#ifdef CONFIG_FS_DAX
> -void
> -xfs_end_io_dax_write(
> -	struct buffer_head	*bh,
> -	int			uptodate)
> -{
> -	struct xfs_ioend	*ioend = bh->b_private;
> -	struct inode		*inode = ioend->io_inode;
> -	ssize_t			size = ioend->io_size;
> -
> -	ASSERT(IS_DAX(ioend->io_inode));
> -
> -	/* if there was an error zeroing, then don't convert it */
> -	if (!uptodate)
> -		ioend->io_error = -EIO;
> -
> -	/*
> -	 * Trim update to EOF, so we don't extend EOF during unwritten extent
> -	 * conversion of partial EOF blocks.
> -	 */
> -	spin_lock(&XFS_I(inode)->i_flags_lock);
> -	if (ioend->io_offset + size > i_size_read(inode))
> -		size = i_size_read(inode) - ioend->io_offset;
> -	spin_unlock(&XFS_I(inode)->i_flags_lock);
> -
> -	__xfs_end_io_direct_write(inode, ioend, ioend->io_offset, size);
> -
> -}
> -#else
> -void xfs_end_io_dax_write(struct buffer_head *bh, int uptodate) { }
> -#endif
> -
>  static inline ssize_t
>  xfs_vm_do_dio(
>  	struct inode		*inode,
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index d39ba25..f6ffc9a 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -60,7 +60,6 @@ int	xfs_get_blocks_direct(struct inode *inode, sector_t offset,
>  			      struct buffer_head *map_bh, int create);
>  int	xfs_get_blocks_dax_fault(struct inode *inode, sector_t offset,
>  			         struct buffer_head *map_bh, int create);
> -void	xfs_end_io_dax_write(struct buffer_head *bh, int uptodate);
>  
>  extern void xfs_count_page_state(struct page *, int *, int *);
>  
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 7f873bc..403151a 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1508,8 +1508,7 @@ xfs_filemap_page_mkwrite(
>  	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>  
>  	if (IS_DAX(inode)) {
> -		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault,
> -				    xfs_end_io_dax_write);
> +		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault, NULL);
>  	} else {
>  		ret = __block_page_mkwrite(vma, vmf, xfs_get_blocks);
>  		ret = block_page_mkwrite_return(ret);
> @@ -1571,7 +1570,7 @@ xfs_filemap_pmd_fault(
>  	file_update_time(vma->vm_file);
>  	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>  	ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault,
> -				    xfs_end_io_dax_write);
> +			      NULL);
>  	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>  	sb_end_pagefault(inode->i_sb);
>  
> -- 
> 2.5.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/6] xfs: add ->pfn_mkwrite support for DAX
  2015-10-19  3:27 ` [PATCH 5/6] xfs: add ->pfn_mkwrite support for DAX Dave Chinner
@ 2015-10-29 14:30   ` Brian Foster
  0 siblings, 0 replies; 47+ messages in thread
From: Brian Foster @ 2015-10-29 14:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: ross.zwisler, jack, xfs

On Mon, Oct 19, 2015 at 02:27:17PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> ->pfn_mkwrite support is needed so that when a page with allocated
> backing store takes a write fault we can check that the fault has
> not raced with a truncate and is pointing to a region beyond the
> current end of file.
> 
> This also allows us to update the timestamp on the inode, too, which
> fixes a generic/080 failure.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_file.c  | 35 +++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_trace.h |  1 +
>  2 files changed, 36 insertions(+)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 403151a..e7cf9ec 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1577,11 +1577,46 @@ xfs_filemap_pmd_fault(
>  	return ret;
>  }
>  
> +/*
> + * pfn_mkwrite was originally inteneded to ensure we capture time stamp
> + * updates on write faults. In reality, it's need to serialise against
> + * truncate similar to page_mkwrite. Hence we open-code dax_pfn_mkwrite()
> + * here and cycle the XFS_MMAPLOCK_SHARED to ensure we serialise the fault
> + * barrier in place.
> + */
> +static int
> +xfs_filemap_pfn_mkwrite(
> +	struct vm_area_struct	*vma,
> +	struct vm_fault		*vmf)
> +{
> +
> +	struct inode		*inode = file_inode(vma->vm_file);
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	int			ret = VM_FAULT_NOPAGE;
> +	loff_t			size;
> +
> +	trace_xfs_filemap_pfn_mkwrite(ip);
> +
> +	sb_start_pagefault(inode->i_sb);
> +	file_update_time(vma->vm_file);
> +
> +	/* check if the faulting page hasn't raced with truncate */
> +	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	if (vmf->pgoff >= size)
> +		ret = VM_FAULT_SIGBUS;
> +	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
> +	sb_end_pagefault(inode->i_sb);
> +	return ret;
> +
> +}
> +
>  static const struct vm_operations_struct xfs_file_vm_ops = {
>  	.fault		= xfs_filemap_fault,
>  	.pmd_fault	= xfs_filemap_pmd_fault,
>  	.map_pages	= filemap_map_pages,
>  	.page_mkwrite	= xfs_filemap_page_mkwrite,
> +	.pfn_mkwrite	= xfs_filemap_pfn_mkwrite,
>  };
>  
>  STATIC int
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 957f5cc..877079eb 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -689,6 +689,7 @@ DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid);
>  DEFINE_INODE_EVENT(xfs_filemap_fault);
>  DEFINE_INODE_EVENT(xfs_filemap_pmd_fault);
>  DEFINE_INODE_EVENT(xfs_filemap_page_mkwrite);
> +DEFINE_INODE_EVENT(xfs_filemap_pfn_mkwrite);
>  
>  DECLARE_EVENT_CLASS(xfs_iref_class,
>  	TP_PROTO(struct xfs_inode *ip, unsigned long caller_ip),
> -- 
> 2.5.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/6] xfs: xfs_filemap_pmd_fault treats read faults as write faults
  2015-10-19  3:27 ` [PATCH 6/6] xfs: xfs_filemap_pmd_fault treats read faults as write faults Dave Chinner
@ 2015-10-29 14:30   ` Brian Foster
  0 siblings, 0 replies; 47+ messages in thread
From: Brian Foster @ 2015-10-29 14:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: ross.zwisler, jack, xfs

On Mon, Oct 19, 2015 at 02:27:18PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The code initially committed didn't have the same checks for write
> faults as the dax_pmd_fault code and hence treats all faults as
> write faults. We can get read faults through this path because they
> is no pmd_mkwrite path for write faults similar to the normal page
> fault path. Hence we need to ensure that we only do c/mtime updates
> on write faults, and freeze protection is unnecessary for read
> faults.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_file.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index e7cf9ec..0045b0a 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1482,7 +1482,7 @@ xfs_file_llseek(
>   *
>   * mmap_sem (MM)
>   *   sb_start_pagefault(vfs, freeze)
> - *     i_mmap_lock (XFS - truncate serialisation)
> + *     i_mmaplock (XFS - truncate serialisation)
>   *       page_lock (MM)
>   *         i_lock (XFS - extent map serialisation)
>   */
> @@ -1550,6 +1550,13 @@ xfs_filemap_fault(
>  	return ret;
>  }
>  
> +/*
> + * Similar to xfs_filemap_fault(), the DAX fault path can call into here on
> + * both read and write faults. Hence we need to handle both cases. There is no
> + * ->pmd_mkwrite callout for huge pages, so we have a single function here to
> + * handle both cases here. @flags carries the information on the type of fault
> + * occuring.
> + */
>  STATIC int
>  xfs_filemap_pmd_fault(
>  	struct vm_area_struct	*vma,
> @@ -1566,13 +1573,18 @@ xfs_filemap_pmd_fault(
>  
>  	trace_xfs_filemap_pmd_fault(ip);
>  
> -	sb_start_pagefault(inode->i_sb);
> -	file_update_time(vma->vm_file);
> +	if (flags & FAULT_FLAG_WRITE) {
> +		sb_start_pagefault(inode->i_sb);
> +		file_update_time(vma->vm_file);
> +	}
> +
>  	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>  	ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault,
>  			      NULL);
>  	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> -	sb_end_pagefault(inode->i_sb);
> +
> +	if (flags & FAULT_FLAG_WRITE)
> +		sb_end_pagefault(inode->i_sb);
>  
>  	return ret;
>  }
> -- 
> 2.5.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/6] xfs: introduce BMAPI_ZERO for allocating zeroed extents
  2015-10-29 14:27   ` Brian Foster
@ 2015-10-29 23:35     ` Dave Chinner
  2015-10-30 12:36       ` Brian Foster
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2015-10-29 23:35 UTC (permalink / raw)
  To: Brian Foster; +Cc: ross.zwisler, jack, xfs

On Thu, Oct 29, 2015 at 10:27:58AM -0400, Brian Foster wrote:
> On Mon, Oct 19, 2015 at 02:27:14PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > To enable DAX to do atomic allocation of zeroed extents, we need to
> > drive the block zeroing deep into the allocator. Because
> > xfs_bmapi_write() can return merged extents on allocation that were
> > only partially allocated (i.e. requested range spans allocated and
> > hole regions, allocation into the hole was contiguous), we cannot
> > zero the extent returned from xfs_bmapi_write() as that can
> > overwrite existing data with zeros.
> > 
> > Hence we have to drive the extent zeroing into the allocation code,
> > prior to where we merge the extents into the BMBT and return the
> > resultant map. This means we need to propagate this need down to
> > the xfs_alloc_vextent() and issue the block zeroing at this point.
> > 
> > While this functionality is being introduced for DAX, there is no
> > reason why it is specific to DAX - we can per-zero blocks during the
> > allocation transaction on any type of device. It's just slow (and
> > usually slower than unwritten allocation and conversion) on
> > traditional block devices so doesn't tend to get used. We can,
> > however, hook hardware zeroing optimisations via sb_issue_zeroout()
> > to this operation, so it may be useful in future and hence the
> > "allocate zeroed blocks" API needs to be implementation neutral.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
....
> >  #endif
> > +
> > +		/* Zero the extent if we were asked to do so */
> > +		if (args->userdata & XFS_ALLOC_USERDATA_ZERO) {
> > +			error = xfs_zero_extent(args->ip, args->fsbno, args->len);
> > +			if (error)
> > +				goto error0;
> > +		}
> > +
> 
> Ok, so we wire up zeroing to the actual block allocation here...

Yes.

> > @@ -4421,6 +4429,17 @@ xfs_bmapi_convert_unwritten(
> >  	mval->br_state = (mval->br_state == XFS_EXT_UNWRITTEN)
> >  				? XFS_EXT_NORM : XFS_EXT_UNWRITTEN;
> >  
> > +	/*
> > +	 * Before insertion into the bmbt, zero the range being converted
> > +	 * if required.
> > +	 */
> > +	if (flags & XFS_BMAPI_ZERO) {
> > +		error = xfs_zero_extent(bma->ip, mval->br_startblock,
> > +					mval->br_blockcount);
> > +		if (error)
> > +			return error;
> > +	}
> > +
> 
> ... and also zero the extent on unwritten conversion, if necessary.

Correct.

> I suspect there's no use case to pass XFS_BMAPI_ZERO for new unwritten
> allocations, but if the flag is passed, doesn't this cause duplicate
> block zeroing?

It probably would, but....

> Perhaps we should drop the zero flag from 'flags' after
> allocation in xfs_bmapi_write() just to ensure this executes in one
> place or the other..?

I think that if we hit this, we're doing something else wrong - why
would we allocate unwritten extents and still need to initialise
them to zero?

> > +	xfs_daddr_t	sector = xfs_fsb_to_db(ip, start_fsb);
> > +	sector_t	block = XFS_BB_TO_FSBT(mp, sector);
> > +	ssize_t		size = XFS_FSB_TO_B(mp, count_fsb);
> > +
> > +	if (IS_DAX(VFS_I(ip)))
> > +		return dax_clear_blocks(VFS_I(ip), block, size);
> > +
> > +	/*
> > +	 * let the block layer decide on the fastest method of
> > +	 * implementing the zeroing.
> > +	 */
> > +	return sb_issue_zeroout(mp->m_super, block, count_fsb, GFP_NOFS);
> 
> The count param to sb_issue_zeroout() is a sector_t and we're passing an
> FSB.

"sector_t" does not mean the function takes parameters in units of
sectors. It's the only variable that you can guarantee will be sized
correctly to the kernel's configured block device capacity
support. Indeed:

static inline int sb_issue_zeroout(struct super_block *sb, sector_t block,
                sector_t nr_blocks, gfp_t gfp_mask)
{
        return blkdev_issue_zeroout(sb->s_bdev,
                                    block << (sb->s_blocksize_bits - 9),
                                    nr_blocks << (sb->s_blocksize_bits - 9),
                                    gfp_mask, true);
}

sb_issue_zeroout() takes a block device offset and length in
filesystem block size units and converts them back to sectors to
pass it to the block layer.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
  2015-10-29 14:29   ` Brian Foster
@ 2015-10-29 23:37     ` Dave Chinner
  2015-10-30 12:36       ` Brian Foster
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2015-10-29 23:37 UTC (permalink / raw)
  To: Brian Foster; +Cc: ross.zwisler, jack, xfs

On Thu, Oct 29, 2015 at 10:29:50AM -0400, Brian Foster wrote:
> On Mon, Oct 19, 2015 at 02:27:15PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > DAX has a page fault serialisation problem with block allocation.
> > Because it allows concurrent page faults and does not have a page
> > lock to serialise faults to the same page, it can get two concurrent
> > faults to the page that race.
> > 
> > When two read faults race, this isn't a huge problem as the data
> > underlying the page is not changing and so "detect and drop" works
> > just fine. The issues are to do with write faults.
> > 
> > When two write faults occur, we serialise block allocation in
> > get_blocks() so only one faul will allocate the extent. It will,
> > however, be marked as an unwritten extent, and that is where the
> > problem lies - the DAX fault code cannot differentiate between a
> > block that was just allocated and a block that was preallocated and
> > needs zeroing. The result is that both write faults end up zeroing
> > the block and attempting to convert it back to written.
> > 
> > The problem is that the first fault can zero and convert before the
> > second fault starts zeroing, resulting in the zeroing for the second
> > fault overwriting the data that the first fault wrote with zeros.
> > The second fault then attempts to convert the unwritten extent,
> > which is then a no-op because it's already written. Data loss occurs
> > as a result of this race.
> > 
> > Because there is no sane locking construct in the page fault code
> > that we can use for serialisation across the page faults, we need to
> > ensure block allocation and zeroing occurs atomically in the
> > filesystem. This means we can still take concurrent page faults and
> > the only time they will serialise is in the filesystem
> > mapping/allocation callback. The page fault code will always see
> > written, initialised extents, so we will be able to remove the
> > unwritten extent handling from the DAX code when all filesystems are
> > converted.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/dax.c           |  5 +++++
> >  fs/xfs/xfs_aops.c  | 13 +++++++++----
> >  fs/xfs/xfs_iomap.c | 21 ++++++++++++++++++++-
> >  3 files changed, 34 insertions(+), 5 deletions(-)
> > 
> ...
> > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > index c3cb5a5..f4f5b43 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> ...
> >  	tp = xfs_trans_alloc(mp, XFS_TRANS_DIOSTRAT);
> > +
> > +	/*
> > +	 * For DAX, we do not allocate unwritten extents, but instead we zero
> > +	 * the block before we commit the transaction.  Ideally we'd like to do
> > +	 * this outside the transaction context, but if we commit and then crash
> > +	 * we may not have zeroed the blocks and this will be exposed on
> > +	 * recovery of the allocation. Hence we must zero before commit.
> > +	 * Further, if we are mapping unwritten extents here, we need to zero
> > +	 * and convert them to written so that we don't need an unwritten extent
> > +	 * callback for DAX. This also means that we need to be able to dip into
> > +	 * the reserve block pool if there is no space left but we need to do
> > +	 * unwritten extent conversion.
> > +	 */
> > +	if (IS_DAX(VFS_I(ip))) {
> > +		bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO;
> > +		tp->t_flags |= XFS_TRANS_RESERVE;
> > +	}
> 
> Am I following the commit log description correctly in that block
> zeroing is only required for DAX faults? Do we zero blocks for DAX DIO
> as well to be consistent, or is that also required (because it looks
> like we still have end_io completion for dio writes anyways)?

DAX DIO will do the zeroing rather than using unwritten extents,
too. But we still have DIO IO completion as that needs to do file
size updates.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/6] xfs: DAX does not use IO completion callbacks
  2015-10-29 14:29   ` Brian Foster
@ 2015-10-29 23:39     ` Dave Chinner
  2015-10-30 12:37       ` Brian Foster
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2015-10-29 23:39 UTC (permalink / raw)
  To: Brian Foster; +Cc: ross.zwisler, jack, xfs

On Thu, Oct 29, 2015 at 10:29:57AM -0400, Brian Foster wrote:
> On Mon, Oct 19, 2015 at 02:27:16PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > For DAX, we are now doing block zeroing and
> > we are updating the file size during allocation. This means we no
> > longer need an IO completion callback to do these things, so remove
> > the completion callbacks from the __dax_fault and __dax_mkwrite
> > calls.
> > 
> 
> Where do we "update the file size during allocation?"

Stale comment. For page faults, we'll never update the file size
(segv if fault is beyond EOF), and DIO still does IO completion
based file size updates.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/6] xfs: introduce BMAPI_ZERO for allocating zeroed extents
  2015-10-29 23:35     ` Dave Chinner
@ 2015-10-30 12:36       ` Brian Foster
  2015-11-02  1:21         ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Brian Foster @ 2015-10-30 12:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: ross.zwisler, jack, xfs

On Fri, Oct 30, 2015 at 10:35:48AM +1100, Dave Chinner wrote:
> On Thu, Oct 29, 2015 at 10:27:58AM -0400, Brian Foster wrote:
> > On Mon, Oct 19, 2015 at 02:27:14PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
...
> 
> > I suspect there's no use case to pass XFS_BMAPI_ZERO for new unwritten
> > allocations, but if the flag is passed, doesn't this cause duplicate
> > block zeroing?
> 
> It probably would, but....
> 
> > Perhaps we should drop the zero flag from 'flags' after
> > allocation in xfs_bmapi_write() just to ensure this executes in one
> > place or the other..?
> 
> I think that if we hit this, we're doing something else wrong - why
> would we allocate unwritten extents and still need to initialise
> them to zero?
> 

No idea, really (as noted above). ;) It just looked like it could be
invoked twice per bmapi call, nothing else I saw prevented it, and it
looks easily avoidable. Maybe somebody down the road decides to turn on
block zeroing unconditionally in the block allocator due to hardware
support or some such. Or maybe we'll never hit the problem. The point is
that this code will inevitably be modified/enhanced down the road and
nobody is going to remember that the zeroing is invoked twice in a
particular prealloc codepath.

If we don't want to mess with the flags, how about an assert somewhere
so it's explicit the bmapi implementation doesn't expect this
combination of flags?

> > > +	xfs_daddr_t	sector = xfs_fsb_to_db(ip, start_fsb);
> > > +	sector_t	block = XFS_BB_TO_FSBT(mp, sector);
> > > +	ssize_t		size = XFS_FSB_TO_B(mp, count_fsb);
> > > +
> > > +	if (IS_DAX(VFS_I(ip)))
> > > +		return dax_clear_blocks(VFS_I(ip), block, size);
> > > +
> > > +	/*
> > > +	 * let the block layer decide on the fastest method of
> > > +	 * implementing the zeroing.
> > > +	 */
> > > +	return sb_issue_zeroout(mp->m_super, block, count_fsb, GFP_NOFS);
> > 
> > The count param to sb_issue_zeroout() is a sector_t and we're passing an
> > FSB.
> 
> "sector_t" does not mean the function takes parameters in units of
> sectors. It's the only variable that you can guarantee will be sized
> correctly to the kernel's configured block device capacity
> support. Indeed:
> 
> static inline int sb_issue_zeroout(struct super_block *sb, sector_t block,
>                 sector_t nr_blocks, gfp_t gfp_mask)
> {
>         return blkdev_issue_zeroout(sb->s_bdev,
>                                     block << (sb->s_blocksize_bits - 9),
>                                     nr_blocks << (sb->s_blocksize_bits - 9),
>                                     gfp_mask, true);
> }
> 
> sb_issue_zeroout() takes a block device offset and length in
> filesystem block size units and converts them back to sectors to
> pass it to the block layer.
> 

Ah, I see. I missed the conversion that was being done there via the sb
helper.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
  2015-10-29 23:37     ` Dave Chinner
@ 2015-10-30 12:36       ` Brian Foster
  2015-11-02  1:14         ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Brian Foster @ 2015-10-30 12:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: ross.zwisler, jack, xfs

On Fri, Oct 30, 2015 at 10:37:56AM +1100, Dave Chinner wrote:
> On Thu, Oct 29, 2015 at 10:29:50AM -0400, Brian Foster wrote:
> > On Mon, Oct 19, 2015 at 02:27:15PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
...
> > > +	/*
> > > +	 * For DAX, we do not allocate unwritten extents, but instead we zero
> > > +	 * the block before we commit the transaction.  Ideally we'd like to do
> > > +	 * this outside the transaction context, but if we commit and then crash
> > > +	 * we may not have zeroed the blocks and this will be exposed on
> > > +	 * recovery of the allocation. Hence we must zero before commit.
> > > +	 * Further, if we are mapping unwritten extents here, we need to zero
> > > +	 * and convert them to written so that we don't need an unwritten extent
> > > +	 * callback for DAX. This also means that we need to be able to dip into
> > > +	 * the reserve block pool if there is no space left but we need to do
> > > +	 * unwritten extent conversion.
> > > +	 */
> > > +	if (IS_DAX(VFS_I(ip))) {
> > > +		bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO;
> > > +		tp->t_flags |= XFS_TRANS_RESERVE;
> > > +	}
> > 
> > Am I following the commit log description correctly in that block
> > zeroing is only required for DAX faults? Do we zero blocks for DAX DIO
> > as well to be consistent, or is that also required (because it looks
> > like we still have end_io completion for dio writes anyways)?
> 
> DAX DIO will do the zeroing rather than using unwritten extents,
> too. But we still have DIO IO completion as that needs to do file
> size updates.
> 

Right, my question is: is the DAX DIO zeroing required to avoid the
races described as the purpose for this patch, or is this just here as a
simplification? In other words, why not do block zeroing only for DAX
faults and not DAX/DIO?

I ask because my understanding is the purpose of this patch is a special
atomic zeroed allocation requirement just for mmap. Unless there is some
special mixed dio/mmap case I'm missing, doing so for DAX/DIO basically
causes a clear_pmem() over every page sized chunk of the target I/O
range for which we already have the data. Perhaps that is fine (for now)
from a performance perspective, but seems unnecessary. Further, we still
have write completion in place which means we can still handle unwritten
conversion just as easily for DAX/DIO as normal DIO.

Thoughts?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/6] xfs: DAX does not use IO completion callbacks
  2015-10-29 23:39     ` Dave Chinner
@ 2015-10-30 12:37       ` Brian Foster
  0 siblings, 0 replies; 47+ messages in thread
From: Brian Foster @ 2015-10-30 12:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: ross.zwisler, jack, xfs

On Fri, Oct 30, 2015 at 10:39:11AM +1100, Dave Chinner wrote:
> On Thu, Oct 29, 2015 at 10:29:57AM -0400, Brian Foster wrote:
> > On Mon, Oct 19, 2015 at 02:27:16PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > For DAX, we are now doing block zeroing and
> > > we are updating the file size during allocation. This means we no
> > > longer need an IO completion callback to do these things, so remove
> > > the completion callbacks from the __dax_fault and __dax_mkwrite
> > > calls.
> > > 
> > 
> > Where do we "update the file size during allocation?"
> 
> Stale comment. For page faults, we'll never update the file size
> (segv if fault is beyond EOF), and DIO still does IO completion
> based file size updates.
> 

Ok. With that fixed up:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
  2015-10-30 12:36       ` Brian Foster
@ 2015-11-02  1:14         ` Dave Chinner
  2015-11-02 14:15           ` Brian Foster
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2015-11-02  1:14 UTC (permalink / raw)
  To: Brian Foster; +Cc: ross.zwisler, jack, xfs

On Fri, Oct 30, 2015 at 08:36:57AM -0400, Brian Foster wrote:
> On Fri, Oct 30, 2015 at 10:37:56AM +1100, Dave Chinner wrote:
> > On Thu, Oct 29, 2015 at 10:29:50AM -0400, Brian Foster wrote:
> > > On Mon, Oct 19, 2015 at 02:27:15PM +1100, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> ...
> > > > +	/*
> > > > +	 * For DAX, we do not allocate unwritten extents, but instead we zero
> > > > +	 * the block before we commit the transaction.  Ideally we'd like to do
> > > > +	 * this outside the transaction context, but if we commit and then crash
> > > > +	 * we may not have zeroed the blocks and this will be exposed on
> > > > +	 * recovery of the allocation. Hence we must zero before commit.
> > > > +	 * Further, if we are mapping unwritten extents here, we need to zero
> > > > +	 * and convert them to written so that we don't need an unwritten extent
> > > > +	 * callback for DAX. This also means that we need to be able to dip into
> > > > +	 * the reserve block pool if there is no space left but we need to do
> > > > +	 * unwritten extent conversion.
> > > > +	 */
> > > > +	if (IS_DAX(VFS_I(ip))) {
> > > > +		bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO;
> > > > +		tp->t_flags |= XFS_TRANS_RESERVE;
> > > > +	}
> > > 
> > > Am I following the commit log description correctly in that block
> > > zeroing is only required for DAX faults? Do we zero blocks for DAX DIO
> > > as well to be consistent, or is that also required (because it looks
> > > like we still have end_io completion for dio writes anyways)?
> > 
> > DAX DIO will do the zeroing rather than using unwritten extents,
> > too. But we still have DIO IO completion as that needs to do file
> > size updates.
> > 
> 
> Right, my question is: is the DAX DIO zeroing required to avoid the
> races described as the purpose for this patch, or is this just here as a
> simplification? In other words, why not do block zeroing only for DAX
> faults and not DAX/DIO?

Because the only reason the DIO code does 'allocate unwritten;
convert unwritten on IO completion' is so that if we have:

	allocate
	trans_commit
	....				log force
					journal IO submit
	....				journal IO completion
	submit data io
	crash

We don't expose allocated blocks containing stale data to userspace
via recovery. The allcoation uses unwritten extents to ensure that
if the allocation is recovered without the correspending completion,
it reads as zeros rather whatever was previously on disk in taht
location.

For DAX, we can zero the blocks inside the allocation transaction
for direct IO, and hence even if we have the above happen, we'll
only ever expose zeros. Hence we don't need unwritten extents in the
DIO path to avoid stale data exposure, and so we can simply avoid
all that extra overhead of unwritten extent conversion on
completion...

> I ask because my understanding is the purpose of this patch is a special
> atomic zeroed allocation requirement just for mmap.

The requirement is set by DAX+mmap; the implementation is a generic
"allocate zeroed blocks" mechanism that can be applied to any
allocation that uses unwritten extents to allocate zeroed blocks if
zeroing is more efficient than using unwritten extents....

> Unless there is some
> special mixed dio/mmap case I'm missing, doing so for DAX/DIO basically
> causes a clear_pmem() over every page sized chunk of the target I/O
> range for which we already have the data.

I don't follow - this only zeros blocks when we do allocation of new
blocks or overwrite unwritten extents, not on blocks which we
already have written data extents allocated for...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/6] xfs: introduce BMAPI_ZERO for allocating zeroed extents
  2015-10-30 12:36       ` Brian Foster
@ 2015-11-02  1:21         ` Dave Chinner
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2015-11-02  1:21 UTC (permalink / raw)
  To: Brian Foster; +Cc: ross.zwisler, jack, xfs

On Fri, Oct 30, 2015 at 08:36:08AM -0400, Brian Foster wrote:
> On Fri, Oct 30, 2015 at 10:35:48AM +1100, Dave Chinner wrote:
> > On Thu, Oct 29, 2015 at 10:27:58AM -0400, Brian Foster wrote:
> > > On Mon, Oct 19, 2015 at 02:27:14PM +1100, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> ...
> > 
> > > I suspect there's no use case to pass XFS_BMAPI_ZERO for new unwritten
> > > allocations, but if the flag is passed, doesn't this cause duplicate
> > > block zeroing?
> > 
> > It probably would, but....
> > 
> > > Perhaps we should drop the zero flag from 'flags' after
> > > allocation in xfs_bmapi_write() just to ensure this executes in one
> > > place or the other..?
> > 
> > I think that if we hit this, we're doing something else wrong - why
> > would we allocate unwritten extents and still need to initialise
> > them to zero?
> > 
> 
> No idea, really (as noted above). ;) It just looked like it could be
> invoked twice per bmapi call, nothing else I saw prevented it, and it
> looks easily avoidable. Maybe somebody down the road decides to turn on
> block zeroing unconditionally in the block allocator due to hardware
> support or some such. Or maybe we'll never hit the problem. The point is
> that this code will inevitably be modified/enhanced down the road and
> nobody is going to remember that the zeroing is invoked twice in a
> particular prealloc codepath.
> 
> If we don't want to mess with the flags, how about an assert somewhere
> so it's explicit the bmapi implementation doesn't expect this
> combination of flags?

Easy enough. I'll add this to the initial asserts in xfs_bmapi_write():

+	/*
+	 * we can allocate unwritten extents or pre-zero allocated blocks,
+	 * but it makes no sense to do both at once. This would result in
+	 * zeroing the unwritten extent twice, but it still being an
+	 * unwritten extent....
+	 */
+	ASSERT((flags & (XFS_BMAPI_PREALLOC|XFS_BMAPI_ZERO)) !=
+			(XFS_BMAPI_PREALLOC|XFS_BMAPI_ZERO));

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
  2015-11-02  1:14         ` Dave Chinner
@ 2015-11-02 14:15           ` Brian Foster
  2015-11-02 21:44               ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Brian Foster @ 2015-11-02 14:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: ross.zwisler, jack, xfs

On Mon, Nov 02, 2015 at 12:14:33PM +1100, Dave Chinner wrote:
> On Fri, Oct 30, 2015 at 08:36:57AM -0400, Brian Foster wrote:
> > On Fri, Oct 30, 2015 at 10:37:56AM +1100, Dave Chinner wrote:
> > > On Thu, Oct 29, 2015 at 10:29:50AM -0400, Brian Foster wrote:
> > > > On Mon, Oct 19, 2015 at 02:27:15PM +1100, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > 
> > ...
> > > > > +	/*
> > > > > +	 * For DAX, we do not allocate unwritten extents, but instead we zero
> > > > > +	 * the block before we commit the transaction.  Ideally we'd like to do
> > > > > +	 * this outside the transaction context, but if we commit and then crash
> > > > > +	 * we may not have zeroed the blocks and this will be exposed on
> > > > > +	 * recovery of the allocation. Hence we must zero before commit.
> > > > > +	 * Further, if we are mapping unwritten extents here, we need to zero
> > > > > +	 * and convert them to written so that we don't need an unwritten extent
> > > > > +	 * callback for DAX. This also means that we need to be able to dip into
> > > > > +	 * the reserve block pool if there is no space left but we need to do
> > > > > +	 * unwritten extent conversion.
> > > > > +	 */
> > > > > +	if (IS_DAX(VFS_I(ip))) {
> > > > > +		bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO;
> > > > > +		tp->t_flags |= XFS_TRANS_RESERVE;
> > > > > +	}
> > > > 
> > > > Am I following the commit log description correctly in that block
> > > > zeroing is only required for DAX faults? Do we zero blocks for DAX DIO
> > > > as well to be consistent, or is that also required (because it looks
> > > > like we still have end_io completion for dio writes anyways)?
> > > 
> > > DAX DIO will do the zeroing rather than using unwritten extents,
> > > too. But we still have DIO IO completion as that needs to do file
> > > size updates.
> > > 
> > 
> > Right, my question is: is the DAX DIO zeroing required to avoid the
> > races described as the purpose for this patch, or is this just here as a
> > simplification? In other words, why not do block zeroing only for DAX
> > faults and not DAX/DIO?
> 
> Because the only reason the DIO code does 'allocate unwritten;
> convert unwritten on IO completion' is so that if we have:
> 
> 	allocate
> 	trans_commit
> 	....				log force
> 					journal IO submit
> 	....				journal IO completion
> 	submit data io
> 	crash
> 
> We don't expose allocated blocks containing stale data to userspace
> via recovery. The allcoation uses unwritten extents to ensure that
> if the allocation is recovered without the correspending completion,
> it reads as zeros rather whatever was previously on disk in taht
> location.
> 
> For DAX, we can zero the blocks inside the allocation transaction
> for direct IO, and hence even if we have the above happen, we'll
> only ever expose zeros. Hence we don't need unwritten extents in the
> DIO path to avoid stale data exposure, and so we can simply avoid
> all that extra overhead of unwritten extent conversion on
> completion...
> 

Yeah, I get that bit. In fact, I was hoping to get to doing something
similar for delalloc extents to deal with the similar issue there if the
extent conversion makes it to the log and we crash before I/O (as I
believe we've discussed in the past). That is for something unrelated,
however...

> > I ask because my understanding is the purpose of this patch is a special
> > atomic zeroed allocation requirement just for mmap.
> 
> The requirement is set by DAX+mmap; the implementation is a generic
> "allocate zeroed blocks" mechanism that can be applied to any
> allocation that uses unwritten extents to allocate zeroed blocks if
> zeroing is more efficient than using unwritten extents....
> 

Ok, I take that to mean that there is no race or corruption vector so
long as DAX/mmap uses the atomic zero allocation. The implementation
mostly looks pretty good to me, but I suspect I'm not being clear with
my question, which is...

> > Unless there is some
> > special mixed dio/mmap case I'm missing, doing so for DAX/DIO basically
> > causes a clear_pmem() over every page sized chunk of the target I/O
> > range for which we already have the data.
> 
> I don't follow - this only zeros blocks when we do allocation of new
> blocks or overwrite unwritten extents, not on blocks which we
> already have written data extents allocated for...
> 

Why are we assuming that block zeroing is more efficient than unwritten
extents for DAX/dio? I haven't played with pmem enough to know for sure
one way or another (or if hw support is imminent), but I'd expect the
latter to be more efficient in general without any kind of hardware
support.

Just as an example, here's an 8GB pwrite test, large buffer size, to XFS
on a ramdisk mounted with '-o dax:'

- Before this series:

# xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
wrote 8589934592/8589934592 bytes at offset 0
8.000 GiB, 820 ops; 0:00:04.00 (1.909 GiB/sec and 195.6591 ops/sec)

- After this series:

# xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
wrote 8589934592/8589934592 bytes at offset 0
8.000 GiB, 820 ops; 0:00:12.00 (659.790 MiB/sec and 66.0435 ops/sec)

The impact is less with a smaller buffer size so the above is just meant
to illustrate the point. FWIW, I'm also fine with getting this in as a
matter of "correctness before performance" since this stuff is clearly
still under development, but as far as I can see so far we should
probably ultimately prefer unwritten extents for DAX/DIO (or at least
plan to run some similar tests on real pmem hw). Thoughts?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
  2015-11-02 14:15           ` Brian Foster
  2015-11-02 21:44               ` Dave Chinner
@ 2015-11-02 21:44               ` Dave Chinner
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2015-11-02 21:44 UTC (permalink / raw)
  To: Brian Foster
  Cc: ross.zwisler, jack, xfs, linux-fsdevel, dan.j.williams, linux-nvdimm

[add people to the cc list]

On Mon, Nov 02, 2015 at 09:15:10AM -0500, Brian Foster wrote:
> On Mon, Nov 02, 2015 at 12:14:33PM +1100, Dave Chinner wrote:
> > On Fri, Oct 30, 2015 at 08:36:57AM -0400, Brian Foster wrote:
> > > Unless there is some
> > > special mixed dio/mmap case I'm missing, doing so for DAX/DIO basically
> > > causes a clear_pmem() over every page sized chunk of the target I/O
> > > range for which we already have the data.
> > 
> > I don't follow - this only zeros blocks when we do allocation of new
> > blocks or overwrite unwritten extents, not on blocks which we
> > already have written data extents allocated for...
> > 
> 
> Why are we assuming that block zeroing is more efficient than unwritten
> extents for DAX/dio? I haven't played with pmem enough to know for sure
> one way or another (or if hw support is imminent), but I'd expect the
> latter to be more efficient in general without any kind of hardware
> support.
> 
> Just as an example, here's an 8GB pwrite test, large buffer size, to XFS
> on a ramdisk mounted with '-o dax:'
> 
> - Before this series:
> 
> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
> wrote 8589934592/8589934592 bytes at offset 0
> 8.000 GiB, 820 ops; 0:00:04.00 (1.909 GiB/sec and 195.6591 ops/sec)
> 
> - After this series:
> 
> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
> wrote 8589934592/8589934592 bytes at offset 0
> 8.000 GiB, 820 ops; 0:00:12.00 (659.790 MiB/sec and 66.0435 ops/sec)

That looks wrong. Much, much slower than it should be just zeroing
pages and then writing to them again while cache hot.

Oh, hell - dax_clear_blocks() is stupidly slow. A profile shows this
loop spending most of the CPU time:

       �     � jbe    ea
       � de:   clflus %ds:(%rax)
 84.67 �       add    %rsi,%rax
       �       cmp    %rax,%rdx
       �     � ja     de
       � ea:   add    %r13,-0x38(%rbp)
       �       sub    %r12,%r14
       �       sub    %r12,-0x40(%rbp)

That is the overhead of __arch_wb_cache_pmem() i.e. issuing CPU
cache flushes after each memset.

None of these pmem memory operations are optimised yet - the
implementations are correct, but performance still needs work. The
conversion to non-temporal stores should get rid of this cache flush
overhead (somewhat), but I was still expecting this code to be much
closer to memset speed and not reduce performance to bugger all...

> The impact is less with a smaller buffer size so the above is just meant
> to illustrate the point. FWIW, I'm also fine with getting this in as a
> matter of "correctness before performance" since this stuff is clearly
> still under development, but as far as I can see so far we should
> probably ultimately prefer unwritten extents for DAX/DIO (or at least
> plan to run some similar tests on real pmem hw). Thoughts?

We're going to have these problems initially, but from the XFS
persepective those problems don't matter because we have a different
problem to solve.  That is, we need to move towards an architecture
that is highly efficient for low latency, high IOPS storage
subsystems.  The first step towards that is to be able to offload
block zeroing to the hardware so we can avoid using unwritten
extents.

In the long run, we don't want to use unwritten extents on DAX if we
can avoid it - the CPU overhead of unwritten extent conversion isn't
constant (i.e. it grows with the size of the BMBT) and it isn't
deterministic (e.g. split/merge take much more CPU than a simple
record write to clear an unwritten flag). We don't notice this
overhead much with normal IO because of the fact that the CPU time
for conversion is much less than the CPU time for the IO to
complete, hence it's a win.

But for PMEM, directly zeroing a 4k chunk of data should take *much
less CPU* than synchronously starting a transaction, reserving
space, looking up the extent in the BMBT, loading the buffers from
cache, modifying the buffers, logging the changes and committing the
transaction (which in itself will typically copy more than a single
page of metadata into the CIL buffers).

Realistically, dax_clear_blocks() should probably be implemented at
the pmem driver layer through blkdev_issue_zeroout() because all it
does is directly map the sector/len to pfn via bdev_direct_access()
and then zero it - it's a sector based, block device operation. We
don't actually need a special case path for DAX here. Optimisation
of this operation has little to do with the filesystem.

This comes back to the comments I made w.r.t. the pmem driver
implementation doing synchronous IO by immediately forcing CPU cache
flushes and barriers. it's obviously correct, but it looks like
there's going to be a major performance penalty associated with it.
This is why I recently suggested that a pmem driver that doesn't do
CPU cache writeback during IO but does it on REQ_FLUSH is an
architecture we'll likely have to support.

In this case, the block device won't need to flush CPU cache lines
for the zeroed blocks until the allocation transaction is actually
committed to the journal. In that case, there's a good chance that
we'd end up also committing the new data as well, hence avoiding two
synchronous memory writes. i.e. the "big hammer" REQ_FLUSH
implementation may well be *much* faster than fine-grained
synchronous "stable on completion" writes for persistent memory.

This, however, is not really a problem for the filesystem - it's
a pmem driver architecture problem. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
@ 2015-11-02 21:44               ` Dave Chinner
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2015-11-02 21:44 UTC (permalink / raw)
  To: Brian Foster
  Cc: ross.zwisler, jack, xfs, linux-fsdevel, dan.j.williams, linux-nvdimm

[add people to the cc list]

On Mon, Nov 02, 2015 at 09:15:10AM -0500, Brian Foster wrote:
> On Mon, Nov 02, 2015 at 12:14:33PM +1100, Dave Chinner wrote:
> > On Fri, Oct 30, 2015 at 08:36:57AM -0400, Brian Foster wrote:
> > > Unless there is some
> > > special mixed dio/mmap case I'm missing, doing so for DAX/DIO basically
> > > causes a clear_pmem() over every page sized chunk of the target I/O
> > > range for which we already have the data.
> > 
> > I don't follow - this only zeros blocks when we do allocation of new
> > blocks or overwrite unwritten extents, not on blocks which we
> > already have written data extents allocated for...
> > 
> 
> Why are we assuming that block zeroing is more efficient than unwritten
> extents for DAX/dio? I haven't played with pmem enough to know for sure
> one way or another (or if hw support is imminent), but I'd expect the
> latter to be more efficient in general without any kind of hardware
> support.
> 
> Just as an example, here's an 8GB pwrite test, large buffer size, to XFS
> on a ramdisk mounted with '-o dax:'
> 
> - Before this series:
> 
> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
> wrote 8589934592/8589934592 bytes at offset 0
> 8.000 GiB, 820 ops; 0:00:04.00 (1.909 GiB/sec and 195.6591 ops/sec)
> 
> - After this series:
> 
> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
> wrote 8589934592/8589934592 bytes at offset 0
> 8.000 GiB, 820 ops; 0:00:12.00 (659.790 MiB/sec and 66.0435 ops/sec)

That looks wrong. Much, much slower than it should be just zeroing
pages and then writing to them again while cache hot.

Oh, hell - dax_clear_blocks() is stupidly slow. A profile shows this
loop spending most of the CPU time:

       ¿     ¿ jbe    ea
       ¿ de:   clflus %ds:(%rax)
 84.67 ¿       add    %rsi,%rax
       ¿       cmp    %rax,%rdx
       ¿     ¿ ja     de
       ¿ ea:   add    %r13,-0x38(%rbp)
       ¿       sub    %r12,%r14
       ¿       sub    %r12,-0x40(%rbp)

That is the overhead of __arch_wb_cache_pmem() i.e. issuing CPU
cache flushes after each memset.

None of these pmem memory operations are optimised yet - the
implementations are correct, but performance still needs work. The
conversion to non-temporal stores should get rid of this cache flush
overhead (somewhat), but I was still expecting this code to be much
closer to memset speed and not reduce performance to bugger all...

> The impact is less with a smaller buffer size so the above is just meant
> to illustrate the point. FWIW, I'm also fine with getting this in as a
> matter of "correctness before performance" since this stuff is clearly
> still under development, but as far as I can see so far we should
> probably ultimately prefer unwritten extents for DAX/DIO (or at least
> plan to run some similar tests on real pmem hw). Thoughts?

We're going to have these problems initially, but from the XFS
persepective those problems don't matter because we have a different
problem to solve.  That is, we need to move towards an architecture
that is highly efficient for low latency, high IOPS storage
subsystems.  The first step towards that is to be able to offload
block zeroing to the hardware so we can avoid using unwritten
extents.

In the long run, we don't want to use unwritten extents on DAX if we
can avoid it - the CPU overhead of unwritten extent conversion isn't
constant (i.e. it grows with the size of the BMBT) and it isn't
deterministic (e.g. split/merge take much more CPU than a simple
record write to clear an unwritten flag). We don't notice this
overhead much with normal IO because of the fact that the CPU time
for conversion is much less than the CPU time for the IO to
complete, hence it's a win.

But for PMEM, directly zeroing a 4k chunk of data should take *much
less CPU* than synchronously starting a transaction, reserving
space, looking up the extent in the BMBT, loading the buffers from
cache, modifying the buffers, logging the changes and committing the
transaction (which in itself will typically copy more than a single
page of metadata into the CIL buffers).

Realistically, dax_clear_blocks() should probably be implemented at
the pmem driver layer through blkdev_issue_zeroout() because all it
does is directly map the sector/len to pfn via bdev_direct_access()
and then zero it - it's a sector based, block device operation. We
don't actually need a special case path for DAX here. Optimisation
of this operation has little to do with the filesystem.

This comes back to the comments I made w.r.t. the pmem driver
implementation doing synchronous IO by immediately forcing CPU cache
flushes and barriers. it's obviously correct, but it looks like
there's going to be a major performance penalty associated with it.
This is why I recently suggested that a pmem driver that doesn't do
CPU cache writeback during IO but does it on REQ_FLUSH is an
architecture we'll likely have to support.

In this case, the block device won't need to flush CPU cache lines
for the zeroed blocks until the allocation transaction is actually
committed to the journal. In that case, there's a good chance that
we'd end up also committing the new data as well, hence avoiding two
synchronous memory writes. i.e. the "big hammer" REQ_FLUSH
implementation may well be *much* faster than fine-grained
synchronous "stable on completion" writes for persistent memory.

This, however, is not really a problem for the filesystem - it's
a pmem driver architecture problem. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
@ 2015-11-02 21:44               ` Dave Chinner
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2015-11-02 21:44 UTC (permalink / raw)
  To: Brian Foster
  Cc: jack, linux-nvdimm, xfs, linux-fsdevel, ross.zwisler, dan.j.williams

[add people to the cc list]

On Mon, Nov 02, 2015 at 09:15:10AM -0500, Brian Foster wrote:
> On Mon, Nov 02, 2015 at 12:14:33PM +1100, Dave Chinner wrote:
> > On Fri, Oct 30, 2015 at 08:36:57AM -0400, Brian Foster wrote:
> > > Unless there is some
> > > special mixed dio/mmap case I'm missing, doing so for DAX/DIO basically
> > > causes a clear_pmem() over every page sized chunk of the target I/O
> > > range for which we already have the data.
> > 
> > I don't follow - this only zeros blocks when we do allocation of new
> > blocks or overwrite unwritten extents, not on blocks which we
> > already have written data extents allocated for...
> > 
> 
> Why are we assuming that block zeroing is more efficient than unwritten
> extents for DAX/dio? I haven't played with pmem enough to know for sure
> one way or another (or if hw support is imminent), but I'd expect the
> latter to be more efficient in general without any kind of hardware
> support.
> 
> Just as an example, here's an 8GB pwrite test, large buffer size, to XFS
> on a ramdisk mounted with '-o dax:'
> 
> - Before this series:
> 
> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
> wrote 8589934592/8589934592 bytes at offset 0
> 8.000 GiB, 820 ops; 0:00:04.00 (1.909 GiB/sec and 195.6591 ops/sec)
> 
> - After this series:
> 
> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
> wrote 8589934592/8589934592 bytes at offset 0
> 8.000 GiB, 820 ops; 0:00:12.00 (659.790 MiB/sec and 66.0435 ops/sec)

That looks wrong. Much, much slower than it should be just zeroing
pages and then writing to them again while cache hot.

Oh, hell - dax_clear_blocks() is stupidly slow. A profile shows this
loop spending most of the CPU time:

       ¿     ¿ jbe    ea
       ¿ de:   clflus %ds:(%rax)
 84.67 ¿       add    %rsi,%rax
       ¿       cmp    %rax,%rdx
       ¿     ¿ ja     de
       ¿ ea:   add    %r13,-0x38(%rbp)
       ¿       sub    %r12,%r14
       ¿       sub    %r12,-0x40(%rbp)

That is the overhead of __arch_wb_cache_pmem() i.e. issuing CPU
cache flushes after each memset.

None of these pmem memory operations are optimised yet - the
implementations are correct, but performance still needs work. The
conversion to non-temporal stores should get rid of this cache flush
overhead (somewhat), but I was still expecting this code to be much
closer to memset speed and not reduce performance to bugger all...

> The impact is less with a smaller buffer size so the above is just meant
> to illustrate the point. FWIW, I'm also fine with getting this in as a
> matter of "correctness before performance" since this stuff is clearly
> still under development, but as far as I can see so far we should
> probably ultimately prefer unwritten extents for DAX/DIO (or at least
> plan to run some similar tests on real pmem hw). Thoughts?

We're going to have these problems initially, but from the XFS
persepective those problems don't matter because we have a different
problem to solve.  That is, we need to move towards an architecture
that is highly efficient for low latency, high IOPS storage
subsystems.  The first step towards that is to be able to offload
block zeroing to the hardware so we can avoid using unwritten
extents.

In the long run, we don't want to use unwritten extents on DAX if we
can avoid it - the CPU overhead of unwritten extent conversion isn't
constant (i.e. it grows with the size of the BMBT) and it isn't
deterministic (e.g. split/merge take much more CPU than a simple
record write to clear an unwritten flag). We don't notice this
overhead much with normal IO because of the fact that the CPU time
for conversion is much less than the CPU time for the IO to
complete, hence it's a win.

But for PMEM, directly zeroing a 4k chunk of data should take *much
less CPU* than synchronously starting a transaction, reserving
space, looking up the extent in the BMBT, loading the buffers from
cache, modifying the buffers, logging the changes and committing the
transaction (which in itself will typically copy more than a single
page of metadata into the CIL buffers).

Realistically, dax_clear_blocks() should probably be implemented at
the pmem driver layer through blkdev_issue_zeroout() because all it
does is directly map the sector/len to pfn via bdev_direct_access()
and then zero it - it's a sector based, block device operation. We
don't actually need a special case path for DAX here. Optimisation
of this operation has little to do with the filesystem.

This comes back to the comments I made w.r.t. the pmem driver
implementation doing synchronous IO by immediately forcing CPU cache
flushes and barriers. it's obviously correct, but it looks like
there's going to be a major performance penalty associated with it.
This is why I recently suggested that a pmem driver that doesn't do
CPU cache writeback during IO but does it on REQ_FLUSH is an
architecture we'll likely have to support.

In this case, the block device won't need to flush CPU cache lines
for the zeroed blocks until the allocation transaction is actually
committed to the journal. In that case, there's a good chance that
we'd end up also committing the new data as well, hence avoiding two
synchronous memory writes. i.e. the "big hammer" REQ_FLUSH
implementation may well be *much* faster than fine-grained
synchronous "stable on completion" writes for persistent memory.

This, however, is not really a problem for the filesystem - it's
a pmem driver architecture problem. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
  2015-11-02 21:44               ` Dave Chinner
  (?)
@ 2015-11-03  3:53                 ` Dan Williams
  -1 siblings, 0 replies; 47+ messages in thread
From: Dan Williams @ 2015-11-03  3:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Brian Foster, Ross Zwisler, Jan Kara, xfs, linux-fsdevel, linux-nvdimm

On Mon, Nov 2, 2015 at 1:44 PM, Dave Chinner <david@fromorbit.com> wrote:
> [add people to the cc list]
>
> On Mon, Nov 02, 2015 at 09:15:10AM -0500, Brian Foster wrote:
>> On Mon, Nov 02, 2015 at 12:14:33PM +1100, Dave Chinner wrote:
>> > On Fri, Oct 30, 2015 at 08:36:57AM -0400, Brian Foster wrote:
>> > > Unless there is some
>> > > special mixed dio/mmap case I'm missing, doing so for DAX/DIO basically
>> > > causes a clear_pmem() over every page sized chunk of the target I/O
>> > > range for which we already have the data.
>> >
>> > I don't follow - this only zeros blocks when we do allocation of new
>> > blocks or overwrite unwritten extents, not on blocks which we
>> > already have written data extents allocated for...
>> >
>>
>> Why are we assuming that block zeroing is more efficient than unwritten
>> extents for DAX/dio? I haven't played with pmem enough to know for sure
>> one way or another (or if hw support is imminent), but I'd expect the
>> latter to be more efficient in general without any kind of hardware
>> support.
>>
>> Just as an example, here's an 8GB pwrite test, large buffer size, to XFS
>> on a ramdisk mounted with '-o dax:'
>>
>> - Before this series:
>>
>> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
>> wrote 8589934592/8589934592 bytes at offset 0
>> 8.000 GiB, 820 ops; 0:00:04.00 (1.909 GiB/sec and 195.6591 ops/sec)
>>
>> - After this series:
>>
>> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
>> wrote 8589934592/8589934592 bytes at offset 0
>> 8.000 GiB, 820 ops; 0:00:12.00 (659.790 MiB/sec and 66.0435 ops/sec)
>
> That looks wrong. Much, much slower than it should be just zeroing
> pages and then writing to them again while cache hot.
>
> Oh, hell - dax_clear_blocks() is stupidly slow. A profile shows this
> loop spending most of the CPU time:
>
>        ¿     ¿ jbe    ea
>        ¿ de:   clflus %ds:(%rax)
>  84.67 ¿       add    %rsi,%rax
>        ¿       cmp    %rax,%rdx
>        ¿     ¿ ja     de
>        ¿ ea:   add    %r13,-0x38(%rbp)
>        ¿       sub    %r12,%r14
>        ¿       sub    %r12,-0x40(%rbp)
>
> That is the overhead of __arch_wb_cache_pmem() i.e. issuing CPU
> cache flushes after each memset.

Ideally this would be non-temporal and skip the second flush loop
altogether.  Outside of that another problem is that this cpu does not
support the clwb instruction and is instead using the serializing and
invalidating clflush instruction.


> None of these pmem memory operations are optimised yet - the
> implementations are correct, but performance still needs work. The
> conversion to non-temporal stores should get rid of this cache flush
> overhead (somewhat), but I was still expecting this code to be much
> closer to memset speed and not reduce performance to bugger all...
>
>> The impact is less with a smaller buffer size so the above is just meant
>> to illustrate the point. FWIW, I'm also fine with getting this in as a
>> matter of "correctness before performance" since this stuff is clearly
>> still under development, but as far as I can see so far we should
>> probably ultimately prefer unwritten extents for DAX/DIO (or at least
>> plan to run some similar tests on real pmem hw). Thoughts?
>
> We're going to have these problems initially, but from the XFS
> persepective those problems don't matter because we have a different
> problem to solve.  That is, we need to move towards an architecture
> that is highly efficient for low latency, high IOPS storage
> subsystems.  The first step towards that is to be able to offload
> block zeroing to the hardware so we can avoid using unwritten
> extents.
>
> In the long run, we don't want to use unwritten extents on DAX if we
> can avoid it - the CPU overhead of unwritten extent conversion isn't
> constant (i.e. it grows with the size of the BMBT) and it isn't
> deterministic (e.g. split/merge take much more CPU than a simple
> record write to clear an unwritten flag). We don't notice this
> overhead much with normal IO because of the fact that the CPU time
> for conversion is much less than the CPU time for the IO to
> complete, hence it's a win.
>
> But for PMEM, directly zeroing a 4k chunk of data should take *much
> less CPU* than synchronously starting a transaction, reserving
> space, looking up the extent in the BMBT, loading the buffers from
> cache, modifying the buffers, logging the changes and committing the
> transaction (which in itself will typically copy more than a single
> page of metadata into the CIL buffers).
>
> Realistically, dax_clear_blocks() should probably be implemented at
> the pmem driver layer through blkdev_issue_zeroout() because all it
> does is directly map the sector/len to pfn via bdev_direct_access()
> and then zero it - it's a sector based, block device operation. We
> don't actually need a special case path for DAX here. Optimisation
> of this operation has little to do with the filesystem.
>
> This comes back to the comments I made w.r.t. the pmem driver
> implementation doing synchronous IO by immediately forcing CPU cache
> flushes and barriers. it's obviously correct, but it looks like
> there's going to be a major performance penalty associated with it.
> This is why I recently suggested that a pmem driver that doesn't do
> CPU cache writeback during IO but does it on REQ_FLUSH is an
> architecture we'll likely have to support.
>

The only thing we can realistically delay is wmb_pmem() i.e. the final
sync waiting for data that has *left* the cpu cache.  Unless/until we
get a architecturally guaranteed method to write-back the entire
cache, or flush the cache by physical-cache-way we're stuck with
either non-temporal cycles or looping on potentially huge virtual
address ranges.

> In this case, the block device won't need to flush CPU cache lines
> for the zeroed blocks until the allocation transaction is actually
> committed to the journal. In that case, there's a good chance that
> we'd end up also committing the new data as well, hence avoiding two
> synchronous memory writes. i.e. the "big hammer" REQ_FLUSH
> implementation may well be *much* faster than fine-grained
> synchronous "stable on completion" writes for persistent memory.

I can only see it being faster in the case where the flush is cheap to
initiate.   That's not the case yet so we're stuck doing it
synchronously.

>
> This, however, is not really a problem for the filesystem - it's
> a pmem driver architecture problem. ;)
>

It's a platform problem.  Let's see how this looks when not using
clflush instructions.

Also, another benefit of pushing zeroing down into the driver is that
for brd, as used in this example, it will rightly be a nop because
there's no persistence to guarantee there.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
@ 2015-11-03  3:53                 ` Dan Williams
  0 siblings, 0 replies; 47+ messages in thread
From: Dan Williams @ 2015-11-03  3:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Brian Foster, Ross Zwisler, Jan Kara, xfs, linux-fsdevel, linux-nvdimm

On Mon, Nov 2, 2015 at 1:44 PM, Dave Chinner <david@fromorbit.com> wrote:
> [add people to the cc list]
>
> On Mon, Nov 02, 2015 at 09:15:10AM -0500, Brian Foster wrote:
>> On Mon, Nov 02, 2015 at 12:14:33PM +1100, Dave Chinner wrote:
>> > On Fri, Oct 30, 2015 at 08:36:57AM -0400, Brian Foster wrote:
>> > > Unless there is some
>> > > special mixed dio/mmap case I'm missing, doing so for DAX/DIO basically
>> > > causes a clear_pmem() over every page sized chunk of the target I/O
>> > > range for which we already have the data.
>> >
>> > I don't follow - this only zeros blocks when we do allocation of new
>> > blocks or overwrite unwritten extents, not on blocks which we
>> > already have written data extents allocated for...
>> >
>>
>> Why are we assuming that block zeroing is more efficient than unwritten
>> extents for DAX/dio? I haven't played with pmem enough to know for sure
>> one way or another (or if hw support is imminent), but I'd expect the
>> latter to be more efficient in general without any kind of hardware
>> support.
>>
>> Just as an example, here's an 8GB pwrite test, large buffer size, to XFS
>> on a ramdisk mounted with '-o dax:'
>>
>> - Before this series:
>>
>> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
>> wrote 8589934592/8589934592 bytes at offset 0
>> 8.000 GiB, 820 ops; 0:00:04.00 (1.909 GiB/sec and 195.6591 ops/sec)
>>
>> - After this series:
>>
>> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
>> wrote 8589934592/8589934592 bytes at offset 0
>> 8.000 GiB, 820 ops; 0:00:12.00 (659.790 MiB/sec and 66.0435 ops/sec)
>
> That looks wrong. Much, much slower than it should be just zeroing
> pages and then writing to them again while cache hot.
>
> Oh, hell - dax_clear_blocks() is stupidly slow. A profile shows this
> loop spending most of the CPU time:
>
>        ¿     ¿ jbe    ea
>        ¿ de:   clflus %ds:(%rax)
>  84.67 ¿       add    %rsi,%rax
>        ¿       cmp    %rax,%rdx
>        ¿     ¿ ja     de
>        ¿ ea:   add    %r13,-0x38(%rbp)
>        ¿       sub    %r12,%r14
>        ¿       sub    %r12,-0x40(%rbp)
>
> That is the overhead of __arch_wb_cache_pmem() i.e. issuing CPU
> cache flushes after each memset.

Ideally this would be non-temporal and skip the second flush loop
altogether.  Outside of that another problem is that this cpu does not
support the clwb instruction and is instead using the serializing and
invalidating clflush instruction.


> None of these pmem memory operations are optimised yet - the
> implementations are correct, but performance still needs work. The
> conversion to non-temporal stores should get rid of this cache flush
> overhead (somewhat), but I was still expecting this code to be much
> closer to memset speed and not reduce performance to bugger all...
>
>> The impact is less with a smaller buffer size so the above is just meant
>> to illustrate the point. FWIW, I'm also fine with getting this in as a
>> matter of "correctness before performance" since this stuff is clearly
>> still under development, but as far as I can see so far we should
>> probably ultimately prefer unwritten extents for DAX/DIO (or at least
>> plan to run some similar tests on real pmem hw). Thoughts?
>
> We're going to have these problems initially, but from the XFS
> persepective those problems don't matter because we have a different
> problem to solve.  That is, we need to move towards an architecture
> that is highly efficient for low latency, high IOPS storage
> subsystems.  The first step towards that is to be able to offload
> block zeroing to the hardware so we can avoid using unwritten
> extents.
>
> In the long run, we don't want to use unwritten extents on DAX if we
> can avoid it - the CPU overhead of unwritten extent conversion isn't
> constant (i.e. it grows with the size of the BMBT) and it isn't
> deterministic (e.g. split/merge take much more CPU than a simple
> record write to clear an unwritten flag). We don't notice this
> overhead much with normal IO because of the fact that the CPU time
> for conversion is much less than the CPU time for the IO to
> complete, hence it's a win.
>
> But for PMEM, directly zeroing a 4k chunk of data should take *much
> less CPU* than synchronously starting a transaction, reserving
> space, looking up the extent in the BMBT, loading the buffers from
> cache, modifying the buffers, logging the changes and committing the
> transaction (which in itself will typically copy more than a single
> page of metadata into the CIL buffers).
>
> Realistically, dax_clear_blocks() should probably be implemented at
> the pmem driver layer through blkdev_issue_zeroout() because all it
> does is directly map the sector/len to pfn via bdev_direct_access()
> and then zero it - it's a sector based, block device operation. We
> don't actually need a special case path for DAX here. Optimisation
> of this operation has little to do with the filesystem.
>
> This comes back to the comments I made w.r.t. the pmem driver
> implementation doing synchronous IO by immediately forcing CPU cache
> flushes and barriers. it's obviously correct, but it looks like
> there's going to be a major performance penalty associated with it.
> This is why I recently suggested that a pmem driver that doesn't do
> CPU cache writeback during IO but does it on REQ_FLUSH is an
> architecture we'll likely have to support.
>

The only thing we can realistically delay is wmb_pmem() i.e. the final
sync waiting for data that has *left* the cpu cache.  Unless/until we
get a architecturally guaranteed method to write-back the entire
cache, or flush the cache by physical-cache-way we're stuck with
either non-temporal cycles or looping on potentially huge virtual
address ranges.

> In this case, the block device won't need to flush CPU cache lines
> for the zeroed blocks until the allocation transaction is actually
> committed to the journal. In that case, there's a good chance that
> we'd end up also committing the new data as well, hence avoiding two
> synchronous memory writes. i.e. the "big hammer" REQ_FLUSH
> implementation may well be *much* faster than fine-grained
> synchronous "stable on completion" writes for persistent memory.

I can only see it being faster in the case where the flush is cheap to
initiate.   That's not the case yet so we're stuck doing it
synchronously.

>
> This, however, is not really a problem for the filesystem - it's
> a pmem driver architecture problem. ;)
>

It's a platform problem.  Let's see how this looks when not using
clflush instructions.

Also, another benefit of pushing zeroing down into the driver is that
for brd, as used in this example, it will rightly be a nop because
there's no persistence to guarantee there.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
@ 2015-11-03  3:53                 ` Dan Williams
  0 siblings, 0 replies; 47+ messages in thread
From: Dan Williams @ 2015-11-03  3:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-nvdimm, Brian Foster, xfs, linux-fsdevel, Ross Zwisler

On Mon, Nov 2, 2015 at 1:44 PM, Dave Chinner <david@fromorbit.com> wrote:
> [add people to the cc list]
>
> On Mon, Nov 02, 2015 at 09:15:10AM -0500, Brian Foster wrote:
>> On Mon, Nov 02, 2015 at 12:14:33PM +1100, Dave Chinner wrote:
>> > On Fri, Oct 30, 2015 at 08:36:57AM -0400, Brian Foster wrote:
>> > > Unless there is some
>> > > special mixed dio/mmap case I'm missing, doing so for DAX/DIO basically
>> > > causes a clear_pmem() over every page sized chunk of the target I/O
>> > > range for which we already have the data.
>> >
>> > I don't follow - this only zeros blocks when we do allocation of new
>> > blocks or overwrite unwritten extents, not on blocks which we
>> > already have written data extents allocated for...
>> >
>>
>> Why are we assuming that block zeroing is more efficient than unwritten
>> extents for DAX/dio? I haven't played with pmem enough to know for sure
>> one way or another (or if hw support is imminent), but I'd expect the
>> latter to be more efficient in general without any kind of hardware
>> support.
>>
>> Just as an example, here's an 8GB pwrite test, large buffer size, to XFS
>> on a ramdisk mounted with '-o dax:'
>>
>> - Before this series:
>>
>> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
>> wrote 8589934592/8589934592 bytes at offset 0
>> 8.000 GiB, 820 ops; 0:00:04.00 (1.909 GiB/sec and 195.6591 ops/sec)
>>
>> - After this series:
>>
>> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
>> wrote 8589934592/8589934592 bytes at offset 0
>> 8.000 GiB, 820 ops; 0:00:12.00 (659.790 MiB/sec and 66.0435 ops/sec)
>
> That looks wrong. Much, much slower than it should be just zeroing
> pages and then writing to them again while cache hot.
>
> Oh, hell - dax_clear_blocks() is stupidly slow. A profile shows this
> loop spending most of the CPU time:
>
>        ¿     ¿ jbe    ea
>        ¿ de:   clflus %ds:(%rax)
>  84.67 ¿       add    %rsi,%rax
>        ¿       cmp    %rax,%rdx
>        ¿     ¿ ja     de
>        ¿ ea:   add    %r13,-0x38(%rbp)
>        ¿       sub    %r12,%r14
>        ¿       sub    %r12,-0x40(%rbp)
>
> That is the overhead of __arch_wb_cache_pmem() i.e. issuing CPU
> cache flushes after each memset.

Ideally this would be non-temporal and skip the second flush loop
altogether.  Outside of that another problem is that this cpu does not
support the clwb instruction and is instead using the serializing and
invalidating clflush instruction.


> None of these pmem memory operations are optimised yet - the
> implementations are correct, but performance still needs work. The
> conversion to non-temporal stores should get rid of this cache flush
> overhead (somewhat), but I was still expecting this code to be much
> closer to memset speed and not reduce performance to bugger all...
>
>> The impact is less with a smaller buffer size so the above is just meant
>> to illustrate the point. FWIW, I'm also fine with getting this in as a
>> matter of "correctness before performance" since this stuff is clearly
>> still under development, but as far as I can see so far we should
>> probably ultimately prefer unwritten extents for DAX/DIO (or at least
>> plan to run some similar tests on real pmem hw). Thoughts?
>
> We're going to have these problems initially, but from the XFS
> persepective those problems don't matter because we have a different
> problem to solve.  That is, we need to move towards an architecture
> that is highly efficient for low latency, high IOPS storage
> subsystems.  The first step towards that is to be able to offload
> block zeroing to the hardware so we can avoid using unwritten
> extents.
>
> In the long run, we don't want to use unwritten extents on DAX if we
> can avoid it - the CPU overhead of unwritten extent conversion isn't
> constant (i.e. it grows with the size of the BMBT) and it isn't
> deterministic (e.g. split/merge take much more CPU than a simple
> record write to clear an unwritten flag). We don't notice this
> overhead much with normal IO because of the fact that the CPU time
> for conversion is much less than the CPU time for the IO to
> complete, hence it's a win.
>
> But for PMEM, directly zeroing a 4k chunk of data should take *much
> less CPU* than synchronously starting a transaction, reserving
> space, looking up the extent in the BMBT, loading the buffers from
> cache, modifying the buffers, logging the changes and committing the
> transaction (which in itself will typically copy more than a single
> page of metadata into the CIL buffers).
>
> Realistically, dax_clear_blocks() should probably be implemented at
> the pmem driver layer through blkdev_issue_zeroout() because all it
> does is directly map the sector/len to pfn via bdev_direct_access()
> and then zero it - it's a sector based, block device operation. We
> don't actually need a special case path for DAX here. Optimisation
> of this operation has little to do with the filesystem.
>
> This comes back to the comments I made w.r.t. the pmem driver
> implementation doing synchronous IO by immediately forcing CPU cache
> flushes and barriers. it's obviously correct, but it looks like
> there's going to be a major performance penalty associated with it.
> This is why I recently suggested that a pmem driver that doesn't do
> CPU cache writeback during IO but does it on REQ_FLUSH is an
> architecture we'll likely have to support.
>

The only thing we can realistically delay is wmb_pmem() i.e. the final
sync waiting for data that has *left* the cpu cache.  Unless/until we
get a architecturally guaranteed method to write-back the entire
cache, or flush the cache by physical-cache-way we're stuck with
either non-temporal cycles or looping on potentially huge virtual
address ranges.

> In this case, the block device won't need to flush CPU cache lines
> for the zeroed blocks until the allocation transaction is actually
> committed to the journal. In that case, there's a good chance that
> we'd end up also committing the new data as well, hence avoiding two
> synchronous memory writes. i.e. the "big hammer" REQ_FLUSH
> implementation may well be *much* faster than fine-grained
> synchronous "stable on completion" writes for persistent memory.

I can only see it being faster in the case where the flush is cheap to
initiate.   That's not the case yet so we're stuck doing it
synchronously.

>
> This, however, is not really a problem for the filesystem - it's
> a pmem driver architecture problem. ;)
>

It's a platform problem.  Let's see how this looks when not using
clflush instructions.

Also, another benefit of pushing zeroing down into the driver is that
for brd, as used in this example, it will rightly be a nop because
there's no persistence to guarantee there.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
  2015-11-03  3:53                 ` Dan Williams
@ 2015-11-03  5:04                   ` Dave Chinner
  -1 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2015-11-03  5:04 UTC (permalink / raw)
  To: Dan Williams
  Cc: Brian Foster, Ross Zwisler, Jan Kara, xfs, linux-fsdevel, linux-nvdimm

On Mon, Nov 02, 2015 at 07:53:27PM -0800, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 1:44 PM, Dave Chinner <david@fromorbit.com> wrote:
> > [add people to the cc list]
> >
> > On Mon, Nov 02, 2015 at 09:15:10AM -0500, Brian Foster wrote:
> >> On Mon, Nov 02, 2015 at 12:14:33PM +1100, Dave Chinner wrote:
> >> > On Fri, Oct 30, 2015 at 08:36:57AM -0400, Brian Foster wrote:
> >> > > Unless there is some
> >> > > special mixed dio/mmap case I'm missing, doing so for DAX/DIO basically
> >> > > causes a clear_pmem() over every page sized chunk of the target I/O
> >> > > range for which we already have the data.
> >> >
> >> > I don't follow - this only zeros blocks when we do allocation of new
> >> > blocks or overwrite unwritten extents, not on blocks which we
> >> > already have written data extents allocated for...
> >> >
> >>
> >> Why are we assuming that block zeroing is more efficient than unwritten
> >> extents for DAX/dio? I haven't played with pmem enough to know for sure
> >> one way or another (or if hw support is imminent), but I'd expect the
> >> latter to be more efficient in general without any kind of hardware
> >> support.
> >>
> >> Just as an example, here's an 8GB pwrite test, large buffer size, to XFS
> >> on a ramdisk mounted with '-o dax:'
> >>
> >> - Before this series:
> >>
> >> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
> >> wrote 8589934592/8589934592 bytes at offset 0
> >> 8.000 GiB, 820 ops; 0:00:04.00 (1.909 GiB/sec and 195.6591 ops/sec)
> >>
> >> - After this series:
> >>
> >> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
> >> wrote 8589934592/8589934592 bytes at offset 0
> >> 8.000 GiB, 820 ops; 0:00:12.00 (659.790 MiB/sec and 66.0435 ops/sec)
> >
> > That looks wrong. Much, much slower than it should be just zeroing
> > pages and then writing to them again while cache hot.
> >
> > Oh, hell - dax_clear_blocks() is stupidly slow. A profile shows this
> > loop spending most of the CPU time:
> >
> >        �     � jbe    ea
> >        � de:   clflus %ds:(%rax)
> >  84.67 �       add    %rsi,%rax
> >        �       cmp    %rax,%rdx
> >        �     � ja     de
> >        � ea:   add    %r13,-0x38(%rbp)
> >        �       sub    %r12,%r14
> >        �       sub    %r12,-0x40(%rbp)
> >
> > That is the overhead of __arch_wb_cache_pmem() i.e. issuing CPU
> > cache flushes after each memset.
> 
> Ideally this would be non-temporal and skip the second flush loop
> altogether.  Outside of that another problem is that this cpu does not
> support the clwb instruction and is instead using the serializing and
> invalidating clflush instruction.

Sure, it can be optimised to improve behaviour, and other hardware
will be more performant, but we'll still be doing synchronous cache
flushing operations here and so the fundamental problem still
exists.

> > This comes back to the comments I made w.r.t. the pmem driver
> > implementation doing synchronous IO by immediately forcing CPU cache
> > flushes and barriers. it's obviously correct, but it looks like
> > there's going to be a major performance penalty associated with it.
> > This is why I recently suggested that a pmem driver that doesn't do
> > CPU cache writeback during IO but does it on REQ_FLUSH is an
> > architecture we'll likely have to support.
> >
> 
> The only thing we can realistically delay is wmb_pmem() i.e. the final
> sync waiting for data that has *left* the cpu cache.  Unless/until we
> get a architecturally guaranteed method to write-back the entire
> cache, or flush the cache by physical-cache-way we're stuck with
> either non-temporal cycles or looping on potentially huge virtual
> address ranges.

I'm missing something: why won't flushing the address range returned
by bdev_direct_access() during a fsync operation work? i.e. we're
working with exactly the same address as dax_clear_blocks() and
dax_do_io() use, so why can't we look up that address and flush it
from fsync?

> > This, however, is not really a problem for the filesystem - it's
> > a pmem driver architecture problem. ;)
> >
> 
> It's a platform problem.  Let's see how this looks when not using
> clflush instructions.

Yes, clflush vs clwb is that's a platform problem.

However, the architectural problem I've refered to is that is we've
designed the DAX infrastructure around a CPU implementation that
requires synchronous memory flushes for persistence, and so driven
that synchronous flush requirement into the DAX implementation
itself....

> Also, another benefit of pushing zeroing down into the driver is that
> for brd, as used in this example, it will rightly be a nop because
> there's no persistence to guarantee there.

Precisely my point - different drivers and hardware have different
semantics and optimisations, and only by separating the data copying
from the persistence model do we end up with infrastructure that is
flexible enough to work with different hardware/driver pmem models...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
@ 2015-11-03  5:04                   ` Dave Chinner
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2015-11-03  5:04 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Brian Foster, xfs, linux-fsdevel, Ross Zwisler

On Mon, Nov 02, 2015 at 07:53:27PM -0800, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 1:44 PM, Dave Chinner <david@fromorbit.com> wrote:
> > [add people to the cc list]
> >
> > On Mon, Nov 02, 2015 at 09:15:10AM -0500, Brian Foster wrote:
> >> On Mon, Nov 02, 2015 at 12:14:33PM +1100, Dave Chinner wrote:
> >> > On Fri, Oct 30, 2015 at 08:36:57AM -0400, Brian Foster wrote:
> >> > > Unless there is some
> >> > > special mixed dio/mmap case I'm missing, doing so for DAX/DIO basically
> >> > > causes a clear_pmem() over every page sized chunk of the target I/O
> >> > > range for which we already have the data.
> >> >
> >> > I don't follow - this only zeros blocks when we do allocation of new
> >> > blocks or overwrite unwritten extents, not on blocks which we
> >> > already have written data extents allocated for...
> >> >
> >>
> >> Why are we assuming that block zeroing is more efficient than unwritten
> >> extents for DAX/dio? I haven't played with pmem enough to know for sure
> >> one way or another (or if hw support is imminent), but I'd expect the
> >> latter to be more efficient in general without any kind of hardware
> >> support.
> >>
> >> Just as an example, here's an 8GB pwrite test, large buffer size, to XFS
> >> on a ramdisk mounted with '-o dax:'
> >>
> >> - Before this series:
> >>
> >> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
> >> wrote 8589934592/8589934592 bytes at offset 0
> >> 8.000 GiB, 820 ops; 0:00:04.00 (1.909 GiB/sec and 195.6591 ops/sec)
> >>
> >> - After this series:
> >>
> >> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
> >> wrote 8589934592/8589934592 bytes at offset 0
> >> 8.000 GiB, 820 ops; 0:00:12.00 (659.790 MiB/sec and 66.0435 ops/sec)
> >
> > That looks wrong. Much, much slower than it should be just zeroing
> > pages and then writing to them again while cache hot.
> >
> > Oh, hell - dax_clear_blocks() is stupidly slow. A profile shows this
> > loop spending most of the CPU time:
> >
> >        ¿     ¿ jbe    ea
> >        ¿ de:   clflus %ds:(%rax)
> >  84.67 ¿       add    %rsi,%rax
> >        ¿       cmp    %rax,%rdx
> >        ¿     ¿ ja     de
> >        ¿ ea:   add    %r13,-0x38(%rbp)
> >        ¿       sub    %r12,%r14
> >        ¿       sub    %r12,-0x40(%rbp)
> >
> > That is the overhead of __arch_wb_cache_pmem() i.e. issuing CPU
> > cache flushes after each memset.
> 
> Ideally this would be non-temporal and skip the second flush loop
> altogether.  Outside of that another problem is that this cpu does not
> support the clwb instruction and is instead using the serializing and
> invalidating clflush instruction.

Sure, it can be optimised to improve behaviour, and other hardware
will be more performant, but we'll still be doing synchronous cache
flushing operations here and so the fundamental problem still
exists.

> > This comes back to the comments I made w.r.t. the pmem driver
> > implementation doing synchronous IO by immediately forcing CPU cache
> > flushes and barriers. it's obviously correct, but it looks like
> > there's going to be a major performance penalty associated with it.
> > This is why I recently suggested that a pmem driver that doesn't do
> > CPU cache writeback during IO but does it on REQ_FLUSH is an
> > architecture we'll likely have to support.
> >
> 
> The only thing we can realistically delay is wmb_pmem() i.e. the final
> sync waiting for data that has *left* the cpu cache.  Unless/until we
> get a architecturally guaranteed method to write-back the entire
> cache, or flush the cache by physical-cache-way we're stuck with
> either non-temporal cycles or looping on potentially huge virtual
> address ranges.

I'm missing something: why won't flushing the address range returned
by bdev_direct_access() during a fsync operation work? i.e. we're
working with exactly the same address as dax_clear_blocks() and
dax_do_io() use, so why can't we look up that address and flush it
from fsync?

> > This, however, is not really a problem for the filesystem - it's
> > a pmem driver architecture problem. ;)
> >
> 
> It's a platform problem.  Let's see how this looks when not using
> clflush instructions.

Yes, clflush vs clwb is that's a platform problem.

However, the architectural problem I've refered to is that is we've
designed the DAX infrastructure around a CPU implementation that
requires synchronous memory flushes for persistence, and so driven
that synchronous flush requirement into the DAX implementation
itself....

> Also, another benefit of pushing zeroing down into the driver is that
> for brd, as used in this example, it will rightly be a nop because
> there's no persistence to guarantee there.

Precisely my point - different drivers and hardware have different
semantics and optimisations, and only by separating the data copying
from the persistence model do we end up with infrastructure that is
flexible enough to work with different hardware/driver pmem models...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
  2015-11-02 21:44               ` Dave Chinner
@ 2015-11-03  9:16                 ` Jan Kara
  -1 siblings, 0 replies; 47+ messages in thread
From: Jan Kara @ 2015-11-03  9:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Brian Foster, ross.zwisler, jack, xfs, linux-fsdevel,
	dan.j.williams, linux-nvdimm

On Tue 03-11-15 08:44:24, Dave Chinner wrote:
> Realistically, dax_clear_blocks() should probably be implemented at
> the pmem driver layer through blkdev_issue_zeroout() because all it
> does is directly map the sector/len to pfn via bdev_direct_access()
> and then zero it - it's a sector based, block device operation. We
> don't actually need a special case path for DAX here. Optimisation
> of this operation has little to do with the filesystem.

Yep. This is actually what I did in ext4 - the block zeroing is using the
ext4 block zeroout path which ends up calling blkdev_issue_zeroout(). I
didn't want to special-case DAX and figured out that if we want
performance, we should implement blkdev_issue_zeroout() efficiently for
pmem. After all ext4 uses blkdev_issue_zeroout() in other extent conversion
cases where zeroing out is needed.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
@ 2015-11-03  9:16                 ` Jan Kara
  0 siblings, 0 replies; 47+ messages in thread
From: Jan Kara @ 2015-11-03  9:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: jack, linux-nvdimm, Brian Foster, xfs, linux-fsdevel,
	ross.zwisler, dan.j.williams

On Tue 03-11-15 08:44:24, Dave Chinner wrote:
> Realistically, dax_clear_blocks() should probably be implemented at
> the pmem driver layer through blkdev_issue_zeroout() because all it
> does is directly map the sector/len to pfn via bdev_direct_access()
> and then zero it - it's a sector based, block device operation. We
> don't actually need a special case path for DAX here. Optimisation
> of this operation has little to do with the filesystem.

Yep. This is actually what I did in ext4 - the block zeroing is using the
ext4 block zeroout path which ends up calling blkdev_issue_zeroout(). I
didn't want to special-case DAX and figured out that if we want
performance, we should implement blkdev_issue_zeroout() efficiently for
pmem. After all ext4 uses blkdev_issue_zeroout() in other extent conversion
cases where zeroing out is needed.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
  2015-11-03  5:04                   ` Dave Chinner
@ 2015-11-04  0:50                     ` Ross Zwisler
  -1 siblings, 0 replies; 47+ messages in thread
From: Ross Zwisler @ 2015-11-04  0:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dan Williams, Brian Foster, Ross Zwisler, Jan Kara, xfs,
	linux-fsdevel, linux-nvdimm

On Tue, Nov 03, 2015 at 04:04:13PM +1100, Dave Chinner wrote:
> On Mon, Nov 02, 2015 at 07:53:27PM -0800, Dan Williams wrote:
> > On Mon, Nov 2, 2015 at 1:44 PM, Dave Chinner <david@fromorbit.com> wrote:
<>
> > > This comes back to the comments I made w.r.t. the pmem driver
> > > implementation doing synchronous IO by immediately forcing CPU cache
> > > flushes and barriers. it's obviously correct, but it looks like
> > > there's going to be a major performance penalty associated with it.
> > > This is why I recently suggested that a pmem driver that doesn't do
> > > CPU cache writeback during IO but does it on REQ_FLUSH is an
> > > architecture we'll likely have to support.
> > >
> > 
> > The only thing we can realistically delay is wmb_pmem() i.e. the final
> > sync waiting for data that has *left* the cpu cache.  Unless/until we
> > get a architecturally guaranteed method to write-back the entire
> > cache, or flush the cache by physical-cache-way we're stuck with
> > either non-temporal cycles or looping on potentially huge virtual
> > address ranges.
> 
> I'm missing something: why won't flushing the address range returned
> by bdev_direct_access() during a fsync operation work? i.e. we're
> working with exactly the same address as dax_clear_blocks() and
> dax_do_io() use, so why can't we look up that address and flush it
> from fsync?

I could be wrong, but I don't see a reason why DAX can't use the strategy of
writing data and marking it dirty in one step and then flushing later in
response to fsync/msync.  I think this could be used everywhere we write or
zero data - dax_clear_blocks(), dax_io() etc.  (I believe that lots of the
block zeroing code will go away once we have the XFS and ext4 patches in that
guarantee we will only get written and zeroed extents from the filesystem in
response to get_block().)  I think the PMEM driver, lacking the ability to
mark things as dirty in the radix tree, etc, will need to keep doing things
synchronously.

Hmm...if we go this path, though, is that an argument against moving the
zeroing from DAX down into the driver?  True, with BRD it makes things nice
and efficient because you can zero and never flush, and the driver knows
there's nothing else to do.

For PMEM, though, you lose the ability to zero the data and then queue the
flushing for later, as you would be able to do if you left the zeroing code in
DAX.  The benefit of this is that if you are going to immediately re-write the
newly zeroed data (which seems common), PMEM will end up doing an extra cache
flush of the zeroes, only to have them overwritten and marked as dirty by DAX.
If we leave the zeroing to DAX we can mark it dirty once, zero it once, write
it once, and flush it once.

This would make us lose the ability to do hardware-assisted flushing in the
future that requires driver specific knowledge, though I don't think that
exists yet.  Perhaps we should leave the zeroing in DAX for now to take
advantage of the single flush, and then move it down if a driver can improve
performance with hardware assisted PMEM zeroing?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
@ 2015-11-04  0:50                     ` Ross Zwisler
  0 siblings, 0 replies; 47+ messages in thread
From: Ross Zwisler @ 2015-11-04  0:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-nvdimm, Brian Foster, xfs, linux-fsdevel,
	Dan Williams, Ross Zwisler

On Tue, Nov 03, 2015 at 04:04:13PM +1100, Dave Chinner wrote:
> On Mon, Nov 02, 2015 at 07:53:27PM -0800, Dan Williams wrote:
> > On Mon, Nov 2, 2015 at 1:44 PM, Dave Chinner <david@fromorbit.com> wrote:
<>
> > > This comes back to the comments I made w.r.t. the pmem driver
> > > implementation doing synchronous IO by immediately forcing CPU cache
> > > flushes and barriers. it's obviously correct, but it looks like
> > > there's going to be a major performance penalty associated with it.
> > > This is why I recently suggested that a pmem driver that doesn't do
> > > CPU cache writeback during IO but does it on REQ_FLUSH is an
> > > architecture we'll likely have to support.
> > >
> > 
> > The only thing we can realistically delay is wmb_pmem() i.e. the final
> > sync waiting for data that has *left* the cpu cache.  Unless/until we
> > get a architecturally guaranteed method to write-back the entire
> > cache, or flush the cache by physical-cache-way we're stuck with
> > either non-temporal cycles or looping on potentially huge virtual
> > address ranges.
> 
> I'm missing something: why won't flushing the address range returned
> by bdev_direct_access() during a fsync operation work? i.e. we're
> working with exactly the same address as dax_clear_blocks() and
> dax_do_io() use, so why can't we look up that address and flush it
> from fsync?

I could be wrong, but I don't see a reason why DAX can't use the strategy of
writing data and marking it dirty in one step and then flushing later in
response to fsync/msync.  I think this could be used everywhere we write or
zero data - dax_clear_blocks(), dax_io() etc.  (I believe that lots of the
block zeroing code will go away once we have the XFS and ext4 patches in that
guarantee we will only get written and zeroed extents from the filesystem in
response to get_block().)  I think the PMEM driver, lacking the ability to
mark things as dirty in the radix tree, etc, will need to keep doing things
synchronously.

Hmm...if we go this path, though, is that an argument against moving the
zeroing from DAX down into the driver?  True, with BRD it makes things nice
and efficient because you can zero and never flush, and the driver knows
there's nothing else to do.

For PMEM, though, you lose the ability to zero the data and then queue the
flushing for later, as you would be able to do if you left the zeroing code in
DAX.  The benefit of this is that if you are going to immediately re-write the
newly zeroed data (which seems common), PMEM will end up doing an extra cache
flush of the zeroes, only to have them overwritten and marked as dirty by DAX.
If we leave the zeroing to DAX we can mark it dirty once, zero it once, write
it once, and flush it once.

This would make us lose the ability to do hardware-assisted flushing in the
future that requires driver specific knowledge, though I don't think that
exists yet.  Perhaps we should leave the zeroing in DAX for now to take
advantage of the single flush, and then move it down if a driver can improve
performance with hardware assisted PMEM zeroing?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
  2015-11-04  0:50                     ` Ross Zwisler
@ 2015-11-04  1:02                       ` Dan Williams
  -1 siblings, 0 replies; 47+ messages in thread
From: Dan Williams @ 2015-11-04  1:02 UTC (permalink / raw)
  To: Ross Zwisler, Dave Chinner, Dan Williams, Brian Foster, Jan Kara,
	xfs, linux-fsdevel, linux-nvdimm

On Tue, Nov 3, 2015 at 4:50 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Tue, Nov 03, 2015 at 04:04:13PM +1100, Dave Chinner wrote:
>> On Mon, Nov 02, 2015 at 07:53:27PM -0800, Dan Williams wrote:
>> > On Mon, Nov 2, 2015 at 1:44 PM, Dave Chinner <david@fromorbit.com> wrote:
> <>
>> > > This comes back to the comments I made w.r.t. the pmem driver
>> > > implementation doing synchronous IO by immediately forcing CPU cache
>> > > flushes and barriers. it's obviously correct, but it looks like
>> > > there's going to be a major performance penalty associated with it.
>> > > This is why I recently suggested that a pmem driver that doesn't do
>> > > CPU cache writeback during IO but does it on REQ_FLUSH is an
>> > > architecture we'll likely have to support.
>> > >
>> >
>> > The only thing we can realistically delay is wmb_pmem() i.e. the final
>> > sync waiting for data that has *left* the cpu cache.  Unless/until we
>> > get a architecturally guaranteed method to write-back the entire
>> > cache, or flush the cache by physical-cache-way we're stuck with
>> > either non-temporal cycles or looping on potentially huge virtual
>> > address ranges.
>>
>> I'm missing something: why won't flushing the address range returned
>> by bdev_direct_access() during a fsync operation work? i.e. we're
>> working with exactly the same address as dax_clear_blocks() and
>> dax_do_io() use, so why can't we look up that address and flush it
>> from fsync?
>
> I could be wrong, but I don't see a reason why DAX can't use the strategy of
> writing data and marking it dirty in one step and then flushing later in
> response to fsync/msync.  I think this could be used everywhere we write or
> zero data - dax_clear_blocks(), dax_io() etc.  (I believe that lots of the
> block zeroing code will go away once we have the XFS and ext4 patches in that
> guarantee we will only get written and zeroed extents from the filesystem in
> response to get_block().)  I think the PMEM driver, lacking the ability to
> mark things as dirty in the radix tree, etc, will need to keep doing things
> synchronously.

Not without numbers showing the relative performance of dirtying cache
followed by flushing vs non-temporal + pcommit.

> Hmm...if we go this path, though, is that an argument against moving the
> zeroing from DAX down into the driver?  True, with BRD it makes things nice
> and efficient because you can zero and never flush, and the driver knows
> there's nothing else to do.
>
> For PMEM, though, you lose the ability to zero the data and then queue the
> flushing for later, as you would be able to do if you left the zeroing code in
> DAX.  The benefit of this is that if you are going to immediately re-write the
> newly zeroed data (which seems common), PMEM will end up doing an extra cache
> flush of the zeroes, only to have them overwritten and marked as dirty by DAX.
> If we leave the zeroing to DAX we can mark it dirty once, zero it once, write
> it once, and flush it once.

Why do we lose the ability to flush later if the driver supports
blkdev_issue_zeroout?

> This would make us lose the ability to do hardware-assisted flushing in the
> future that requires driver specific knowledge, though I don't think that
> exists yet.

ioatdma has supported memset() for a while now, but I would prioritize
a non-temporal SIMD implementation first.

> Perhaps we should leave the zeroing in DAX for now to take
> advantage of the single flush, and then move it down if a driver can improve
> performance with hardware assisted PMEM zeroing?

Not convinced.  I think we should implement the driver zeroing
solution and take a look at performance.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
@ 2015-11-04  1:02                       ` Dan Williams
  0 siblings, 0 replies; 47+ messages in thread
From: Dan Williams @ 2015-11-04  1:02 UTC (permalink / raw)
  To: Ross Zwisler, Dave Chinner, Dan Williams, Brian Foster, Jan Kara,
	xfs, linux-fsdevel, linux-nvdimm

On Tue, Nov 3, 2015 at 4:50 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Tue, Nov 03, 2015 at 04:04:13PM +1100, Dave Chinner wrote:
>> On Mon, Nov 02, 2015 at 07:53:27PM -0800, Dan Williams wrote:
>> > On Mon, Nov 2, 2015 at 1:44 PM, Dave Chinner <david@fromorbit.com> wrote:
> <>
>> > > This comes back to the comments I made w.r.t. the pmem driver
>> > > implementation doing synchronous IO by immediately forcing CPU cache
>> > > flushes and barriers. it's obviously correct, but it looks like
>> > > there's going to be a major performance penalty associated with it.
>> > > This is why I recently suggested that a pmem driver that doesn't do
>> > > CPU cache writeback during IO but does it on REQ_FLUSH is an
>> > > architecture we'll likely have to support.
>> > >
>> >
>> > The only thing we can realistically delay is wmb_pmem() i.e. the final
>> > sync waiting for data that has *left* the cpu cache.  Unless/until we
>> > get a architecturally guaranteed method to write-back the entire
>> > cache, or flush the cache by physical-cache-way we're stuck with
>> > either non-temporal cycles or looping on potentially huge virtual
>> > address ranges.
>>
>> I'm missing something: why won't flushing the address range returned
>> by bdev_direct_access() during a fsync operation work? i.e. we're
>> working with exactly the same address as dax_clear_blocks() and
>> dax_do_io() use, so why can't we look up that address and flush it
>> from fsync?
>
> I could be wrong, but I don't see a reason why DAX can't use the strategy of
> writing data and marking it dirty in one step and then flushing later in
> response to fsync/msync.  I think this could be used everywhere we write or
> zero data - dax_clear_blocks(), dax_io() etc.  (I believe that lots of the
> block zeroing code will go away once we have the XFS and ext4 patches in that
> guarantee we will only get written and zeroed extents from the filesystem in
> response to get_block().)  I think the PMEM driver, lacking the ability to
> mark things as dirty in the radix tree, etc, will need to keep doing things
> synchronously.

Not without numbers showing the relative performance of dirtying cache
followed by flushing vs non-temporal + pcommit.

> Hmm...if we go this path, though, is that an argument against moving the
> zeroing from DAX down into the driver?  True, with BRD it makes things nice
> and efficient because you can zero and never flush, and the driver knows
> there's nothing else to do.
>
> For PMEM, though, you lose the ability to zero the data and then queue the
> flushing for later, as you would be able to do if you left the zeroing code in
> DAX.  The benefit of this is that if you are going to immediately re-write the
> newly zeroed data (which seems common), PMEM will end up doing an extra cache
> flush of the zeroes, only to have them overwritten and marked as dirty by DAX.
> If we leave the zeroing to DAX we can mark it dirty once, zero it once, write
> it once, and flush it once.

Why do we lose the ability to flush later if the driver supports
blkdev_issue_zeroout?

> This would make us lose the ability to do hardware-assisted flushing in the
> future that requires driver specific knowledge, though I don't think that
> exists yet.

ioatdma has supported memset() for a while now, but I would prioritize
a non-temporal SIMD implementation first.

> Perhaps we should leave the zeroing in DAX for now to take
> advantage of the single flush, and then move it down if a driver can improve
> performance with hardware assisted PMEM zeroing?

Not convinced.  I think we should implement the driver zeroing
solution and take a look at performance.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
  2015-11-04  1:02                       ` Dan Williams
@ 2015-11-04  4:46                         ` Ross Zwisler
  -1 siblings, 0 replies; 47+ messages in thread
From: Ross Zwisler @ 2015-11-04  4:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, Dave Chinner, Brian Foster, Jan Kara, xfs,
	linux-fsdevel, linux-nvdimm

On Tue, Nov 03, 2015 at 05:02:34PM -0800, Dan Williams wrote:
> On Tue, Nov 3, 2015 at 4:50 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > On Tue, Nov 03, 2015 at 04:04:13PM +1100, Dave Chinner wrote:
> >> On Mon, Nov 02, 2015 at 07:53:27PM -0800, Dan Williams wrote:
> >> > On Mon, Nov 2, 2015 at 1:44 PM, Dave Chinner <david@fromorbit.com> wrote:
> > <>
> >> > > This comes back to the comments I made w.r.t. the pmem driver
> >> > > implementation doing synchronous IO by immediately forcing CPU cache
> >> > > flushes and barriers. it's obviously correct, but it looks like
> >> > > there's going to be a major performance penalty associated with it.
> >> > > This is why I recently suggested that a pmem driver that doesn't do
> >> > > CPU cache writeback during IO but does it on REQ_FLUSH is an
> >> > > architecture we'll likely have to support.
> >> > >
> >> >
> >> > The only thing we can realistically delay is wmb_pmem() i.e. the final
> >> > sync waiting for data that has *left* the cpu cache.  Unless/until we
> >> > get a architecturally guaranteed method to write-back the entire
> >> > cache, or flush the cache by physical-cache-way we're stuck with
> >> > either non-temporal cycles or looping on potentially huge virtual
> >> > address ranges.
> >>
> >> I'm missing something: why won't flushing the address range returned
> >> by bdev_direct_access() during a fsync operation work? i.e. we're
> >> working with exactly the same address as dax_clear_blocks() and
> >> dax_do_io() use, so why can't we look up that address and flush it
> >> from fsync?
> >
> > I could be wrong, but I don't see a reason why DAX can't use the strategy of
> > writing data and marking it dirty in one step and then flushing later in
> > response to fsync/msync.  I think this could be used everywhere we write or
> > zero data - dax_clear_blocks(), dax_io() etc.  (I believe that lots of the
> > block zeroing code will go away once we have the XFS and ext4 patches in that
> > guarantee we will only get written and zeroed extents from the filesystem in
> > response to get_block().)  I think the PMEM driver, lacking the ability to
> > mark things as dirty in the radix tree, etc, will need to keep doing things
> > synchronously.
> 
> Not without numbers showing the relative performance of dirtying cache
> followed by flushing vs non-temporal + pcommit.

Sorry - do you mean that you want to make sure that we get a performance
benefit from the "dirty and flush later" path vs the "write and flush now"
path?  Sure, that seems reasonable.

> > Hmm...if we go this path, though, is that an argument against moving the
> > zeroing from DAX down into the driver?  True, with BRD it makes things nice
> > and efficient because you can zero and never flush, and the driver knows
> > there's nothing else to do.
> >
> > For PMEM, though, you lose the ability to zero the data and then queue the
> > flushing for later, as you would be able to do if you left the zeroing code in
> > DAX.  The benefit of this is that if you are going to immediately re-write the
> > newly zeroed data (which seems common), PMEM will end up doing an extra cache
> > flush of the zeroes, only to have them overwritten and marked as dirty by DAX.
> > If we leave the zeroing to DAX we can mark it dirty once, zero it once, write
> > it once, and flush it once.
> 
> Why do we lose the ability to flush later if the driver supports
> blkdev_issue_zeroout?

I think that if you implement zeroing in the driver you'd need to also flush
in the driver because you wouldn't have access to the radix tree to be able to
mark entries as dirty so you can flush them later.

As I think about this more, though, I'm not sure that having the zeroing flush
later could work.  I'm guessing that the filesystem must require a sync point
between the zeroing and the subsequent follow-up writes so that you can sync
metadata for the block allocation.  Otherwise you could end up in a situation
where you've got your metadata pointing at newly allocated blocks but the new
zeros are still in the processor cache - if you lose power you've just created
an information leak.   Dave, Jan, does this make sense?  

> > This would make us lose the ability to do hardware-assisted flushing in the
> > future that requires driver specific knowledge, though I don't think that
> > exists yet.
> 
> ioatdma has supported memset() for a while now, but I would prioritize
> a non-temporal SIMD implementation first.

Sweet, didn't know about that, obviously.  :)  Thanks for the pointer.

> > Perhaps we should leave the zeroing in DAX for now to take
> > advantage of the single flush, and then move it down if a driver can improve
> > performance with hardware assisted PMEM zeroing?
> 
> Not convinced.  I think we should implement the driver zeroing
> solution and take a look at performance.

I agree, this should all be driven by performance measurements.  Thanks for
the feedback.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
@ 2015-11-04  4:46                         ` Ross Zwisler
  0 siblings, 0 replies; 47+ messages in thread
From: Ross Zwisler @ 2015-11-04  4:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Brian Foster, xfs, linux-fsdevel, Ross Zwisler

On Tue, Nov 03, 2015 at 05:02:34PM -0800, Dan Williams wrote:
> On Tue, Nov 3, 2015 at 4:50 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > On Tue, Nov 03, 2015 at 04:04:13PM +1100, Dave Chinner wrote:
> >> On Mon, Nov 02, 2015 at 07:53:27PM -0800, Dan Williams wrote:
> >> > On Mon, Nov 2, 2015 at 1:44 PM, Dave Chinner <david@fromorbit.com> wrote:
> > <>
> >> > > This comes back to the comments I made w.r.t. the pmem driver
> >> > > implementation doing synchronous IO by immediately forcing CPU cache
> >> > > flushes and barriers. it's obviously correct, but it looks like
> >> > > there's going to be a major performance penalty associated with it.
> >> > > This is why I recently suggested that a pmem driver that doesn't do
> >> > > CPU cache writeback during IO but does it on REQ_FLUSH is an
> >> > > architecture we'll likely have to support.
> >> > >
> >> >
> >> > The only thing we can realistically delay is wmb_pmem() i.e. the final
> >> > sync waiting for data that has *left* the cpu cache.  Unless/until we
> >> > get a architecturally guaranteed method to write-back the entire
> >> > cache, or flush the cache by physical-cache-way we're stuck with
> >> > either non-temporal cycles or looping on potentially huge virtual
> >> > address ranges.
> >>
> >> I'm missing something: why won't flushing the address range returned
> >> by bdev_direct_access() during a fsync operation work? i.e. we're
> >> working with exactly the same address as dax_clear_blocks() and
> >> dax_do_io() use, so why can't we look up that address and flush it
> >> from fsync?
> >
> > I could be wrong, but I don't see a reason why DAX can't use the strategy of
> > writing data and marking it dirty in one step and then flushing later in
> > response to fsync/msync.  I think this could be used everywhere we write or
> > zero data - dax_clear_blocks(), dax_io() etc.  (I believe that lots of the
> > block zeroing code will go away once we have the XFS and ext4 patches in that
> > guarantee we will only get written and zeroed extents from the filesystem in
> > response to get_block().)  I think the PMEM driver, lacking the ability to
> > mark things as dirty in the radix tree, etc, will need to keep doing things
> > synchronously.
> 
> Not without numbers showing the relative performance of dirtying cache
> followed by flushing vs non-temporal + pcommit.

Sorry - do you mean that you want to make sure that we get a performance
benefit from the "dirty and flush later" path vs the "write and flush now"
path?  Sure, that seems reasonable.

> > Hmm...if we go this path, though, is that an argument against moving the
> > zeroing from DAX down into the driver?  True, with BRD it makes things nice
> > and efficient because you can zero and never flush, and the driver knows
> > there's nothing else to do.
> >
> > For PMEM, though, you lose the ability to zero the data and then queue the
> > flushing for later, as you would be able to do if you left the zeroing code in
> > DAX.  The benefit of this is that if you are going to immediately re-write the
> > newly zeroed data (which seems common), PMEM will end up doing an extra cache
> > flush of the zeroes, only to have them overwritten and marked as dirty by DAX.
> > If we leave the zeroing to DAX we can mark it dirty once, zero it once, write
> > it once, and flush it once.
> 
> Why do we lose the ability to flush later if the driver supports
> blkdev_issue_zeroout?

I think that if you implement zeroing in the driver you'd need to also flush
in the driver because you wouldn't have access to the radix tree to be able to
mark entries as dirty so you can flush them later.

As I think about this more, though, I'm not sure that having the zeroing flush
later could work.  I'm guessing that the filesystem must require a sync point
between the zeroing and the subsequent follow-up writes so that you can sync
metadata for the block allocation.  Otherwise you could end up in a situation
where you've got your metadata pointing at newly allocated blocks but the new
zeros are still in the processor cache - if you lose power you've just created
an information leak.   Dave, Jan, does this make sense?  

> > This would make us lose the ability to do hardware-assisted flushing in the
> > future that requires driver specific knowledge, though I don't think that
> > exists yet.
> 
> ioatdma has supported memset() for a while now, but I would prioritize
> a non-temporal SIMD implementation first.

Sweet, didn't know about that, obviously.  :)  Thanks for the pointer.

> > Perhaps we should leave the zeroing in DAX for now to take
> > advantage of the single flush, and then move it down if a driver can improve
> > performance with hardware assisted PMEM zeroing?
> 
> Not convinced.  I think we should implement the driver zeroing
> solution and take a look at performance.

I agree, this should all be driven by performance measurements.  Thanks for
the feedback.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
  2015-11-04  4:46                         ` Ross Zwisler
@ 2015-11-04  9:06                           ` Jan Kara
  -1 siblings, 0 replies; 47+ messages in thread
From: Jan Kara @ 2015-11-04  9:06 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Dan Williams, Dave Chinner, Brian Foster, Jan Kara, xfs,
	linux-fsdevel, linux-nvdimm

On Tue 03-11-15 21:46:13, Ross Zwisler wrote:
> On Tue, Nov 03, 2015 at 05:02:34PM -0800, Dan Williams wrote:
> > > Hmm...if we go this path, though, is that an argument against moving the
> > > zeroing from DAX down into the driver?  True, with BRD it makes things nice
> > > and efficient because you can zero and never flush, and the driver knows
> > > there's nothing else to do.
> > >
> > > For PMEM, though, you lose the ability to zero the data and then queue the
> > > flushing for later, as you would be able to do if you left the zeroing code in
> > > DAX.  The benefit of this is that if you are going to immediately re-write the
> > > newly zeroed data (which seems common), PMEM will end up doing an extra cache
> > > flush of the zeroes, only to have them overwritten and marked as dirty by DAX.
> > > If we leave the zeroing to DAX we can mark it dirty once, zero it once, write
> > > it once, and flush it once.
> > 
> > Why do we lose the ability to flush later if the driver supports
> > blkdev_issue_zeroout?
> 
> I think that if you implement zeroing in the driver you'd need to also
> flush in the driver because you wouldn't have access to the radix tree to
> be able to mark entries as dirty so you can flush them later.
> 
> As I think about this more, though, I'm not sure that having the zeroing
> flush later could work.  I'm guessing that the filesystem must require a
> sync point between the zeroing and the subsequent follow-up writes so
> that you can sync metadata for the block allocation.  Otherwise you could
> end up in a situation where you've got your metadata pointing at newly
> allocated blocks but the new zeros are still in the processor cache - if
> you lose power you've just created an information leak.   Dave, Jan, does
> this make sense?  

So the problem you describe does not exist. Thing to keep in mind is that
filesystem are designed to work reliably with 'non-persistent' cache in the
disk which is common these days. That's why we bother with all that
REQ_FLUSH | REQ_FUA and blkdev_issue_flush() stuff after all. Processor
cache is exactly that kind of the cache attached to the PMEM storage. And
Dave and I try to steer you to a solution that would also treat it equally
in DAX filesystems as well :).

Now how the problem is currently solved: When we allocate blocks, we just
record that information in a transaction in the journal. For DAX case we
also submit the IO zeroing those blocks and wait for it. Now if we crash
before the transaction gets committed, blocks won't be seen in the inode
after a journal recovery and thus no data exposure can happen. As a part of
transaction commit, we call blkdev_issue_flush() (or submit REQ_FLUSH
request). We expect that to force out all the IO in volatile caches into
the persistent storage. So this will also force the zeroing into persistent
storage for normal disks and AFAIU if you do zeroing with non-temporal
writes in pmem driver and then do wmb_pmem() in response to a flush request
we get the same persistency guarantee in pmem case as well. So after a
transaction commit we are guaranteed to see zeros in those allocated
blocks. 

So the transaction commit and the corresponding flush request in particular
is the sync point you speak about above but the good thing is that in most
cases this will happen after real data gets written into those blocks so we
save the unnecessary flush.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
@ 2015-11-04  9:06                           ` Jan Kara
  0 siblings, 0 replies; 47+ messages in thread
From: Jan Kara @ 2015-11-04  9:06 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-nvdimm, Brian Foster, xfs, linux-fsdevel, Dan Williams

On Tue 03-11-15 21:46:13, Ross Zwisler wrote:
> On Tue, Nov 03, 2015 at 05:02:34PM -0800, Dan Williams wrote:
> > > Hmm...if we go this path, though, is that an argument against moving the
> > > zeroing from DAX down into the driver?  True, with BRD it makes things nice
> > > and efficient because you can zero and never flush, and the driver knows
> > > there's nothing else to do.
> > >
> > > For PMEM, though, you lose the ability to zero the data and then queue the
> > > flushing for later, as you would be able to do if you left the zeroing code in
> > > DAX.  The benefit of this is that if you are going to immediately re-write the
> > > newly zeroed data (which seems common), PMEM will end up doing an extra cache
> > > flush of the zeroes, only to have them overwritten and marked as dirty by DAX.
> > > If we leave the zeroing to DAX we can mark it dirty once, zero it once, write
> > > it once, and flush it once.
> > 
> > Why do we lose the ability to flush later if the driver supports
> > blkdev_issue_zeroout?
> 
> I think that if you implement zeroing in the driver you'd need to also
> flush in the driver because you wouldn't have access to the radix tree to
> be able to mark entries as dirty so you can flush them later.
> 
> As I think about this more, though, I'm not sure that having the zeroing
> flush later could work.  I'm guessing that the filesystem must require a
> sync point between the zeroing and the subsequent follow-up writes so
> that you can sync metadata for the block allocation.  Otherwise you could
> end up in a situation where you've got your metadata pointing at newly
> allocated blocks but the new zeros are still in the processor cache - if
> you lose power you've just created an information leak.   Dave, Jan, does
> this make sense?  

So the problem you describe does not exist. Thing to keep in mind is that
filesystem are designed to work reliably with 'non-persistent' cache in the
disk which is common these days. That's why we bother with all that
REQ_FLUSH | REQ_FUA and blkdev_issue_flush() stuff after all. Processor
cache is exactly that kind of the cache attached to the PMEM storage. And
Dave and I try to steer you to a solution that would also treat it equally
in DAX filesystems as well :).

Now how the problem is currently solved: When we allocate blocks, we just
record that information in a transaction in the journal. For DAX case we
also submit the IO zeroing those blocks and wait for it. Now if we crash
before the transaction gets committed, blocks won't be seen in the inode
after a journal recovery and thus no data exposure can happen. As a part of
transaction commit, we call blkdev_issue_flush() (or submit REQ_FLUSH
request). We expect that to force out all the IO in volatile caches into
the persistent storage. So this will also force the zeroing into persistent
storage for normal disks and AFAIU if you do zeroing with non-temporal
writes in pmem driver and then do wmb_pmem() in response to a flush request
we get the same persistency guarantee in pmem case as well. So after a
transaction commit we are guaranteed to see zeros in those allocated
blocks. 

So the transaction commit and the corresponding flush request in particular
is the sync point you speak about above but the good thing is that in most
cases this will happen after real data gets written into those blocks so we
save the unnecessary flush.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
  2015-11-04  9:06                           ` Jan Kara
@ 2015-11-04 15:35                             ` Ross Zwisler
  -1 siblings, 0 replies; 47+ messages in thread
From: Ross Zwisler @ 2015-11-04 15:35 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, Dan Williams, Dave Chinner, Brian Foster, xfs,
	linux-fsdevel, linux-nvdimm

On Wed, Nov 04, 2015 at 10:06:12AM +0100, Jan Kara wrote:
> On Tue 03-11-15 21:46:13, Ross Zwisler wrote:
> > On Tue, Nov 03, 2015 at 05:02:34PM -0800, Dan Williams wrote:
> > > > Hmm...if we go this path, though, is that an argument against moving the
> > > > zeroing from DAX down into the driver?  True, with BRD it makes things nice
> > > > and efficient because you can zero and never flush, and the driver knows
> > > > there's nothing else to do.
> > > >
> > > > For PMEM, though, you lose the ability to zero the data and then queue the
> > > > flushing for later, as you would be able to do if you left the zeroing code in
> > > > DAX.  The benefit of this is that if you are going to immediately re-write the
> > > > newly zeroed data (which seems common), PMEM will end up doing an extra cache
> > > > flush of the zeroes, only to have them overwritten and marked as dirty by DAX.
> > > > If we leave the zeroing to DAX we can mark it dirty once, zero it once, write
> > > > it once, and flush it once.
> > > 
> > > Why do we lose the ability to flush later if the driver supports
> > > blkdev_issue_zeroout?
> > 
> > I think that if you implement zeroing in the driver you'd need to also
> > flush in the driver because you wouldn't have access to the radix tree to
> > be able to mark entries as dirty so you can flush them later.
> > 
> > As I think about this more, though, I'm not sure that having the zeroing
> > flush later could work.  I'm guessing that the filesystem must require a
> > sync point between the zeroing and the subsequent follow-up writes so
> > that you can sync metadata for the block allocation.  Otherwise you could
> > end up in a situation where you've got your metadata pointing at newly
> > allocated blocks but the new zeros are still in the processor cache - if
> > you lose power you've just created an information leak.   Dave, Jan, does
> > this make sense?  
> 
> So the problem you describe does not exist. Thing to keep in mind is that
> filesystem are designed to work reliably with 'non-persistent' cache in the
> disk which is common these days. That's why we bother with all that
> REQ_FLUSH | REQ_FUA and blkdev_issue_flush() stuff after all. Processor
> cache is exactly that kind of the cache attached to the PMEM storage. And
> Dave and I try to steer you to a solution that would also treat it equally
> in DAX filesystems as well :).

And I'm always grateful for the guidance. :)

> Now how the problem is currently solved: When we allocate blocks, we just
> record that information in a transaction in the journal. For DAX case we
> also submit the IO zeroing those blocks and wait for it. Now if we crash
> before the transaction gets committed, blocks won't be seen in the inode
> after a journal recovery and thus no data exposure can happen. As a part of
> transaction commit, we call blkdev_issue_flush() (or submit REQ_FLUSH
> request). We expect that to force out all the IO in volatile caches into
> the persistent storage. So this will also force the zeroing into persistent
> storage for normal disks and AFAIU if you do zeroing with non-temporal
> writes in pmem driver and then do wmb_pmem() in response to a flush request
> we get the same persistency guarantee in pmem case as well. So after a
> transaction commit we are guaranteed to see zeros in those allocated
> blocks. 
> 
> So the transaction commit and the corresponding flush request in particular
> is the sync point you speak about above but the good thing is that in most
> cases this will happen after real data gets written into those blocks so we
> save the unnecessary flush.

Cool, thank you for the explanation, that makes sense to me.

When dealing with normal SSDs and the page cache, does the filesystem keep the
zeroes in the page cache, or does it issue it directly to the driver via
sb_issue_zeroout()/blkdev_issue_zeroout()?  If we keep it in the page cache so
that the follow-up writes just update the dirty pages and we end up writing to
media once, this seems like it would flow nicely into the idea of zeroing new
blocks at the DAX level without flushing and just marking them as dirty in the
radix tree.  If the zeroing happens via sb_issue_zeroout() then this probably
doesn't make sense because the existing flow won't include a fsync/msync type
step of the newly zeroed data in the page cache.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
@ 2015-11-04 15:35                             ` Ross Zwisler
  0 siblings, 0 replies; 47+ messages in thread
From: Ross Zwisler @ 2015-11-04 15:35 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, Brian Foster, xfs, linux-fsdevel, Ross Zwisler,
	Dan Williams

On Wed, Nov 04, 2015 at 10:06:12AM +0100, Jan Kara wrote:
> On Tue 03-11-15 21:46:13, Ross Zwisler wrote:
> > On Tue, Nov 03, 2015 at 05:02:34PM -0800, Dan Williams wrote:
> > > > Hmm...if we go this path, though, is that an argument against moving the
> > > > zeroing from DAX down into the driver?  True, with BRD it makes things nice
> > > > and efficient because you can zero and never flush, and the driver knows
> > > > there's nothing else to do.
> > > >
> > > > For PMEM, though, you lose the ability to zero the data and then queue the
> > > > flushing for later, as you would be able to do if you left the zeroing code in
> > > > DAX.  The benefit of this is that if you are going to immediately re-write the
> > > > newly zeroed data (which seems common), PMEM will end up doing an extra cache
> > > > flush of the zeroes, only to have them overwritten and marked as dirty by DAX.
> > > > If we leave the zeroing to DAX we can mark it dirty once, zero it once, write
> > > > it once, and flush it once.
> > > 
> > > Why do we lose the ability to flush later if the driver supports
> > > blkdev_issue_zeroout?
> > 
> > I think that if you implement zeroing in the driver you'd need to also
> > flush in the driver because you wouldn't have access to the radix tree to
> > be able to mark entries as dirty so you can flush them later.
> > 
> > As I think about this more, though, I'm not sure that having the zeroing
> > flush later could work.  I'm guessing that the filesystem must require a
> > sync point between the zeroing and the subsequent follow-up writes so
> > that you can sync metadata for the block allocation.  Otherwise you could
> > end up in a situation where you've got your metadata pointing at newly
> > allocated blocks but the new zeros are still in the processor cache - if
> > you lose power you've just created an information leak.   Dave, Jan, does
> > this make sense?  
> 
> So the problem you describe does not exist. Thing to keep in mind is that
> filesystem are designed to work reliably with 'non-persistent' cache in the
> disk which is common these days. That's why we bother with all that
> REQ_FLUSH | REQ_FUA and blkdev_issue_flush() stuff after all. Processor
> cache is exactly that kind of the cache attached to the PMEM storage. And
> Dave and I try to steer you to a solution that would also treat it equally
> in DAX filesystems as well :).

And I'm always grateful for the guidance. :)

> Now how the problem is currently solved: When we allocate blocks, we just
> record that information in a transaction in the journal. For DAX case we
> also submit the IO zeroing those blocks and wait for it. Now if we crash
> before the transaction gets committed, blocks won't be seen in the inode
> after a journal recovery and thus no data exposure can happen. As a part of
> transaction commit, we call blkdev_issue_flush() (or submit REQ_FLUSH
> request). We expect that to force out all the IO in volatile caches into
> the persistent storage. So this will also force the zeroing into persistent
> storage for normal disks and AFAIU if you do zeroing with non-temporal
> writes in pmem driver and then do wmb_pmem() in response to a flush request
> we get the same persistency guarantee in pmem case as well. So after a
> transaction commit we are guaranteed to see zeros in those allocated
> blocks. 
> 
> So the transaction commit and the corresponding flush request in particular
> is the sync point you speak about above but the good thing is that in most
> cases this will happen after real data gets written into those blocks so we
> save the unnecessary flush.

Cool, thank you for the explanation, that makes sense to me.

When dealing with normal SSDs and the page cache, does the filesystem keep the
zeroes in the page cache, or does it issue it directly to the driver via
sb_issue_zeroout()/blkdev_issue_zeroout()?  If we keep it in the page cache so
that the follow-up writes just update the dirty pages and we end up writing to
media once, this seems like it would flow nicely into the idea of zeroing new
blocks at the DAX level without flushing and just marking them as dirty in the
radix tree.  If the zeroing happens via sb_issue_zeroout() then this probably
doesn't make sense because the existing flow won't include a fsync/msync type
step of the newly zeroed data in the page cache.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
  2015-11-04 15:35                             ` Ross Zwisler
@ 2015-11-04 17:21                               ` Jan Kara
  -1 siblings, 0 replies; 47+ messages in thread
From: Jan Kara @ 2015-11-04 17:21 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, Dan Williams, Dave Chinner, Brian Foster, xfs,
	linux-fsdevel, linux-nvdimm

On Wed 04-11-15 08:35:51, Ross Zwisler wrote:
> On Wed, Nov 04, 2015 at 10:06:12AM +0100, Jan Kara wrote:
> > Now how the problem is currently solved: When we allocate blocks, we just
> > record that information in a transaction in the journal. For DAX case we
> > also submit the IO zeroing those blocks and wait for it. Now if we crash
> > before the transaction gets committed, blocks won't be seen in the inode
> > after a journal recovery and thus no data exposure can happen. As a part of
> > transaction commit, we call blkdev_issue_flush() (or submit REQ_FLUSH
> > request). We expect that to force out all the IO in volatile caches into
> > the persistent storage. So this will also force the zeroing into persistent
> > storage for normal disks and AFAIU if you do zeroing with non-temporal
> > writes in pmem driver and then do wmb_pmem() in response to a flush request
> > we get the same persistency guarantee in pmem case as well. So after a
> > transaction commit we are guaranteed to see zeros in those allocated
> > blocks. 
> > 
> > So the transaction commit and the corresponding flush request in particular
> > is the sync point you speak about above but the good thing is that in most
> > cases this will happen after real data gets written into those blocks so we
> > save the unnecessary flush.
> 
> Cool, thank you for the explanation, that makes sense to me.
> 
> When dealing with normal SSDs and the page cache, does the filesystem keep the
> zeroes in the page cache, or does it issue it directly to the driver via
> sb_issue_zeroout()/blkdev_issue_zeroout()?  If we keep it in the page cache so

Currently we use sb_issue_zeroout() for zeroing out blocks in ext4 (for DAX
and in some other cases to avoid splitting extents too much).  Note that
zeroing blocks in page cache won't work on it's own - blkdev_issue_flush()
doesn't force out any dirty pages from page cache so transaction commit
wouldn't make sure stale block contents is not exposed.

However we actually do have a mechanism in ext4 to force out data from page
cache during transaction commit - that is what data=ordered mode is all
about - we can attach inode to a transaction and all dirty data of that
inode that has underlying blocks allocated is flushed as a part of
transaction commit. So this makes sure stale data cannot be seen after a
crash in data=ordered mode. XFS doesn't have this mechanism and thus it
normally has to put allocated blocks in unwritten extents, submit IO to
write data to those blocks, and convert extent to written once IO finishes.

> that the follow-up writes just update the dirty pages and we end up
> writing to media once, this seems like it would flow nicely into the idea
> of zeroing new blocks at the DAX level without flushing and just marking
> them as dirty in the radix tree.

So for ext4 you could make this work the same way data=ordered mode works -
just store zeroes into allocated blocks, mark page as dirty in the radix
tree, attach inode to the running transaction, and on transaction commit do
the flush. However note that for DAX page faults zeroing needs to happen
under fs lock serializing faults to the same page which effectively means
it happens inside filesystem's block mapping function and at that place we
don't have a page to zero out. So it would require considerable plumbing to
make this work even for ext4 and I'm not even speaking of XFS where the
flushing mechanism on transaction commit doesn't currently exist AFAIK.

Frankly since the savings would be realized only for allocating writes into
mmaped file which aren't than common, I'm not convinced this would be worth
it.

> If the zeroing happens via sb_issue_zeroout() then this probably doesn't
> make sense because the existing flow won't include a fsync/msync type
> step of the newly zeroed data in the page cache.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
@ 2015-11-04 17:21                               ` Jan Kara
  0 siblings, 0 replies; 47+ messages in thread
From: Jan Kara @ 2015-11-04 17:21 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-nvdimm, Brian Foster, xfs, linux-fsdevel, Dan Williams

On Wed 04-11-15 08:35:51, Ross Zwisler wrote:
> On Wed, Nov 04, 2015 at 10:06:12AM +0100, Jan Kara wrote:
> > Now how the problem is currently solved: When we allocate blocks, we just
> > record that information in a transaction in the journal. For DAX case we
> > also submit the IO zeroing those blocks and wait for it. Now if we crash
> > before the transaction gets committed, blocks won't be seen in the inode
> > after a journal recovery and thus no data exposure can happen. As a part of
> > transaction commit, we call blkdev_issue_flush() (or submit REQ_FLUSH
> > request). We expect that to force out all the IO in volatile caches into
> > the persistent storage. So this will also force the zeroing into persistent
> > storage for normal disks and AFAIU if you do zeroing with non-temporal
> > writes in pmem driver and then do wmb_pmem() in response to a flush request
> > we get the same persistency guarantee in pmem case as well. So after a
> > transaction commit we are guaranteed to see zeros in those allocated
> > blocks. 
> > 
> > So the transaction commit and the corresponding flush request in particular
> > is the sync point you speak about above but the good thing is that in most
> > cases this will happen after real data gets written into those blocks so we
> > save the unnecessary flush.
> 
> Cool, thank you for the explanation, that makes sense to me.
> 
> When dealing with normal SSDs and the page cache, does the filesystem keep the
> zeroes in the page cache, or does it issue it directly to the driver via
> sb_issue_zeroout()/blkdev_issue_zeroout()?  If we keep it in the page cache so

Currently we use sb_issue_zeroout() for zeroing out blocks in ext4 (for DAX
and in some other cases to avoid splitting extents too much).  Note that
zeroing blocks in page cache won't work on it's own - blkdev_issue_flush()
doesn't force out any dirty pages from page cache so transaction commit
wouldn't make sure stale block contents is not exposed.

However we actually do have a mechanism in ext4 to force out data from page
cache during transaction commit - that is what data=ordered mode is all
about - we can attach inode to a transaction and all dirty data of that
inode that has underlying blocks allocated is flushed as a part of
transaction commit. So this makes sure stale data cannot be seen after a
crash in data=ordered mode. XFS doesn't have this mechanism and thus it
normally has to put allocated blocks in unwritten extents, submit IO to
write data to those blocks, and convert extent to written once IO finishes.

> that the follow-up writes just update the dirty pages and we end up
> writing to media once, this seems like it would flow nicely into the idea
> of zeroing new blocks at the DAX level without flushing and just marking
> them as dirty in the radix tree.

So for ext4 you could make this work the same way data=ordered mode works -
just store zeroes into allocated blocks, mark page as dirty in the radix
tree, attach inode to the running transaction, and on transaction commit do
the flush. However note that for DAX page faults zeroing needs to happen
under fs lock serializing faults to the same page which effectively means
it happens inside filesystem's block mapping function and at that place we
don't have a page to zero out. So it would require considerable plumbing to
make this work even for ext4 and I'm not even speaking of XFS where the
flushing mechanism on transaction commit doesn't currently exist AFAIK.

Frankly since the savings would be realized only for allocating writes into
mmaped file which aren't than common, I'm not convinced this would be worth
it.

> If the zeroing happens via sb_issue_zeroout() then this probably doesn't
> make sense because the existing flow won't include a fsync/msync type
> step of the newly zeroed data in the page cache.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/6 V2] xfs: upfront block zeroing for DAX
  2015-10-19  3:27 [PATCH 0/6 V2] xfs: upfront block zeroing for DAX Dave Chinner
                   ` (5 preceding siblings ...)
  2015-10-19  3:27 ` [PATCH 6/6] xfs: xfs_filemap_pmd_fault treats read faults as write faults Dave Chinner
@ 2015-11-05 23:48 ` Ross Zwisler
  2015-11-06 22:32   ` Dave Chinner
  2015-11-06 18:12 ` Boylston, Brian
  7 siblings, 1 reply; 47+ messages in thread
From: Ross Zwisler @ 2015-11-05 23:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: ross.zwisler, jack, xfs

On Mon, Oct 19, 2015 at 02:27:12PM +1100, Dave Chinner wrote:
> Hi folks,
> 
> This is an updated patch set that was first posted here:
> 
> http://oss.sgi.com/archives/xfs/2015-10/msg00006.html
> 
> I've dropped the DAX locking revert patch from it; that's on it's
> way to Linus via other channels and is essentially independent to
> this set of XFS changes.
> 
> The only real change in the XFS code between the two versions is the
> addition of XFS_TRANS_RESERVE in the DAX path in
> xfs_iomap_write_direct() to allow it to dip into the reserve block
> pool for unwritten extent conversion rather than reporting ENOSPC.
> 
> Patches are against 4.3-rc5 + XFS for-next branch.
> 
> -Dave.

Hey Dave,

I was going to start testing these, but I'm having trouble finding a baseline
where they apply cleanly.  It looks like xfs/for-next already contains v1 of
the series, and they don't seem to apply cleanly to the current
xfs/xfs-misc-fixes-for-4.4-2 nor to v4.3.  The xfs repo I'm looking at is
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git

What am I missing?

Thanks,
- Ross

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH 0/6 V2] xfs: upfront block zeroing for DAX
  2015-10-19  3:27 [PATCH 0/6 V2] xfs: upfront block zeroing for DAX Dave Chinner
                   ` (6 preceding siblings ...)
  2015-11-05 23:48 ` [PATCH 0/6 V2] xfs: upfront block zeroing for DAX Ross Zwisler
@ 2015-11-06 18:12 ` Boylston, Brian
  7 siblings, 0 replies; 47+ messages in thread
From: Boylston, Brian @ 2015-11-06 18:12 UTC (permalink / raw)
  To: Dave Chinner, xfs; +Cc: ross.zwisler, jack

Hi,

I wrote a test tool that exercises page faults on hole-y portions of
an mmapped file.  The file is created, sized using various methods,
mmapped, and then two threads race to write a marker to different
offsets within each mapped page.  Once the threads have finished
marking each page, the pages are checked for the presence of the
markers.

The test easily exposes corruption on pmem-backed, DAX-mounted xfs
and ext4 file systems.

With 4.3 and this patch set, the data corruption is no longer seen.

The test code was posted to linux-ext4:
http://marc.info/?l=linux-ext4&m=144683276225574&w=2


For the series:
Tested-by: Brian Boylston <brian.boylston@hpe.com>


-----Original Message-----
From: Dave Chinner [mailto:david@fromorbit.com] 
Sent: Sunday, October 18, 2015 11:27 PM
Subject: [PATCH 0/6 V2] xfs: upfront block zeroing for DAX

Hi folks,

This is an updated patch set that was first posted here:

http://oss.sgi.com/archives/xfs/2015-10/msg00006.html

I've dropped the DAX locking revert patch from it; that's on it's
way to Linus via other channels and is essentially independent to
this set of XFS changes.

The only real change in the XFS code between the two versions is the
addition of XFS_TRANS_RESERVE in the DAX path in
xfs_iomap_write_direct() to allow it to dip into the reserve block
pool for unwritten extent conversion rather than reporting ENOSPC.

Patches are against 4.3-rc5 + XFS for-next branch.

-Dave.



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/6 V2] xfs: upfront block zeroing for DAX
  2015-11-05 23:48 ` [PATCH 0/6 V2] xfs: upfront block zeroing for DAX Ross Zwisler
@ 2015-11-06 22:32   ` Dave Chinner
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2015-11-06 22:32 UTC (permalink / raw)
  To: Ross Zwisler; +Cc: jack, xfs

On Thu, Nov 05, 2015 at 04:48:27PM -0700, Ross Zwisler wrote:
> On Mon, Oct 19, 2015 at 02:27:12PM +1100, Dave Chinner wrote:
> > Hi folks,
> > 
> > This is an updated patch set that was first posted here:
> > 
> > http://oss.sgi.com/archives/xfs/2015-10/msg00006.html
> > 
> > I've dropped the DAX locking revert patch from it; that's on it's
> > way to Linus via other channels and is essentially independent to
> > this set of XFS changes.
> > 
> > The only real change in the XFS code between the two versions is the
> > addition of XFS_TRANS_RESERVE in the DAX path in
> > xfs_iomap_write_direct() to allow it to dip into the reserve block
> > pool for unwritten extent conversion rather than reporting ENOSPC.
> > 
> > Patches are against 4.3-rc5 + XFS for-next branch.
> > 
> > -Dave.
> 
> Hey Dave,
> 
> I was going to start testing these, but I'm having trouble finding a baseline
> where they apply cleanly.  It looks like xfs/for-next already contains v1 of
> the series, and they don't seem to apply cleanly to the current
> xfs/xfs-misc-fixes-for-4.4-2 nor to v4.3.  The xfs repo I'm looking at is
> git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git

The xfs-dax-updates branch of that kernel contains the v3 version of
the patches I posted a few days ago. i.e. just pull the for-next
branch of the XFS tree, because that already contains all the
patches integrated into the rest of the XFS for-next tree...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2015-11-06 22:32 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-19  3:27 [PATCH 0/6 V2] xfs: upfront block zeroing for DAX Dave Chinner
2015-10-19  3:27 ` [PATCH 1/6] xfs: fix inode size update overflow in xfs_map_direct() Dave Chinner
2015-10-29 14:27   ` Brian Foster
2015-10-19  3:27 ` [PATCH 2/6] xfs: introduce BMAPI_ZERO for allocating zeroed extents Dave Chinner
2015-10-29 14:27   ` Brian Foster
2015-10-29 23:35     ` Dave Chinner
2015-10-30 12:36       ` Brian Foster
2015-11-02  1:21         ` Dave Chinner
2015-10-19  3:27 ` [PATCH 3/6] xfs: Don't use unwritten extents for DAX Dave Chinner
2015-10-29 14:29   ` Brian Foster
2015-10-29 23:37     ` Dave Chinner
2015-10-30 12:36       ` Brian Foster
2015-11-02  1:14         ` Dave Chinner
2015-11-02 14:15           ` Brian Foster
2015-11-02 21:44             ` Dave Chinner
2015-11-02 21:44               ` Dave Chinner
2015-11-02 21:44               ` Dave Chinner
2015-11-03  3:53               ` Dan Williams
2015-11-03  3:53                 ` Dan Williams
2015-11-03  3:53                 ` Dan Williams
2015-11-03  5:04                 ` Dave Chinner
2015-11-03  5:04                   ` Dave Chinner
2015-11-04  0:50                   ` Ross Zwisler
2015-11-04  0:50                     ` Ross Zwisler
2015-11-04  1:02                     ` Dan Williams
2015-11-04  1:02                       ` Dan Williams
2015-11-04  4:46                       ` Ross Zwisler
2015-11-04  4:46                         ` Ross Zwisler
2015-11-04  9:06                         ` Jan Kara
2015-11-04  9:06                           ` Jan Kara
2015-11-04 15:35                           ` Ross Zwisler
2015-11-04 15:35                             ` Ross Zwisler
2015-11-04 17:21                             ` Jan Kara
2015-11-04 17:21                               ` Jan Kara
2015-11-03  9:16               ` Jan Kara
2015-11-03  9:16                 ` Jan Kara
2015-10-19  3:27 ` [PATCH 4/6] xfs: DAX does not use IO completion callbacks Dave Chinner
2015-10-29 14:29   ` Brian Foster
2015-10-29 23:39     ` Dave Chinner
2015-10-30 12:37       ` Brian Foster
2015-10-19  3:27 ` [PATCH 5/6] xfs: add ->pfn_mkwrite support for DAX Dave Chinner
2015-10-29 14:30   ` Brian Foster
2015-10-19  3:27 ` [PATCH 6/6] xfs: xfs_filemap_pmd_fault treats read faults as write faults Dave Chinner
2015-10-29 14:30   ` Brian Foster
2015-11-05 23:48 ` [PATCH 0/6 V2] xfs: upfront block zeroing for DAX Ross Zwisler
2015-11-06 22:32   ` Dave Chinner
2015-11-06 18:12 ` Boylston, Brian

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.