All of lore.kernel.org
 help / color / mirror / Atom feed
* LTP write03 writev07 xfs failures
@ 2017-02-27  4:22 Xiong Zhou
  2017-02-27 16:09 ` Brian Foster
  0 siblings, 1 reply; 8+ messages in thread
From: Xiong Zhou @ 2017-02-27  4:22 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-kernel, linux-fsdevel

Hi,

These 2 tests PASS on Linus tree commit:
  37c8596 Merge tag 'tty-4.11-rc1' of git://git.kernel.org/pub/scm/linux...
FAIL on commit:
  60e8d3e Merge tag 'pci-v4.11-changes' of git://git.kernel.org/pub/scm/...

LTP latest commit: c60d3ca move_pages12: include lapi/mmap.h

Steps:

sh-4.2# pwd
/root/ltp
sh-4.2# git log --oneline -1
c60d3ca move_pages12: include lapi/mmap.h
sh-4.2# uname -r
4.10.0-master-60e8d3e+
sh-4.2# mount | grep test1
/dev/sda3 on /test1 type xfs (rw,relatime,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota)
sh-4.2# xfs_info /test1
meta-data=/dev/sda3              isize=512    agcount=16, agsize=245696 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1 spinodes=0
data     =                       bsize=4096   blocks=3931136, imaxpct=25
         =                       sunit=64     swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
sh-4.2# 
sh-4.2# TMPDIR=/test1 ./testcases/kernel/syscalls/write/write03
write03     0  TINFO  :  Enter Block 1: test to check if write corrupts the file when write fails
write03     1  TFAIL  :  write03.c:125: failure of write(2) corrupted the file
write03     0  TINFO  :  Exit block 1
sh-4.2# 
sh-4.2# TMPDIR=/test1 ./testcases/kernel/syscalls/writev/writev07
tst_test.c:760: INFO: Timeout per run is 0h 05m 00s
writev07.c:60: INFO: starting test with initial file offset: 0 
writev07.c:82: INFO: got EFAULT
writev07.c:87: FAIL: file was written to
writev07.c:93: PASS: offset stayed unchanged
writev07.c:60: INFO: starting test with initial file offset: 65 
writev07.c:82: INFO: got EFAULT
writev07.c:89: PASS: file stayed untouched
writev07.c:93: PASS: offset stayed unchanged
writev07.c:60: INFO: starting test with initial file offset: 4096 
writev07.c:82: INFO: got EFAULT
writev07.c:89: PASS: file stayed untouched
writev07.c:93: PASS: offset stayed unchanged
writev07.c:60: INFO: starting test with initial file offset: 4097 
writev07.c:82: INFO: got EFAULT
writev07.c:89: PASS: file stayed untouched
writev07.c:93: PASS: offset stayed unchanged

Summary:
passed   7
failed   1
skipped  0
warnings 0
sh-4.2# 
sh-4.2# mkfs.xfs -V
mkfs.xfs version 4.7.0
sh-4.2# cd ../xfsprogs/
sh-4.2# git log --oneline -1
d7e1f5f xfsprogs: Release v4.7
sh-4.2# 

Thanks,
Xiong

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: LTP write03 writev07 xfs failures
  2017-02-27  4:22 LTP write03 writev07 xfs failures Xiong Zhou
@ 2017-02-27 16:09 ` Brian Foster
  2017-02-27 20:13   ` Brian Foster
  0 siblings, 1 reply; 8+ messages in thread
From: Brian Foster @ 2017-02-27 16:09 UTC (permalink / raw)
  To: Xiong Zhou; +Cc: linux-xfs, linux-kernel, linux-fsdevel, Christoph Hellwig

cc Christoph

On Mon, Feb 27, 2017 at 12:22:20PM +0800, Xiong Zhou wrote:
> Hi,
> 
> These 2 tests PASS on Linus tree commit:
>   37c8596 Merge tag 'tty-4.11-rc1' of git://git.kernel.org/pub/scm/linux...
> FAIL on commit:
>   60e8d3e Merge tag 'pci-v4.11-changes' of git://git.kernel.org/pub/scm/...
> 
> LTP latest commit: c60d3ca move_pages12: include lapi/mmap.h
> 
> Steps:
> 
> sh-4.2# pwd
> /root/ltp
> sh-4.2# git log --oneline -1
> c60d3ca move_pages12: include lapi/mmap.h
> sh-4.2# uname -r
> 4.10.0-master-60e8d3e+
> sh-4.2# mount | grep test1
> /dev/sda3 on /test1 type xfs (rw,relatime,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota)
> sh-4.2# xfs_info /test1
> meta-data=/dev/sda3              isize=512    agcount=16, agsize=245696 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1 spinodes=0
> data     =                       bsize=4096   blocks=3931136, imaxpct=25
>          =                       sunit=64     swidth=64 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> log      =internal               bsize=4096   blocks=2560, version=2
>          =                       sectsz=512   sunit=64 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> sh-4.2# 
> sh-4.2# TMPDIR=/test1 ./testcases/kernel/syscalls/write/write03
> write03     0  TINFO  :  Enter Block 1: test to check if write corrupts the file when write fails
> write03     1  TFAIL  :  write03.c:125: failure of write(2) corrupted the file
> write03     0  TINFO  :  Exit block 1
> sh-4.2# 

On a quick test, both of these are reproduced after commit fa7f138ac4
("xfs: clear delalloc and cache on buffered write failure"). That patch
fixed a problem where if the write allocates a block but fails to write
anything (written == 0), we'd leave a delalloc block lingering in the
inode.

With that change, this test now fails because it sends two writes within
a single block. The first allocates the block, writes 100 bytes and
returns successfully. The next attempts to write the next 100 bytes,
fails and triggers the cleanup of the block because we can't tell
whether this write or the previous had allocated it.

I'm not convinced the right solution is to just go back to the previous
code. That obviously reintroduces the original problem, but we'd also
still have a similar problem if the second (failed) write was a rewrite
of the first. The error handling of the second write would kill off the
blocks allocated and written to successfully by the first. I'm wondering
if the right thing to do here is factor in i_size as it appears that's
what this code did prior to the iomap transition. I'm not sure where
that leaves us wrt to writes into sparse files, though. I may need to
play with this a bit..

Christoph, any thoughts on this?

Brian

> sh-4.2# TMPDIR=/test1 ./testcases/kernel/syscalls/writev/writev07
> tst_test.c:760: INFO: Timeout per run is 0h 05m 00s
> writev07.c:60: INFO: starting test with initial file offset: 0 
> writev07.c:82: INFO: got EFAULT
> writev07.c:87: FAIL: file was written to
> writev07.c:93: PASS: offset stayed unchanged
> writev07.c:60: INFO: starting test with initial file offset: 65 
> writev07.c:82: INFO: got EFAULT
> writev07.c:89: PASS: file stayed untouched
> writev07.c:93: PASS: offset stayed unchanged
> writev07.c:60: INFO: starting test with initial file offset: 4096 
> writev07.c:82: INFO: got EFAULT
> writev07.c:89: PASS: file stayed untouched
> writev07.c:93: PASS: offset stayed unchanged
> writev07.c:60: INFO: starting test with initial file offset: 4097 
> writev07.c:82: INFO: got EFAULT
> writev07.c:89: PASS: file stayed untouched
> writev07.c:93: PASS: offset stayed unchanged
> 
> Summary:
> passed   7
> failed   1
> skipped  0
> warnings 0
> sh-4.2# 
> sh-4.2# mkfs.xfs -V
> mkfs.xfs version 4.7.0
> sh-4.2# cd ../xfsprogs/
> sh-4.2# git log --oneline -1
> d7e1f5f xfsprogs: Release v4.7
> sh-4.2# 
> 
> Thanks,
> Xiong
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: LTP write03 writev07 xfs failures
  2017-02-27 16:09 ` Brian Foster
@ 2017-02-27 20:13   ` Brian Foster
  2017-02-28 14:04     ` Christoph Hellwig
  0 siblings, 1 reply; 8+ messages in thread
From: Brian Foster @ 2017-02-27 20:13 UTC (permalink / raw)
  To: Xiong Zhou; +Cc: linux-xfs, linux-kernel, linux-fsdevel, Christoph Hellwig

On Mon, Feb 27, 2017 at 11:09:01AM -0500, Brian Foster wrote:
> cc Christoph
> 
> On Mon, Feb 27, 2017 at 12:22:20PM +0800, Xiong Zhou wrote:
> > Hi,
> > 
> > These 2 tests PASS on Linus tree commit:
> >   37c8596 Merge tag 'tty-4.11-rc1' of git://git.kernel.org/pub/scm/linux...
> > FAIL on commit:
> >   60e8d3e Merge tag 'pci-v4.11-changes' of git://git.kernel.org/pub/scm/...
> > 
> > LTP latest commit: c60d3ca move_pages12: include lapi/mmap.h
> > 
> > Steps:
> > 
> > sh-4.2# pwd
> > /root/ltp
> > sh-4.2# git log --oneline -1
> > c60d3ca move_pages12: include lapi/mmap.h
> > sh-4.2# uname -r
> > 4.10.0-master-60e8d3e+
> > sh-4.2# mount | grep test1
> > /dev/sda3 on /test1 type xfs (rw,relatime,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota)
> > sh-4.2# xfs_info /test1
> > meta-data=/dev/sda3              isize=512    agcount=16, agsize=245696 blks
> >          =                       sectsz=512   attr=2, projid32bit=1
> >          =                       crc=1        finobt=1 spinodes=0
> > data     =                       bsize=4096   blocks=3931136, imaxpct=25
> >          =                       sunit=64     swidth=64 blks
> > naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> > log      =internal               bsize=4096   blocks=2560, version=2
> >          =                       sectsz=512   sunit=64 blks, lazy-count=1
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > sh-4.2# 
> > sh-4.2# TMPDIR=/test1 ./testcases/kernel/syscalls/write/write03
> > write03     0  TINFO  :  Enter Block 1: test to check if write corrupts the file when write fails
> > write03     1  TFAIL  :  write03.c:125: failure of write(2) corrupted the file
> > write03     0  TINFO  :  Exit block 1
> > sh-4.2# 
> 
> On a quick test, both of these are reproduced after commit fa7f138ac4
> ("xfs: clear delalloc and cache on buffered write failure"). That patch
> fixed a problem where if the write allocates a block but fails to write
> anything (written == 0), we'd leave a delalloc block lingering in the
> inode.
> 
> With that change, this test now fails because it sends two writes within
> a single block. The first allocates the block, writes 100 bytes and
> returns successfully. The next attempts to write the next 100 bytes,
> fails and triggers the cleanup of the block because we can't tell
> whether this write or the previous had allocated it.
> 
> I'm not convinced the right solution is to just go back to the previous
> code. That obviously reintroduces the original problem, but we'd also
> still have a similar problem if the second (failed) write was a rewrite
> of the first. The error handling of the second write would kill off the
> blocks allocated and written to successfully by the first. I'm wondering
> if the right thing to do here is factor in i_size as it appears that's
> what this code did prior to the iomap transition. I'm not sure where
> that leaves us wrt to writes into sparse files, though. I may need to
> play with this a bit..
> 

After playing around a bit, I don't think using i_size is the right
approach either. It just exacerbates the original problem on buffered
writes into sparse files. We can end up leaving around however many
delalloc blocks we've allocated.

I think we need a way to differentiate preexisting (previously written)
delalloc blocks from those allocated and unused by the current write. We
might be able to do that by looking at the pagecache, but I think that
means looking at the buffer state to make sure we handle sub-page block
sizes correctly. I.e., make *_iomap_end_delalloc() punch out all
delalloc blocks in the non-written range that are either not page backed
or not dirty+delalloc buffer backed. Hm?

Brian

> Christoph, any thoughts on this?
> 
> Brian
> 
> > sh-4.2# TMPDIR=/test1 ./testcases/kernel/syscalls/writev/writev07
> > tst_test.c:760: INFO: Timeout per run is 0h 05m 00s
> > writev07.c:60: INFO: starting test with initial file offset: 0 
> > writev07.c:82: INFO: got EFAULT
> > writev07.c:87: FAIL: file was written to
> > writev07.c:93: PASS: offset stayed unchanged
> > writev07.c:60: INFO: starting test with initial file offset: 65 
> > writev07.c:82: INFO: got EFAULT
> > writev07.c:89: PASS: file stayed untouched
> > writev07.c:93: PASS: offset stayed unchanged
> > writev07.c:60: INFO: starting test with initial file offset: 4096 
> > writev07.c:82: INFO: got EFAULT
> > writev07.c:89: PASS: file stayed untouched
> > writev07.c:93: PASS: offset stayed unchanged
> > writev07.c:60: INFO: starting test with initial file offset: 4097 
> > writev07.c:82: INFO: got EFAULT
> > writev07.c:89: PASS: file stayed untouched
> > writev07.c:93: PASS: offset stayed unchanged
> > 
> > Summary:
> > passed   7
> > failed   1
> > skipped  0
> > warnings 0
> > sh-4.2# 
> > sh-4.2# mkfs.xfs -V
> > mkfs.xfs version 4.7.0
> > sh-4.2# cd ../xfsprogs/
> > sh-4.2# git log --oneline -1
> > d7e1f5f xfsprogs: Release v4.7
> > sh-4.2# 
> > 
> > Thanks,
> > Xiong
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: LTP write03 writev07 xfs failures
  2017-02-27 20:13   ` Brian Foster
@ 2017-02-28 14:04     ` Christoph Hellwig
  2017-02-28 14:59       ` Brian Foster
  0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2017-02-28 14:04 UTC (permalink / raw)
  To: Brian Foster
  Cc: Xiong Zhou, linux-xfs, linux-kernel, linux-fsdevel, Christoph Hellwig

On Mon, Feb 27, 2017 at 03:13:35PM -0500, Brian Foster wrote:
> After playing around a bit, I don't think using i_size is the right
> approach either. It just exacerbates the original problem on buffered
> writes into sparse files. We can end up leaving around however many
> delalloc blocks we've allocated.
> 
> I think we need a way to differentiate preexisting (previously written)
> delalloc blocks from those allocated and unused by the current write. We
> might be able to do that by looking at the pagecache, but I think that
> means looking at the buffer state to make sure we handle sub-page block
> sizes correctly. I.e., make *_iomap_end_delalloc() punch out all
> delalloc blocks in the non-written range that are either not page backed
> or not dirty+delalloc buffer backed. Hm?

That sounds ugly, but right off my mind I see no other way.  I'll need
to take a look at what the old pre-iomap code did there, as I think
none of these issues happened there.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: LTP write03 writev07 xfs failures
  2017-02-28 14:04     ` Christoph Hellwig
@ 2017-02-28 14:59       ` Brian Foster
  2017-02-28 15:11         ` Christoph Hellwig
  0 siblings, 1 reply; 8+ messages in thread
From: Brian Foster @ 2017-02-28 14:59 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Xiong Zhou, linux-xfs, linux-kernel, linux-fsdevel

On Tue, Feb 28, 2017 at 06:04:55AM -0800, Christoph Hellwig wrote:
> On Mon, Feb 27, 2017 at 03:13:35PM -0500, Brian Foster wrote:
> > After playing around a bit, I don't think using i_size is the right
> > approach either. It just exacerbates the original problem on buffered
> > writes into sparse files. We can end up leaving around however many
> > delalloc blocks we've allocated.
> > 
> > I think we need a way to differentiate preexisting (previously written)
> > delalloc blocks from those allocated and unused by the current write. We
> > might be able to do that by looking at the pagecache, but I think that
> > means looking at the buffer state to make sure we handle sub-page block
> > sizes correctly. I.e., make *_iomap_end_delalloc() punch out all
> > delalloc blocks in the non-written range that are either not page backed
> > or not dirty+delalloc buffer backed. Hm?
> 
> That sounds ugly, but right off my mind I see no other way.  I'll need
> to take a look at what the old pre-iomap code did there, as I think
> none of these issues happened there.

Heh. I've appended what I'm currently playing around with. It's
certainly uglier, but not terrible IMO (outside of the fact that we have
to look at the buffer_heads). This seems to address the problem, but
still only lightly tested...

An entirely different approach may be to somehow or another
differentiate allocated delalloc blocks from "found" delalloc blocks in
the iomap_begin() handler, and then perhaps encode that into the iomap
such that the iomap_end() handler has an explicit reference of what to
punch. Personally, I wouldn't mind doing something like the below short
term to fix the regression and then incorporate an iomap enhancement to
break the buffer_head dependency.

Brian

---8<---

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 41662fb..5761dc6 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1066,6 +1066,30 @@ xfs_file_iomap_begin(
 	return error;
 }
 
+/* grab the bh for a page offset */
+static struct buffer_head *
+bh_for_pgoff(
+	struct page		*page,
+	loff_t			offset)
+{
+	struct buffer_head	*bh, *head;
+
+	ASSERT(offset < PAGE_SIZE);
+	ASSERT(page_has_buffers(page));
+
+	bh = head = page_buffers(page);
+	do {
+		struct buffer_head	*next = bh->b_this_page;
+		if (next == head)
+			break;
+		if (bh_offset(next) > offset)
+		       break;
+		bh = next;
+	} while (true);
+
+	return bh;
+}
+
 static int
 xfs_file_iomap_end_delalloc(
 	struct xfs_inode	*ip,
@@ -1074,13 +1098,21 @@ xfs_file_iomap_end_delalloc(
 	ssize_t			written)
 {
 	struct xfs_mount	*mp = ip->i_mount;
+	struct address_space	*mapping = VFS_I(ip)->i_mapping;
+	struct page		*page;
+	struct buffer_head	*bh;
 	xfs_fileoff_t		start_fsb;
 	xfs_fileoff_t		end_fsb;
 	int			error = 0;
 
-	/* behave as if the write failed if drop writes is enabled */
-	if (xfs_mp_drop_writes(mp))
+	/*
+	 * Behave as if the write failed if drop writes is enabled. Punch out
+	 * the pagecache to trigger delalloc cleanup.
+	 */
+	if (xfs_mp_drop_writes(mp)) {
 		written = 0;
+		truncate_pagecache_range(VFS_I(ip), offset, offset + length);
+	}
 
 	/*
 	 * start_fsb refers to the first unused block after a short write. If
@@ -1094,22 +1126,29 @@ xfs_file_iomap_end_delalloc(
 	end_fsb = XFS_B_TO_FSB(mp, offset + length);
 
 	/*
-	 * Trim back delalloc blocks if we didn't manage to write the whole
-	 * range reserved.
+	 * We have to clear out any unused delalloc blocks in the event of a
+	 * failed or short write. Otherwise, these blocks linger indefinitely as
+	 * they are not fronted by dirty pagecache.
 	 *
-	 * We don't need to care about racing delalloc as we hold i_mutex
-	 * across the reserve/allocate/unreserve calls. If there are delalloc
-	 * blocks in the range, they are ours.
+	 * To filter out blocks that were successfully written by a previous
+	 * write, walk the unwritten range and only punch out blocks that are
+	 * not backed by dirty+delalloc buffers.
 	 */
-	if (start_fsb < end_fsb) {
-		truncate_pagecache_range(VFS_I(ip), XFS_FSB_TO_B(mp, start_fsb),
-					 XFS_FSB_TO_B(mp, end_fsb) - 1);
+	for (; start_fsb < end_fsb; start_fsb++) {
+		offset = XFS_FSB_TO_B(mp, start_fsb);
+		page = find_get_page(mapping, offset >> PAGE_SHIFT);
+		if (page) {
+			bh = bh_for_pgoff(page, offset & ~PAGE_MASK);
+			if ((buffer_dirty(bh) && buffer_delay(bh))) {
+				put_page(page);
+				continue;
+			}
+			put_page(page);
+		}
 
 		xfs_ilock(ip, XFS_ILOCK_EXCL);
-		error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
-					       end_fsb - start_fsb);
+		error = xfs_bmap_punch_delalloc_range(ip, start_fsb, 1);
 		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-
 		if (error && !XFS_FORCED_SHUTDOWN(mp)) {
 			xfs_alert(mp, "%s: unable to clean up ino %lld",
 				__func__, ip->i_ino);

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: LTP write03 writev07 xfs failures
  2017-02-28 14:59       ` Brian Foster
@ 2017-02-28 15:11         ` Christoph Hellwig
  2017-02-28 16:10           ` Brian Foster
  0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2017-02-28 15:11 UTC (permalink / raw)
  To: Brian Foster
  Cc: Christoph Hellwig, Xiong Zhou, linux-xfs, linux-kernel, linux-fsdevel

On Tue, Feb 28, 2017 at 09:59:40AM -0500, Brian Foster wrote:
> Heh. I've appended what I'm currently playing around with. It's
> certainly uglier, but not terrible IMO (outside of the fact that we have
> to look at the buffer_heads). This seems to address the problem, but
> still only lightly tested...
> 
> An entirely different approach may be to somehow or another
> differentiate allocated delalloc blocks from "found" delalloc blocks in
> the iomap_begin() handler, and then perhaps encode that into the iomap
> such that the iomap_end() handler has an explicit reference of what to
> punch. Personally, I wouldn't mind doing something like the below short
> term to fix the regression and then incorporate an iomap enhancement to
> break the buffer_head dependency.

We actually have a IOMAP_F_NEW for this already, but so far it's
only used by the DIO and DAX code.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: LTP write03 writev07 xfs failures
  2017-02-28 15:11         ` Christoph Hellwig
@ 2017-02-28 16:10           ` Brian Foster
  0 siblings, 0 replies; 8+ messages in thread
From: Brian Foster @ 2017-02-28 16:10 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Xiong Zhou, linux-xfs, linux-kernel, linux-fsdevel

On Tue, Feb 28, 2017 at 07:11:35AM -0800, Christoph Hellwig wrote:
> On Tue, Feb 28, 2017 at 09:59:40AM -0500, Brian Foster wrote:
> > Heh. I've appended what I'm currently playing around with. It's
> > certainly uglier, but not terrible IMO (outside of the fact that we have
> > to look at the buffer_heads). This seems to address the problem, but
> > still only lightly tested...
> > 
> > An entirely different approach may be to somehow or another
> > differentiate allocated delalloc blocks from "found" delalloc blocks in
> > the iomap_begin() handler, and then perhaps encode that into the iomap
> > such that the iomap_end() handler has an explicit reference of what to
> > punch. Personally, I wouldn't mind doing something like the below short
> > term to fix the regression and then incorporate an iomap enhancement to
> > break the buffer_head dependency.
> 
> We actually have a IOMAP_F_NEW for this already, but so far it's
> only used by the DIO and DAX code.

Ok. I think that has the potential to provide a more clean and simple
solution. I don't think it's as straightforward of a change to enable
that for the buffered write code, however. I don't think we can just set
the NEW flag when we do xfs_bmapi_reserve_delalloc() because something
like the following would still break:

  xfs_io -fc "pwrite 16k 4k" -c "pwrite -b 32k 0 32k" <file>

If the second write happened to fail, AFAICT it would still punch out
the block allocated by the first. So I suppose we'd have to tweak
reserve_delalloc() to trim the returned extent, or perhaps add a flag
that skips the xfs_bmbt_get_all() call to update got after we insert the
extent, and thus only return what was allocated..?

(The latter actually seems to work on a very quick test, see appended
diff..).

Brian

---8<---

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index a9c66d4..18b927d 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4159,7 +4159,8 @@ xfs_bmapi_reserve_delalloc(
 	xfs_filblks_t		prealloc,
 	struct xfs_bmbt_irec	*got,
 	xfs_extnum_t		*lastx,
-	int			eof)
+	int			eof,
+	int			flags)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
@@ -4242,7 +4243,8 @@ xfs_bmapi_reserve_delalloc(
 	 * Update our extent pointer, given that xfs_bmap_add_extent_hole_delay
 	 * might have merged it into one of the neighbouring ones.
 	 */
-	xfs_bmbt_get_all(xfs_iext_get_ext(ifp, *lastx), got);
+	if (flags & XFS_BMAPI_ENTIRE)
+		xfs_bmbt_get_all(xfs_iext_get_ext(ifp, *lastx), got);
 
 	/*
 	 * Tag the inode if blocks were preallocated. Note that COW fork
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index cdef87d..459ba6b 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -243,7 +243,8 @@ int	xfs_bmap_shift_extents(struct xfs_trans *tp, struct xfs_inode *ip,
 int	xfs_bmap_split_extent(struct xfs_inode *ip, xfs_fileoff_t split_offset);
 int	xfs_bmapi_reserve_delalloc(struct xfs_inode *ip, int whichfork,
 		xfs_fileoff_t off, xfs_filblks_t len, xfs_filblks_t prealloc,
-		struct xfs_bmbt_irec *got, xfs_extnum_t *lastx, int eof);
+		struct xfs_bmbt_irec *got, xfs_extnum_t *lastx, int eof,
+		int flags);
 
 enum xfs_bmap_intent_type {
 	XFS_BMAP_MAP = 1,
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 41662fb..9b1b2a4 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -613,7 +613,8 @@ xfs_file_iomap_begin_delay(
 
 retry:
 	error = xfs_bmapi_reserve_delalloc(ip, XFS_DATA_FORK, offset_fsb,
-			end_fsb - offset_fsb, prealloc_blocks, &got, &idx, eof);
+			end_fsb - offset_fsb, prealloc_blocks, &got, &idx,
+			eof, 0);
 	switch (error) {
 	case 0:
 		break;
@@ -629,7 +630,7 @@ xfs_file_iomap_begin_delay(
 	default:
 		goto out_unlock;
 	}
-
+	iomap->flags = IOMAP_F_NEW;
 	trace_xfs_iomap_alloc(ip, offset, count, 0, &got);
 done:
 	if (isnullstartblock(got.br_startblock))
@@ -1071,16 +1072,22 @@ xfs_file_iomap_end_delalloc(
 	struct xfs_inode	*ip,
 	loff_t			offset,
 	loff_t			length,
-	ssize_t			written)
+	ssize_t			written,
+	struct iomap		*iomap)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	xfs_fileoff_t		start_fsb;
 	xfs_fileoff_t		end_fsb;
 	int			error = 0;
 
+	trace_printk("%d: ino 0x%llx offset %llu len %llu writ %ld new %d\n", __LINE__,
+		     ip->i_ino, offset, length, written, !!(iomap->flags & IOMAP_F_NEW));
+
 	/* behave as if the write failed if drop writes is enabled */
-	if (xfs_mp_drop_writes(mp))
+	if (xfs_mp_drop_writes(mp)) {
+		iomap->flags |= IOMAP_F_NEW;
 		written = 0;
+	}
 
 	/*
 	 * start_fsb refers to the first unused block after a short write. If
@@ -1101,7 +1108,7 @@ xfs_file_iomap_end_delalloc(
 	 * across the reserve/allocate/unreserve calls. If there are delalloc
 	 * blocks in the range, they are ours.
 	 */
-	if (start_fsb < end_fsb) {
+	if (iomap->flags & IOMAP_F_NEW && start_fsb < end_fsb) {
 		truncate_pagecache_range(VFS_I(ip), XFS_FSB_TO_B(mp, start_fsb),
 					 XFS_FSB_TO_B(mp, end_fsb) - 1);
 
@@ -1131,7 +1138,7 @@ xfs_file_iomap_end(
 {
 	if ((flags & IOMAP_WRITE) && iomap->type == IOMAP_DELALLOC)
 		return xfs_file_iomap_end_delalloc(XFS_I(inode), offset,
-				length, written);
+				length, written, iomap);
 	return 0;
 }
 
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index da6d08f..d76a2b6 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -313,7 +313,8 @@ xfs_reflink_reserve_cow(
 		return error;
 
 	error = xfs_bmapi_reserve_delalloc(ip, XFS_COW_FORK, imap->br_startoff,
-			imap->br_blockcount, 0, &got, &idx, eof);
+			imap->br_blockcount, 0, &got, &idx, eof,
+			XFS_BMAPI_ENTIRE);
 	if (error == -ENOSPC || error == -EDQUOT)
 		trace_xfs_reflink_cow_enospc(ip, imap);
 	if (error)

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* LTP write03 writev07 xfs failures
@ 2017-02-27  5:03 Xiong Zhou
  0 siblings, 0 replies; 8+ messages in thread
From: Xiong Zhou @ 2017-02-27  5:03 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-kernel, Linux-Fsdevel

Hi,

These 2 tests PASS on Linus tree commit:
  37c8596 Merge tag 'tty-4.11-rc1' of git://git.kernel.org/pub/scm/linux...
FAIL on commit:
  60e8d3e Merge tag 'pci-v4.11-changes' of git://git.kernel.org/pub/scm/...

LTP latest commit: c60d3ca move_pages12: include lapi/mmap.h

Steps:

sh-4.2# pwd
/root/ltp
sh-4.2# git log --oneline -1
c60d3ca move_pages12: include lapi/mmap.h
sh-4.2# uname -r
4.10.0-master-60e8d3e+
sh-4.2# mount | grep test1
/dev/sda3 on /test1 type xfs
(rw,relatime,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota)
sh-4.2# xfs_info /test1
meta-data=/dev/sda3              isize=512    agcount=16, agsize=245696 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1 spinodes=0
data     =                       bsize=4096   blocks=3931136, imaxpct=25
         =                       sunit=64     swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
sh-4.2#
sh-4.2# TMPDIR=/test1 ./testcases/kernel/syscalls/write/write03
write03     0  TINFO  :  Enter Block 1: test to check if write
corrupts the file when write fails
write03     1  TFAIL  :  write03.c:125: failure of write(2) corrupted the file
write03     0  TINFO  :  Exit block 1
sh-4.2#
sh-4.2# TMPDIR=/test1 ./testcases/kernel/syscalls/writev/writev07
tst_test.c:760: INFO: Timeout per run is 0h 05m 00s
writev07.c:60: INFO: starting test with initial file offset: 0
writev07.c:82: INFO: got EFAULT
writev07.c:87: FAIL: file was written to
writev07.c:93: PASS: offset stayed unchanged
writev07.c:60: INFO: starting test with initial file offset: 65
writev07.c:82: INFO: got EFAULT
writev07.c:89: PASS: file stayed untouched
writev07.c:93: PASS: offset stayed unchanged
writev07.c:60: INFO: starting test with initial file offset: 4096
writev07.c:82: INFO: got EFAULT
writev07.c:89: PASS: file stayed untouched
writev07.c:93: PASS: offset stayed unchanged
writev07.c:60: INFO: starting test with initial file offset: 4097
writev07.c:82: INFO: got EFAULT
writev07.c:89: PASS: file stayed untouched
writev07.c:93: PASS: offset stayed unchanged

Summary:
passed   7
failed   1
skipped  0
warnings 0
sh-4.2#
sh-4.2# mkfs.xfs -V
mkfs.xfs version 4.7.0
sh-4.2# cd ../xfsprogs/
sh-4.2# git log --oneline -1
d7e1f5f xfsprogs: Release v4.7
sh-4.2#

Thanks,
Xiong

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-02-28 16:11 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-27  4:22 LTP write03 writev07 xfs failures Xiong Zhou
2017-02-27 16:09 ` Brian Foster
2017-02-27 20:13   ` Brian Foster
2017-02-28 14:04     ` Christoph Hellwig
2017-02-28 14:59       ` Brian Foster
2017-02-28 15:11         ` Christoph Hellwig
2017-02-28 16:10           ` Brian Foster
2017-02-27  5:03 Xiong Zhou

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.