linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] fix gfs2 truncate race
@ 2016-06-20 18:22 Benjamin Marzinski
  2016-06-20 18:22 ` [PATCH v2 1/2] fs: export __block_write_full_page Benjamin Marzinski
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Benjamin Marzinski @ 2016-06-20 18:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: cluster-devel, Steven Whitehouse

If gfs2 tries to write out a page of a file with data journaling
enabled, while that file is being truncated, it can cause a kernel
panic. The problem is that the page it is writing out may point to past
the end of the file. If this happens, gfs2 will try to invalidate the
page.  However, it can't invalidate a journaled page with buffers in the
in-memory log, without revoking those buffers, so that there is no
chance of corruption if the machine crashes and later has its journal
replayed.  If the page is being written out by the log flush code, there
is no way that it can add a revoke entry onto the log during the flush.

To solve this, gfs2 simply writes out journalled data pages that point
past the end of the file, since the truncate is still in progress, and
everything will be cleaned up before the file is unlocked, or the blocks
are marked free. Doing this involves both a gfs2 change and exporting an
additional symbol from the vfs.

These v2 patches leave the code in gfs2_writepage_common alone, and
replace that call in gfs2_jdata_writepage with the necessary checks.  Doing
this highlighted the fact that __gfs2_jdata_writepage will never actually
run during a transaction, and unless starting and then ending an untouched
transaction has some important side effect that I can't see. All the
transaction code can be pulled out of gfs2_jdata_writepage with no changes
to how gfs2 is currently operating.

However, looking through that commit history, I don't think that this is
intentional.  It looks more like gfs2_jdata_writepage lost the ability to
usefully start transactions, and write out the page during them, so as to
allow it to invalidate the page in those cases.  Restoring this won't solve
the truncate race bug, since this can happen at times when gfs2_jdata_writepage
won't (and can't) start a transaction.

Steve, it looks like your commit 1bb7322fd0d5abdce396de51cbc5dbc489523018
caused this change. Do you have any idea if it was intentional. Clearly it
isn't breaking things. But we should either remove the transaction completely
like this patch, or make it possible to actually write out the page during
the transaction.

Benjamin Marzinski (2):
  fs: export __block_write_full_page
  gfs2: writeout truncated pages

 fs/buffer.c                 |  3 ++-
 fs/gfs2/aops.c              | 49 +++++++++++++++++++++++++++++++--------------
 include/linux/buffer_head.h |  3 +++
 3 files changed, 39 insertions(+), 16 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v2 1/2] fs: export __block_write_full_page
  2016-06-20 18:22 [PATCH v2 0/2] fix gfs2 truncate race Benjamin Marzinski
@ 2016-06-20 18:22 ` Benjamin Marzinski
  2016-06-20 18:22 ` [PATCH v2 2/2] gfs2: writeout truncated pages Benjamin Marzinski
  2016-06-27 15:22 ` [Cluster-devel] [PATCH v2 0/2] fix gfs2 truncate race Bob Peterson
  2 siblings, 0 replies; 4+ messages in thread
From: Benjamin Marzinski @ 2016-06-20 18:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: cluster-devel, Steven Whitehouse

gfs2 needs to be able to skip the check to see if a page is outside of
the file size when writing it out. gfs2 can get into a situation where
it needs to flush its in-memory log to disk while a truncate is in
progress. If the file being trucated has data journaling enabled, it is
possible that there are data blocks in the log that are past the end of
the file. gfs can't finish the log flush without either writing these
blocks out or revoking them. Otherwise, if the node crashed, it could
overwrite subsequent changes made by other nodes in the cluster when
it's journal was replayed.

Unfortunately, there is no way to add log entries to the log during a
flush. So gfs2 simply writes out the page instead. This situation can
only occur when the truncate code still has the file locked exclusively,
and hasn't marked this block as free in the metadata (which happens
later in truc_dealloc).  After gfs2 writes this page out, the truncation
code will shortly invalidate it and write out any revokes if necessary.

In order to make this work, gfs2 needs to be able to skip the check for
writes outside the file size. Since the check exists in
block_write_full_page, this patch exports __block_write_full_page, which
doesn't have the check.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
 fs/buffer.c                 | 3 ++-
 include/linux/buffer_head.h | 3 +++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 754813a..6c15012 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1687,7 +1687,7 @@ static struct buffer_head *create_page_buffers(struct page *page, struct inode *
  * WB_SYNC_ALL, the writes are posted using WRITE_SYNC; this
  * causes the writes to be flagged as synchronous writes.
  */
-static int __block_write_full_page(struct inode *inode, struct page *page,
+int __block_write_full_page(struct inode *inode, struct page *page,
 			get_block_t *get_block, struct writeback_control *wbc,
 			bh_end_io_t *handler)
 {
@@ -1848,6 +1848,7 @@ recover:
 	unlock_page(page);
 	goto done;
 }
+EXPORT_SYMBOL(__block_write_full_page);
 
 /*
  * If a page has any new buffers, zero them out here, and mark them uptodate
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index d48daa3..7e14e54 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -208,6 +208,9 @@ void block_invalidatepage(struct page *page, unsigned int offset,
 			  unsigned int length);
 int block_write_full_page(struct page *page, get_block_t *get_block,
 				struct writeback_control *wbc);
+int __block_write_full_page(struct inode *inode, struct page *page,
+			get_block_t *get_block, struct writeback_control *wbc,
+			bh_end_io_t *handler);
 int block_read_full_page(struct page*, get_block_t*);
 int block_is_partially_uptodate(struct page *page, unsigned long from,
 				unsigned long count);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH v2 2/2] gfs2: writeout truncated pages
  2016-06-20 18:22 [PATCH v2 0/2] fix gfs2 truncate race Benjamin Marzinski
  2016-06-20 18:22 ` [PATCH v2 1/2] fs: export __block_write_full_page Benjamin Marzinski
@ 2016-06-20 18:22 ` Benjamin Marzinski
  2016-06-27 15:22 ` [Cluster-devel] [PATCH v2 0/2] fix gfs2 truncate race Bob Peterson
  2 siblings, 0 replies; 4+ messages in thread
From: Benjamin Marzinski @ 2016-06-20 18:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: cluster-devel, Steven Whitehouse

When gfs2 attempts to write a page to a file that is being truncated,
and notices that the page is completely outside of the file size, it
tries to invalidate it.  However, this may require a transaction for
journaled data files to revoke any buffers from the page on the active
items list. Unfortunately, this can happen inside a log flush, where a
transaction cannot be started. Also, gfs2 may need to be able to remove
the buffer from the ail1 list before it can finish the log flush.

To deal with this, when writing a page of a file with data journalling
enabled gfs2 now skips the check to see if the write is outside the file
size, and simply writes it anyway. This situation can only occur when
the truncate code still has the file locked exclusively, and hasn't
marked this block as free in the metadata (which happens later in
truc_dealloc).  After gfs2 writes this page out, the truncation code
will shortly invalidate it and write out any revokes if necessary.

To do this, gfs2 now implements its own version of block_write_full_page
without the check, and calls the newly exported __block_write_full_page.
It also no longer calls gfs2_writepage_common from gfs2_jdata_writepage.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
 fs/gfs2/aops.c | 49 ++++++++++++++++++++++++++++++++++---------------
 1 file changed, 34 insertions(+), 15 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 37b7bc1..82df368 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -140,6 +140,32 @@ static int gfs2_writepage(struct page *page, struct writeback_control *wbc)
 	return nobh_writepage(page, gfs2_get_block_noalloc, wbc);
 }
 
+/* This is the same as calling block_write_full_page, but it also
+ * writes pages outside of i_size
+ */
+int gfs2_write_full_page(struct page *page, get_block_t *get_block,
+			 struct writeback_control *wbc)
+{
+	struct inode * const inode = page->mapping->host;
+	loff_t i_size = i_size_read(inode);
+	const pgoff_t end_index = i_size >> PAGE_SHIFT;
+	unsigned offset;
+
+	/*
+	 * The page straddles i_size.  It must be zeroed out on each and every
+	 * writepage invocation because it may be mmapped.  "A file is mapped
+	 * in multiples of the page size.  For a file that is not a multiple of
+	 * the  page size, the remaining memory is zeroed when mapped, and
+	 * writes to that region are not written out to the file."
+	 */
+	offset = i_size & (PAGE_SIZE-1);
+	if (page->index == end_index && offset)
+		zero_user_segment(page, offset, PAGE_SIZE);
+
+	return __block_write_full_page(inode, page, get_block, wbc,
+				       end_buffer_async_write);
+}
+
 /**
  * __gfs2_jdata_writepage - The core of jdata writepage
  * @page: The page to write
@@ -165,7 +191,7 @@ static int __gfs2_jdata_writepage(struct page *page, struct writeback_control *w
 		}
 		gfs2_page_add_databufs(ip, page, 0, sdp->sd_vfs->s_blocksize-1);
 	}
-	return block_write_full_page(page, gfs2_get_block_noalloc, wbc);
+	return gfs2_write_full_page(page, gfs2_get_block_noalloc, wbc);
 }
 
 /**
@@ -180,27 +206,20 @@ static int __gfs2_jdata_writepage(struct page *page, struct writeback_control *w
 static int gfs2_jdata_writepage(struct page *page, struct writeback_control *wbc)
 {
 	struct inode *inode = page->mapping->host;
+	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 	int ret;
-	int done_trans = 0;
 
-	if (PageChecked(page)) {
-		if (wbc->sync_mode != WB_SYNC_ALL)
-			goto out_ignore;
-		ret = gfs2_trans_begin(sdp, RES_DINODE + 1, 0);
-		if (ret)
-			goto out_ignore;
-		done_trans = 1;
-	}
-	ret = gfs2_writepage_common(page, wbc);
-	if (ret > 0)
-		ret = __gfs2_jdata_writepage(page, wbc);
-	if (done_trans)
-		gfs2_trans_end(sdp);
+	if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl)))
+		goto out;
+	if (PageChecked(page) || current->journal_info)
+		goto out_ignore;
+	ret = __gfs2_jdata_writepage(page, wbc);
 	return ret;
 
 out_ignore:
 	redirty_page_for_writepage(wbc, page);
+out:
 	unlock_page(page);
 	return 0;
 }
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [Cluster-devel] [PATCH v2 0/2] fix gfs2 truncate race
  2016-06-20 18:22 [PATCH v2 0/2] fix gfs2 truncate race Benjamin Marzinski
  2016-06-20 18:22 ` [PATCH v2 1/2] fs: export __block_write_full_page Benjamin Marzinski
  2016-06-20 18:22 ` [PATCH v2 2/2] gfs2: writeout truncated pages Benjamin Marzinski
@ 2016-06-27 15:22 ` Bob Peterson
  2 siblings, 0 replies; 4+ messages in thread
From: Bob Peterson @ 2016-06-27 15:22 UTC (permalink / raw)
  To: Benjamin Marzinski; +Cc: linux-fsdevel, cluster-devel

----- Original Message -----
| If gfs2 tries to write out a page of a file with data journaling
| enabled, while that file is being truncated, it can cause a kernel
| panic. The problem is that the page it is writing out may point to past
| the end of the file. If this happens, gfs2 will try to invalidate the
| page.  However, it can't invalidate a journaled page with buffers in the
| in-memory log, without revoking those buffers, so that there is no
| chance of corruption if the machine crashes and later has its journal
| replayed.  If the page is being written out by the log flush code, there
| is no way that it can add a revoke entry onto the log during the flush.
| 
| To solve this, gfs2 simply writes out journalled data pages that point
| past the end of the file, since the truncate is still in progress, and
| everything will be cleaned up before the file is unlocked, or the blocks
| are marked free. Doing this involves both a gfs2 change and exporting an
| additional symbol from the vfs.
| 
| These v2 patches leave the code in gfs2_writepage_common alone, and
| replace that call in gfs2_jdata_writepage with the necessary checks.  Doing
| this highlighted the fact that __gfs2_jdata_writepage will never actually
| run during a transaction, and unless starting and then ending an untouched
| transaction has some important side effect that I can't see. All the
| transaction code can be pulled out of gfs2_jdata_writepage with no changes
| to how gfs2 is currently operating.
| 
| However, looking through that commit history, I don't think that this is
| intentional.  It looks more like gfs2_jdata_writepage lost the ability to
| usefully start transactions, and write out the page during them, so as to
| allow it to invalidate the page in those cases.  Restoring this won't solve
| the truncate race bug, since this can happen at times when
| gfs2_jdata_writepage
| won't (and can't) start a transaction.
| 
| Steve, it looks like your commit 1bb7322fd0d5abdce396de51cbc5dbc489523018
| caused this change. Do you have any idea if it was intentional. Clearly it
| isn't breaking things. But we should either remove the transaction completely
| like this patch, or make it possible to actually write out the page during
| the transaction.
| 
| Benjamin Marzinski (2):
|   fs: export __block_write_full_page
|   gfs2: writeout truncated pages
| 
|  fs/buffer.c                 |  3 ++-
|  fs/gfs2/aops.c              | 49
|  +++++++++++++++++++++++++++++++--------------
|  include/linux/buffer_head.h |  3 +++
|  3 files changed, 39 insertions(+), 16 deletions(-)
| 
| --
| 1.8.3.1
| 
| 
Hi,

Thanks. These are now applied to the for-next branch of the linux-gfs2 tree:
https://git.kernel.org/cgit/linux/kernel/git/gfs2/linux-gfs2.git/commit/fs?h=for-next&id=b4bba38909c21689de21355e84259cb7b38f25ac
https://git.kernel.org/cgit/linux/kernel/git/gfs2/linux-gfs2.git/commit/fs?h=for-next&id=fd4c5748b8d3f7420e8932ed0bde3d53cc8acc9d

Regards,

Bob Peterson
Red Hat File Systems

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-06-27 15:22 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-20 18:22 [PATCH v2 0/2] fix gfs2 truncate race Benjamin Marzinski
2016-06-20 18:22 ` [PATCH v2 1/2] fs: export __block_write_full_page Benjamin Marzinski
2016-06-20 18:22 ` [PATCH v2 2/2] gfs2: writeout truncated pages Benjamin Marzinski
2016-06-27 15:22 ` [Cluster-devel] [PATCH v2 0/2] fix gfs2 truncate race Bob Peterson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).