All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/12 v2] ext4: Several simplifications and fixes
@ 2013-01-18 12:00 Jan Kara
  2013-01-18 12:00 ` [PATCH 01/12] ext4: Always use ext4_bio_write_page() for writeout Jan Kara
                   ` (11 more replies)
  0 siblings, 12 replies; 53+ messages in thread
From: Jan Kara @ 2013-01-18 12:00 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

  Hi,

  this patch series contains several simplifications of ext4 code I came across
when thinking how to simplify handling of uninitialized extents. The series has
somewhat grown since last time. I have quite a bit more patches in my tree
(e.g. removing the need for page pointers from ext4_ioend_t so the structure
shrunk a lot, moving of PageWriteback completion after extent conversion, ...)
but they need more work and I'm not sure they will be ready for the next merge
window. But I hope everything in this series should be simple enough so that
it can go in.

The series contains two fixes:
[PATCH 02/12] ext4: Use redirty_page_for_writepage() in ext4_bio_write_page()
and
[PATCH 11/12] ext4: Make ext4_bio_writepage() handle unprepared buffers
[PATCH 12/12] ext4: Fix ext4_writepage() to achieve data=ordered guarantees

The rest are simplifications and cleanups. In total the series removes ~250
LOC.

								Honza

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 01/12] ext4: Always use ext4_bio_write_page() for writeout
  2013-01-18 12:00 [PATCH 0/12 v2] ext4: Several simplifications and fixes Jan Kara
@ 2013-01-18 12:00 ` Jan Kara
  2013-01-28 14:31   ` Theodore Ts'o
  2013-01-18 12:00 ` [PATCH 02/12] ext4: Use redirty_page_for_writepage() in ext4_bio_write_page() Jan Kara
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 53+ messages in thread
From: Jan Kara @ 2013-01-18 12:00 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Currently we sometimes used block_write_full_page() and sometimes
ext4_bio_write_page() for writeback (depending on mount options and call
path). Let's always use ext4_bio_write_page() to simplify things a bit.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h    |    2 -
 fs/ext4/inode.c   |  128 ++++++-----------------------------------------------
 fs/ext4/page-io.c |    2 -
 fs/ext4/super.c   |    7 +--
 4 files changed, 16 insertions(+), 123 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8462eb3..fb40ee2 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -215,7 +215,6 @@ typedef struct ext4_io_end {
 	struct list_head	list;		/* per-file finished IO list */
 	struct inode		*inode;		/* file being written to */
 	unsigned int		flag;		/* unwritten or not */
-	struct page		*page;		/* for writepage() path */
 	loff_t			offset;		/* offset in the file */
 	ssize_t			size;		/* size of the extent */
 	struct work_struct	work;		/* data work queue */
@@ -985,7 +984,6 @@ struct ext4_inode_info {
 #define EXT4_MOUNT_DIOREAD_NOLOCK	0x400000 /* Enable support for dio read nolocking */
 #define EXT4_MOUNT_JOURNAL_CHECKSUM	0x800000 /* Journal checksums */
 #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT	0x1000000 /* Journal Async Commit */
-#define EXT4_MOUNT_MBLK_IO_SUBMIT	0x4000000 /* multi-block io submits */
 #define EXT4_MOUNT_DELALLOC		0x8000000 /* Delalloc support */
 #define EXT4_MOUNT_DATA_ERR_ABORT	0x10000000 /* Abort on file data write */
 #define EXT4_MOUNT_BLOCK_VALIDITY	0x20000000 /* Block validity checking */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cb1c1ab..f95b511 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -134,8 +134,6 @@ static inline int ext4_begin_ordered_truncate(struct inode *inode,
 static void ext4_invalidatepage(struct page *page, unsigned long offset);
 static int noalloc_get_block_write(struct inode *inode, sector_t iblock,
 				   struct buffer_head *bh_result, int create);
-static int ext4_set_bh_endio(struct buffer_head *bh, struct inode *inode);
-static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate);
 static int __ext4_journalled_writepage(struct page *page, unsigned int len);
 static int ext4_bh_delay_or_unwritten(handle_t *handle, struct buffer_head *bh);
 static int ext4_discard_partial_page_buffers_no_lock(handle_t *handle,
@@ -808,11 +806,10 @@ int ext4_walk_page_buffers(handle_t *handle,
  * and the commit_write().  So doing the jbd2_journal_start at the start of
  * prepare_write() is the right place.
  *
- * Also, this function can nest inside ext4_writepage() ->
- * block_write_full_page(). In that case, we *know* that ext4_writepage()
- * has generated enough buffer credits to do the whole page.  So we won't
- * block on the journal in that case, which is good, because the caller may
- * be PF_MEMALLOC.
+ * Also, this function can nest inside ext4_writepage().  In that case, we
+ * *know* that ext4_writepage() has generated enough buffer credits to do the
+ * whole page.  So we won't block on the journal in that case, which is good,
+ * because the caller may be PF_MEMALLOC.
  *
  * By accident, ext4 can be reentered when a transaction is open via
  * quota file writes.  If we were to commit the transaction while thus
@@ -1463,18 +1460,9 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd,
 			 */
 			if (unlikely(journal_data && PageChecked(page)))
 				err = __ext4_journalled_writepage(page, len);
-			else if (test_opt(inode->i_sb, MBLK_IO_SUBMIT))
+			else
 				err = ext4_bio_write_page(&io_submit, page,
 							  len, mpd->wbc);
-			else if (buffer_uninit(page_bufs)) {
-				ext4_set_bh_endio(page_bufs, inode);
-				err = block_write_full_page_endio(page,
-					noalloc_get_block_write,
-					mpd->wbc, ext4_end_io_buffer_write);
-			} else
-				err = block_write_full_page(page,
-					noalloc_get_block_write, mpd->wbc);
-
 			if (!err)
 				mpd->pages_written++;
 			/*
@@ -1891,16 +1879,16 @@ int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
 }
 
 /*
- * This function is used as a standard get_block_t calback function
- * when there is no desire to allocate any blocks.  It is used as a
- * callback function for block_write_begin() and block_write_full_page().
- * These functions should only try to map a single block at a time.
+ * This function is used as a standard get_block_t calback function when there
+ * is no desire to allocate any blocks.  It is used as a callback function for
+ * block_write_begin().  These functions should only try to map a single block
+ * at a time.
  *
  * Since this function doesn't do block allocations even if the caller
  * requests it by passing in create=1, it is critically important that
  * any caller checks to make sure that any buffer heads are returned
  * by this function are either all already mapped or marked for
- * delayed allocation before calling  block_write_full_page().  Otherwise,
+ * delayed allocation before calling ext4_bio_write_page().  Otherwise,
  * b_blocknr could be left unitialized, and the page write functions will
  * be taken by surprise.
  */
@@ -2040,6 +2028,7 @@ static int ext4_writepage(struct page *page,
 	unsigned int len;
 	struct buffer_head *page_bufs = NULL;
 	struct inode *inode = page->mapping->host;
+	struct ext4_io_submit io_submit;
 
 	trace_ext4_writepage(page);
 	size = i_size_read(inode);
@@ -2089,14 +2078,9 @@ static int ext4_writepage(struct page *page,
 		 */
 		return __ext4_journalled_writepage(page, len);
 
-	if (buffer_uninit(page_bufs)) {
-		ext4_set_bh_endio(page_bufs, inode);
-		ret = block_write_full_page_endio(page, noalloc_get_block_write,
-					    wbc, ext4_end_io_buffer_write);
-	} else
-		ret = block_write_full_page(page, noalloc_get_block_write,
-					    wbc);
-
+	memset(&io_submit, 0, sizeof(io_submit));
+	ret = ext4_bio_write_page(&io_submit, page, len, wbc);
+	ext4_io_submit(&io_submit);
 	return ret;
 }
 
@@ -2858,26 +2842,6 @@ ext4_readpages(struct file *file, struct address_space *mapping,
 	return mpage_readpages(mapping, pages, nr_pages, ext4_get_block);
 }
 
-static void ext4_invalidatepage_free_endio(struct page *page, unsigned long offset)
-{
-	struct buffer_head *head, *bh;
-	unsigned int curr_off = 0;
-
-	if (!page_has_buffers(page))
-		return;
-	head = bh = page_buffers(page);
-	do {
-		if (offset <= curr_off && test_clear_buffer_uninit(bh)
-					&& bh->b_private) {
-			ext4_free_io_end(bh->b_private);
-			bh->b_private = NULL;
-			bh->b_end_io = NULL;
-		}
-		curr_off = curr_off + bh->b_size;
-		bh = bh->b_this_page;
-	} while (bh != head);
-}
-
 static void ext4_invalidatepage(struct page *page, unsigned long offset)
 {
 	journal_t *journal = EXT4_JOURNAL(page->mapping->host);
@@ -2885,11 +2849,6 @@ static void ext4_invalidatepage(struct page *page, unsigned long offset)
 	trace_ext4_invalidatepage(page, offset);
 
 	/*
-	 * free any io_end structure allocated for buffers to be discarded
-	 */
-	if (ext4_should_dioread_nolock(page->mapping->host))
-		ext4_invalidatepage_free_endio(page, offset);
-	/*
 	 * If it's a full truncate we just forget about the pending dirtying
 	 */
 	if (offset == 0)
@@ -2977,65 +2936,6 @@ out:
 	ext4_add_complete_io(io_end);
 }
 
-static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
-{
-	ext4_io_end_t *io_end = bh->b_private;
-	struct inode *inode;
-
-	if (!test_clear_buffer_uninit(bh) || !io_end)
-		goto out;
-
-	if (!(io_end->inode->i_sb->s_flags & MS_ACTIVE)) {
-		ext4_msg(io_end->inode->i_sb, KERN_INFO,
-			 "sb umounted, discard end_io request for inode %lu",
-			 io_end->inode->i_ino);
-		ext4_free_io_end(io_end);
-		goto out;
-	}
-
-	/*
-	 * It may be over-defensive here to check EXT4_IO_END_UNWRITTEN now,
-	 * but being more careful is always safe for the future change.
-	 */
-	inode = io_end->inode;
-	ext4_set_io_unwritten_flag(inode, io_end);
-	ext4_add_complete_io(io_end);
-out:
-	bh->b_private = NULL;
-	bh->b_end_io = NULL;
-	clear_buffer_uninit(bh);
-	end_buffer_async_write(bh, uptodate);
-}
-
-static int ext4_set_bh_endio(struct buffer_head *bh, struct inode *inode)
-{
-	ext4_io_end_t *io_end;
-	struct page *page = bh->b_page;
-	loff_t offset = (sector_t)page->index << PAGE_CACHE_SHIFT;
-	size_t size = bh->b_size;
-
-retry:
-	io_end = ext4_init_io_end(inode, GFP_ATOMIC);
-	if (!io_end) {
-		pr_warn_ratelimited("%s: allocation fail\n", __func__);
-		schedule();
-		goto retry;
-	}
-	io_end->offset = offset;
-	io_end->size = size;
-	/*
-	 * We need to hold a reference to the page to make sure it
-	 * doesn't get evicted before ext4_end_io_work() has a chance
-	 * to convert the extent from written to unwritten.
-	 */
-	io_end->page = page;
-	get_page(io_end->page);
-
-	bh->b_private = io_end;
-	bh->b_end_io = ext4_end_io_buffer_write;
-	return 0;
-}
-
 /*
  * For ext4 extent files, ext4 will do direct-io write to holes,
  * preallocated extents, and those write extend the file, no need to
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 0016fbc..ddb3d40 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -73,8 +73,6 @@ void ext4_free_io_end(ext4_io_end_t *io)
 	BUG_ON(!list_empty(&io->list));
 	BUG_ON(io->flag & EXT4_IO_END_UNWRITTEN);
 
-	if (io->page)
-		put_page(io->page);
 	for (i = 0; i < io->num_io_pages; i++)
 		put_io_page(io->pages[i]);
 	io->num_io_pages = 0;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index e09f7d1..afbe974 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1280,8 +1280,8 @@ static const match_table_t tokens = {
 	{Opt_stripe, "stripe=%u"},
 	{Opt_delalloc, "delalloc"},
 	{Opt_nodelalloc, "nodelalloc"},
-	{Opt_mblk_io_submit, "mblk_io_submit"},
-	{Opt_nomblk_io_submit, "nomblk_io_submit"},
+	{Opt_removed, "mblk_io_submit"},
+	{Opt_removed, "nomblk_io_submit"},
 	{Opt_block_validity, "block_validity"},
 	{Opt_noblock_validity, "noblock_validity"},
 	{Opt_inode_readahead_blks, "inode_readahead_blks=%u"},
@@ -1414,8 +1414,6 @@ static const struct mount_opts {
 	{Opt_bsd_df, EXT4_MOUNT_MINIX_DF, MOPT_CLEAR},
 	{Opt_grpid, EXT4_MOUNT_GRPID, MOPT_SET},
 	{Opt_nogrpid, EXT4_MOUNT_GRPID, MOPT_CLEAR},
-	{Opt_mblk_io_submit, EXT4_MOUNT_MBLK_IO_SUBMIT, MOPT_SET},
-	{Opt_nomblk_io_submit, EXT4_MOUNT_MBLK_IO_SUBMIT, MOPT_CLEAR},
 	{Opt_block_validity, EXT4_MOUNT_BLOCK_VALIDITY, MOPT_SET},
 	{Opt_noblock_validity, EXT4_MOUNT_BLOCK_VALIDITY, MOPT_CLEAR},
 	{Opt_dioread_nolock, EXT4_MOUNT_DIOREAD_NOLOCK, MOPT_SET},
@@ -3373,7 +3371,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 #ifdef CONFIG_EXT4_FS_POSIX_ACL
 	set_opt(sb, POSIX_ACL);
 #endif
-	set_opt(sb, MBLK_IO_SUBMIT);
 	if ((def_mount_opts & EXT4_DEFM_JMODE) == EXT4_DEFM_JMODE_DATA)
 		set_opt(sb, JOURNAL_DATA);
 	else if ((def_mount_opts & EXT4_DEFM_JMODE) == EXT4_DEFM_JMODE_ORDERED)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 02/12] ext4: Use redirty_page_for_writepage() in ext4_bio_write_page()
  2013-01-18 12:00 [PATCH 0/12 v2] ext4: Several simplifications and fixes Jan Kara
  2013-01-18 12:00 ` [PATCH 01/12] ext4: Always use ext4_bio_write_page() for writeout Jan Kara
@ 2013-01-18 12:00 ` Jan Kara
  2013-01-28 14:34   ` Theodore Ts'o
  2013-01-18 12:00 ` [PATCH 03/12] ext4: Remove bogus wait for unwritten extents in ext4_ind_direct_IO Jan Kara
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 53+ messages in thread
From: Jan Kara @ 2013-01-18 12:00 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

When we cannot write a page we should use redirty_page_for_writepage()
instead of plain set_page_dirty(). That tells writeback code we have
problems, redirties only the page (redirtying buffers is not needed),
and updates mm accounting of failed page writes.

Also move clearing of buffer dirty flag after io_submit_add_bh(). At that
moment we are sure buffer will be going to disk.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/page-io.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index ddb3d40..05795f1 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -23,6 +23,7 @@
 #include <linux/workqueue.h>
 #include <linux/kernel.h>
 #include <linux/slab.h>
+#include <linux/mm.h>
 
 #include "ext4_jbd2.h"
 #include "xattr.h"
@@ -434,7 +435,7 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 
 	io_page = kmem_cache_alloc(io_page_cachep, GFP_NOFS);
 	if (!io_page) {
-		set_page_dirty(page);
+		redirty_page_for_writepage(wbc, page);
 		unlock_page(page);
 		return -ENOMEM;
 	}
@@ -466,7 +467,6 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 			set_buffer_uptodate(bh);
 			continue;
 		}
-		clear_buffer_dirty(bh);
 		ret = io_submit_add_bh(io, io_page, inode, wbc, bh);
 		if (ret) {
 			/*
@@ -474,9 +474,10 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 			 * we can do but mark the page as dirty, and
 			 * better luck next time.
 			 */
-			set_page_dirty(page);
+			redirty_page_for_writepage(wbc, page);
 			break;
 		}
+		clear_buffer_dirty(bh);
 	}
 	unlock_page(page);
 	/*
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 03/12] ext4: Remove bogus wait for unwritten extents in ext4_ind_direct_IO
  2013-01-18 12:00 [PATCH 0/12 v2] ext4: Several simplifications and fixes Jan Kara
  2013-01-18 12:00 ` [PATCH 01/12] ext4: Always use ext4_bio_write_page() for writeout Jan Kara
  2013-01-18 12:00 ` [PATCH 02/12] ext4: Use redirty_page_for_writepage() in ext4_bio_write_page() Jan Kara
@ 2013-01-18 12:00 ` Jan Kara
  2013-01-22 11:11   ` Dmitry Monakhov
  2013-01-18 12:00 ` [PATCH 04/12] ext4: Disable merging of uninitialized extents Jan Kara
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 53+ messages in thread
From: Jan Kara @ 2013-01-18 12:00 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

When using indirect blocks there is no possibility to have any unwritten
extents. So wait for them in ext4_ind_direct_IO() is just bogus.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/indirect.c |    5 -----
 1 files changed, 0 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 20862f9..993247c 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -807,11 +807,6 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
 
 retry:
 	if (rw == READ && ext4_should_dioread_nolock(inode)) {
-		if (unlikely(atomic_read(&EXT4_I(inode)->i_unwritten))) {
-			mutex_lock(&inode->i_mutex);
-			ext4_flush_unwritten_io(inode);
-			mutex_unlock(&inode->i_mutex);
-		}
 		/*
 		 * Nolock dioread optimization may be dynamically disabled
 		 * via ext4_inode_block_unlocked_dio(). Check inode's state
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 04/12] ext4: Disable merging of uninitialized extents
  2013-01-18 12:00 [PATCH 0/12 v2] ext4: Several simplifications and fixes Jan Kara
                   ` (2 preceding siblings ...)
  2013-01-18 12:00 ` [PATCH 03/12] ext4: Remove bogus wait for unwritten extents in ext4_ind_direct_IO Jan Kara
@ 2013-01-18 12:00 ` Jan Kara
  2013-01-24  9:49   ` Dmitry Monakhov
  2013-02-09 17:10   ` REGRESSION: " Theodore Ts'o
  2013-01-18 12:00 ` [PATCH 05/12] ext4: Remove unnecessary wait for extent conversion in ext4_fallocate() Jan Kara
                   ` (7 subsequent siblings)
  11 siblings, 2 replies; 53+ messages in thread
From: Jan Kara @ 2013-01-18 12:00 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Merging of uninitialized extents creates all sorts of interesting race
possibilities when writeback / DIO races with fallocate. Thus
ext4_convert_unwritten_extents_endio() has to deal with a case where
extent to be converted needs to be split out first. That isn't nice
for two reasons:

1) It may need allocation of extent tree block so ENOSPC is possible.
2) It complicates end_io handling code

So we disable merging of uninitialized extents which allows us to simplify
the code. Extents will get merged after they are converted to initialized
ones.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/extents.c |   61 +++++++++++++++-------------------------------------
 1 files changed, 18 insertions(+), 43 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 26af228..f1ce33a 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -54,9 +54,6 @@
 #define EXT4_EXT_MARK_UNINIT1	0x2  /* mark first half uninitialized */
 #define EXT4_EXT_MARK_UNINIT2	0x4  /* mark second half uninitialized */
 
-#define EXT4_EXT_DATA_VALID1	0x8  /* first half contains valid data */
-#define EXT4_EXT_DATA_VALID2	0x10 /* second half contains valid data */
-
 static __le32 ext4_extent_block_csum(struct inode *inode,
 				     struct ext4_extent_header *eh)
 {
@@ -1579,20 +1576,17 @@ int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
 				struct ext4_extent *ex2)
 {
-	unsigned short ext1_ee_len, ext2_ee_len, max_len;
+	unsigned ext1_ee_len, ext2_ee_len;
 
 	/*
-	 * Make sure that either both extents are uninitialized, or
-	 * both are _not_.
+	 * Make sure that both extents are initialized. We don't merge
+	 * uninitialized extents so that we can be sure that end_io code has
+	 * the extent that was written properly split out and conversion to
+	 * initialized is trivial.
 	 */
-	if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+	if (ext4_ext_is_uninitialized(ex1) || ext4_ext_is_uninitialized(ex2))
 		return 0;
 
-	if (ext4_ext_is_uninitialized(ex1))
-		max_len = EXT_UNINIT_MAX_LEN;
-	else
-		max_len = EXT_INIT_MAX_LEN;
-
 	ext1_ee_len = ext4_ext_get_actual_len(ex1);
 	ext2_ee_len = ext4_ext_get_actual_len(ex2);
 
@@ -1605,7 +1599,7 @@ ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
 	 * as an RO_COMPAT feature, refuse to merge to extents if
 	 * this can result in the top bit of ee_len being set.
 	 */
-	if (ext1_ee_len + ext2_ee_len > max_len)
+	if (ext1_ee_len + ext2_ee_len > EXT_INIT_MAX_LEN)
 		return 0;
 #ifdef AGGRESSIVE_TEST
 	if (ext1_ee_len >= 4)
@@ -2959,9 +2953,6 @@ static int ext4_split_extent_at(handle_t *handle,
 	unsigned int ee_len, depth;
 	int err = 0;
 
-	BUG_ON((split_flag & (EXT4_EXT_DATA_VALID1 | EXT4_EXT_DATA_VALID2)) ==
-	       (EXT4_EXT_DATA_VALID1 | EXT4_EXT_DATA_VALID2));
-
 	ext_debug("ext4_split_extents_at: inode %lu, logical"
 		"block %llu\n", inode->i_ino, (unsigned long long)split);
 
@@ -3020,14 +3011,7 @@ static int ext4_split_extent_at(handle_t *handle,
 
 	err = ext4_ext_insert_extent(handle, inode, path, &newex, flags);
 	if (err == -ENOSPC && (EXT4_EXT_MAY_ZEROOUT & split_flag)) {
-		if (split_flag & (EXT4_EXT_DATA_VALID1|EXT4_EXT_DATA_VALID2)) {
-			if (split_flag & EXT4_EXT_DATA_VALID1)
-				err = ext4_ext_zeroout(inode, ex2);
-			else
-				err = ext4_ext_zeroout(inode, ex);
-		} else
-			err = ext4_ext_zeroout(inode, &orig_ex);
-
+		err = ext4_ext_zeroout(inode, &orig_ex);
 		if (err)
 			goto fix_extent_len;
 		/* update the extent length and mark as initialized */
@@ -3085,8 +3069,6 @@ static int ext4_split_extent(handle_t *handle,
 		if (uninitialized)
 			split_flag1 |= EXT4_EXT_MARK_UNINIT1 |
 				       EXT4_EXT_MARK_UNINIT2;
-		if (split_flag & EXT4_EXT_DATA_VALID2)
-			split_flag1 |= EXT4_EXT_DATA_VALID1;
 		err = ext4_split_extent_at(handle, inode, path,
 				map->m_lblk + map->m_len, split_flag1, flags1);
 		if (err)
@@ -3099,8 +3081,7 @@ static int ext4_split_extent(handle_t *handle,
 		return PTR_ERR(path);
 
 	if (map->m_lblk >= ee_block) {
-		split_flag1 = split_flag & (EXT4_EXT_MAY_ZEROOUT |
-					    EXT4_EXT_DATA_VALID2);
+		split_flag1 = split_flag & EXT4_EXT_MAY_ZEROOUT;
 		if (uninitialized)
 			split_flag1 |= EXT4_EXT_MARK_UNINIT1;
 		if (split_flag & EXT4_EXT_MARK_UNINIT2)
@@ -3379,8 +3360,7 @@ static int ext4_split_unwritten_extents(handle_t *handle,
 
 	split_flag |= ee_block + ee_len <= eof_block ? EXT4_EXT_MAY_ZEROOUT : 0;
 	split_flag |= EXT4_EXT_MARK_UNINIT2;
-	if (flags & EXT4_GET_BLOCKS_CONVERT)
-		split_flag |= EXT4_EXT_DATA_VALID2;
+
 	flags |= EXT4_GET_BLOCKS_PRE_IO;
 	return ext4_split_extent(handle, inode, path, map, split_flag, flags);
 }
@@ -3405,20 +3385,15 @@ static int ext4_convert_unwritten_extents_endio(handle_t *handle,
 		"block %llu, max_blocks %u\n", inode->i_ino,
 		  (unsigned long long)ee_block, ee_len);
 
-	/* If extent is larger than requested then split is required */
+	/* Extent is larger than requested? */
 	if (ee_block != map->m_lblk || ee_len > map->m_len) {
-		err = ext4_split_unwritten_extents(handle, inode, map, path,
-						   EXT4_GET_BLOCKS_CONVERT);
-		if (err < 0)
-			goto out;
-		ext4_ext_drop_refs(path);
-		path = ext4_ext_find_extent(inode, map->m_lblk, path);
-		if (IS_ERR(path)) {
-			err = PTR_ERR(path);
-			goto out;
-		}
-		depth = ext_depth(inode);
-		ex = path[depth].p_ext;
+		EXT4_ERROR_INODE(inode, "Written extent modified before IO"
+			" finished: extent logical block %llu, len %u; IO"
+			" logical block %llu, len %u\n",
+			(unsigned long long)ee_block, ee_len,
+			(unsigned long long)map->m_lblk, map->m_len);
+		err = -EIO;
+		goto out;
 	}
 
 	err = ext4_ext_get_access(handle, inode, path + depth);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 05/12] ext4: Remove unnecessary wait for extent conversion in ext4_fallocate()
  2013-01-18 12:00 [PATCH 0/12 v2] ext4: Several simplifications and fixes Jan Kara
                   ` (3 preceding siblings ...)
  2013-01-18 12:00 ` [PATCH 04/12] ext4: Disable merging of uninitialized extents Jan Kara
@ 2013-01-18 12:00 ` Jan Kara
  2013-01-18 12:00 ` [PATCH 06/12] ext4: Move work from io_end to inode Jan Kara
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 53+ messages in thread
From: Jan Kara @ 2013-01-18 12:00 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Now that we don't merge uninitialized extents anymore, ext4_fallocate()
is free to operate on the inode while there are still some extent
conversions pending - it won't disturb them in any way.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/extents.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index f1ce33a..5c7a46a 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4406,8 +4406,6 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	if (len <= EXT_UNINIT_MAX_LEN << blkbits)
 		flags |= EXT4_GET_BLOCKS_NO_NORMALIZE;
 
-	/* Prevent race condition between unwritten */
-	ext4_flush_unwritten_io(inode);
 retry:
 	while (ret >= 0 && ret < max_blocks) {
 		map.m_lblk = map.m_lblk + ret;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 06/12] ext4: Move work from io_end to inode
  2013-01-18 12:00 [PATCH 0/12 v2] ext4: Several simplifications and fixes Jan Kara
                   ` (4 preceding siblings ...)
  2013-01-18 12:00 ` [PATCH 05/12] ext4: Remove unnecessary wait for extent conversion in ext4_fallocate() Jan Kara
@ 2013-01-18 12:00 ` Jan Kara
  2013-01-28 14:45   ` Theodore Ts'o
  2013-01-18 12:00 ` [PATCH 07/12] ext4: Simplify list handling in ext4_do_flush_completed_IO() Jan Kara
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 53+ messages in thread
From: Jan Kara @ 2013-01-18 12:00 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

It does not make much sense to have struct work in ext4_io_end_t because we
always use it for only one ext4_io_end_t per inode (the first one in the
i_completed_io list). So just move the structure to inode itself.  This also
allows for a small simplification in processing io_end structures.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h    |    6 +++---
 fs/ext4/page-io.c |   33 +++++++++------------------------
 fs/ext4/super.c   |    1 +
 3 files changed, 13 insertions(+), 27 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index fb40ee2..7bbb53f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -194,8 +194,7 @@ struct mpage_da_data {
  */
 #define	EXT4_IO_END_UNWRITTEN	0x0001
 #define EXT4_IO_END_ERROR	0x0002
-#define EXT4_IO_END_QUEUED	0x0004
-#define EXT4_IO_END_DIRECT	0x0008
+#define EXT4_IO_END_DIRECT	0x0004
 
 struct ext4_io_page {
 	struct page	*p_page;
@@ -217,7 +216,6 @@ typedef struct ext4_io_end {
 	unsigned int		flag;		/* unwritten or not */
 	loff_t			offset;		/* offset in the file */
 	ssize_t			size;		/* size of the extent */
-	struct work_struct	work;		/* data work queue */
 	struct kiocb		*iocb;		/* iocb struct for AIO */
 	int			result;		/* error value for AIO */
 	int			num_io_pages;   /* for writepages() */
@@ -929,6 +927,7 @@ struct ext4_inode_info {
 	spinlock_t i_completed_io_lock;
 	atomic_t i_ioend_count;	/* Number of outstanding io_end structs */
 	atomic_t i_unwritten; /* Nr. of inflight conversions pending */
+	struct work_struct i_unwritten_work;	/* deferred extent conversion */
 
 	spinlock_t i_block_reservation_lock;
 
@@ -2535,6 +2534,7 @@ extern void ext4_exit_pageio(void);
 extern void ext4_ioend_wait(struct inode *);
 extern void ext4_free_io_end(ext4_io_end_t *io);
 extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
+extern void ext4_end_io_work(struct work_struct *work);
 extern void ext4_io_submit(struct ext4_io_submit *io);
 extern int ext4_bio_write_page(struct ext4_io_submit *io,
 			       struct page *page,
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 05795f1..a029017 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -151,16 +151,13 @@ void ext4_add_complete_io(ext4_io_end_t *io_end)
 	wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
 
 	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
-	if (list_empty(&ei->i_completed_io_list)) {
-		io_end->flag |= EXT4_IO_END_QUEUED;
-		queue_work(wq, &io_end->work);
-	}
+	if (list_empty(&ei->i_completed_io_list))
+		queue_work(wq, &ei->i_unwritten_work);
 	list_add_tail(&io_end->list, &ei->i_completed_io_list);
 	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
 }
 
-static int ext4_do_flush_completed_IO(struct inode *inode,
-				      ext4_io_end_t *work_io)
+static int ext4_do_flush_completed_IO(struct inode *inode)
 {
 	ext4_io_end_t *io;
 	struct list_head unwritten, complete, to_free;
@@ -191,19 +188,7 @@ static int ext4_do_flush_completed_IO(struct inode *inode,
 	while (!list_empty(&complete)) {
 		io = list_entry(complete.next, ext4_io_end_t, list);
 		io->flag &= ~EXT4_IO_END_UNWRITTEN;
-		/* end_io context can not be destroyed now because it still
-		 * used by queued worker. Worker thread will destroy it later */
-		if (io->flag & EXT4_IO_END_QUEUED)
-			list_del_init(&io->list);
-		else
-			list_move(&io->list, &to_free);
-	}
-	/* If we are called from worker context, it is time to clear queued
-	 * flag, and destroy it's end_io if it was converted already */
-	if (work_io) {
-		work_io->flag &= ~EXT4_IO_END_QUEUED;
-		if (!(work_io->flag & EXT4_IO_END_UNWRITTEN))
-			list_add_tail(&work_io->list, &to_free);
+		list_move(&io->list, &to_free);
 	}
 	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
 
@@ -218,10 +203,11 @@ static int ext4_do_flush_completed_IO(struct inode *inode,
 /*
  * work on completed aio dio IO, to convert unwritten extents to extents
  */
-static void ext4_end_io_work(struct work_struct *work)
+void ext4_end_io_work(struct work_struct *work)
 {
-	ext4_io_end_t *io = container_of(work, ext4_io_end_t, work);
-	ext4_do_flush_completed_IO(io->inode, io);
+	struct ext4_inode_info *ei = container_of(work, struct ext4_inode_info,
+						  i_unwritten_work);
+	ext4_do_flush_completed_IO(&ei->vfs_inode);
 }
 
 int ext4_flush_unwritten_io(struct inode *inode)
@@ -229,7 +215,7 @@ int ext4_flush_unwritten_io(struct inode *inode)
 	int ret;
 	WARN_ON_ONCE(!mutex_is_locked(&inode->i_mutex) &&
 		     !(inode->i_state & I_FREEING));
-	ret = ext4_do_flush_completed_IO(inode, NULL);
+	ret = ext4_do_flush_completed_IO(inode);
 	ext4_unwritten_wait(inode);
 	return ret;
 }
@@ -240,7 +226,6 @@ ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
 	if (io) {
 		atomic_inc(&EXT4_I(inode)->i_ioend_count);
 		io->inode = inode;
-		INIT_WORK(&io->work, ext4_end_io_work);
 		INIT_LIST_HEAD(&io->list);
 	}
 	return io;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index afbe974..060fa38 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -960,6 +960,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	ei->i_datasync_tid = 0;
 	atomic_set(&ei->i_ioend_count, 0);
 	atomic_set(&ei->i_unwritten, 0);
+	INIT_WORK(&ei->i_unwritten_work, ext4_end_io_work);
 
 	return &ei->vfs_inode;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 07/12] ext4: Simplify list handling in ext4_do_flush_completed_IO()
  2013-01-18 12:00 [PATCH 0/12 v2] ext4: Several simplifications and fixes Jan Kara
                   ` (5 preceding siblings ...)
  2013-01-18 12:00 ` [PATCH 06/12] ext4: Move work from io_end to inode Jan Kara
@ 2013-01-18 12:00 ` Jan Kara
  2013-01-28 14:51   ` Theodore Ts'o
  2013-01-18 12:00 ` [PATCH 08/12] ext4: Remove __ext4_journalled_writepage() from mpage_da_submit_io() Jan Kara
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 53+ messages in thread
From: Jan Kara @ 2013-01-18 12:00 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara, Dmitry Monakhov

The function splices i_completed_io_list to its private list first. From that
moment on we don't need any lock for working with io_end structures because all
io_end structure on the list are only our own. So we can remove the other two
lists in the function and free io_end immediately after we are done with it.

CC: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/page-io.c |   18 +-----------------
 1 files changed, 1 insertions(+), 17 deletions(-)

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index a029017..3fb385c 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -160,14 +160,11 @@ void ext4_add_complete_io(ext4_io_end_t *io_end)
 static int ext4_do_flush_completed_IO(struct inode *inode)
 {
 	ext4_io_end_t *io;
-	struct list_head unwritten, complete, to_free;
+	struct list_head unwritten;
 	unsigned long flags;
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	int err, ret = 0;
 
-	INIT_LIST_HEAD(&complete);
-	INIT_LIST_HEAD(&to_free);
-
 	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
 	dump_completed_IO(inode);
 	list_replace_init(&ei->i_completed_io_list, &unwritten);
@@ -181,20 +178,7 @@ static int ext4_do_flush_completed_IO(struct inode *inode)
 		err = ext4_end_io(io);
 		if (unlikely(!ret && err))
 			ret = err;
-
-		list_add_tail(&io->list, &complete);
-	}
-	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
-	while (!list_empty(&complete)) {
-		io = list_entry(complete.next, ext4_io_end_t, list);
 		io->flag &= ~EXT4_IO_END_UNWRITTEN;
-		list_move(&io->list, &to_free);
-	}
-	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 08/12] ext4: Remove __ext4_journalled_writepage() from mpage_da_submit_io()
  2013-01-18 12:00 [PATCH 0/12 v2] ext4: Several simplifications and fixes Jan Kara
                   ` (6 preceding siblings ...)
  2013-01-18 12:00 ` [PATCH 07/12] ext4: Simplify list handling in ext4_do_flush_completed_IO() Jan Kara
@ 2013-01-18 12:00 ` Jan Kara
  2013-01-28 14:40   ` Theodore Ts'o
  2013-01-18 12:00 ` [PATCH 09/12] ext4: Dirty page has always buffers attached Jan Kara
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 53+ messages in thread
From: Jan Kara @ 2013-01-18 12:00 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

We don't support delayed allocation in data=journal mode. So checking for it in
mpage_da_submit_io() doesn't make really sence. If we ever decide to extend
delayed allocation support to data=journal mode, adding
__ext4_journalled_writepage() call will be the least of problems we have to
solve. Most likely we'd have to implement separate writepages call anyways
because we don't have transaction credits for writing more than a single page
so mapping of page buffers would have to be done differently.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c |   13 ++-----------
 1 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f95b511..359d06e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1354,7 +1354,6 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd,
 	loff_t size = i_size_read(inode);
 	unsigned int len, block_start;
 	struct buffer_head *bh, *page_bufs = NULL;
-	int journal_data = ext4_should_journal_data(inode);
 	sector_t pblock = 0, cur_logical = 0;
 	struct ext4_io_submit io_submit;
 
@@ -1453,16 +1452,8 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd,
 				block_commit_write(page, 0, len);
 
 			clear_page_dirty_for_io(page);
-			/*
-			 * Delalloc doesn't support data journalling,
-			 * but eventually maybe we'll lift this
-			 * restriction.
-			 */
-			if (unlikely(journal_data && PageChecked(page)))
-				err = __ext4_journalled_writepage(page, len);
-			else
-				err = ext4_bio_write_page(&io_submit, page,
-							  len, mpd->wbc);
+			err = ext4_bio_write_page(&io_submit, page, len,
+						  mpd->wbc);
 			if (!err)
 				mpd->pages_written++;
 			/*
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 09/12] ext4: Dirty page has always buffers attached
  2013-01-18 12:00 [PATCH 0/12 v2] ext4: Several simplifications and fixes Jan Kara
                   ` (7 preceding siblings ...)
  2013-01-18 12:00 ` [PATCH 08/12] ext4: Remove __ext4_journalled_writepage() from mpage_da_submit_io() Jan Kara
@ 2013-01-18 12:00 ` Jan Kara
  2013-01-28 17:55   ` Theodore Ts'o
  2013-01-18 12:00 ` [PATCH 10/12] ext4: Simplify mpage_add_bh_to_extent() Jan Kara
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 53+ messages in thread
From: Jan Kara @ 2013-01-18 12:00 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

ext4_writepage(), write_cache_pages_da(), and mpage_da_submit_io() doesn't
have to deal with the case when page doesn't have buffers. We attach buffers
to a page in ->write_begin() and ->page_mkwrite() which covers all places
where a page can become dirty.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c |  145 ++++++++++++++-----------------------------------------
 1 files changed, 36 insertions(+), 109 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 359d06e..0eb27e1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -132,8 +132,6 @@ static inline int ext4_begin_ordered_truncate(struct inode *inode,
 }
 
 static void ext4_invalidatepage(struct page *page, unsigned long offset);
-static int noalloc_get_block_write(struct inode *inode, sector_t iblock,
-				   struct buffer_head *bh_result, int create);
 static int __ext4_journalled_writepage(struct page *page, unsigned int len);
 static int ext4_bh_delay_or_unwritten(handle_t *handle, struct buffer_head *bh);
 static int ext4_discard_partial_page_buffers_no_lock(handle_t *handle,
@@ -1374,7 +1372,7 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd,
 		if (nr_pages == 0)
 			break;
 		for (i = 0; i < nr_pages; i++) {
-			int commit_write = 0, skip_page = 0;
+			int skip_page = 0;
 			struct page *page = pvec.pages[i];
 
 			index = page->index;
@@ -1396,27 +1394,9 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd,
 			BUG_ON(!PageLocked(page));
 			BUG_ON(PageWriteback(page));
 
-			/*
-			 * If the page does not have buffers (for
-			 * whatever reason), try to create them using
-			 * __block_write_begin.  If this fails,
-			 * skip the page and move on.
-			 */
-			if (!page_has_buffers(page)) {
-				if (__block_write_begin(page, 0, len,
-						noalloc_get_block_write)) {
-				skip_page:
-					unlock_page(page);
-					continue;
-				}
-				commit_write = 1;
-			}
-
 			bh = page_bufs = page_buffers(page);
 			block_start = 0;
 			do {
-				if (!bh)
-					goto skip_page;
 				if (map && (cur_logical >= map->m_lblk) &&
 				    (cur_logical <= (map->m_lblk +
 						     (map->m_len - 1)))) {
@@ -1444,12 +1424,10 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd,
 				pblock++;
 			} while (bh != page_bufs);
 
-			if (skip_page)
-				goto skip_page;
-
-			if (commit_write)
-				/* mark the buffer_heads as dirty & uptodate */
-				block_commit_write(page, 0, len);
+			if (skip_page) {
+				unlock_page(page);
+				continue;
+			}
 
 			clear_page_dirty_for_io(page);
 			err = ext4_bio_write_page(&io_submit, page, len,
@@ -1869,27 +1847,6 @@ int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
 	return 0;
 }
 
-/*
- * This function is used as a standard get_block_t calback function when there
- * is no desire to allocate any blocks.  It is used as a callback function for
- * block_write_begin().  These functions should only try to map a single block
- * at a time.
- *
- * Since this function doesn't do block allocations even if the caller
- * requests it by passing in create=1, it is critically important that
- * any caller checks to make sure that any buffer heads are returned
- * by this function are either all already mapped or marked for
- * delayed allocation before calling ext4_bio_write_page().  Otherwise,
- * b_blocknr could be left unitialized, and the page write functions will
- * be taken by surprise.
- */
-static int noalloc_get_block_write(struct inode *inode, sector_t iblock,
-				   struct buffer_head *bh_result, int create)
-{
-	BUG_ON(bh_result->b_size != inode->i_sb->s_blocksize);
-	return _ext4_get_block(inode, iblock, bh_result, 0);
-}
-
 static int bget_one(handle_t *handle, struct buffer_head *bh)
 {
 	get_bh(bh);
@@ -2014,7 +1971,7 @@ out:
 static int ext4_writepage(struct page *page,
 			  struct writeback_control *wbc)
 {
-	int ret = 0, commit_write = 0;
+	int ret = 0;
 	loff_t size;
 	unsigned int len;
 	struct buffer_head *page_bufs = NULL;
@@ -2028,21 +1985,6 @@ static int ext4_writepage(struct page *page,
 	else
 		len = PAGE_CACHE_SIZE;
 
-	/*
-	 * If the page does not have buffers (for whatever reason),
-	 * try to create them using __block_write_begin.  If this
-	 * fails, redirty the page and move on.
-	 */
-	if (!page_has_buffers(page)) {
-		if (__block_write_begin(page, 0, len,
-					noalloc_get_block_write)) {
-		redirty_page:
-			redirty_page_for_writepage(wbc, page);
-			unlock_page(page);
-			return 0;
-		}
-		commit_write = 1;
-	}
 	page_bufs = page_buffers(page);
 	if (ext4_walk_page_buffers(NULL, page_bufs, 0, len, NULL,
 				   ext4_bh_delay_or_unwritten)) {
@@ -2056,11 +1998,10 @@ static int ext4_writepage(struct page *page,
 		 */
 		WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
 								PF_MEMALLOC);
-		goto redirty_page;
+		redirty_page_for_writepage(wbc, page);
+		unlock_page(page);
+		return 0;
 	}
-	if (commit_write)
-		/* now mark the buffer_heads as dirty and uptodate */
-		block_commit_write(page, 0, len);
 
 	if (PageChecked(page) && ext4_should_journal_data(inode))
 		/*
@@ -2203,51 +2144,37 @@ static int write_cache_pages_da(handle_t *handle,
 			logical = (sector_t) page->index <<
 				(PAGE_CACHE_SHIFT - inode->i_blkbits);
 
-			if (!page_has_buffers(page)) {
-				mpage_add_bh_to_extent(mpd, logical,
-						       PAGE_CACHE_SIZE,
-						       (1 << BH_Dirty) | (1 << BH_Uptodate));
-				if (mpd->io_done)
-					goto ret_extent_tail;
-			} else {
+			/* Add all dirty buffers to mpd */
+			head = page_buffers(page);
+			bh = head;
+			do {
+				BUG_ON(buffer_locked(bh));
 				/*
-				 * Page with regular buffer heads,
-				 * just add all dirty ones
+				 * We need to try to allocate unmapped blocks
+				 * in the same page.  Otherwise we won't make
+				 * progress with the page in ext4_writepage
 				 */
-				head = page_buffers(page);
-				bh = head;
-				do {
-					BUG_ON(buffer_locked(bh));
+				if (ext4_bh_delay_or_unwritten(NULL, bh)) {
+					mpage_add_bh_to_extent(mpd, logical,
+							       bh->b_size,
+							       bh->b_state);
+					if (mpd->io_done)
+						goto ret_extent_tail;
+				} else if (buffer_dirty(bh) && buffer_mapped(bh)) {
 					/*
-					 * We need to try to allocate
-					 * unmapped blocks in the same page.
-					 * Otherwise we won't make progress
-					 * with the page in ext4_writepage
+					 * mapped dirty buffer. We need to
+					 * update the b_state because we look
+					 * at b_state in mpage_da_map_blocks.
+					 * We don't update b_size because if we
+					 * find an unmapped buffer_head later
+					 * we need to use the b_state flag of
+					 * that buffer_head.
 					 */
-					if (ext4_bh_delay_or_unwritten(NULL, bh)) {
-						mpage_add_bh_to_extent(mpd, logical,
-								       bh->b_size,
-								       bh->b_state);
-						if (mpd->io_done)
-							goto ret_extent_tail;
-					} else if (buffer_dirty(bh) && (buffer_mapped(bh))) {
-						/*
-						 * mapped dirty buffer. We need
-						 * to update the b_state
-						 * because we look at b_state
-						 * in mpage_da_map_blocks.  We
-						 * don't update b_size because
-						 * if we find an unmapped
-						 * buffer_head later we need to
-						 * use the b_state flag of that
-						 * buffer_head.
-						 */
-						if (mpd->b_size == 0)
-							mpd->b_state = bh->b_state & BH_FLAGS;
-					}
-					logical++;
-				} while ((bh = bh->b_this_page) != head);
-			}
+					if (mpd->b_size == 0)
+						mpd->b_state = bh->b_state & BH_FLAGS;
+				}
+				logical++;
+			} while ((bh = bh->b_this_page) != head);
 
 			if (nr_to_write > 0) {
 				nr_to_write--;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 10/12] ext4: Simplify mpage_add_bh_to_extent()
  2013-01-18 12:00 [PATCH 0/12 v2] ext4: Several simplifications and fixes Jan Kara
                   ` (8 preceding siblings ...)
  2013-01-18 12:00 ` [PATCH 09/12] ext4: Dirty page has always buffers attached Jan Kara
@ 2013-01-18 12:00 ` Jan Kara
  2013-01-28 18:06   ` Theodore Ts'o
  2013-01-18 12:00 ` [PATCH 11/12] ext4: Make ext4_bio_writepage() handle unprepared buffers Jan Kara
  2013-01-18 12:00 ` [PATCH 12/12] ext4: Fix ext4_writepage() to achieve data=ordered guarantees Jan Kara
  11 siblings, 1 reply; 53+ messages in thread
From: Jan Kara @ 2013-01-18 12:00 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

The argument b_size of mpage_add_bh_to_extent() was bogus since it was
always == blocksize (which we can easily derive from inode->i_blkbits).
Also second branch of condition:
	if (nrblocks >= EXT4_MAX_TRANS_DATA) {
	} else if ((nrblocks + (b_size >> mpd->inode->i_blkbits)) >
						EXT4_MAX_TRANS_DATA) {
	}
was never taken because (b_size >> mpd->inode->i_blkbits) == 1.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c |   29 +++++++++--------------------
 1 files changed, 9 insertions(+), 20 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0eb27e1..3b6bb61 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1647,16 +1647,16 @@ submit_io:
  *
  * @mpd->lbh - extent of blocks
  * @logical - logical number of the block in the file
- * @bh - bh of the block (used to access block's state)
+ * @b_state - b_state of the buffer head added
  *
  * the function is used to collect contig. blocks in same state
  */
-static void mpage_add_bh_to_extent(struct mpage_da_data *mpd,
-				   sector_t logical, size_t b_size,
+static void mpage_add_bh_to_extent(struct mpage_da_data *mpd, sector_t logical,
 				   unsigned long b_state)
 {
 	sector_t next;
-	int nrblocks = mpd->b_size >> mpd->inode->i_blkbits;
+	int blkbits = mpd->inode->i_blkbits;
+	int nrblocks = mpd->b_size >> blkbits;
 
 	/*
 	 * XXX Don't go larger than mballoc is willing to allocate
@@ -1664,11 +1664,11 @@ static void mpage_add_bh_to_extent(struct mpage_da_data *mpd,
 	 * mpage_da_submit_io() into this function and then call
 	 * ext4_map_blocks() multiple times in a loop
 	 */
-	if (nrblocks >= 8*1024*1024/mpd->inode->i_sb->s_blocksize)
+	if (nrblocks >= (8*1024*1024 >> blkbits))
 		goto flush_it;
 
-	/* check if thereserved journal credits might overflow */
-	if (!(ext4_test_inode_flag(mpd->inode, EXT4_INODE_EXTENTS))) {
+	/* check if the reserved journal credits might overflow */
+	if (!ext4_test_inode_flag(mpd->inode, EXT4_INODE_EXTENTS)) {
 		if (nrblocks >= EXT4_MAX_TRANS_DATA) {
 			/*
 			 * With non-extent format we are limited by the journal
@@ -1677,16 +1677,6 @@ static void mpage_add_bh_to_extent(struct mpage_da_data *mpd,
 			 * nrblocks.  So limit nrblocks.
 			 */
 			goto flush_it;
-		} else if ((nrblocks + (b_size >> mpd->inode->i_blkbits)) >
-				EXT4_MAX_TRANS_DATA) {
-			/*
-			 * Adding the new buffer_head would make it cross the
-			 * allowed limit for which we have journal credit
-			 * reserved. So limit the new bh->b_size
-			 */
-			b_size = (EXT4_MAX_TRANS_DATA - nrblocks) <<
-						mpd->inode->i_blkbits;
-			/* we will do mpage_da_submit_io in the next loop */
 		}
 	}
 	/*
@@ -1694,7 +1684,7 @@ static void mpage_add_bh_to_extent(struct mpage_da_data *mpd,
 	 */
 	if (mpd->b_size == 0) {
 		mpd->b_blocknr = logical;
-		mpd->b_size = b_size;
+		mpd->b_size = 1 << blkbits;
 		mpd->b_state = b_state & BH_FLAGS;
 		return;
 	}
@@ -1704,7 +1694,7 @@ static void mpage_add_bh_to_extent(struct mpage_da_data *mpd,
 	 * Can we merge the block to our big extent?
 	 */
 	if (logical == next && (b_state & BH_FLAGS) == mpd->b_state) {
-		mpd->b_size += b_size;
+		mpd->b_size += 1 << blkbits;
 		return;
 	}
 
@@ -2156,7 +2146,6 @@ static int write_cache_pages_da(handle_t *handle,
 				 */
 				if (ext4_bh_delay_or_unwritten(NULL, bh)) {
 					mpage_add_bh_to_extent(mpd, logical,
-							       bh->b_size,
 							       bh->b_state);
 					if (mpd->io_done)
 						goto ret_extent_tail;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 11/12] ext4: Make ext4_bio_writepage() handle unprepared buffers
  2013-01-18 12:00 [PATCH 0/12 v2] ext4: Several simplifications and fixes Jan Kara
                   ` (9 preceding siblings ...)
  2013-01-18 12:00 ` [PATCH 10/12] ext4: Simplify mpage_add_bh_to_extent() Jan Kara
@ 2013-01-18 12:00 ` Jan Kara
  2013-01-29  1:59   ` Theodore Ts'o
  2013-01-18 12:00 ` [PATCH 12/12] ext4: Fix ext4_writepage() to achieve data=ordered guarantees Jan Kara
  11 siblings, 1 reply; 53+ messages in thread
From: Jan Kara @ 2013-01-18 12:00 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

So far ext4_bio_writepage() unconditionally cleared dirty bit on all buffers
underlying the page. That implicitely assumes we can write all buffers. So
far that is true because callers call into ext4_bio_writepage() make sure
all buffers in the page are mapped but
a) it's a data corruption bug waiting to happen
b) in data=ordered mode when blocksize < pagesize we do need to write pages
   that may have only some of dirty buffers mapped.

So change ext4_bio_writepage() to skip buffers that cannot be written without
clearing their dirty bit.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/page-io.c |   17 +++++++++--------
 1 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 3fb385c..0290bf8 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -350,14 +350,6 @@ static int io_submit_add_bh(struct ext4_io_submit *io,
 		unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
 	}
 
-	if (!buffer_mapped(bh) || buffer_delay(bh)) {
-		if (!buffer_mapped(bh))
-			clear_buffer_dirty(bh);
-		if (io->io_bio)
-			ext4_io_submit(io);
-		return 0;
-	}

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 12/12] ext4: Fix ext4_writepage() to achieve data=ordered guarantees
  2013-01-18 12:00 [PATCH 0/12 v2] ext4: Several simplifications and fixes Jan Kara
                   ` (10 preceding siblings ...)
  2013-01-18 12:00 ` [PATCH 11/12] ext4: Make ext4_bio_writepage() handle unprepared buffers Jan Kara
@ 2013-01-18 12:00 ` Jan Kara
  2013-01-29  2:08   ` Theodore Ts'o
  11 siblings, 1 reply; 53+ messages in thread
From: Jan Kara @ 2013-01-18 12:00 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

So far ext4_writepage() skipped writing pages that had any delayed or
unwritten buffers attached. When blocksize < pagesize this breaks
data=ordered mode guarantees as we can have a page with one freshly
allocated buffer whose allocation is part of the committing transaction
and another buffer in the page which is delayed or unwritten. So fix this
problem by calling ext4_bio_writepage() anyway. It will submit mapped
buffers and leave others alone.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c |   36 ++++++++++++++++++++++--------------
 1 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3b6bb61..c4d45d5 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1967,6 +1967,7 @@ static int ext4_writepage(struct page *page,
 	struct buffer_head *page_bufs = NULL;
 	struct inode *inode = page->mapping->host;
 	struct ext4_io_submit io_submit;
+	int redirty = 0;
 
 	trace_ext4_writepage(page);
 	size = i_size_read(inode);
@@ -1976,21 +1977,28 @@ static int ext4_writepage(struct page *page,
 		len = PAGE_CACHE_SIZE;
 
 	page_bufs = page_buffers(page);
-	if (ext4_walk_page_buffers(NULL, page_bufs, 0, len, NULL,
-				   ext4_bh_delay_or_unwritten)) {
-		/*
-		 * We don't want to do block allocation, so redirty
-		 * the page and return.  We may reach here when we do
-		 * a journal commit via journal_submit_inode_data_buffers.
-		 * We can also reach here via shrink_page_list but it
-		 * should never be for direct reclaim so warn if that
-		 * happens
-		 */
-		WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
-								PF_MEMALLOC);
+	redirty = ext4_walk_page_buffers(NULL, page_bufs, 0, len, NULL,
+				   ext4_bh_delay_or_unwritten);
+	/*
+	 * We cannot do block allocation or other extent handling in this
+	 * function. If there are buffers needing that, we have to redirty
+	 * the page. But we may reach here when we do a journal commit via
+	 * journal_submit_inode_data_buffers() and in that case we must write
+	 * allocated buffers to achieve data=ordered mode guarantees.
+	 */
+	if (redirty) {
 		redirty_page_for_writepage(wbc, page);
-		unlock_page(page);
-		return 0;
+		if (current->flags & PF_MEMALLOC) {
+			/*
+			 * For memory cleaning there's no point in writing only
+			 * some buffers. So just bail out. Warn if we came here
+			 * from direct reclaim.
+			 */
+			WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD))
+							== PF_MEMALLOC);
+			unlock_page(page);
+			return 0;
+		}
 	}
 
 	if (PageChecked(page) && ext4_should_journal_data(inode))
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/12] ext4: Remove bogus wait for unwritten extents in ext4_ind_direct_IO
  2013-01-18 12:00 ` [PATCH 03/12] ext4: Remove bogus wait for unwritten extents in ext4_ind_direct_IO Jan Kara
@ 2013-01-22 11:11   ` Dmitry Monakhov
  2013-01-22 13:44     ` Jan Kara
  0 siblings, 1 reply; 53+ messages in thread
From: Dmitry Monakhov @ 2013-01-22 11:11 UTC (permalink / raw)
  To: Jan Kara, Ted Tso; +Cc: linux-ext4, Jan Kara

On Fri, 18 Jan 2013 13:00:37 +0100, Jan Kara <jack@suse.cz> wrote:
> When using indirect blocks there is no possibility to have any unwritten
> extents. So wait for them in ext4_ind_direct_IO() is just bogus.
But as soon as i remember indirect implementation may also be used by
extents based inodes 3074: ext4_ext_direct_IO
    /* Use the old path for reads and writes beyond i_size. */
    if (rw != WRITE || final_size > inode->i_size)
       return ext4_ind_direct_IO(rw, iocb, iov, offset, nr_segs);

Am I missing ?
> 
> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/indirect.c |    5 -----
>  1 files changed, 0 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> index 20862f9..993247c 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c
> @@ -807,11 +807,6 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
>  
>  retry:
>  	if (rw == READ && ext4_should_dioread_nolock(inode)) {
> -		if (unlikely(atomic_read(&EXT4_I(inode)->i_unwritten))) {
> -			mutex_lock(&inode->i_mutex);
> -			ext4_flush_unwritten_io(inode);
> -			mutex_unlock(&inode->i_mutex);
> -		}
>  		/*
>  		 * Nolock dioread optimization may be dynamically disabled
>  		 * via ext4_inode_block_unlocked_dio(). Check inode's state
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/12] ext4: Remove bogus wait for unwritten extents in ext4_ind_direct_IO
  2013-01-22 11:11   ` Dmitry Monakhov
@ 2013-01-22 13:44     ` Jan Kara
  2013-01-22 14:12       ` Dmitry Monakhov
  2013-01-22 14:22       ` Zheng Liu
  0 siblings, 2 replies; 53+ messages in thread
From: Jan Kara @ 2013-01-22 13:44 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Jan Kara, Ted Tso, linux-ext4

On Tue 22-01-13 15:11:24, Dmitry Monakhov wrote:
> On Fri, 18 Jan 2013 13:00:37 +0100, Jan Kara <jack@suse.cz> wrote:
> > When using indirect blocks there is no possibility to have any unwritten
> > extents. So wait for them in ext4_ind_direct_IO() is just bogus.
> But as soon as i remember indirect implementation may also be used by
> extents based inodes 3074: ext4_ext_direct_IO
>     /* Use the old path for reads and writes beyond i_size. */
>     if (rw != WRITE || final_size > inode->i_size)
>        return ext4_ind_direct_IO(rw, iocb, iov, offset, nr_segs);
> 
> Am I missing ?
  Ah, that's a catch. Thanks for pointing that out! So my patch is wrong
and that code path needs some cleaning and commenting. In particular I'm
afraid using dioread_nolock for inodes with indirect map causes data
exposure bugs when unlocked DIO read races with DIO write because such
inodes don't support uninitialized extents.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/12] ext4: Remove bogus wait for unwritten extents in ext4_ind_direct_IO
  2013-01-22 13:44     ` Jan Kara
@ 2013-01-22 14:12       ` Dmitry Monakhov
  2013-01-22 15:21         ` Jan Kara
  2013-01-22 14:22       ` Zheng Liu
  1 sibling, 1 reply; 53+ messages in thread
From: Dmitry Monakhov @ 2013-01-22 14:12 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jan Kara, Ted Tso, linux-ext4

On Tue, 22 Jan 2013 14:44:00 +0100, Jan Kara <jack@suse.cz> wrote:
> On Tue 22-01-13 15:11:24, Dmitry Monakhov wrote:
> > On Fri, 18 Jan 2013 13:00:37 +0100, Jan Kara <jack@suse.cz> wrote:
> > > When using indirect blocks there is no possibility to have any unwritten
> > > extents. So wait for them in ext4_ind_direct_IO() is just bogus.
> > But as soon as i remember indirect implementation may also be used by
> > extents based inodes 3074: ext4_ext_direct_IO
> >     /* Use the old path for reads and writes beyond i_size. */
> >     if (rw != WRITE || final_size > inode->i_size)
> >        return ext4_ind_direct_IO(rw, iocb, iov, offset, nr_segs);
> > 
> > Am I missing ?
>   Ah, that's a catch. Thanks for pointing that out! So my patch is wrong
> and that code path needs some cleaning and commenting. In particular I'm
> afraid using dioread_nolock for inodes with indirect map causes data
> exposure bugs when unlocked DIO read races with DIO write because such
> inodes don't support uninitialized extents.
Yes that's why dioread_nolock works only for extent based inodes

static inline int ext4_should_dioread_nolock(struct inode *inode)
{
        if (!test_opt(inode->i_sb, DIOREAD_NOLOCK))
                return 0;
        if (!S_ISREG(inode->i_mode))
                return 0;
                if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
                   return 0;
                   if (ext4_should_journal_data(inode))
                return 0;
        return 1;
}

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/12] ext4: Remove bogus wait for unwritten extents in ext4_ind_direct_IO
  2013-01-22 13:44     ` Jan Kara
  2013-01-22 14:12       ` Dmitry Monakhov
@ 2013-01-22 14:22       ` Zheng Liu
  2013-01-22 15:22         ` Jan Kara
  1 sibling, 1 reply; 53+ messages in thread
From: Zheng Liu @ 2013-01-22 14:22 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dmitry Monakhov, Ted Tso, linux-ext4

On Tue, Jan 22, 2013 at 02:44:00PM +0100, Jan Kara wrote:
> On Tue 22-01-13 15:11:24, Dmitry Monakhov wrote:
> > On Fri, 18 Jan 2013 13:00:37 +0100, Jan Kara <jack@suse.cz> wrote:
> > > When using indirect blocks there is no possibility to have any unwritten
> > > extents. So wait for them in ext4_ind_direct_IO() is just bogus.
> > But as soon as i remember indirect implementation may also be used by
> > extents based inodes 3074: ext4_ext_direct_IO
> >     /* Use the old path for reads and writes beyond i_size. */
> >     if (rw != WRITE || final_size > inode->i_size)
> >        return ext4_ind_direct_IO(rw, iocb, iov, offset, nr_segs);
> > 
> > Am I missing ?
>   Ah, that's a catch. Thanks for pointing that out! So my patch is wrong
> and that code path needs some cleaning and commenting. In particular I'm
> afraid using dioread_nolock for inodes with indirect map causes data
> exposure bugs when unlocked DIO read races with DIO write because such
> inodes don't support uninitialized extents.

Sorry, but I am still confused.  dioread_nolock is only for extent-based
file.  So when a file system without extent feature, dioread_nolock
couldn't be enabled.  It seems that we don't need to worry about
exposing stale data here.

Thanks,
                                                - Zheng

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/12] ext4: Remove bogus wait for unwritten extents in ext4_ind_direct_IO
  2013-01-22 14:12       ` Dmitry Monakhov
@ 2013-01-22 15:21         ` Jan Kara
  0 siblings, 0 replies; 53+ messages in thread
From: Jan Kara @ 2013-01-22 15:21 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Jan Kara, Ted Tso, linux-ext4

On Tue 22-01-13 18:12:38, Dmitry Monakhov wrote:
> On Tue, 22 Jan 2013 14:44:00 +0100, Jan Kara <jack@suse.cz> wrote:
> > On Tue 22-01-13 15:11:24, Dmitry Monakhov wrote:
> > > On Fri, 18 Jan 2013 13:00:37 +0100, Jan Kara <jack@suse.cz> wrote:
> > > > When using indirect blocks there is no possibility to have any unwritten
> > > > extents. So wait for them in ext4_ind_direct_IO() is just bogus.
> > > But as soon as i remember indirect implementation may also be used by
> > > extents based inodes 3074: ext4_ext_direct_IO
> > >     /* Use the old path for reads and writes beyond i_size. */
> > >     if (rw != WRITE || final_size > inode->i_size)
> > >        return ext4_ind_direct_IO(rw, iocb, iov, offset, nr_segs);
> > > 
> > > Am I missing ?
> >   Ah, that's a catch. Thanks for pointing that out! So my patch is wrong
> > and that code path needs some cleaning and commenting. In particular I'm
> > afraid using dioread_nolock for inodes with indirect map causes data
> > exposure bugs when unlocked DIO read races with DIO write because such
> > inodes don't support uninitialized extents.
> Yes that's why dioread_nolock works only for extent based inodes
> 
> static inline int ext4_should_dioread_nolock(struct inode *inode)
> {
>         if (!test_opt(inode->i_sb, DIOREAD_NOLOCK))
>                 return 0;
>         if (!S_ISREG(inode->i_mode))
>                 return 0;
>                 if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
>                    return 0;
>                    if (ext4_should_journal_data(inode))
>                 return 0;
>         return 1;
> }
  Sure, I was confused. Things work correctly just the code flow is a bit
confusing.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/12] ext4: Remove bogus wait for unwritten extents in ext4_ind_direct_IO
  2013-01-22 14:22       ` Zheng Liu
@ 2013-01-22 15:22         ` Jan Kara
  2013-01-22 16:00           ` Zheng Liu
  0 siblings, 1 reply; 53+ messages in thread
From: Jan Kara @ 2013-01-22 15:22 UTC (permalink / raw)
  To: Zheng Liu; +Cc: Jan Kara, Dmitry Monakhov, Ted Tso, linux-ext4

On Tue 22-01-13 22:22:21, Zheng Liu wrote:
> On Tue, Jan 22, 2013 at 02:44:00PM +0100, Jan Kara wrote:
> > On Tue 22-01-13 15:11:24, Dmitry Monakhov wrote:
> > > On Fri, 18 Jan 2013 13:00:37 +0100, Jan Kara <jack@suse.cz> wrote:
> > > > When using indirect blocks there is no possibility to have any unwritten
> > > > extents. So wait for them in ext4_ind_direct_IO() is just bogus.
> > > But as soon as i remember indirect implementation may also be used by
> > > extents based inodes 3074: ext4_ext_direct_IO
> > >     /* Use the old path for reads and writes beyond i_size. */
> > >     if (rw != WRITE || final_size > inode->i_size)
> > >        return ext4_ind_direct_IO(rw, iocb, iov, offset, nr_segs);
> > > 
> > > Am I missing ?
> >   Ah, that's a catch. Thanks for pointing that out! So my patch is wrong
> > and that code path needs some cleaning and commenting. In particular I'm
> > afraid using dioread_nolock for inodes with indirect map causes data
> > exposure bugs when unlocked DIO read races with DIO write because such
> > inodes don't support uninitialized extents.
> 
> Sorry, but I am still confused.  dioread_nolock is only for extent-based
> file.  So when a file system without extent feature, dioread_nolock
> couldn't be enabled.  It seems that we don't need to worry about
> exposing stale data here.
  Well, you can have fs with extent feature enabled but still with inodes
using indirect map. But as Dmitry pointed out, ext4_should_dioread_nolock()
handles that correctly. So there's not a bug I was suspecting.

							Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/12] ext4: Remove bogus wait for unwritten extents in ext4_ind_direct_IO
  2013-01-22 15:22         ` Jan Kara
@ 2013-01-22 16:00           ` Zheng Liu
  2013-01-22 23:14             ` Jan Kara
  0 siblings, 1 reply; 53+ messages in thread
From: Zheng Liu @ 2013-01-22 16:00 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dmitry Monakhov, Ted Tso, linux-ext4

On Tue, Jan 22, 2013 at 04:22:43PM +0100, Jan Kara wrote:
> On Tue 22-01-13 22:22:21, Zheng Liu wrote:
> > On Tue, Jan 22, 2013 at 02:44:00PM +0100, Jan Kara wrote:
> > > On Tue 22-01-13 15:11:24, Dmitry Monakhov wrote:
> > > > On Fri, 18 Jan 2013 13:00:37 +0100, Jan Kara <jack@suse.cz> wrote:
> > > > > When using indirect blocks there is no possibility to have any unwritten
> > > > > extents. So wait for them in ext4_ind_direct_IO() is just bogus.
> > > > But as soon as i remember indirect implementation may also be used by
> > > > extents based inodes 3074: ext4_ext_direct_IO
> > > >     /* Use the old path for reads and writes beyond i_size. */
> > > >     if (rw != WRITE || final_size > inode->i_size)
> > > >        return ext4_ind_direct_IO(rw, iocb, iov, offset, nr_segs);
> > > > 
> > > > Am I missing ?
> > >   Ah, that's a catch. Thanks for pointing that out! So my patch is wrong
> > > and that code path needs some cleaning and commenting. In particular I'm
> > > afraid using dioread_nolock for inodes with indirect map causes data
> > > exposure bugs when unlocked DIO read races with DIO write because such
> > > inodes don't support uninitialized extents.
> > 
> > Sorry, but I am still confused.  dioread_nolock is only for extent-based
> > file.  So when a file system without extent feature, dioread_nolock
> > couldn't be enabled.  It seems that we don't need to worry about
> > exposing stale data here.
>   Well, you can have fs with extent feature enabled but still with inodes
> using indirect map. But as Dmitry pointed out, ext4_should_dioread_nolock()
> handles that correctly. So there's not a bug I was suspecting.

Yep, the patch itself is fine.  But that would be great if a comment is
added here.

Regards,
                                                - Zheng

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/12] ext4: Remove bogus wait for unwritten extents in ext4_ind_direct_IO
  2013-01-22 16:00           ` Zheng Liu
@ 2013-01-22 23:14             ` Jan Kara
  2013-01-23  6:11               ` Zheng Liu
  0 siblings, 1 reply; 53+ messages in thread
From: Jan Kara @ 2013-01-22 23:14 UTC (permalink / raw)
  To: Zheng Liu; +Cc: Jan Kara, Dmitry Monakhov, Ted Tso, linux-ext4

On Wed 23-01-13 00:00:17, Zheng Liu wrote:
> On Tue, Jan 22, 2013 at 04:22:43PM +0100, Jan Kara wrote:
> > On Tue 22-01-13 22:22:21, Zheng Liu wrote:
> > > On Tue, Jan 22, 2013 at 02:44:00PM +0100, Jan Kara wrote:
> > > > On Tue 22-01-13 15:11:24, Dmitry Monakhov wrote:
> > > > > On Fri, 18 Jan 2013 13:00:37 +0100, Jan Kara <jack@suse.cz> wrote:
> > > > > > When using indirect blocks there is no possibility to have any unwritten
> > > > > > extents. So wait for them in ext4_ind_direct_IO() is just bogus.
> > > > > But as soon as i remember indirect implementation may also be used by
> > > > > extents based inodes 3074: ext4_ext_direct_IO
> > > > >     /* Use the old path for reads and writes beyond i_size. */
> > > > >     if (rw != WRITE || final_size > inode->i_size)
> > > > >        return ext4_ind_direct_IO(rw, iocb, iov, offset, nr_segs);
> > > > > 
> > > > > Am I missing ?
> > > >   Ah, that's a catch. Thanks for pointing that out! So my patch is wrong
> > > > and that code path needs some cleaning and commenting. In particular I'm
> > > > afraid using dioread_nolock for inodes with indirect map causes data
> > > > exposure bugs when unlocked DIO read races with DIO write because such
> > > > inodes don't support uninitialized extents.
> > > 
> > > Sorry, but I am still confused.  dioread_nolock is only for extent-based
> > > file.  So when a file system without extent feature, dioread_nolock
> > > couldn't be enabled.  It seems that we don't need to worry about
> > > exposing stale data here.
> >   Well, you can have fs with extent feature enabled but still with inodes
> > using indirect map. But as Dmitry pointed out, ext4_should_dioread_nolock()
> > handles that correctly. So there's not a bug I was suspecting.
> 
> Yep, the patch itself is fine.  But that would be great if a comment is
> added here.
  No, the patch is wrong. The code before the patch is correct. We can get
to that code for extent based inode which has unwritten conversions pending
and we need to wait for those as otherwise we could return 0s in places
where we acknowledged successful write just a while ago. Or am I missing
something?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/12] ext4: Remove bogus wait for unwritten extents in ext4_ind_direct_IO
  2013-01-22 23:14             ` Jan Kara
@ 2013-01-23  6:11               ` Zheng Liu
  2013-01-23  9:42                 ` Jan Kara
  0 siblings, 1 reply; 53+ messages in thread
From: Zheng Liu @ 2013-01-23  6:11 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dmitry Monakhov, Ted Tso, linux-ext4

On Wed, Jan 23, 2013 at 12:14:32AM +0100, Jan Kara wrote:
> On Wed 23-01-13 00:00:17, Zheng Liu wrote:
> > On Tue, Jan 22, 2013 at 04:22:43PM +0100, Jan Kara wrote:
> > > On Tue 22-01-13 22:22:21, Zheng Liu wrote:
> > > > On Tue, Jan 22, 2013 at 02:44:00PM +0100, Jan Kara wrote:
> > > > > On Tue 22-01-13 15:11:24, Dmitry Monakhov wrote:
> > > > > > On Fri, 18 Jan 2013 13:00:37 +0100, Jan Kara <jack@suse.cz> wrote:
> > > > > > > When using indirect blocks there is no possibility to have any unwritten
> > > > > > > extents. So wait for them in ext4_ind_direct_IO() is just bogus.
> > > > > > But as soon as i remember indirect implementation may also be used by
> > > > > > extents based inodes 3074: ext4_ext_direct_IO
> > > > > >     /* Use the old path for reads and writes beyond i_size. */
> > > > > >     if (rw != WRITE || final_size > inode->i_size)
> > > > > >        return ext4_ind_direct_IO(rw, iocb, iov, offset, nr_segs);
> > > > > > 
> > > > > > Am I missing ?
> > > > >   Ah, that's a catch. Thanks for pointing that out! So my patch is wrong
> > > > > and that code path needs some cleaning and commenting. In particular I'm
> > > > > afraid using dioread_nolock for inodes with indirect map causes data
> > > > > exposure bugs when unlocked DIO read races with DIO write because such
> > > > > inodes don't support uninitialized extents.
> > > > 
> > > > Sorry, but I am still confused.  dioread_nolock is only for extent-based
> > > > file.  So when a file system without extent feature, dioread_nolock
> > > > couldn't be enabled.  It seems that we don't need to worry about
> > > > exposing stale data here.
> > >   Well, you can have fs with extent feature enabled but still with inodes
> > > using indirect map. But as Dmitry pointed out, ext4_should_dioread_nolock()
> > > handles that correctly. So there's not a bug I was suspecting.
> > 
> > Yep, the patch itself is fine.  But that would be great if a comment is
> > added here.
>   No, the patch is wrong. The code before the patch is correct. We can get
> to that code for extent based inode which has unwritten conversions pending
> and we need to wait for those as otherwise we could return 0s in places
> where we acknowledged successful write just a while ago. Or am I missing
> something?

Ah, I see.  I guess that the problem is that the dio read races with buffered
write.

   dio read                         buffered write
                                    ->ext4_file_write
                                      ->ext4_da_write_begin
                                      ->ext4_da_write_end
                                      [buffered write has finished, but the data
                                       and metadata has not been flushed]
   ->generic_file_aio_read
     ->filemap_write_and_wait_range
       ->do_writepages
         ->ext4_da_writepages
     ->filemap_fdatawait_range
       ->wait_on_page_writeback
                                    ->ext4_end_bio
                                      ->end_page_writeback
                                        [unwritten extent has not been
                                         converted]
     ->ext4_ind_direct_IO
       [here we need to flush unwritten io]

So it seems that this patch could be applied after reworking unwritten extent
conversion.

FWIW, after applied this patch, the latency of dio read could be reduced
dramatically.  So that would be great if this patch can be applied when it
doesn't break something.

Thanks,
						- Zheng

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/12] ext4: Remove bogus wait for unwritten extents in ext4_ind_direct_IO
  2013-01-23  6:11               ` Zheng Liu
@ 2013-01-23  9:42                 ` Jan Kara
  0 siblings, 0 replies; 53+ messages in thread
From: Jan Kara @ 2013-01-23  9:42 UTC (permalink / raw)
  To: Zheng Liu; +Cc: Jan Kara, Dmitry Monakhov, Ted Tso, linux-ext4

On Wed 23-01-13 14:11:41, Zheng Liu wrote:
> On Wed, Jan 23, 2013 at 12:14:32AM +0100, Jan Kara wrote:
> > On Wed 23-01-13 00:00:17, Zheng Liu wrote:
> > > On Tue, Jan 22, 2013 at 04:22:43PM +0100, Jan Kara wrote:
> > > > On Tue 22-01-13 22:22:21, Zheng Liu wrote:
> > > > > On Tue, Jan 22, 2013 at 02:44:00PM +0100, Jan Kara wrote:
> > > > > > On Tue 22-01-13 15:11:24, Dmitry Monakhov wrote:
> > > > > > > On Fri, 18 Jan 2013 13:00:37 +0100, Jan Kara <jack@suse.cz> wrote:
> > > > > > > > When using indirect blocks there is no possibility to have any unwritten
> > > > > > > > extents. So wait for them in ext4_ind_direct_IO() is just bogus.
> > > > > > > But as soon as i remember indirect implementation may also be used by
> > > > > > > extents based inodes 3074: ext4_ext_direct_IO
> > > > > > >     /* Use the old path for reads and writes beyond i_size. */
> > > > > > >     if (rw != WRITE || final_size > inode->i_size)
> > > > > > >        return ext4_ind_direct_IO(rw, iocb, iov, offset, nr_segs);
> > > > > > > 
> > > > > > > Am I missing ?
> > > > > >   Ah, that's a catch. Thanks for pointing that out! So my patch is wrong
> > > > > > and that code path needs some cleaning and commenting. In particular I'm
> > > > > > afraid using dioread_nolock for inodes with indirect map causes data
> > > > > > exposure bugs when unlocked DIO read races with DIO write because such
> > > > > > inodes don't support uninitialized extents.
> > > > > 
> > > > > Sorry, but I am still confused.  dioread_nolock is only for extent-based
> > > > > file.  So when a file system without extent feature, dioread_nolock
> > > > > couldn't be enabled.  It seems that we don't need to worry about
> > > > > exposing stale data here.
> > > >   Well, you can have fs with extent feature enabled but still with inodes
> > > > using indirect map. But as Dmitry pointed out, ext4_should_dioread_nolock()
> > > > handles that correctly. So there's not a bug I was suspecting.
> > > 
> > > Yep, the patch itself is fine.  But that would be great if a comment is
> > > added here.
> >   No, the patch is wrong. The code before the patch is correct. We can get
> > to that code for extent based inode which has unwritten conversions pending
> > and we need to wait for those as otherwise we could return 0s in places
> > where we acknowledged successful write just a while ago. Or am I missing
> > something?
> 
> Ah, I see.  I guess that the problem is that the dio read races with buffered
> write.
> 
>    dio read                         buffered write
>                                     ->ext4_file_write
>                                       ->ext4_da_write_begin
>                                       ->ext4_da_write_end
>                                       [buffered write has finished, but the data
>                                        and metadata has not been flushed]
>    ->generic_file_aio_read
>      ->filemap_write_and_wait_range
>        ->do_writepages
>          ->ext4_da_writepages
>      ->filemap_fdatawait_range
>        ->wait_on_page_writeback
>                                     ->ext4_end_bio
>                                       ->end_page_writeback
>                                         [unwritten extent has not been
>                                          converted]
>      ->ext4_ind_direct_IO
>        [here we need to flush unwritten io]
  Yes, exactly.

> So it seems that this patch could be applied after reworking unwritten extent
> conversion.
  Yes. When PageWriteback is cleared after extent conversion, this waiting
can go away and everything should be fine.

> FWIW, after applied this patch, the latency of dio read could be reduced
> dramatically.  So that would be great if this patch can be applied when it
> doesn't break something.
  Sure, I'll have that in mind.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/12] ext4: Disable merging of uninitialized extents
  2013-01-18 12:00 ` [PATCH 04/12] ext4: Disable merging of uninitialized extents Jan Kara
@ 2013-01-24  9:49   ` Dmitry Monakhov
  2013-01-24 15:12     ` Jan Kara
                       ` (2 more replies)
  2013-02-09 17:10   ` REGRESSION: " Theodore Ts'o
  1 sibling, 3 replies; 53+ messages in thread
From: Dmitry Monakhov @ 2013-01-24  9:49 UTC (permalink / raw)
  To: Jan Kara, Ted Tso; +Cc: linux-ext4, Jan Kara

On Fri, 18 Jan 2013 13:00:38 +0100, Jan Kara <jack@suse.cz> wrote:
> Merging of uninitialized extents creates all sorts of interesting race
> possibilities when writeback / DIO races with fallocate. Thus
> ext4_convert_unwritten_extents_endio() has to deal with a case where
> extent to be converted needs to be split out first. That isn't nice
> for two reasons:
> 
> 1) It may need allocation of extent tree block so ENOSPC is possible.
> 2) It complicates end_io handling code
As we already discussed your idea is 100% correct, but even with
what patch I still able to trigger situation where split it required.
I've got following error with this patch applied on top of 7f5118629f7
EXT4-fs error (device dm-3): ext4_convert_unwritten_extents_endio:3411:
inode #12: comm kworker/u:4: Written extent modified before IO finished:
extent logical block 1379787, len 64; IO logical block 1379787, len 21

------------[ cut here ]------------
WARNING: at fs/ext4/extents.c:4518
ext4_convert_unwritten_extents+0x149/0x210 [ext4]()
Hardware name:         
Modules linked in: ext4 jbd2 cpufreq_ondemand acpi_cpufreq freq_table
mperf coretemp kvm_intel kvm crc32c_intel microcode sg button ext3 jbd
mbcache sd_mod crc_t10dif ahci libahci pata_acpi ata_generic dm_mirror
dm_region_hash dm_log dm_mod
Pid: 249, comm: kworker/u:4 Not tainted 3.8.0-rc3+ #16
Call Trace:
 [<ffffffff8106fc23>] warn_slowpath_common+0xc3/0xf0
 [<ffffffff8106fc6a>] warn_slowpath_null+0x1a/0x20
 [<ffffffffa03fb909>] ext4_convert_unwritten_extents+0x149/0x210 [ext4]
 [<ffffffff811064fa>] ? __lock_release+0x1da/0x1f0
 [<ffffffffa03c368e>] ext4_end_io+0x3e/0x160 [ext4]
 [<ffffffff813aab40>] ? __list_del_entry+0x210/0x250
 [<ffffffffa03c3a21>] ext4_do_flush_completed_IO+0x101/0x280 [ext4]
 [<ffffffffa03c3bb6>] ext4_end_io_work+0x16/0x20 [ext4]
 [<ffffffff8109f7dd>] process_one_work+0x4ad/0x780
 [<ffffffff8109f6d2>] ? process_one_work+0x3a2/0x780
 [<ffffffffa03c3ba0>] ? ext4_do_flush_completed_IO+0x280/0x280 [ext4]
 [<ffffffff810a3ed1>] worker_thread+0x3f1/0x590
 [<ffffffff810a3ae0>] ? manage_workers+0x210/0x210
 [<ffffffff810ac870>] kthread+0x100/0x110
 [<ffffffff810ac770>] ? __init_kthread_worker+0x70/0x70
 [<ffffffff81812e2c>] ret_from_fork+0x7c/0xb0
 [<ffffffff810ac770>] ? __init_kthread_worker+0x70/0x70
---[ end trace add5cefed72186f8 ]---
EXT4-fs (dm-3): ext4_convert_unwritten_extents:4522: inode #12: block
1379787: len 21: ext4_ext_map_blocks returned -5
EXT4-fs (dm-3): failed to convert unwritten extents to written
extents -- potential data loss!  (inode 12, offset 5651562496, size
131072, error -5)

I've run 286'th xfstest (this is my own copy of xfstest so 286'th test
is differ from mainstream one) you can find it here
https://raw.github.com/dmonakhov/xfstests/devel/286
In two words it is stress test which run DIO/AIO,truncate,fallocate in parallel.
Also you need recent FIO(http://git.kernel.dk/?p=fio.git;a=summary)

Currently I try to understand what caused this issue.
> 
> So we disable merging of uninitialized extents which allows us to simplify
> the code. Extents will get merged after they are converted to initialized
> ones.
> 
> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/extents.c |   61 +++++++++++++++-------------------------------------
>  1 files changed, 18 insertions(+), 43 deletions(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 26af228..f1ce33a 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -54,9 +54,6 @@
>  #define EXT4_EXT_MARK_UNINIT1	0x2  /* mark first half uninitialized */
>  #define EXT4_EXT_MARK_UNINIT2	0x4  /* mark second half uninitialized */
>  
> -#define EXT4_EXT_DATA_VALID1	0x8  /* first half contains valid data */
> -#define EXT4_EXT_DATA_VALID2	0x10 /* second half contains valid data */
> -
>  static __le32 ext4_extent_block_csum(struct inode *inode,
>  				     struct ext4_extent_header *eh)
>  {
> @@ -1579,20 +1576,17 @@ int
>  ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
>  				struct ext4_extent *ex2)
>  {
> -	unsigned short ext1_ee_len, ext2_ee_len, max_len;
> +	unsigned ext1_ee_len, ext2_ee_len;
>  
>  	/*
> -	 * Make sure that either both extents are uninitialized, or
> -	 * both are _not_.
> +	 * Make sure that both extents are initialized. We don't merge
> +	 * uninitialized extents so that we can be sure that end_io code has
> +	 * the extent that was written properly split out and conversion to
> +	 * initialized is trivial.
>  	 */
> -	if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
> +	if (ext4_ext_is_uninitialized(ex1) || ext4_ext_is_uninitialized(ex2))
>  		return 0;
>  
> -	if (ext4_ext_is_uninitialized(ex1))
> -		max_len = EXT_UNINIT_MAX_LEN;
> -	else
> -		max_len = EXT_INIT_MAX_LEN;
> -
>  	ext1_ee_len = ext4_ext_get_actual_len(ex1);
>  	ext2_ee_len = ext4_ext_get_actual_len(ex2);
>  
> @@ -1605,7 +1599,7 @@ ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
>  	 * as an RO_COMPAT feature, refuse to merge to extents if
>  	 * this can result in the top bit of ee_len being set.
>  	 */
> -	if (ext1_ee_len + ext2_ee_len > max_len)
> +	if (ext1_ee_len + ext2_ee_len > EXT_INIT_MAX_LEN)
>  		return 0;
>  #ifdef AGGRESSIVE_TEST
>  	if (ext1_ee_len >= 4)
> @@ -2959,9 +2953,6 @@ static int ext4_split_extent_at(handle_t *handle,
>  	unsigned int ee_len, depth;
>  	int err = 0;
>  
> -	BUG_ON((split_flag & (EXT4_EXT_DATA_VALID1 | EXT4_EXT_DATA_VALID2)) ==
> -	       (EXT4_EXT_DATA_VALID1 | EXT4_EXT_DATA_VALID2));
> -
>  	ext_debug("ext4_split_extents_at: inode %lu, logical"
>  		"block %llu\n", inode->i_ino, (unsigned long long)split);
>  
> @@ -3020,14 +3011,7 @@ static int ext4_split_extent_at(handle_t *handle,
>  
>  	err = ext4_ext_insert_extent(handle, inode, path, &newex, flags);
>  	if (err == -ENOSPC && (EXT4_EXT_MAY_ZEROOUT & split_flag)) {
> -		if (split_flag & (EXT4_EXT_DATA_VALID1|EXT4_EXT_DATA_VALID2)) {
> -			if (split_flag & EXT4_EXT_DATA_VALID1)
> -				err = ext4_ext_zeroout(inode, ex2);
> -			else
> -				err = ext4_ext_zeroout(inode, ex);
> -		} else
> -			err = ext4_ext_zeroout(inode, &orig_ex);
> -
> +		err = ext4_ext_zeroout(inode, &orig_ex);
>  		if (err)
>  			goto fix_extent_len;
>  		/* update the extent length and mark as initialized */
> @@ -3085,8 +3069,6 @@ static int ext4_split_extent(handle_t *handle,
>  		if (uninitialized)
>  			split_flag1 |= EXT4_EXT_MARK_UNINIT1 |
>  				       EXT4_EXT_MARK_UNINIT2;
> -		if (split_flag & EXT4_EXT_DATA_VALID2)
> -			split_flag1 |= EXT4_EXT_DATA_VALID1;
>  		err = ext4_split_extent_at(handle, inode, path,
>  				map->m_lblk + map->m_len, split_flag1, flags1);
>  		if (err)
> @@ -3099,8 +3081,7 @@ static int ext4_split_extent(handle_t *handle,
>  		return PTR_ERR(path);
>  
>  	if (map->m_lblk >= ee_block) {
> -		split_flag1 = split_flag & (EXT4_EXT_MAY_ZEROOUT |
> -					    EXT4_EXT_DATA_VALID2);
> +		split_flag1 = split_flag & EXT4_EXT_MAY_ZEROOUT;
>  		if (uninitialized)
>  			split_flag1 |= EXT4_EXT_MARK_UNINIT1;
>  		if (split_flag & EXT4_EXT_MARK_UNINIT2)
> @@ -3379,8 +3360,7 @@ static int ext4_split_unwritten_extents(handle_t *handle,
>  
>  	split_flag |= ee_block + ee_len <= eof_block ? EXT4_EXT_MAY_ZEROOUT : 0;
>  	split_flag |= EXT4_EXT_MARK_UNINIT2;
> -	if (flags & EXT4_GET_BLOCKS_CONVERT)
> -		split_flag |= EXT4_EXT_DATA_VALID2;
> +
>  	flags |= EXT4_GET_BLOCKS_PRE_IO;
>  	return ext4_split_extent(handle, inode, path, map, split_flag, flags);
>  }
> @@ -3405,20 +3385,15 @@ static int ext4_convert_unwritten_extents_endio(handle_t *handle,
>  		"block %llu, max_blocks %u\n", inode->i_ino,
>  		  (unsigned long long)ee_block, ee_len);
>  
> -	/* If extent is larger than requested then split is required */
> +	/* Extent is larger than requested? */
>  	if (ee_block != map->m_lblk || ee_len > map->m_len) {
> -		err = ext4_split_unwritten_extents(handle, inode, map, path,
> -						   EXT4_GET_BLOCKS_CONVERT);
> -		if (err < 0)
> -			goto out;
> -		ext4_ext_drop_refs(path);
> -		path = ext4_ext_find_extent(inode, map->m_lblk, path);
> -		if (IS_ERR(path)) {
> -			err = PTR_ERR(path);
> -			goto out;
> -		}
> -		depth = ext_depth(inode);
> -		ex = path[depth].p_ext;
> +		EXT4_ERROR_INODE(inode, "Written extent modified before IO"
> +			" finished: extent logical block %llu, len %u; IO"
> +			" logical block %llu, len %u\n",
> +			(unsigned long long)ee_block, ee_len,
> +			(unsigned long long)map->m_lblk, map->m_len);
> +		err = -EIO;
> +		goto out;
>  	}
>  
>  	err = ext4_ext_get_access(handle, inode, path + depth);
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/12] ext4: Disable merging of uninitialized extents
  2013-01-24  9:49   ` Dmitry Monakhov
@ 2013-01-24 15:12     ` Jan Kara
  2013-01-24 15:32       ` Dmitry Monakhov
  2013-01-28 14:36     ` Theodore Ts'o
  2013-01-31  7:47     ` Dmitry Monakhov
  2 siblings, 1 reply; 53+ messages in thread
From: Jan Kara @ 2013-01-24 15:12 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Jan Kara, Ted Tso, linux-ext4

On Thu 24-01-13 13:49:45, Dmitry Monakhov wrote:
> On Fri, 18 Jan 2013 13:00:38 +0100, Jan Kara <jack@suse.cz> wrote:
> > Merging of uninitialized extents creates all sorts of interesting race
> > possibilities when writeback / DIO races with fallocate. Thus
> > ext4_convert_unwritten_extents_endio() has to deal with a case where
> > extent to be converted needs to be split out first. That isn't nice
> > for two reasons:
> > 
> > 1) It may need allocation of extent tree block so ENOSPC is possible.
> > 2) It complicates end_io handling code
> As we already discussed your idea is 100% correct, but even with
> what patch I still able to trigger situation where split it required.
> I've got following error with this patch applied on top of 7f5118629f7
> EXT4-fs error (device dm-3): ext4_convert_unwritten_extents_endio:3411:
> inode #12: comm kworker/u:4: Written extent modified before IO finished:
> extent logical block 1379787, len 64; IO logical block 1379787, len 21
  Drat, thanks for heads up. I did run xfstests on the patch set but
apparently you are doing something more evil :) If I get your test & error
right, you do AIO DIO to a file while doing truncate 0, fallocate SIZE, in
a loop. And extent is found longer when we finish the IO. Am I right?

								Honza
> ------------[ cut here ]------------
> WARNING: at fs/ext4/extents.c:4518
> ext4_convert_unwritten_extents+0x149/0x210 [ext4]()
> Hardware name:         
> Modules linked in: ext4 jbd2 cpufreq_ondemand acpi_cpufreq freq_table
> mperf coretemp kvm_intel kvm crc32c_intel microcode sg button ext3 jbd
> mbcache sd_mod crc_t10dif ahci libahci pata_acpi ata_generic dm_mirror
> dm_region_hash dm_log dm_mod
> Pid: 249, comm: kworker/u:4 Not tainted 3.8.0-rc3+ #16
> Call Trace:
>  [<ffffffff8106fc23>] warn_slowpath_common+0xc3/0xf0
>  [<ffffffff8106fc6a>] warn_slowpath_null+0x1a/0x20
>  [<ffffffffa03fb909>] ext4_convert_unwritten_extents+0x149/0x210 [ext4]
>  [<ffffffff811064fa>] ? __lock_release+0x1da/0x1f0
>  [<ffffffffa03c368e>] ext4_end_io+0x3e/0x160 [ext4]
>  [<ffffffff813aab40>] ? __list_del_entry+0x210/0x250
>  [<ffffffffa03c3a21>] ext4_do_flush_completed_IO+0x101/0x280 [ext4]
>  [<ffffffffa03c3bb6>] ext4_end_io_work+0x16/0x20 [ext4]
>  [<ffffffff8109f7dd>] process_one_work+0x4ad/0x780
>  [<ffffffff8109f6d2>] ? process_one_work+0x3a2/0x780
>  [<ffffffffa03c3ba0>] ? ext4_do_flush_completed_IO+0x280/0x280 [ext4]
>  [<ffffffff810a3ed1>] worker_thread+0x3f1/0x590
>  [<ffffffff810a3ae0>] ? manage_workers+0x210/0x210
>  [<ffffffff810ac870>] kthread+0x100/0x110
>  [<ffffffff810ac770>] ? __init_kthread_worker+0x70/0x70
>  [<ffffffff81812e2c>] ret_from_fork+0x7c/0xb0
>  [<ffffffff810ac770>] ? __init_kthread_worker+0x70/0x70
> ---[ end trace add5cefed72186f8 ]---
> EXT4-fs (dm-3): ext4_convert_unwritten_extents:4522: inode #12: block
> 1379787: len 21: ext4_ext_map_blocks returned -5
> EXT4-fs (dm-3): failed to convert unwritten extents to written
> extents -- potential data loss!  (inode 12, offset 5651562496, size
> 131072, error -5)
> 
> I've run 286'th xfstest (this is my own copy of xfstest so 286'th test
> is differ from mainstream one) you can find it here
> https://raw.github.com/dmonakhov/xfstests/devel/286
> In two words it is stress test which run DIO/AIO,truncate,fallocate in parallel.
> Also you need recent FIO(http://git.kernel.dk/?p=fio.git;a=summary)
> 
> Currently I try to understand what caused this issue.
> > 
> > So we disable merging of uninitialized extents which allows us to simplify
> > the code. Extents will get merged after they are converted to initialized
> > ones.
> > 
> > Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/ext4/extents.c |   61 +++++++++++++++-------------------------------------
> >  1 files changed, 18 insertions(+), 43 deletions(-)
> > 
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index 26af228..f1ce33a 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -54,9 +54,6 @@
> >  #define EXT4_EXT_MARK_UNINIT1	0x2  /* mark first half uninitialized */
> >  #define EXT4_EXT_MARK_UNINIT2	0x4  /* mark second half uninitialized */
> >  
> > -#define EXT4_EXT_DATA_VALID1	0x8  /* first half contains valid data */
> > -#define EXT4_EXT_DATA_VALID2	0x10 /* second half contains valid data */
> > -
> >  static __le32 ext4_extent_block_csum(struct inode *inode,
> >  				     struct ext4_extent_header *eh)
> >  {
> > @@ -1579,20 +1576,17 @@ int
> >  ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
> >  				struct ext4_extent *ex2)
> >  {
> > -	unsigned short ext1_ee_len, ext2_ee_len, max_len;
> > +	unsigned ext1_ee_len, ext2_ee_len;
> >  
> >  	/*
> > -	 * Make sure that either both extents are uninitialized, or
> > -	 * both are _not_.
> > +	 * Make sure that both extents are initialized. We don't merge
> > +	 * uninitialized extents so that we can be sure that end_io code has
> > +	 * the extent that was written properly split out and conversion to
> > +	 * initialized is trivial.
> >  	 */
> > -	if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
> > +	if (ext4_ext_is_uninitialized(ex1) || ext4_ext_is_uninitialized(ex2))
> >  		return 0;
> >  
> > -	if (ext4_ext_is_uninitialized(ex1))
> > -		max_len = EXT_UNINIT_MAX_LEN;
> > -	else
> > -		max_len = EXT_INIT_MAX_LEN;
> > -
> >  	ext1_ee_len = ext4_ext_get_actual_len(ex1);
> >  	ext2_ee_len = ext4_ext_get_actual_len(ex2);
> >  
> > @@ -1605,7 +1599,7 @@ ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
> >  	 * as an RO_COMPAT feature, refuse to merge to extents if
> >  	 * this can result in the top bit of ee_len being set.
> >  	 */
> > -	if (ext1_ee_len + ext2_ee_len > max_len)
> > +	if (ext1_ee_len + ext2_ee_len > EXT_INIT_MAX_LEN)
> >  		return 0;
> >  #ifdef AGGRESSIVE_TEST
> >  	if (ext1_ee_len >= 4)
> > @@ -2959,9 +2953,6 @@ static int ext4_split_extent_at(handle_t *handle,
> >  	unsigned int ee_len, depth;
> >  	int err = 0;
> >  
> > -	BUG_ON((split_flag & (EXT4_EXT_DATA_VALID1 | EXT4_EXT_DATA_VALID2)) ==
> > -	       (EXT4_EXT_DATA_VALID1 | EXT4_EXT_DATA_VALID2));
> > -
> >  	ext_debug("ext4_split_extents_at: inode %lu, logical"
> >  		"block %llu\n", inode->i_ino, (unsigned long long)split);
> >  
> > @@ -3020,14 +3011,7 @@ static int ext4_split_extent_at(handle_t *handle,
> >  
> >  	err = ext4_ext_insert_extent(handle, inode, path, &newex, flags);
> >  	if (err == -ENOSPC && (EXT4_EXT_MAY_ZEROOUT & split_flag)) {
> > -		if (split_flag & (EXT4_EXT_DATA_VALID1|EXT4_EXT_DATA_VALID2)) {
> > -			if (split_flag & EXT4_EXT_DATA_VALID1)
> > -				err = ext4_ext_zeroout(inode, ex2);
> > -			else
> > -				err = ext4_ext_zeroout(inode, ex);
> > -		} else
> > -			err = ext4_ext_zeroout(inode, &orig_ex);
> > -
> > +		err = ext4_ext_zeroout(inode, &orig_ex);
> >  		if (err)
> >  			goto fix_extent_len;
> >  		/* update the extent length and mark as initialized */
> > @@ -3085,8 +3069,6 @@ static int ext4_split_extent(handle_t *handle,
> >  		if (uninitialized)
> >  			split_flag1 |= EXT4_EXT_MARK_UNINIT1 |
> >  				       EXT4_EXT_MARK_UNINIT2;
> > -		if (split_flag & EXT4_EXT_DATA_VALID2)
> > -			split_flag1 |= EXT4_EXT_DATA_VALID1;
> >  		err = ext4_split_extent_at(handle, inode, path,
> >  				map->m_lblk + map->m_len, split_flag1, flags1);
> >  		if (err)
> > @@ -3099,8 +3081,7 @@ static int ext4_split_extent(handle_t *handle,
> >  		return PTR_ERR(path);
> >  
> >  	if (map->m_lblk >= ee_block) {
> > -		split_flag1 = split_flag & (EXT4_EXT_MAY_ZEROOUT |
> > -					    EXT4_EXT_DATA_VALID2);
> > +		split_flag1 = split_flag & EXT4_EXT_MAY_ZEROOUT;
> >  		if (uninitialized)
> >  			split_flag1 |= EXT4_EXT_MARK_UNINIT1;
> >  		if (split_flag & EXT4_EXT_MARK_UNINIT2)
> > @@ -3379,8 +3360,7 @@ static int ext4_split_unwritten_extents(handle_t *handle,
> >  
> >  	split_flag |= ee_block + ee_len <= eof_block ? EXT4_EXT_MAY_ZEROOUT : 0;
> >  	split_flag |= EXT4_EXT_MARK_UNINIT2;
> > -	if (flags & EXT4_GET_BLOCKS_CONVERT)
> > -		split_flag |= EXT4_EXT_DATA_VALID2;
> > +
> >  	flags |= EXT4_GET_BLOCKS_PRE_IO;
> >  	return ext4_split_extent(handle, inode, path, map, split_flag, flags);
> >  }
> > @@ -3405,20 +3385,15 @@ static int ext4_convert_unwritten_extents_endio(handle_t *handle,
> >  		"block %llu, max_blocks %u\n", inode->i_ino,
> >  		  (unsigned long long)ee_block, ee_len);
> >  
> > -	/* If extent is larger than requested then split is required */
> > +	/* Extent is larger than requested? */
> >  	if (ee_block != map->m_lblk || ee_len > map->m_len) {
> > -		err = ext4_split_unwritten_extents(handle, inode, map, path,
> > -						   EXT4_GET_BLOCKS_CONVERT);
> > -		if (err < 0)
> > -			goto out;
> > -		ext4_ext_drop_refs(path);
> > -		path = ext4_ext_find_extent(inode, map->m_lblk, path);
> > -		if (IS_ERR(path)) {
> > -			err = PTR_ERR(path);
> > -			goto out;
> > -		}
> > -		depth = ext_depth(inode);
> > -		ex = path[depth].p_ext;
> > +		EXT4_ERROR_INODE(inode, "Written extent modified before IO"
> > +			" finished: extent logical block %llu, len %u; IO"
> > +			" logical block %llu, len %u\n",
> > +			(unsigned long long)ee_block, ee_len,
> > +			(unsigned long long)map->m_lblk, map->m_len);
> > +		err = -EIO;
> > +		goto out;
> >  	}
> >  
> >  	err = ext4_ext_get_access(handle, inode, path + depth);
> > -- 
> > 1.7.1
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/12] ext4: Disable merging of uninitialized extents
  2013-01-24 15:12     ` Jan Kara
@ 2013-01-24 15:32       ` Dmitry Monakhov
  0 siblings, 0 replies; 53+ messages in thread
From: Dmitry Monakhov @ 2013-01-24 15:32 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jan Kara, Ted Tso, linux-ext4

On Thu, 24 Jan 2013 16:12:24 +0100, Jan Kara <jack@suse.cz> wrote:
> On Thu 24-01-13 13:49:45, Dmitry Monakhov wrote:
> > On Fri, 18 Jan 2013 13:00:38 +0100, Jan Kara <jack@suse.cz> wrote:
> > > Merging of uninitialized extents creates all sorts of interesting race
> > > possibilities when writeback / DIO races with fallocate. Thus
> > > ext4_convert_unwritten_extents_endio() has to deal with a case where
> > > extent to be converted needs to be split out first. That isn't nice
> > > for two reasons:
> > > 
> > > 1) It may need allocation of extent tree block so ENOSPC is possible.
> > > 2) It complicates end_io handling code
> > As we already discussed your idea is 100% correct, but even with
> > what patch I still able to trigger situation where split it required.
> > I've got following error with this patch applied on top of 7f5118629f7
> > EXT4-fs error (device dm-3): ext4_convert_unwritten_extents_endio:3411:
> > inode #12: comm kworker/u:4: Written extent modified before IO finished:
> > extent logical block 1379787, len 64; IO logical block 1379787, len 21
>   Drat, thanks for heads up. I did run xfstests on the patch set but
> apparently you are doing something more evil :) If I get your test & error
> right, you do AIO DIO to a file while doing truncate 0, fallocate SIZE, in
> a loop. And extent is found longer when we finish the IO. Am I right?
Correct. AFAIU we have another bug which break things
I've added prink for ext4_can_extents_be_merged if it was positive
And have got following output:     lblk    len  pblk uninit   lblk    len  pblk     uninit 
ext4_can_extents_be_merged:1618 ex1[2254944:32, 2052192]:0 ex2[2254976:32, 2052224]:0
ext4_can_extents_be_merged:1618 ex1[398176:32, 2198368]:0 ex2[398208:32,2198400]:0
ext4_can_extents_be_merged:1618 ex1[584704:32, 1379328]:0 ex2[584736:32,1379360]:0
ext4_can_extents_be_merged:1618 ex1[1407488:32, 762368]:0 ex2[1407520:32, 762400]:0
ext4_can_extents_be_merged:1618 ex1[443744:32, 2495456]:0 ex2[443776:32,2495488]:0
ext4_can_extents_be_merged:1618 ex1[1057230:1, 2190944]:0 ex2[1057231:17, 2190945]:0
ext4_can_extents_be_merged:1618 ex1[2108160:832, 2563328]:0 ex2[2108992:32, 2564160]:0
##### Here Both extents was initialized                  ^^^                        ^^^
EXT4-fs error (device dm-3): ext4_convert_unwritten_extents_endio:3426: inode #12: comm kworker/u:4: Written extent modif
ied before IO finished: extent logical block 2108576, len 448; IO logical block 2108576, len 32
#####But right after that it is appeared uninitialized.

> 
> 								Honza
> > ------------[ cut here ]------------
> > WARNING: at fs/ext4/extents.c:4518
> > ext4_convert_unwritten_extents+0x149/0x210 [ext4]()
> > Hardware name:         
> > Modules linked in: ext4 jbd2 cpufreq_ondemand acpi_cpufreq freq_table
> > mperf coretemp kvm_intel kvm crc32c_intel microcode sg button ext3 jbd
> > mbcache sd_mod crc_t10dif ahci libahci pata_acpi ata_generic dm_mirror
> > dm_region_hash dm_log dm_mod
> > Pid: 249, comm: kworker/u:4 Not tainted 3.8.0-rc3+ #16
> > Call Trace:
> >  [<ffffffff8106fc23>] warn_slowpath_common+0xc3/0xf0
> >  [<ffffffff8106fc6a>] warn_slowpath_null+0x1a/0x20
> >  [<ffffffffa03fb909>] ext4_convert_unwritten_extents+0x149/0x210 [ext4]
> >  [<ffffffff811064fa>] ? __lock_release+0x1da/0x1f0
> >  [<ffffffffa03c368e>] ext4_end_io+0x3e/0x160 [ext4]
> >  [<ffffffff813aab40>] ? __list_del_entry+0x210/0x250
> >  [<ffffffffa03c3a21>] ext4_do_flush_completed_IO+0x101/0x280 [ext4]
> >  [<ffffffffa03c3bb6>] ext4_end_io_work+0x16/0x20 [ext4]
> >  [<ffffffff8109f7dd>] process_one_work+0x4ad/0x780
> >  [<ffffffff8109f6d2>] ? process_one_work+0x3a2/0x780
> >  [<ffffffffa03c3ba0>] ? ext4_do_flush_completed_IO+0x280/0x280 [ext4]
> >  [<ffffffff810a3ed1>] worker_thread+0x3f1/0x590
> >  [<ffffffff810a3ae0>] ? manage_workers+0x210/0x210
> >  [<ffffffff810ac870>] kthread+0x100/0x110
> >  [<ffffffff810ac770>] ? __init_kthread_worker+0x70/0x70
> >  [<ffffffff81812e2c>] ret_from_fork+0x7c/0xb0
> >  [<ffffffff810ac770>] ? __init_kthread_worker+0x70/0x70
> > ---[ end trace add5cefed72186f8 ]---
> > EXT4-fs (dm-3): ext4_convert_unwritten_extents:4522: inode #12: block
> > 1379787: len 21: ext4_ext_map_blocks returned -5
> > EXT4-fs (dm-3): failed to convert unwritten extents to written
> > extents -- potential data loss!  (inode 12, offset 5651562496, size
> > 131072, error -5)
> > 
> > I've run 286'th xfstest (this is my own copy of xfstest so 286'th test
> > is differ from mainstream one) you can find it here
> > https://raw.github.com/dmonakhov/xfstests/devel/286
> > In two words it is stress test which run DIO/AIO,truncate,fallocate in parallel.
> > Also you need recent FIO(http://git.kernel.dk/?p=fio.git;a=summary)
> > 
> > Currently I try to understand what caused this issue.
> > > 
> > > So we disable merging of uninitialized extents which allows us to simplify
> > > the code. Extents will get merged after they are converted to initialized
> > > ones.
> > > 
> > > Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
> > > Signed-off-by: Jan Kara <jack@suse.cz>
> > > ---
> > >  fs/ext4/extents.c |   61 +++++++++++++++-------------------------------------
> > >  1 files changed, 18 insertions(+), 43 deletions(-)
> > > 
> > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > > index 26af228..f1ce33a 100644
> > > --- a/fs/ext4/extents.c
> > > +++ b/fs/ext4/extents.c
> > > @@ -54,9 +54,6 @@
> > >  #define EXT4_EXT_MARK_UNINIT1	0x2  /* mark first half uninitialized */
> > >  #define EXT4_EXT_MARK_UNINIT2	0x4  /* mark second half uninitialized */
> > >  
> > > -#define EXT4_EXT_DATA_VALID1	0x8  /* first half contains valid data */
> > > -#define EXT4_EXT_DATA_VALID2	0x10 /* second half contains valid data */
> > > -
> > >  static __le32 ext4_extent_block_csum(struct inode *inode,
> > >  				     struct ext4_extent_header *eh)
> > >  {
> > > @@ -1579,20 +1576,17 @@ int
> > >  ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
> > >  				struct ext4_extent *ex2)
> > >  {
> > > -	unsigned short ext1_ee_len, ext2_ee_len, max_len;
> > > +	unsigned ext1_ee_len, ext2_ee_len;
> > >  
> > >  	/*
> > > -	 * Make sure that either both extents are uninitialized, or
> > > -	 * both are _not_.
> > > +	 * Make sure that both extents are initialized. We don't merge
> > > +	 * uninitialized extents so that we can be sure that end_io code has
> > > +	 * the extent that was written properly split out and conversion to
> > > +	 * initialized is trivial.
> > >  	 */
> > > -	if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
> > > +	if (ext4_ext_is_uninitialized(ex1) || ext4_ext_is_uninitialized(ex2))
> > >  		return 0;
> > >  
> > > -	if (ext4_ext_is_uninitialized(ex1))
> > > -		max_len = EXT_UNINIT_MAX_LEN;
> > > -	else
> > > -		max_len = EXT_INIT_MAX_LEN;
> > > -
> > >  	ext1_ee_len = ext4_ext_get_actual_len(ex1);
> > >  	ext2_ee_len = ext4_ext_get_actual_len(ex2);
> > >  
> > > @@ -1605,7 +1599,7 @@ ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
> > >  	 * as an RO_COMPAT feature, refuse to merge to extents if
> > >  	 * this can result in the top bit of ee_len being set.
> > >  	 */
> > > -	if (ext1_ee_len + ext2_ee_len > max_len)
> > > +	if (ext1_ee_len + ext2_ee_len > EXT_INIT_MAX_LEN)
> > >  		return 0;
> > >  #ifdef AGGRESSIVE_TEST
> > >  	if (ext1_ee_len >= 4)
> > > @@ -2959,9 +2953,6 @@ static int ext4_split_extent_at(handle_t *handle,
> > >  	unsigned int ee_len, depth;
> > >  	int err = 0;
> > >  
> > > -	BUG_ON((split_flag & (EXT4_EXT_DATA_VALID1 | EXT4_EXT_DATA_VALID2)) ==
> > > -	       (EXT4_EXT_DATA_VALID1 | EXT4_EXT_DATA_VALID2));
> > > -
> > >  	ext_debug("ext4_split_extents_at: inode %lu, logical"
> > >  		"block %llu\n", inode->i_ino, (unsigned long long)split);
> > >  
> > > @@ -3020,14 +3011,7 @@ static int ext4_split_extent_at(handle_t *handle,
> > >  
> > >  	err = ext4_ext_insert_extent(handle, inode, path, &newex, flags);
> > >  	if (err == -ENOSPC && (EXT4_EXT_MAY_ZEROOUT & split_flag)) {
> > > -		if (split_flag & (EXT4_EXT_DATA_VALID1|EXT4_EXT_DATA_VALID2)) {
> > > -			if (split_flag & EXT4_EXT_DATA_VALID1)
> > > -				err = ext4_ext_zeroout(inode, ex2);
> > > -			else
> > > -				err = ext4_ext_zeroout(inode, ex);
> > > -		} else
> > > -			err = ext4_ext_zeroout(inode, &orig_ex);
> > > -
> > > +		err = ext4_ext_zeroout(inode, &orig_ex);
> > >  		if (err)
> > >  			goto fix_extent_len;
> > >  		/* update the extent length and mark as initialized */
> > > @@ -3085,8 +3069,6 @@ static int ext4_split_extent(handle_t *handle,
> > >  		if (uninitialized)
> > >  			split_flag1 |= EXT4_EXT_MARK_UNINIT1 |
> > >  				       EXT4_EXT_MARK_UNINIT2;
> > > -		if (split_flag & EXT4_EXT_DATA_VALID2)
> > > -			split_flag1 |= EXT4_EXT_DATA_VALID1;
> > >  		err = ext4_split_extent_at(handle, inode, path,
> > >  				map->m_lblk + map->m_len, split_flag1, flags1);
> > >  		if (err)
> > > @@ -3099,8 +3081,7 @@ static int ext4_split_extent(handle_t *handle,
> > >  		return PTR_ERR(path);
> > >  
> > >  	if (map->m_lblk >= ee_block) {
> > > -		split_flag1 = split_flag & (EXT4_EXT_MAY_ZEROOUT |
> > > -					    EXT4_EXT_DATA_VALID2);
> > > +		split_flag1 = split_flag & EXT4_EXT_MAY_ZEROOUT;
> > >  		if (uninitialized)
> > >  			split_flag1 |= EXT4_EXT_MARK_UNINIT1;
> > >  		if (split_flag & EXT4_EXT_MARK_UNINIT2)
> > > @@ -3379,8 +3360,7 @@ static int ext4_split_unwritten_extents(handle_t *handle,
> > >  
> > >  	split_flag |= ee_block + ee_len <= eof_block ? EXT4_EXT_MAY_ZEROOUT : 0;
> > >  	split_flag |= EXT4_EXT_MARK_UNINIT2;
> > > -	if (flags & EXT4_GET_BLOCKS_CONVERT)
> > > -		split_flag |= EXT4_EXT_DATA_VALID2;
> > > +
> > >  	flags |= EXT4_GET_BLOCKS_PRE_IO;
> > >  	return ext4_split_extent(handle, inode, path, map, split_flag, flags);
> > >  }
> > > @@ -3405,20 +3385,15 @@ static int ext4_convert_unwritten_extents_endio(handle_t *handle,
> > >  		"block %llu, max_blocks %u\n", inode->i_ino,
> > >  		  (unsigned long long)ee_block, ee_len);
> > >  
> > > -	/* If extent is larger than requested then split is required */
> > > +	/* Extent is larger than requested? */
> > >  	if (ee_block != map->m_lblk || ee_len > map->m_len) {
> > > -		err = ext4_split_unwritten_extents(handle, inode, map, path,
> > > -						   EXT4_GET_BLOCKS_CONVERT);
> > > -		if (err < 0)
> > > -			goto out;
> > > -		ext4_ext_drop_refs(path);
> > > -		path = ext4_ext_find_extent(inode, map->m_lblk, path);
> > > -		if (IS_ERR(path)) {
> > > -			err = PTR_ERR(path);
> > > -			goto out;
> > > -		}
> > > -		depth = ext_depth(inode);
> > > -		ex = path[depth].p_ext;
> > > +		EXT4_ERROR_INODE(inode, "Written extent modified before IO"
> > > +			" finished: extent logical block %llu, len %u; IO"
> > > +			" logical block %llu, len %u\n",
> > > +			(unsigned long long)ee_block, ee_len,
> > > +			(unsigned long long)map->m_lblk, map->m_len);
> > > +		err = -EIO;
> > > +		goto out;
> > >  	}
> > >  
> > >  	err = ext4_ext_get_access(handle, inode, path + depth);
> > > -- 
> > > 1.7.1
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 01/12] ext4: Always use ext4_bio_write_page() for writeout
  2013-01-18 12:00 ` [PATCH 01/12] ext4: Always use ext4_bio_write_page() for writeout Jan Kara
@ 2013-01-28 14:31   ` Theodore Ts'o
  0 siblings, 0 replies; 53+ messages in thread
From: Theodore Ts'o @ 2013-01-28 14:31 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Jan 18, 2013 at 01:00:35PM +0100, Jan Kara wrote:
> Currently we sometimes used block_write_full_page() and sometimes
> ext4_bio_write_page() for writeback (depending on mount options and call
> path). Let's always use ext4_bio_write_page() to simplify things a bit.
> 
> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
> Signed-off-by: Jan Kara <jack@suse.cz>

Thanks, applied.

					- Ted

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/12] ext4: Use redirty_page_for_writepage() in ext4_bio_write_page()
  2013-01-18 12:00 ` [PATCH 02/12] ext4: Use redirty_page_for_writepage() in ext4_bio_write_page() Jan Kara
@ 2013-01-28 14:34   ` Theodore Ts'o
  0 siblings, 0 replies; 53+ messages in thread
From: Theodore Ts'o @ 2013-01-28 14:34 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Jan 18, 2013 at 01:00:36PM +0100, Jan Kara wrote:
> When we cannot write a page we should use redirty_page_for_writepage()
> instead of plain set_page_dirty(). That tells writeback code we have
> problems, redirties only the page (redirtying buffers is not needed),
> and updates mm accounting of failed page writes.
> 
> Also move clearing of buffer dirty flag after io_submit_add_bh(). At that
> moment we are sure buffer will be going to disk.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Applied, thanks.

						- Ted

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/12] ext4: Disable merging of uninitialized extents
  2013-01-24  9:49   ` Dmitry Monakhov
  2013-01-24 15:12     ` Jan Kara
@ 2013-01-28 14:36     ` Theodore Ts'o
  2013-01-28 15:02       ` Dmitry Monakhov
  2013-01-31  7:47     ` Dmitry Monakhov
  2 siblings, 1 reply; 53+ messages in thread
From: Theodore Ts'o @ 2013-01-28 14:36 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Jan Kara, linux-ext4

On Thu, Jan 24, 2013 at 01:49:45PM +0400, Dmitry Monakhov wrote:
> > 1) It may need allocation of extent tree block so ENOSPC is possible.
> > 2) It complicates end_io handling code
>
> As we already discussed your idea is 100% correct, but even with
> what patch I still able to trigger situation where split it required.
> I've got following error with this patch applied on top of 7f5118629f7
> EXT4-fs error (device dm-3): ext4_convert_unwritten_extents_endio:3411:
> inode #12: comm kworker/u:4: Written extent modified before IO finished:
> extent logical block 1379787, len 64; IO logical block 1379787, len 21

So does this patch makes this better enough that it's worth applying
now?  Or should we hold off until we figure out what's going on with
the race that you've foind?

Thanks,

					- Ted

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 08/12] ext4: Remove __ext4_journalled_writepage() from mpage_da_submit_io()
  2013-01-18 12:00 ` [PATCH 08/12] ext4: Remove __ext4_journalled_writepage() from mpage_da_submit_io() Jan Kara
@ 2013-01-28 14:40   ` Theodore Ts'o
  0 siblings, 0 replies; 53+ messages in thread
From: Theodore Ts'o @ 2013-01-28 14:40 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Jan 18, 2013 at 01:00:42PM +0100, Jan Kara wrote:
> We don't support delayed allocation in data=journal mode. So checking for it in
> mpage_da_submit_io() doesn't make really sence. If we ever decide to extend
> delayed allocation support to data=journal mode, adding
> __ext4_journalled_writepage() call will be the least of problems we have to
> solve. Most likely we'd have to implement separate writepages call anyways
> because we don't have transaction credits for writing more than a single page
> so mapping of page buffers would have to be done differently.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Thanks, applied.

					- Ted

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 06/12] ext4: Move work from io_end to inode
  2013-01-18 12:00 ` [PATCH 06/12] ext4: Move work from io_end to inode Jan Kara
@ 2013-01-28 14:45   ` Theodore Ts'o
  0 siblings, 0 replies; 53+ messages in thread
From: Theodore Ts'o @ 2013-01-28 14:45 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Jan 18, 2013 at 01:00:40PM +0100, Jan Kara wrote:
> It does not make much sense to have struct work in ext4_io_end_t because we
> always use it for only one ext4_io_end_t per inode (the first one in the
> i_completed_io list). So just move the structure to inode itself.  This also
> allows for a small simplification in processing io_end structures.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Thanks, applied.

(Note, I haven't applied patch 5 yet because it has a dependency on
patch #4, and I'm waiting to hear whether we should apply this given
the race condition which Dmitry has found.  It sounds like it's a
pre-existing condition, and not a regression, but I want get a
confirmation that applying patches 5 and 6 won't make things worse...)

	     	  	   	     	   - Ted

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 07/12] ext4: Simplify list handling in ext4_do_flush_completed_IO()
  2013-01-18 12:00 ` [PATCH 07/12] ext4: Simplify list handling in ext4_do_flush_completed_IO() Jan Kara
@ 2013-01-28 14:51   ` Theodore Ts'o
  0 siblings, 0 replies; 53+ messages in thread
From: Theodore Ts'o @ 2013-01-28 14:51 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Dmitry Monakhov

On Fri, Jan 18, 2013 at 01:00:41PM +0100, Jan Kara wrote:
> The function splices i_completed_io_list to its private list first. From that
> moment on we don't need any lock for working with io_end structures because all
> io_end structure on the list are only our own. So we can remove the other two
> lists in the function and free io_end immediately after we are done with it.
> 
> CC: Dmitry Monakhov <dmonakhov@openvz.org>
> Signed-off-by: Jan Kara <jack@suse.cz>

Thanks, applied.

						- Ted

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/12] ext4: Disable merging of uninitialized extents
  2013-01-28 14:36     ` Theodore Ts'o
@ 2013-01-28 15:02       ` Dmitry Monakhov
  2013-01-28 15:38         ` Theodore Ts'o
  0 siblings, 1 reply; 53+ messages in thread
From: Dmitry Monakhov @ 2013-01-28 15:02 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jan Kara, linux-ext4

On Mon, 28 Jan 2013 09:36:47 -0500, Theodore Ts'o <tytso@mit.edu> wrote:
> On Thu, Jan 24, 2013 at 01:49:45PM +0400, Dmitry Monakhov wrote:
> > > 1) It may need allocation of extent tree block so ENOSPC is possible.
> > > 2) It complicates end_io handling code
> >
> > As we already discussed your idea is 100% correct, but even with
> > what patch I still able to trigger situation where split it required.
> > I've got following error with this patch applied on top of 7f5118629f7
> > EXT4-fs error (device dm-3): ext4_convert_unwritten_extents_endio:3411:
> > inode #12: comm kworker/u:4: Written extent modified before IO finished:
> > extent logical block 1379787, len 64; IO logical block 1379787, len 21
> 
> So does this patch makes this better enough that it's worth applying
> now?  Or should we hold off until we figure out what's going on with
> the race that you've foind?
Actually this patch consists of two peaces
1) disable merging of uninitialized extents. (1 line change) I'm
absolutely agree with it.
2) Remove explicit extent split inside endio. By assumption explicit
split is workaround against 1'st bug, but even w/ disabled merging
we still have another race which result in situation where split is
required.

At this moment i try to localize this issue. Let's commit first part,
but keep second one until I'll find the race. 
> 
> Thanks,
> 
> 					- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/12] ext4: Disable merging of uninitialized extents
  2013-01-28 15:02       ` Dmitry Monakhov
@ 2013-01-28 15:38         ` Theodore Ts'o
  2013-01-29  7:41           ` Dmitry Monakhov
  0 siblings, 1 reply; 53+ messages in thread
From: Theodore Ts'o @ 2013-01-28 15:38 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Jan Kara, linux-ext4

On Mon, Jan 28, 2013 at 07:02:55PM +0400, Dmitry Monakhov wrote:
> Actually this patch consists of two peaces
> 1) disable merging of uninitialized extents. (1 line change) I'm
> absolutely agree with it.

To be clear, that's this patch chunk (one line change not including
comments :-), right?

--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1579,11 +1576,13 @@ int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
 				struct ext4_extent *ex2)
 {
 
 	/*
-	 * Make sure that either both extents are uninitialized, or
-	 * both are _not_.
+	 * Make sure that both extents are initialized. We don't merge
+	 * uninitialized extents so that we can be sure that end_io code has
+	 * the extent that was written properly split out and conversion to
+	 * initialized is trivial.
 	 */
-	if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+	if (ext4_ext_is_uninitialized(ex1) || ext4_ext_is_uninitialized(ex2))
 		return 0;
 

The one thing I'm a bit worried about is how much worse will extent
fragmentation be once we do this, but it's clear we need to strive for
correctness first.

							- Ted

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/12] ext4: Dirty page has always buffers attached
  2013-01-18 12:00 ` [PATCH 09/12] ext4: Dirty page has always buffers attached Jan Kara
@ 2013-01-28 17:55   ` Theodore Ts'o
  0 siblings, 0 replies; 53+ messages in thread
From: Theodore Ts'o @ 2013-01-28 17:55 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Jan 18, 2013 at 01:00:43PM +0100, Jan Kara wrote:
> ext4_writepage(), write_cache_pages_da(), and mpage_da_submit_io() doesn't
> have to deal with the case when page doesn't have buffers. We attach buffers
> to a page in ->write_begin() and ->page_mkwrite() which covers all places
> where a page can become dirty.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Thanks, applied.

						- Ted

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/12] ext4: Simplify mpage_add_bh_to_extent()
  2013-01-18 12:00 ` [PATCH 10/12] ext4: Simplify mpage_add_bh_to_extent() Jan Kara
@ 2013-01-28 18:06   ` Theodore Ts'o
  0 siblings, 0 replies; 53+ messages in thread
From: Theodore Ts'o @ 2013-01-28 18:06 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Jan 18, 2013 at 01:00:44PM +0100, Jan Kara wrote:
> The argument b_size of mpage_add_bh_to_extent() was bogus since it was
> always == blocksize (which we can easily derive from inode->i_blkbits).
> Also second branch of condition:
> 	if (nrblocks >= EXT4_MAX_TRANS_DATA) {
> 	} else if ((nrblocks + (b_size >> mpd->inode->i_blkbits)) >
> 						EXT4_MAX_TRANS_DATA) {
> 	}
> was never taken because (b_size >> mpd->inode->i_blkbits) == 1.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Thanks, applied.

					- Ted

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/12] ext4: Make ext4_bio_writepage() handle unprepared buffers
  2013-01-18 12:00 ` [PATCH 11/12] ext4: Make ext4_bio_writepage() handle unprepared buffers Jan Kara
@ 2013-01-29  1:59   ` Theodore Ts'o
  0 siblings, 0 replies; 53+ messages in thread
From: Theodore Ts'o @ 2013-01-29  1:59 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Jan 18, 2013 at 01:00:45PM +0100, Jan Kara wrote:
> So far ext4_bio_writepage() unconditionally cleared dirty bit on all buffers
> underlying the page. That implicitely assumes we can write all buffers. So
> far that is true because callers call into ext4_bio_writepage() make sure
> all buffers in the page are mapped but
> a) it's a data corruption bug waiting to happen
> b) in data=ordered mode when blocksize < pagesize we do need to write pages
>    that may have only some of dirty buffers mapped.
> 
> So change ext4_bio_writepage() to skip buffers that cannot be written without
> clearing their dirty bit.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Thanks, applied.

						- Ted

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 12/12] ext4: Fix ext4_writepage() to achieve data=ordered guarantees
  2013-01-18 12:00 ` [PATCH 12/12] ext4: Fix ext4_writepage() to achieve data=ordered guarantees Jan Kara
@ 2013-01-29  2:08   ` Theodore Ts'o
  0 siblings, 0 replies; 53+ messages in thread
From: Theodore Ts'o @ 2013-01-29  2:08 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Jan 18, 2013 at 01:00:46PM +0100, Jan Kara wrote:
> So far ext4_writepage() skipped writing pages that had any delayed or
> unwritten buffers attached. When blocksize < pagesize this breaks
> data=ordered mode guarantees as we can have a page with one freshly
> allocated buffer whose allocation is part of the committing transaction
> and another buffer in the page which is delayed or unwritten. So fix this
> problem by calling ext4_bio_writepage() anyway. It will submit mapped
> buffers and leave others alone.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Thanks, applied.

					- Ted

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/12] ext4: Disable merging of uninitialized extents
  2013-01-28 15:38         ` Theodore Ts'o
@ 2013-01-29  7:41           ` Dmitry Monakhov
  2013-01-29  8:37             ` Zheng Liu
  0 siblings, 1 reply; 53+ messages in thread
From: Dmitry Monakhov @ 2013-01-29  7:41 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jan Kara, linux-ext4

On Mon, 28 Jan 2013 10:38:36 -0500, Theodore Ts'o <tytso@mit.edu> wrote:
> On Mon, Jan 28, 2013 at 07:02:55PM +0400, Dmitry Monakhov wrote:
> > Actually this patch consists of two peaces
> > 1) disable merging of uninitialized extents. (1 line change) I'm
> > absolutely agree with it.
> 
> To be clear, that's this patch chunk (one line change not including
> comments :-), right?
Off course.
> 
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -1579,11 +1576,13 @@ int
>  ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
>  				struct ext4_extent *ex2)
>  {
>  
>  	/*
> -	 * Make sure that either both extents are uninitialized, or
> -	 * both are _not_.
> +	 * Make sure that both extents are initialized. We don't merge
> +	 * uninitialized extents so that we can be sure that end_io code has
> +	 * the extent that was written properly split out and conversion to
> +	 * initialized is trivial.
>  	 */
> -	if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
> +	if (ext4_ext_is_uninitialized(ex1) || ext4_ext_is_uninitialized(ex2))
>  		return 0;
>  
> 
> The one thing I'm a bit worried about is how much worse will extent
> fragmentation be once we do this, but it's clear we need to strive for
> correctness first.
This change should not affect fragmentation because
1) Most people call fallocate(2) on big chunks (>4M)
2) Once uninitialized extent filled with data and converted to
initialized extents  will be merged immediately.
3) Most people use fallocate(2) for preallocation before write(2)
   so effectively calls are interleaved so merging works as expected.

The only case where fragmentation will increase is when someone
performs many fallocate(2) calls for small chunks (4k) w/o writes.
As result leaf block will consist of 256 extents 4k each.
Later writes can't help us because we can not merge extents from two
leaf blocks. But I still think that this use case it inconvenient.

BTW why do we not try to merge extents from two leaf blocs?
I do not see any technical difficulties. If two adjacent leaf blocks
are covered by common index block merging is possible (but we need +1
journal block).  
> 
> 							- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/12] ext4: Disable merging of uninitialized extents
  2013-01-29  7:41           ` Dmitry Monakhov
@ 2013-01-29  8:37             ` Zheng Liu
  0 siblings, 0 replies; 53+ messages in thread
From: Zheng Liu @ 2013-01-29  8:37 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Theodore Ts'o, Jan Kara, linux-ext4

On Tue, Jan 29, 2013 at 11:41:13AM +0400, Dmitry Monakhov wrote:
> On Mon, 28 Jan 2013 10:38:36 -0500, Theodore Ts'o <tytso@mit.edu> wrote:
> > On Mon, Jan 28, 2013 at 07:02:55PM +0400, Dmitry Monakhov wrote:
> > > Actually this patch consists of two peaces
> > > 1) disable merging of uninitialized extents. (1 line change) I'm
> > > absolutely agree with it.
> > 
> > To be clear, that's this patch chunk (one line change not including
> > comments :-), right?
> Off course.
> > 
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -1579,11 +1576,13 @@ int
> >  ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
> >  				struct ext4_extent *ex2)
> >  {
> >  
> >  	/*
> > -	 * Make sure that either both extents are uninitialized, or
> > -	 * both are _not_.
> > +	 * Make sure that both extents are initialized. We don't merge
> > +	 * uninitialized extents so that we can be sure that end_io code has
> > +	 * the extent that was written properly split out and conversion to
> > +	 * initialized is trivial.
> >  	 */
> > -	if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
> > +	if (ext4_ext_is_uninitialized(ex1) || ext4_ext_is_uninitialized(ex2))
> >  		return 0;
> >  
> > 
> > The one thing I'm a bit worried about is how much worse will extent
> > fragmentation be once we do this, but it's clear we need to strive for
> > correctness first.
> This change should not affect fragmentation because
> 1) Most people call fallocate(2) on big chunks (>4M)
> 2) Once uninitialized extent filled with data and converted to
> initialized extents  will be merged immediately.
> 3) Most people use fallocate(2) for preallocation before write(2)
>    so effectively calls are interleaved so merging works as expected.
> 
> The only case where fragmentation will increase is when someone
> performs many fallocate(2) calls for small chunks (4k) w/o writes.
> As result leaf block will consist of 256 extents 4k each.
> Later writes can't help us because we can not merge extents from two
> leaf blocks. But I still think that this use case it inconvenient.

Yes, I doubt that no one does like this because it can not brings any
benefit.  We usually call fallocate(2) to preallocate some sequential
spaces.  Obviously preallocating a small chunk is useless.

> 
> BTW why do we not try to merge extents from two leaf blocs?
> I do not see any technical difficulties. If two adjacent leaf blocks
> are covered by common index block merging is possible (but we need +1
> journal block).

Yep, I also notice this.  I don't think there is a technical difficulty.
That would be great if two adjacent leaf blocks could be merged.

Regards,
                                                - Zheng

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/12] ext4: Disable merging of uninitialized extents
  2013-01-24  9:49   ` Dmitry Monakhov
  2013-01-24 15:12     ` Jan Kara
  2013-01-28 14:36     ` Theodore Ts'o
@ 2013-01-31  7:47     ` Dmitry Monakhov
  2013-01-31 12:39       ` Jan Kara
  2013-01-31 16:54       ` Theodore Ts'o
  2 siblings, 2 replies; 53+ messages in thread
From: Dmitry Monakhov @ 2013-01-31  7:47 UTC (permalink / raw)
  To: Jan Kara, Ted Tso; +Cc: linux-ext4, Jan Kara

On Thu, 24 Jan 2013 13:49:45 +0400, Dmitry Monakhov <dmonakhov@openvz.org> wrote:
> On Fri, 18 Jan 2013 13:00:38 +0100, Jan Kara <jack@suse.cz> wrote:
> > Merging of uninitialized extents creates all sorts of interesting race
> > possibilities when writeback / DIO races with fallocate. Thus
> > ext4_convert_unwritten_extents_endio() has to deal with a case where
> > extent to be converted needs to be split out first. That isn't nice
> > for two reasons:
> > 
> > 1) It may need allocation of extent tree block so ENOSPC is possible.
> > 2) It complicates end_io handling code
> As we already discussed your idea is 100% correct, but even with
> what patch I still able to trigger situation where split it required.
> I've got following error with this patch applied on top of 7f5118629f7
> EXT4-fs error (device dm-3): ext4_convert_unwritten_extents_endio:3411:
> inode #12: comm kworker/u:4: Written extent modified before IO finished:
> extent logical block 1379787, len 64; IO logical block 1379787, len 21
> 
> ------------[ cut here ]------------
> WARNING: at fs/ext4/extents.c:4518
> ext4_convert_unwritten_extents+0x149/0x210 [ext4]()
OK I've found it. I'm a bit disappointed, it is even not a race
condition, but simple corruption.
Patch is available here: http://article.gmane.org/gmane.comp.file-systems.ext4/36762
link for sain mailer client: <1359617098-18451-1-git-send-email-dmonakhov@openvz.org>
After this bug was fixed it is safe to apply both Jan's patches:
[PATCH 04/12] ext4: Disable merging of uninitialized extents
[PATCH 05/12] ext4: Remove unnecessary wait for extent conversion in ext4_fallocate()
At least it survives after all my tests.

BTW: It is appeared that ext4_debug() infrastructure is almost unusable
because based on printk() instead of light-wait event tracing infrastructure.
I'm now work on patch-set which fix that.

> Hardware name:         
> Modules linked in: ext4 jbd2 cpufreq_ondemand acpi_cpufreq freq_table
> mperf coretemp kvm_intel kvm crc32c_intel microcode sg button ext3 jbd
> mbcache sd_mod crc_t10dif ahci libahci pata_acpi ata_generic dm_mirror
> dm_region_hash dm_log dm_mod
> Pid: 249, comm: kworker/u:4 Not tainted 3.8.0-rc3+ #16
> Call Trace:
>  [<ffffffff8106fc23>] warn_slowpath_common+0xc3/0xf0
>  [<ffffffff8106fc6a>] warn_slowpath_null+0x1a/0x20
>  [<ffffffffa03fb909>] ext4_convert_unwritten_extents+0x149/0x210 [ext4]
>  [<ffffffff811064fa>] ? __lock_release+0x1da/0x1f0
>  [<ffffffffa03c368e>] ext4_end_io+0x3e/0x160 [ext4]
>  [<ffffffff813aab40>] ? __list_del_entry+0x210/0x250
>  [<ffffffffa03c3a21>] ext4_do_flush_completed_IO+0x101/0x280 [ext4]
>  [<ffffffffa03c3bb6>] ext4_end_io_work+0x16/0x20 [ext4]
>  [<ffffffff8109f7dd>] process_one_work+0x4ad/0x780
>  [<ffffffff8109f6d2>] ? process_one_work+0x3a2/0x780
>  [<ffffffffa03c3ba0>] ? ext4_do_flush_completed_IO+0x280/0x280 [ext4]
>  [<ffffffff810a3ed1>] worker_thread+0x3f1/0x590
>  [<ffffffff810a3ae0>] ? manage_workers+0x210/0x210
>  [<ffffffff810ac870>] kthread+0x100/0x110
>  [<ffffffff810ac770>] ? __init_kthread_worker+0x70/0x70
>  [<ffffffff81812e2c>] ret_from_fork+0x7c/0xb0
>  [<ffffffff810ac770>] ? __init_kthread_worker+0x70/0x70
> ---[ end trace add5cefed72186f8 ]---
> EXT4-fs (dm-3): ext4_convert_unwritten_extents:4522: inode #12: block
> 1379787: len 21: ext4_ext_map_blocks returned -5
> EXT4-fs (dm-3): failed to convert unwritten extents to written
> extents -- potential data loss!  (inode 12, offset 5651562496, size
> 131072, error -5)
> 
> I've run 286'th xfstest (this is my own copy of xfstest so 286'th test
> is differ from mainstream one) you can find it here
> https://raw.github.com/dmonakhov/xfstests/devel/286
> In two words it is stress test which run DIO/AIO,truncate,fallocate in parallel.
> Also you need recent FIO(http://git.kernel.dk/?p=fio.git;a=summary)
> 
> Currently I try to understand what caused this issue.
> > 
> > So we disable merging of uninitialized extents which allows us to simplify
> > the code. Extents will get merged after they are converted to initialized
> > ones.
> > 
> > Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/ext4/extents.c |   61 +++++++++++++++-------------------------------------
> >  1 files changed, 18 insertions(+), 43 deletions(-)
> > 
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index 26af228..f1ce33a 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -54,9 +54,6 @@
> >  #define EXT4_EXT_MARK_UNINIT1	0x2  /* mark first half uninitialized */
> >  #define EXT4_EXT_MARK_UNINIT2	0x4  /* mark second half uninitialized */
> >  
> > -#define EXT4_EXT_DATA_VALID1	0x8  /* first half contains valid data */
> > -#define EXT4_EXT_DATA_VALID2	0x10 /* second half contains valid data */
> > -
> >  static __le32 ext4_extent_block_csum(struct inode *inode,
> >  				     struct ext4_extent_header *eh)
> >  {
> > @@ -1579,20 +1576,17 @@ int
> >  ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
> >  				struct ext4_extent *ex2)
> >  {
> > -	unsigned short ext1_ee_len, ext2_ee_len, max_len;
> > +	unsigned ext1_ee_len, ext2_ee_len;
> >  
> >  	/*
> > -	 * Make sure that either both extents are uninitialized, or
> > -	 * both are _not_.
> > +	 * Make sure that both extents are initialized. We don't merge
> > +	 * uninitialized extents so that we can be sure that end_io code has
> > +	 * the extent that was written properly split out and conversion to
> > +	 * initialized is trivial.
> >  	 */
> > -	if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
> > +	if (ext4_ext_is_uninitialized(ex1) || ext4_ext_is_uninitialized(ex2))
> >  		return 0;
> >  
> > -	if (ext4_ext_is_uninitialized(ex1))
> > -		max_len = EXT_UNINIT_MAX_LEN;
> > -	else
> > -		max_len = EXT_INIT_MAX_LEN;
> > -
> >  	ext1_ee_len = ext4_ext_get_actual_len(ex1);
> >  	ext2_ee_len = ext4_ext_get_actual_len(ex2);
> >  
> > @@ -1605,7 +1599,7 @@ ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
> >  	 * as an RO_COMPAT feature, refuse to merge to extents if
> >  	 * this can result in the top bit of ee_len being set.
> >  	 */
> > -	if (ext1_ee_len + ext2_ee_len > max_len)
> > +	if (ext1_ee_len + ext2_ee_len > EXT_INIT_MAX_LEN)
> >  		return 0;
> >  #ifdef AGGRESSIVE_TEST
> >  	if (ext1_ee_len >= 4)
> > @@ -2959,9 +2953,6 @@ static int ext4_split_extent_at(handle_t *handle,
> >  	unsigned int ee_len, depth;
> >  	int err = 0;
> >  
> > -	BUG_ON((split_flag & (EXT4_EXT_DATA_VALID1 | EXT4_EXT_DATA_VALID2)) ==
> > -	       (EXT4_EXT_DATA_VALID1 | EXT4_EXT_DATA_VALID2));
> > -
> >  	ext_debug("ext4_split_extents_at: inode %lu, logical"
> >  		"block %llu\n", inode->i_ino, (unsigned long long)split);
> >  
> > @@ -3020,14 +3011,7 @@ static int ext4_split_extent_at(handle_t *handle,
> >  
> >  	err = ext4_ext_insert_extent(handle, inode, path, &newex, flags);
> >  	if (err == -ENOSPC && (EXT4_EXT_MAY_ZEROOUT & split_flag)) {
> > -		if (split_flag & (EXT4_EXT_DATA_VALID1|EXT4_EXT_DATA_VALID2)) {
> > -			if (split_flag & EXT4_EXT_DATA_VALID1)
> > -				err = ext4_ext_zeroout(inode, ex2);
> > -			else
> > -				err = ext4_ext_zeroout(inode, ex);
> > -		} else
> > -			err = ext4_ext_zeroout(inode, &orig_ex);
> > -
> > +		err = ext4_ext_zeroout(inode, &orig_ex);
> >  		if (err)
> >  			goto fix_extent_len;
> >  		/* update the extent length and mark as initialized */
> > @@ -3085,8 +3069,6 @@ static int ext4_split_extent(handle_t *handle,
> >  		if (uninitialized)
> >  			split_flag1 |= EXT4_EXT_MARK_UNINIT1 |
> >  				       EXT4_EXT_MARK_UNINIT2;
> > -		if (split_flag & EXT4_EXT_DATA_VALID2)
> > -			split_flag1 |= EXT4_EXT_DATA_VALID1;
> >  		err = ext4_split_extent_at(handle, inode, path,
> >  				map->m_lblk + map->m_len, split_flag1, flags1);
> >  		if (err)
> > @@ -3099,8 +3081,7 @@ static int ext4_split_extent(handle_t *handle,
> >  		return PTR_ERR(path);
> >  
> >  	if (map->m_lblk >= ee_block) {
> > -		split_flag1 = split_flag & (EXT4_EXT_MAY_ZEROOUT |
> > -					    EXT4_EXT_DATA_VALID2);
> > +		split_flag1 = split_flag & EXT4_EXT_MAY_ZEROOUT;
> >  		if (uninitialized)
> >  			split_flag1 |= EXT4_EXT_MARK_UNINIT1;
> >  		if (split_flag & EXT4_EXT_MARK_UNINIT2)
> > @@ -3379,8 +3360,7 @@ static int ext4_split_unwritten_extents(handle_t *handle,
> >  
> >  	split_flag |= ee_block + ee_len <= eof_block ? EXT4_EXT_MAY_ZEROOUT : 0;
> >  	split_flag |= EXT4_EXT_MARK_UNINIT2;
> > -	if (flags & EXT4_GET_BLOCKS_CONVERT)
> > -		split_flag |= EXT4_EXT_DATA_VALID2;
> > +
> >  	flags |= EXT4_GET_BLOCKS_PRE_IO;
> >  	return ext4_split_extent(handle, inode, path, map, split_flag, flags);
> >  }
> > @@ -3405,20 +3385,15 @@ static int ext4_convert_unwritten_extents_endio(handle_t *handle,
> >  		"block %llu, max_blocks %u\n", inode->i_ino,
> >  		  (unsigned long long)ee_block, ee_len);
> >  
> > -	/* If extent is larger than requested then split is required */
> > +	/* Extent is larger than requested? */
> >  	if (ee_block != map->m_lblk || ee_len > map->m_len) {
> > -		err = ext4_split_unwritten_extents(handle, inode, map, path,
> > -						   EXT4_GET_BLOCKS_CONVERT);
> > -		if (err < 0)
> > -			goto out;
> > -		ext4_ext_drop_refs(path);
> > -		path = ext4_ext_find_extent(inode, map->m_lblk, path);
> > -		if (IS_ERR(path)) {
> > -			err = PTR_ERR(path);
> > -			goto out;
> > -		}
> > -		depth = ext_depth(inode);
> > -		ex = path[depth].p_ext;
> > +		EXT4_ERROR_INODE(inode, "Written extent modified before IO"
> > +			" finished: extent logical block %llu, len %u; IO"
> > +			" logical block %llu, len %u\n",
> > +			(unsigned long long)ee_block, ee_len,
> > +			(unsigned long long)map->m_lblk, map->m_len);
> > +		err = -EIO;
> > +		goto out;
> >  	}
> >  
> >  	err = ext4_ext_get_access(handle, inode, path + depth);
> > -- 
> > 1.7.1
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/12] ext4: Disable merging of uninitialized extents
  2013-01-31  7:47     ` Dmitry Monakhov
@ 2013-01-31 12:39       ` Jan Kara
  2013-01-31 14:09         ` Dmitry Monakhov
  2013-01-31 16:54       ` Theodore Ts'o
  1 sibling, 1 reply; 53+ messages in thread
From: Jan Kara @ 2013-01-31 12:39 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Jan Kara, Ted Tso, linux-ext4

On Thu 31-01-13 11:47:23, Dmitry Monakhov wrote:
> On Thu, 24 Jan 2013 13:49:45 +0400, Dmitry Monakhov <dmonakhov@openvz.org> wrote:
> > On Fri, 18 Jan 2013 13:00:38 +0100, Jan Kara <jack@suse.cz> wrote:
> > > Merging of uninitialized extents creates all sorts of interesting race
> > > possibilities when writeback / DIO races with fallocate. Thus
> > > ext4_convert_unwritten_extents_endio() has to deal with a case where
> > > extent to be converted needs to be split out first. That isn't nice
> > > for two reasons:
> > > 
> > > 1) It may need allocation of extent tree block so ENOSPC is possible.
> > > 2) It complicates end_io handling code
> > As we already discussed your idea is 100% correct, but even with
> > what patch I still able to trigger situation where split it required.
> > I've got following error with this patch applied on top of 7f5118629f7
> > EXT4-fs error (device dm-3): ext4_convert_unwritten_extents_endio:3411:
> > inode #12: comm kworker/u:4: Written extent modified before IO finished:
> > extent logical block 1379787, len 64; IO logical block 1379787, len 21
> > 
> > ------------[ cut here ]------------
> > WARNING: at fs/ext4/extents.c:4518
> > ext4_convert_unwritten_extents+0x149/0x210 [ext4]()
> OK I've found it. I'm a bit disappointed, it is even not a race
> condition, but simple corruption.
> Patch is available here: http://article.gmane.org/gmane.comp.file-systems.ext4/36762
> link for sain mailer client: <1359617098-18451-1-git-send-email-dmonakhov@openvz.org>
> After this bug was fixed it is safe to apply both Jan's patches:
> [PATCH 04/12] ext4: Disable merging of uninitialized extents
> [PATCH 05/12] ext4: Remove unnecessary wait for extent conversion in ext4_fallocate()
> At least it survives after all my tests.
  Thanks for debugging this. I was looking into the code but it didn't come
to my mind what happens when ext4_ext_split() fails.
 
> BTW: It is appeared that ext4_debug() infrastructure is almost unusable
> because based on printk() instead of light-wait event tracing infrastructure.
> I'm now work on patch-set which fix that.
  That would be certainly welcome. ext4_debug() infrastructure comes from
times when tracing didn't exist so we used what we could. These days it's
cumbersome...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/12] ext4: Disable merging of uninitialized extents
  2013-01-31 12:39       ` Jan Kara
@ 2013-01-31 14:09         ` Dmitry Monakhov
  0 siblings, 0 replies; 53+ messages in thread
From: Dmitry Monakhov @ 2013-01-31 14:09 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jan Kara, Ted Tso, linux-ext4

On Thu, 31 Jan 2013 13:39:30 +0100, Jan Kara <jack@suse.cz> wrote:
> On Thu 31-01-13 11:47:23, Dmitry Monakhov wrote:
> > On Thu, 24 Jan 2013 13:49:45 +0400, Dmitry Monakhov <dmonakhov@openvz.org> wrote:
> > > On Fri, 18 Jan 2013 13:00:38 +0100, Jan Kara <jack@suse.cz> wrote:
> > > > Merging of uninitialized extents creates all sorts of interesting race
> > > > possibilities when writeback / DIO races with fallocate. Thus
> > > > ext4_convert_unwritten_extents_endio() has to deal with a case where
> > > > extent to be converted needs to be split out first. That isn't nice
> > > > for two reasons:
> > > > 
> > > > 1) It may need allocation of extent tree block so ENOSPC is possible.
> > > > 2) It complicates end_io handling code
> > > As we already discussed your idea is 100% correct, but even with
> > > what patch I still able to trigger situation where split it required.
> > > I've got following error with this patch applied on top of 7f5118629f7
> > > EXT4-fs error (device dm-3): ext4_convert_unwritten_extents_endio:3411:
> > > inode #12: comm kworker/u:4: Written extent modified before IO finished:
> > > extent logical block 1379787, len 64; IO logical block 1379787, len 21
> > > 
> > > ------------[ cut here ]------------
> > > WARNING: at fs/ext4/extents.c:4518
> > > ext4_convert_unwritten_extents+0x149/0x210 [ext4]()
> > OK I've found it. I'm a bit disappointed, it is even not a race
> > condition, but simple corruption.
> > Patch is available here: http://article.gmane.org/gmane.comp.file-systems.ext4/36762
> > link for sain mailer client: <1359617098-18451-1-git-send-email-dmonakhov@openvz.org>
> > After this bug was fixed it is safe to apply both Jan's patches:
> > [PATCH 04/12] ext4: Disable merging of uninitialized extents
> > [PATCH 05/12] ext4: Remove unnecessary wait for extent conversion in ext4_fallocate()
> > At least it survives after all my tests.
>   Thanks for debugging this. I was looking into the code but it didn't come
> to my mind what happens when ext4_ext_split() fails.
ext4_ext_split does have not fail.
EXTENT:a----------b-------------c------d
It call ext4_ext_split_at twice:
first ext4_ext_split_at() should split [a,d] to [a,c],[c,d]
second ext4_ext_split_at() should split [a,c] to [a,b],[b,c]

But if first ext4_ext_split_at() internally failed due to ENOSPC it will
does ext_ext_zeroout() which is correct(if ).
Second one expect to operate on [a,c] which should be uninitialized,
but it operate on [a,d] which was initialized. At the end it mark
extent as uninitialized here:
        if (split == ee_block) {
                /*                                                                                                                               
                 * case b: block @split is the block that the extent
                 begins with                                                                 
                 * then we just change the state of the extent, and
                 splitting                                                                    
                 * is not needed.                                                                                                                
                 */
                if (split_flag & EXT4_EXT_MARK_UNINIT2)
                        ext4_ext_mark_uninitialized(ex);
####^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                else
                        ext4_ext_mark_initialized(ex);
...

In fact my first ext4_ext_split_at() result in zeroout we can skip
second one because it is not necessary, but I just clear
EXT4_EXT_MARK_UNINIT2 flag if extent becomes initialized.
And fairly to say my patch is not quite right because it is not
not correct to perform zeroout on initialized extent.
I'll send updated version in a minute.

>  
> > BTW: It is appeared that ext4_debug() infrastructure is almost unusable
> > because based on printk() instead of light-wait event tracing infrastructure.
> > I'm now work on patch-set which fix that.
>   That would be certainly welcome. ext4_debug() infrastructure comes from
> times when tracing didn't exist so we used what we could. These days it's
> cumbersome...
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/12] ext4: Disable merging of uninitialized extents
  2013-01-31  7:47     ` Dmitry Monakhov
  2013-01-31 12:39       ` Jan Kara
@ 2013-01-31 16:54       ` Theodore Ts'o
  1 sibling, 0 replies; 53+ messages in thread
From: Theodore Ts'o @ 2013-01-31 16:54 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Jan Kara, linux-ext4

On Thu, Jan 31, 2013 at 11:47:23AM +0400, Dmitry Monakhov wrote:
> OK I've found it. I'm a bit disappointed, it is even not a race
> condition, but simple corruption.
> Patch is available here: http://article.gmane.org/gmane.comp.file-systems.ext4/36762
> link for sain mailer client: <1359617098-18451-1-git-send-email-dmonakhov@openvz.org>
> After this bug was fixed it is safe to apply both Jan's patches:
> [PATCH 04/12] ext4: Disable merging of uninitialized extents
> [PATCH 05/12] ext4: Remove unnecessary wait for extent conversion in ext4_fallocate()
> At least it survives after all my tests.

Thanks for finding it!  As folks have probably noticed the dev branch
and linux-next already has all of the other patches (except for #3,
which we've agreed isn't needed since the original code was correct
as-is).  I'll work on reviewing your patch and then applying the last
three, and then over the weekend I'll push the master branch up, thus
casting all of the patches currently in the tree into stone.

> BTW: It is appeared that ext4_debug() infrastructure is almost unusable
> because based on printk() instead of light-wait event tracing infrastructure.
> I'm now work on patch-set which fix that.

Yes, I haven't used the ext4_debug infrastructure in quite some time.
As I've needed to do debugging I've added new tracepoints, and in some
cases I hadn't gotten around to removing the old ext4_debug code, on
the assumption that maybe someone else was still using it, but mostly
because I was lazy.   That would be a great thing to do.

	      	      	   	      	    - Ted

^ permalink raw reply	[flat|nested] 53+ messages in thread

* REGRESSION: [PATCH 04/12] ext4: Disable merging of uninitialized extents
  2013-01-18 12:00 ` [PATCH 04/12] ext4: Disable merging of uninitialized extents Jan Kara
  2013-01-24  9:49   ` Dmitry Monakhov
@ 2013-02-09 17:10   ` Theodore Ts'o
  2013-02-12 21:58     ` Jan Kara
  2013-02-14 16:11     ` Jan Kara
  1 sibling, 2 replies; 53+ messages in thread
From: Theodore Ts'o @ 2013-02-09 17:10 UTC (permalink / raw)
  To: Jan Kara, Dmitry Monakhov; +Cc: linux-ext4

On Fri, Jan 18, 2013 at 01:00:38PM +0100, Jan Kara wrote:
> Merging of uninitialized extents creates all sorts of interesting race
> possibilities when writeback / DIO races with fallocate. Thus
> ext4_convert_unwritten_extents_endio() has to deal with a case where
> extent to be converted needs to be split out first. That isn't nice
> for two reasons:
> 
> 1) It may need allocation of extent tree block so ENOSPC is possible.
> 2) It complicates end_io handling code
> 
> So we disable merging of uninitialized extents which allows us to simplify
> the code. Extents will get merged after they are converted to initialized
> ones.
> 
> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
> Signed-off-by: Jan Kara <jack@suse.cz>

Sorry for not noticing this earlier, but this patch is causing a
regression.  It is loading to test 113 failing when dioread_nolock is
used:

113	[   47.619363] EXT4-fs error (device vdb): ext4_convert_unwritten_extents_endio:3411: i
node #10951: comm kworker/u:0: Written extent modified before IO finished: extent logical block
 1024, len 1024; IO logical block 1024, len 127
[   47.619363] 
[   47.623239] EXT4-fs warning (device vdb): ext4_convert_unwritten_extents:4522: inode #10951:
 block 1024: len 127: ext4_ext_map_blocks returned -5
[   47.628975] EXT4-fs (vdb): failed to convert unwritten extents to written extents -- potenti
al data loss!  (inode 10951, offset 4194304, size 520192, error -5)


As a result, I am considering whether or not I should to drop the
following patches from the ext4 tree:

e63dd9c ext4: disable merging of uninitialized extents
de39534 ext4: remove unnecessary wait for extent conversion in ext4_fallocate()
37bf0a8 ext4: ext4_split_extent should take care of extent zeroout

I know that these patches fix other potential races which causes data
loss, but they've been around for a while, and in practice seem to be
relatively rarely hit.

Jan, Dmitry, any chance you can take a look at this, and give me
suggestions about how to proceed at this point?

Thanks,

						- Ted

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: REGRESSION: [PATCH 04/12] ext4: Disable merging of uninitialized extents
  2013-02-09 17:10   ` REGRESSION: " Theodore Ts'o
@ 2013-02-12 21:58     ` Jan Kara
  2013-02-13  4:57       ` Theodore Ts'o
  2013-02-14 16:11     ` Jan Kara
  1 sibling, 1 reply; 53+ messages in thread
From: Jan Kara @ 2013-02-12 21:58 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jan Kara, Dmitry Monakhov, linux-ext4

On Sat 09-02-13 12:10:15, Ted Tso wrote:
> On Fri, Jan 18, 2013 at 01:00:38PM +0100, Jan Kara wrote:
> > Merging of uninitialized extents creates all sorts of interesting race
> > possibilities when writeback / DIO races with fallocate. Thus
> > ext4_convert_unwritten_extents_endio() has to deal with a case where
> > extent to be converted needs to be split out first. That isn't nice
> > for two reasons:
> > 
> > 1) It may need allocation of extent tree block so ENOSPC is possible.
> > 2) It complicates end_io handling code
> > 
> > So we disable merging of uninitialized extents which allows us to simplify
> > the code. Extents will get merged after they are converted to initialized
> > ones.
> > 
> > Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> Sorry for not noticing this earlier, but this patch is causing a
> regression.  It is loading to test 113 failing when dioread_nolock is
> used:
> 
> 113	[   47.619363] EXT4-fs error (device vdb): ext4_convert_unwritten_extents_endio:3411: i
> node #10951: comm kworker/u:0: Written extent modified before IO finished: extent logical block
>  1024, len 1024; IO logical block 1024, len 127
> [   47.619363] 
> [   47.623239] EXT4-fs warning (device vdb): ext4_convert_unwritten_extents:4522: inode #10951:
>  block 1024: len 127: ext4_ext_map_blocks returned -5
> [   47.628975] EXT4-fs (vdb): failed to convert unwritten extents to written extents -- potenti
> al data loss!  (inode 10951, offset 4194304, size 520192, error -5)
  Hum, yes. Thanks for letting us know. I was able to reproduce this and
I'm looking into it now. I guess it is a similar problem as Dmitry fixed in
his fix...

> As a result, I am considering whether or not I should to drop the
> following patches from the ext4 tree:
> 
> e63dd9c ext4: disable merging of uninitialized extents
> de39534 ext4: remove unnecessary wait for extent conversion in ext4_fallocate()
> 37bf0a8 ext4: ext4_split_extent should take care of extent zeroout
> 
> I know that these patches fix other potential races which causes data
> loss, but they've been around for a while, and in practice seem to be
> relatively rarely hit.
  The third patch is a fix which shouldn't cause any issues. So you can
take just that one and leave the other two aside until we are able to
resolve the issue.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: REGRESSION: [PATCH 04/12] ext4: Disable merging of uninitialized extents
  2013-02-12 21:58     ` Jan Kara
@ 2013-02-13  4:57       ` Theodore Ts'o
  2013-02-13  7:26         ` Dmitry Monakhov
  0 siblings, 1 reply; 53+ messages in thread
From: Theodore Ts'o @ 2013-02-13  4:57 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dmitry Monakhov, linux-ext4

On Tue, Feb 12, 2013 at 10:58:32PM +0100, Jan Kara wrote:
>   The third patch is a fix which shouldn't cause any issues. So you can
> take just that one and leave the other two aside until we are able to
> resolve the issue.

I thought the third patch depending on the first two?  Certainly it
doesn't apply cleanly without the first two patches...

	      	      	      	  	- Ted

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: REGRESSION: [PATCH 04/12] ext4: Disable merging of uninitialized extents
  2013-02-13  4:57       ` Theodore Ts'o
@ 2013-02-13  7:26         ` Dmitry Monakhov
  2013-02-13 15:08           ` Merge window planning for ext4 and Ted's vacation Theodore Ts'o
  2013-02-14 10:47           ` REGRESSION: [PATCH 04/12] ext4: Disable merging of uninitialized extents Jan Kara
  0 siblings, 2 replies; 53+ messages in thread
From: Dmitry Monakhov @ 2013-02-13  7:26 UTC (permalink / raw)
  To: Theodore Ts'o, Jan Kara; +Cc: linux-ext4

On Tue, 12 Feb 2013 23:57:59 -0500, Theodore Ts'o <tytso@mit.edu> wrote:
> On Tue, Feb 12, 2013 at 10:58:32PM +0100, Jan Kara wrote:
> >   The third patch is a fix which shouldn't cause any issues. So you can
> > take just that one and leave the other two aside until we are able to
> > resolve the issue.
> 
> I thought the third patch depending on the first two?  Certainly it
> doesn't apply cleanly without the first two patches...
My patch fix old issue, but i've prepared it on top of Jan's patches
only for simplicity. I'll send new version which not depend on his
patches today.
Over-all Jan's statment that split should not happen inside end_io
and it is clear sing of a bug is absolutely right decision.
This helps us to spot several hidden issues (number is still unknown)
so may be it is reasonable to split first patch in two parts:
1) disable uninitialized extents merging itself.
2) Print warning if split is required inside end_io(so only warning will
be printed, but w/o data corruption)
3) Get rid of extent split machinery from end_io (because it is not
longer valid situation)

(1) and (2) should be accepted ASAP and will help us to spot and fix
other hidden issues. And we fix all related issues it will be safe
to apply (3)'rd one.
I'll send patches soon.
> 
> 	      	      	      	  	- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Merge window planning for ext4 and Ted's vacation
  2013-02-13  7:26         ` Dmitry Monakhov
@ 2013-02-13 15:08           ` Theodore Ts'o
  2013-02-14 10:47           ` REGRESSION: [PATCH 04/12] ext4: Disable merging of uninitialized extents Jan Kara
  1 sibling, 0 replies; 53+ messages in thread
From: Theodore Ts'o @ 2013-02-13 15:08 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Jan Kara, linux-ext4, Zheng Liu

On Wed, Feb 13, 2013 at 11:26:40AM +0400, Dmitry Monakhov wrote:
> This helps us to spot several hidden issues (number is still unknown)
> so may be it is reasonable to split first patch in two parts:
> 1) disable uninitialized extents merging itself.
> 2) Print warning if split is required inside end_io(so only warning will
> be printed, but w/o data corruption)
> 3) Get rid of extent split machinery from end_io (because it is not
> longer valid situation)
> 
> (1) and (2) should be accepted ASAP and will help us to spot and fix
> other hidden issues. And we fix all related issues it will be safe
> to apply (3)'rd one.
> I'll send patches soon.

That seems like reasonable approach; I'm not sure we'll be able to get
to (3) by the time the merge window opens, but getting (1) and (2) in
would be great; if the warning is getting triggered a lot, we might
need to only enable it under CONFIG_EXT4_DEBUG, though.

If you could you base the patches off of tip of the dev branch (which
I just updated), I'd really appreciate it.  The last commit on the
newly updated dev branch is:

7eedefe ext4: support simple conversion of extent-mapped inodes to use i_blocks

In this updated dev branch, I've moved the extent status patches to
the unstable portion of the tree due to the bigalloc regressions.  So
the plan for the merge window is that currently, the extent status
patches and the uninitialized extent merging patches are currently on
the unstable portion of the tree here:

	http://repo.or.cz/w/ext4-patch-queue.git
	git://repo.or.cz/ext4-patch-queue.git

That way we can do intensive testing on the rest of the ext4 tree.
Once the rest of the patches have been validated, I'll push the master
branch on the ext4 git tree to point at the fully validated set of
patches, and then we can try to push the rest of the patches that have
been moved past the stable-boundary in the ext4 patch queue back onto
the dev branch, and see if we can shake out the rest of the problems
before the merge window opens.

I will be leaving for Hawaii tomorrow, so the amount of time I will
have to do a development/debugging work will be limited.  On the other
hand, it's easy to let regression tests run in the hotel room while
I'm visiting volcano craters, so please do send me updated patches,
and I'll review and test them as I have time.  The joys of having the
pre-merge window crunch coincide with your vacation...  :-)

    	    	       	     	    	     - Ted

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: REGRESSION: [PATCH 04/12] ext4: Disable merging of uninitialized extents
  2013-02-13  7:26         ` Dmitry Monakhov
  2013-02-13 15:08           ` Merge window planning for ext4 and Ted's vacation Theodore Ts'o
@ 2013-02-14 10:47           ` Jan Kara
  1 sibling, 0 replies; 53+ messages in thread
From: Jan Kara @ 2013-02-14 10:47 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Theodore Ts'o, Jan Kara, linux-ext4

On Wed 13-02-13 11:26:40, Dmitry Monakhov wrote:
> On Tue, 12 Feb 2013 23:57:59 -0500, Theodore Ts'o <tytso@mit.edu> wrote:
> > On Tue, Feb 12, 2013 at 10:58:32PM +0100, Jan Kara wrote:
> > >   The third patch is a fix which shouldn't cause any issues. So you can
> > > take just that one and leave the other two aside until we are able to
> > > resolve the issue.
> > 
> > I thought the third patch depending on the first two?  Certainly it
> > doesn't apply cleanly without the first two patches...
> My patch fix old issue, but i've prepared it on top of Jan's patches
> only for simplicity. I'll send new version which not depend on his
> patches today.
> Over-all Jan's statment that split should not happen inside end_io
> and it is clear sing of a bug is absolutely right decision.
> This helps us to spot several hidden issues (number is still unknown)
> so may be it is reasonable to split first patch in two parts:
> 1) disable uninitialized extents merging itself.
> 2) Print warning if split is required inside end_io(so only warning will
> be printed, but w/o data corruption)
> 3) Get rid of extent split machinery from end_io (because it is not
> longer valid situation)
> 
> (1) and (2) should be accepted ASAP and will help us to spot and fix
> other hidden issues. And we fix all related issues it will be safe
> to apply (3)'rd one.
> I'll send patches soon.
  I agree with this plan, I just think the warnings are currently too easy
to trigger (just random AIO DIO writing seems to trigger them with
dioread_nolock) so as Ted said we'll probably have to hide them under
EXT4_DEBUG or something like that. I'll try to debug the current problem
while you prepare the patches :).

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: REGRESSION: [PATCH 04/12] ext4: Disable merging of uninitialized extents
  2013-02-09 17:10   ` REGRESSION: " Theodore Ts'o
  2013-02-12 21:58     ` Jan Kara
@ 2013-02-14 16:11     ` Jan Kara
  2013-02-14 19:05       ` Theodore Ts'o
  1 sibling, 1 reply; 53+ messages in thread
From: Jan Kara @ 2013-02-14 16:11 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jan Kara, Dmitry Monakhov, linux-ext4

On Sat 09-02-13 12:10:15, Ted Tso wrote:
> On Fri, Jan 18, 2013 at 01:00:38PM +0100, Jan Kara wrote:
> > Merging of uninitialized extents creates all sorts of interesting race
> > possibilities when writeback / DIO races with fallocate. Thus
> > ext4_convert_unwritten_extents_endio() has to deal with a case where
> > extent to be converted needs to be split out first. That isn't nice
> > for two reasons:
> > 
> > 1) It may need allocation of extent tree block so ENOSPC is possible.
> > 2) It complicates end_io handling code
> > 
> > So we disable merging of uninitialized extents which allows us to simplify
> > the code. Extents will get merged after they are converted to initialized
> > ones.
> > 
> > Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> Sorry for not noticing this earlier, but this patch is causing a
> regression.  It is loading to test 113 failing when dioread_nolock is
> used:
> 
> 113	[   47.619363] EXT4-fs error (device vdb): ext4_convert_unwritten_extents_endio:3411: i
> node #10951: comm kworker/u:0: Written extent modified before IO finished: extent logical block
>  1024, len 1024; IO logical block 1024, len 127
> [   47.619363] 
> [   47.623239] EXT4-fs warning (device vdb): ext4_convert_unwritten_extents:4522: inode #10951:
>  block 1024: len 127: ext4_ext_map_blocks returned -5
> [   47.628975] EXT4-fs (vdb): failed to convert unwritten extents to written extents -- potenti
> al data loss!  (inode 10951, offset 4194304, size 520192, error -5)
> 
> 
> As a result, I am considering whether or not I should to drop the
> following patches from the ext4 tree:
> 
> e63dd9c ext4: disable merging of uninitialized extents
> de39534 ext4: remove unnecessary wait for extent conversion in ext4_fallocate()
> 37bf0a8 ext4: ext4_split_extent should take care of extent zeroout
> 
> I know that these patches fix other potential races which causes data
> loss, but they've been around for a while, and in practice seem to be
> relatively rarely hit.
  OK, so I've debugged this. It wasn't actually that hard. The problem is
that mpage_da_map_and_submit() prepares extent with e.g. 256 blocks but
later we submit a shorter bio e.g. because it cannot carry that many pages.
So ->end_io is called only for first 128 blocks or so. I spotted this
problem already before, just it didn't come up when Ted sent me this bug
report.

The fix isn't trivial. What we need to do is to be able to attach multiple
bios to one io_end structure and start the conversion only once they are
all finished. I actually have patches for this in the second part of my
patch set. So for this merge window I'd just do what Dmitry suggested (and
the warning can be triggered really trivially by a sequential write with
dioread_nolock so that definitely has to be hidden by default). And I'll go
off to finish that second part of my patch set so that it can get to Ted's
tree as soon as possible.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: REGRESSION: [PATCH 04/12] ext4: Disable merging of uninitialized extents
  2013-02-14 16:11     ` Jan Kara
@ 2013-02-14 19:05       ` Theodore Ts'o
  2013-02-14 21:32         ` Jan Kara
  0 siblings, 1 reply; 53+ messages in thread
From: Theodore Ts'o @ 2013-02-14 19:05 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dmitry Monakhov, linux-ext4

On Thu, Feb 14, 2013 at 05:11:16PM +0100, Jan Kara wrote:
>   OK, so I've debugged this. It wasn't actually that hard. The problem is
> that mpage_da_map_and_submit() prepares extent with e.g. 256 blocks but
> later we submit a shorter bio e.g. because it cannot carry that many pages.
> So ->end_io is called only for first 128 blocks or so. I spotted this
> problem already before, just it didn't come up when Ted sent me this bug
> report.

So can we get a corruption without these patches, but it was a lot
harder to hit that case before?

						- Ted

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: REGRESSION: [PATCH 04/12] ext4: Disable merging of uninitialized extents
  2013-02-14 19:05       ` Theodore Ts'o
@ 2013-02-14 21:32         ` Jan Kara
  0 siblings, 0 replies; 53+ messages in thread
From: Jan Kara @ 2013-02-14 21:32 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jan Kara, Dmitry Monakhov, linux-ext4

On Thu 14-02-13 14:05:55, Ted Tso wrote:
> On Thu, Feb 14, 2013 at 05:11:16PM +0100, Jan Kara wrote:
> >   OK, so I've debugged this. It wasn't actually that hard. The problem is
> > that mpage_da_map_and_submit() prepares extent with e.g. 256 blocks but
> > later we submit a shorter bio e.g. because it cannot carry that many pages.
> > So ->end_io is called only for first 128 blocks or so. I spotted this
> > problem already before, just it didn't come up when Ted sent me this bug
> > report.
> 
> So can we get a corruption without these patches, but it was a lot
> harder to hit that case before?
  When the bio is shorter we just need to split the unwritten extent when
processing end_io callback. That is properly handled after Dmitry's fixes
so corruption can happen only when we are close to ENOSPC and the split
fails. Now we started seeing the split is happening because one of the
patches added the warning that the split from end_io is happening...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2013-02-14 21:32 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-18 12:00 [PATCH 0/12 v2] ext4: Several simplifications and fixes Jan Kara
2013-01-18 12:00 ` [PATCH 01/12] ext4: Always use ext4_bio_write_page() for writeout Jan Kara
2013-01-28 14:31   ` Theodore Ts'o
2013-01-18 12:00 ` [PATCH 02/12] ext4: Use redirty_page_for_writepage() in ext4_bio_write_page() Jan Kara
2013-01-28 14:34   ` Theodore Ts'o
2013-01-18 12:00 ` [PATCH 03/12] ext4: Remove bogus wait for unwritten extents in ext4_ind_direct_IO Jan Kara
2013-01-22 11:11   ` Dmitry Monakhov
2013-01-22 13:44     ` Jan Kara
2013-01-22 14:12       ` Dmitry Monakhov
2013-01-22 15:21         ` Jan Kara
2013-01-22 14:22       ` Zheng Liu
2013-01-22 15:22         ` Jan Kara
2013-01-22 16:00           ` Zheng Liu
2013-01-22 23:14             ` Jan Kara
2013-01-23  6:11               ` Zheng Liu
2013-01-23  9:42                 ` Jan Kara
2013-01-18 12:00 ` [PATCH 04/12] ext4: Disable merging of uninitialized extents Jan Kara
2013-01-24  9:49   ` Dmitry Monakhov
2013-01-24 15:12     ` Jan Kara
2013-01-24 15:32       ` Dmitry Monakhov
2013-01-28 14:36     ` Theodore Ts'o
2013-01-28 15:02       ` Dmitry Monakhov
2013-01-28 15:38         ` Theodore Ts'o
2013-01-29  7:41           ` Dmitry Monakhov
2013-01-29  8:37             ` Zheng Liu
2013-01-31  7:47     ` Dmitry Monakhov
2013-01-31 12:39       ` Jan Kara
2013-01-31 14:09         ` Dmitry Monakhov
2013-01-31 16:54       ` Theodore Ts'o
2013-02-09 17:10   ` REGRESSION: " Theodore Ts'o
2013-02-12 21:58     ` Jan Kara
2013-02-13  4:57       ` Theodore Ts'o
2013-02-13  7:26         ` Dmitry Monakhov
2013-02-13 15:08           ` Merge window planning for ext4 and Ted's vacation Theodore Ts'o
2013-02-14 10:47           ` REGRESSION: [PATCH 04/12] ext4: Disable merging of uninitialized extents Jan Kara
2013-02-14 16:11     ` Jan Kara
2013-02-14 19:05       ` Theodore Ts'o
2013-02-14 21:32         ` Jan Kara
2013-01-18 12:00 ` [PATCH 05/12] ext4: Remove unnecessary wait for extent conversion in ext4_fallocate() Jan Kara
2013-01-18 12:00 ` [PATCH 06/12] ext4: Move work from io_end to inode Jan Kara
2013-01-28 14:45   ` Theodore Ts'o
2013-01-18 12:00 ` [PATCH 07/12] ext4: Simplify list handling in ext4_do_flush_completed_IO() Jan Kara
2013-01-28 14:51   ` Theodore Ts'o
2013-01-18 12:00 ` [PATCH 08/12] ext4: Remove __ext4_journalled_writepage() from mpage_da_submit_io() Jan Kara
2013-01-28 14:40   ` Theodore Ts'o
2013-01-18 12:00 ` [PATCH 09/12] ext4: Dirty page has always buffers attached Jan Kara
2013-01-28 17:55   ` Theodore Ts'o
2013-01-18 12:00 ` [PATCH 10/12] ext4: Simplify mpage_add_bh_to_extent() Jan Kara
2013-01-28 18:06   ` Theodore Ts'o
2013-01-18 12:00 ` [PATCH 11/12] ext4: Make ext4_bio_writepage() handle unprepared buffers Jan Kara
2013-01-29  1:59   ` Theodore Ts'o
2013-01-18 12:00 ` [PATCH 12/12] ext4: Fix ext4_writepage() to achieve data=ordered guarantees Jan Kara
2013-01-29  2:08   ` Theodore Ts'o

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.