[PATCH 00/22 v1] Fixes and improvements in ext4 writeback path

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path
@ 2013-04-08 21:32 Jan Kara
  2013-04-08 21:32 ` [PATCH 01/29] ext4: Make ext4_bio_write_page() use BH_Async_Write flags instead page pointers from ext4_io_end Jan Kara
                   ` (28 more replies)
  0 siblings, 29 replies; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

  Hello,

  this is my series of patches improving ext4 writeback path and somewhat
cleaning up that code. Some things this patch set achieves:
* ext4_io_end structure doesn't contain page pointers anymore so it is
  significantly smaller (by about 1KB) (patch 1)
* bio splitting is now handled properly so we no longer hit warnings
  about extents changed while IO was in progress (patch 2)
* JBD2 supports transaction reservations - a way to start a transaction
  without blocking on journal (patch 12)
* cleanups of ext4_da_writepages() and connected stuff (patches 14-18)
* we clear PageWriteback bit only after extents are converted (PATCHES 22)
* we can thus remove waits for unwritten extent conversion

  I've tested patches with xfstests in different configurations (default,
dioread_nolock, nojournal, blocksize 1KB). The patches are based on ext4
development branch (and actually depend on some of the fixes there from Dmitry
and Zheng). Now I understand this might be a too big chunk to swallow at once.
So I could maybe split out some easier parts for the nearest merge window.
But for now I wanted to post everything anyway for people to review.

								Honza

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 01/29] ext4: Make ext4_bio_write_page() use BH_Async_Write flags instead page pointers from ext4_io_end
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-04-10 18:05   ` Dmitry Monakhov
                     ` (2 more replies)
  2013-04-08 21:32 ` [PATCH 02/29] ext4: Use io_end for multiple bios Jan Kara
                   ` (27 subsequent siblings)
  28 siblings, 3 replies; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

So far ext4_bio_write_page() attached all the pages to ext4_io_end structure.
This makes that structure pretty heavy (1 KB for pointers + 16 bytes per
page attached to the bio). Also later we would like to share ext4_io_end
structure among several bios in case IO to a single extent needs to be split
among several bios and pointing to pages from ext4_io_end makes this complex.

We remove page pointers from ext4_io_end and use pointers from bio itself
instead. This isn't as easy when blocksize < pagesize because then we can have
several bios in flight for a single page and we have to be careful when to call
end_page_writeback(). However this is a known problem already solved by
block_write_full_page() / end_buffer_async_write() so we mimic its behavior
here. We mark buffers going to disk with BH_Async_Write flag and in
ext4_bio_end_io() we check whether there are any buffers with BH_Async_Write
flag left. If there are not, we can call end_page_writeback().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h    |   14 -----
 fs/ext4/page-io.c |  163 +++++++++++++++++++++++++----------------------------
 2 files changed, 77 insertions(+), 100 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 4a01ba3..3c70547 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -196,19 +196,8 @@ struct mpage_da_data {
 #define EXT4_IO_END_ERROR	0x0002
 #define EXT4_IO_END_DIRECT	0x0004
 
-struct ext4_io_page {
-	struct page	*p_page;
-	atomic_t	p_count;
-};
-
-#define MAX_IO_PAGES 128
-
 /*
  * For converting uninitialized extents on a work queue.
- *
- * 'page' is only used from the writepage() path; 'pages' is only used for
- * buffered writes; they are used to keep page references until conversion
- * takes place.  For AIO/DIO, neither field is filled in.
  */
 typedef struct ext4_io_end {
 	struct list_head	list;		/* per-file finished IO list */
@@ -218,15 +207,12 @@ typedef struct ext4_io_end {
 	ssize_t			size;		/* size of the extent */
 	struct kiocb		*iocb;		/* iocb struct for AIO */
 	int			result;		/* error value for AIO */
-	int			num_io_pages;   /* for writepages() */
-	struct ext4_io_page	*pages[MAX_IO_PAGES]; /* for writepages() */
 } ext4_io_end_t;
 
 struct ext4_io_submit {
 	int			io_op;
 	struct bio		*io_bio;
 	ext4_io_end_t		*io_end;
-	struct ext4_io_page	*io_page;
 	sector_t		io_next_block;
 };
 
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 809b310..73bc011 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -29,25 +29,19 @@
 #include "xattr.h"
 #include "acl.h"
 
-static struct kmem_cache *io_page_cachep, *io_end_cachep;
+static struct kmem_cache *io_end_cachep;
 
 int __init ext4_init_pageio(void)
 {
-	io_page_cachep = KMEM_CACHE(ext4_io_page, SLAB_RECLAIM_ACCOUNT);
-	if (io_page_cachep == NULL)
-		return -ENOMEM;
 	io_end_cachep = KMEM_CACHE(ext4_io_end, SLAB_RECLAIM_ACCOUNT);
-	if (io_end_cachep == NULL) {
-		kmem_cache_destroy(io_page_cachep);
+	if (io_end_cachep == NULL)
 		return -ENOMEM;
-	}
 	return 0;
 }
 
 void ext4_exit_pageio(void)
 {
 	kmem_cache_destroy(io_end_cachep);
-	kmem_cache_destroy(io_page_cachep);
 }
 
 void ext4_ioend_wait(struct inode *inode)
@@ -57,15 +51,6 @@ void ext4_ioend_wait(struct inode *inode)
 	wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_ioend_count) == 0));
 }
 
-static void put_io_page(struct ext4_io_page *io_page)
-{
-	if (atomic_dec_and_test(&io_page->p_count)) {
-		end_page_writeback(io_page->p_page);
-		put_page(io_page->p_page);
-		kmem_cache_free(io_page_cachep, io_page);
-	}
-}
-
 void ext4_free_io_end(ext4_io_end_t *io)
 {
 	int i;
@@ -74,9 +59,6 @@ void ext4_free_io_end(ext4_io_end_t *io)
 	BUG_ON(!list_empty(&io->list));
 	BUG_ON(io->flag & EXT4_IO_END_UNWRITTEN);
 
-	for (i = 0; i < io->num_io_pages; i++)
-		put_io_page(io->pages[i]);
-	io->num_io_pages = 0;
 	if (atomic_dec_and_test(&EXT4_I(io->inode)->i_ioend_count))
 		wake_up_all(ext4_ioend_wq(io->inode));
 	kmem_cache_free(io_end_cachep, io);
@@ -233,45 +215,56 @@ static void ext4_end_bio(struct bio *bio, int error)
 	ext4_io_end_t *io_end = bio->bi_private;
 	struct inode *inode;
 	int i;
+	int blocksize;
 	sector_t bi_sector = bio->bi_sector;
 
 	BUG_ON(!io_end);
+	inode = io_end->inode;
+	blocksize = 1 << inode->i_blkbits;
 	bio->bi_private = NULL;
 	bio->bi_end_io = NULL;
 	if (test_bit(BIO_UPTODATE, &bio->bi_flags))
 		error = 0;
-	bio_put(bio);
-
-	for (i = 0; i < io_end->num_io_pages; i++) {
-		struct page *page = io_end->pages[i]->p_page;
+	for (i = 0; i < bio->bi_vcnt; i++) {
+		struct bio_vec *bvec = &bio->bi_io_vec[i];
+		struct page *page = bvec->bv_page;
 		struct buffer_head *bh, *head;
-		loff_t offset;
-		loff_t io_end_offset;
+		unsigned bio_start = bvec->bv_offset;
+		unsigned bio_end = bio_start + bvec->bv_len;
+		unsigned under_io = 0;
+		unsigned long flags;
+
+		if (!page)
+			continue;
 
 		if (error) {
 			SetPageError(page);
 			set_bit(AS_EIO, &page->mapping->flags);
-			head = page_buffers(page);
-			BUG_ON(!head);
-
-			io_end_offset = io_end->offset + io_end->size;
-
-			offset = (sector_t) page->index << PAGE_CACHE_SHIFT;
-			bh = head;
-			do {
-				if ((offset >= io_end->offset) &&
-				    (offset+bh->b_size <= io_end_offset))
-					buffer_io_error(bh);
-
-				offset += bh->b_size;
-				bh = bh->b_this_page;
-			} while (bh != head);
 		}
-
-		put_io_page(io_end->pages[i]);
+		bh = head = page_buffers(page);
+		/*
+		 * We check all buffers in the page under BH_Uptodate_Lock
+		 * to avoid races with other end io clearing async_write flags
+		 */
+		local_irq_save(flags);
+		bit_spin_lock(BH_Uptodate_Lock, &head->b_state);
+		do {
+			if (bh_offset(bh) < bio_start ||
+			    bh_offset(bh) + blocksize > bio_end) {
+				if (buffer_async_write(bh))
+					under_io++;
+				continue;
+			}
+			clear_buffer_async_write(bh);
+			if (error)
+				buffer_io_error(bh);
+		} while ((bh = bh->b_this_page) != head);
+		bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
+		local_irq_restore(flags);
+		if (!under_io)
+			end_page_writeback(page);
 	}
-	io_end->num_io_pages = 0;
-	inode = io_end->inode;
+	bio_put(bio);
 
 	if (error) {
 		io_end->flag |= EXT4_IO_END_ERROR;
@@ -335,7 +328,6 @@ static int io_submit_init(struct ext4_io_submit *io,
 }
 
 static int io_submit_add_bh(struct ext4_io_submit *io,
-			    struct ext4_io_page *io_page,
 			    struct inode *inode,
 			    struct writeback_control *wbc,
 			    struct buffer_head *bh)
@@ -343,11 +335,6 @@ static int io_submit_add_bh(struct ext4_io_submit *io,
 	ext4_io_end_t *io_end;
 	int ret;
 
-	if (buffer_new(bh)) {
-		clear_buffer_new(bh);
-		unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
-	}
-
 	if (io->io_bio && bh->b_blocknr != io->io_next_block) {
 submit_and_retry:
 		ext4_io_submit(io);
@@ -358,9 +345,6 @@ submit_and_retry:
 			return ret;
 	}
 	io_end = io->io_end;
-	if ((io_end->num_io_pages >= MAX_IO_PAGES) &&
-	    (io_end->pages[io_end->num_io_pages-1] != io_page))
-		goto submit_and_retry;
 	if (buffer_uninit(bh))
 		ext4_set_io_unwritten_flag(inode, io_end);
 	io->io_end->size += bh->b_size;
@@ -368,11 +352,6 @@ submit_and_retry:
 	ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh));
 	if (ret != bh->b_size)
 		goto submit_and_retry;
-	if ((io_end->num_io_pages == 0) ||
-	    (io_end->pages[io_end->num_io_pages-1] != io_page)) {
-		io_end->pages[io_end->num_io_pages++] = io_page;
-		atomic_inc(&io_page->p_count);
-	}
 	return 0;
 }
 
@@ -382,33 +361,29 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 			struct writeback_control *wbc)
 {
 	struct inode *inode = page->mapping->host;
-	unsigned block_start, block_end, blocksize;
-	struct ext4_io_page *io_page;
+	unsigned block_start, blocksize;
 	struct buffer_head *bh, *head;
 	int ret = 0;
+	int nr_submitted = 0;
 
 	blocksize = 1 << inode->i_blkbits;
 
 	BUG_ON(!PageLocked(page));
 	BUG_ON(PageWriteback(page));
 
-	io_page = kmem_cache_alloc(io_page_cachep, GFP_NOFS);
-	if (!io_page) {
-		redirty_page_for_writepage(wbc, page);
-		unlock_page(page);
-		return -ENOMEM;
-	}
-	io_page->p_page = page;
-	atomic_set(&io_page->p_count, 1);
-	get_page(page);
 	set_page_writeback(page);
 	ClearPageError(page);
 
-	for (bh = head = page_buffers(page), block_start = 0;
-	     bh != head || !block_start;
-	     block_start = block_end, bh = bh->b_this_page) {
-
-		block_end = block_start + blocksize;
+	/*
+	 * In the first loop we prepare and mark buffers to submit. We have to
+	 * mark all buffers in the page before submitting so that
+	 * end_page_writeback() cannot be called from ext4_bio_end_io() when IO
+	 * on the first buffer finishes and we are still working on submitting
+	 * the second buffer.
+	 */
+	bh = head = page_buffers(page);
+	do {
+		block_start = bh_offset(bh);
 		if (block_start >= len) {
 			/*
 			 * Comments copied from block_write_full_page_endio:
@@ -421,7 +396,8 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 			 * mapped, and writes to that region are not written
 			 * out to the file."
 			 */
-			zero_user_segment(page, block_start, block_end);
+			zero_user_segment(page, block_start,
+					  block_start + blocksize);
 			clear_buffer_dirty(bh);
 			set_buffer_uptodate(bh);
 			continue;
@@ -435,7 +411,19 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 				ext4_io_submit(io);
 			continue;
 		}
-		ret = io_submit_add_bh(io, io_page, inode, wbc, bh);
+		if (buffer_new(bh)) {
+			clear_buffer_new(bh);
+			unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
+		}
+		set_buffer_async_write(bh);
+	} while ((bh = bh->b_this_page) != head);
+
+	/* Now submit buffers to write */
+	bh = head = page_buffers(page);
+	do {
+		if (!buffer_async_write(bh))
+			continue;
+		ret = io_submit_add_bh(io, inode, wbc, bh);
 		if (ret) {
 			/*
 			 * We only get here on ENOMEM.  Not much else
@@ -445,17 +433,20 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 			redirty_page_for_writepage(wbc, page);
 			break;
 		}
+		nr_submitted++;
 		clear_buffer_dirty(bh);
+	} while ((bh = bh->b_this_page) != head);
+
+	/* Error stopped previous loop? Clean up buffers... */
+	if (ret) {
+		do {
+			clear_buffer_async_write(bh);
+			bh = bh->b_this_page;
+		} while (bh != head);
 	}
 	unlock_page(page);
-	/*
-	 * If the page was truncated before we could do the writeback,
-	 * or we had a memory allocation error while trying to write
-	 * the first buffer head, we won't have submitted any pages for
-	 * I/O.  In that case we need to make sure we've cleared the
-	 * PageWriteback bit from the page to prevent the system from
-	 * wedging later on.
-	 */
-	put_io_page(io_page);
+	/* Nothing submitted - we have to end page writeback */
+	if (!nr_submitted)
+		end_page_writeback(page);
 	return ret;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 02/29] ext4: Use io_end for multiple bios
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
  2013-04-08 21:32 ` [PATCH 01/29] ext4: Make ext4_bio_write_page() use BH_Async_Write flags instead page pointers from ext4_io_end Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-04-11  5:10   ` Dmitry Monakhov
                     ` (2 more replies)
  2013-04-08 21:32 ` [PATCH 03/29] ext4: Clear buffer_uninit flag when submitting IO Jan Kara
                   ` (26 subsequent siblings)
  28 siblings, 3 replies; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Change writeback path to create just one io_end structure for the extent
to which we submit IO and share it among bios writing that extent. This
prevents needless splitting and joining of unwritten extents when they
cannot be submitted as a single bio.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h    |    8 +++-
 fs/ext4/inode.c   |   85 ++++++++++++++++++++-----------------
 fs/ext4/page-io.c |  121 +++++++++++++++++++++++++++++++++--------------------
 3 files changed, 128 insertions(+), 86 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 3c70547..edf9b9e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -207,6 +207,7 @@ typedef struct ext4_io_end {
 	ssize_t			size;		/* size of the extent */
 	struct kiocb		*iocb;		/* iocb struct for AIO */
 	int			result;		/* error value for AIO */
+	atomic_t		count;		/* reference counter */
 } ext4_io_end_t;
 
 struct ext4_io_submit {
@@ -2601,11 +2602,14 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
 
 /* page-io.c */
 extern int __init ext4_init_pageio(void);
-extern void ext4_add_complete_io(ext4_io_end_t *io_end);
 extern void ext4_exit_pageio(void);
 extern void ext4_ioend_wait(struct inode *);
-extern void ext4_free_io_end(ext4_io_end_t *io);
 extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
+extern ext4_io_end_t *ext4_get_io_end(ext4_io_end_t *io_end);
+extern int ext4_put_io_end(ext4_io_end_t *io_end);
+extern void ext4_put_io_end_defer(ext4_io_end_t *io_end);
+extern void ext4_io_submit_init(struct ext4_io_submit *io,
+				struct writeback_control *wbc);
 extern void ext4_end_io_work(struct work_struct *work);
 extern void ext4_io_submit(struct ext4_io_submit *io);
 extern int ext4_bio_write_page(struct ext4_io_submit *io,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6b8ec2a..ba07412 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1409,7 +1409,10 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd,
 	struct ext4_io_submit io_submit;
 
 	BUG_ON(mpd->next_page <= mpd->first_page);
-	memset(&io_submit, 0, sizeof(io_submit));
+	ext4_io_submit_init(&io_submit, mpd->wbc);
+	io_submit.io_end = ext4_init_io_end(inode, GFP_NOFS);
+	if (!io_submit.io_end)
+		return -ENOMEM;
 	/*
 	 * We need to start from the first_page to the next_page - 1
 	 * to make sure we also write the mapped dirty buffer_heads.
@@ -1497,6 +1500,8 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd,
 		pagevec_release(&pvec);
 	}
 	ext4_io_submit(&io_submit);
+	/* Drop io_end reference we got from init */
+	ext4_put_io_end_defer(io_submit.io_end);
 	return ret;
 }
 
@@ -2116,9 +2121,16 @@ static int ext4_writepage(struct page *page,
 		 */
 		return __ext4_journalled_writepage(page, len);
 
-	memset(&io_submit, 0, sizeof(io_submit));
+	ext4_io_submit_init(&io_submit, wbc);
+	io_submit.io_end = ext4_init_io_end(inode, GFP_NOFS);
+	if (!io_submit.io_end) {
+		redirty_page_for_writepage(wbc, page);
+		return -ENOMEM;
+	}
 	ret = ext4_bio_write_page(&io_submit, page, len, wbc);
 	ext4_io_submit(&io_submit);
+	/* Drop io_end reference we got from init */
+	ext4_put_io_end_defer(io_submit.io_end);
 	return ret;
 }
 
@@ -2957,9 +2969,13 @@ static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
 	struct inode *inode = file_inode(iocb->ki_filp);
         ext4_io_end_t *io_end = iocb->private;
 
-	/* if not async direct IO or dio with 0 bytes write, just return */
-	if (!io_end || !size)
-		goto out;
+	/* if not async direct IO just return */
+	if (!io_end) {
+		inode_dio_done(inode);
+		if (is_async)
+			aio_complete(iocb, ret, 0);
+		return;
+	}
 
 	ext_debug("ext4_end_io_dio(): io_end 0x%p "
 		  "for inode %lu, iocb 0x%p, offset %llu, size %zd\n",
@@ -2967,25 +2983,13 @@ static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
 		  size);
 
 	iocb->private = NULL;
-
-	/* if not aio dio with unwritten extents, just free io and return */
-	if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
-		ext4_free_io_end(io_end);
-out:
-		inode_dio_done(inode);
-		if (is_async)
-			aio_complete(iocb, ret, 0);
-		return;
-	}
-
 	io_end->offset = offset;
 	io_end->size = size;
 	if (is_async) {
 		io_end->iocb = iocb;
 		io_end->result = ret;
 	}
-
-	ext4_add_complete_io(io_end);
+	ext4_put_io_end_defer(io_end);
 }
 
 /*
@@ -3019,6 +3023,7 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
 	get_block_t *get_block_func = NULL;
 	int dio_flags = 0;
 	loff_t final_size = offset + count;
+	ext4_io_end_t *io_end = NULL;
 
 	/* Use the old path for reads and writes beyond i_size. */
 	if (rw != WRITE || final_size > inode->i_size)
@@ -3057,13 +3062,16 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
 	iocb->private = NULL;
 	ext4_inode_aio_set(inode, NULL);
 	if (!is_sync_kiocb(iocb)) {
-		ext4_io_end_t *io_end = ext4_init_io_end(inode, GFP_NOFS);
+		io_end = ext4_init_io_end(inode, GFP_NOFS);
 		if (!io_end) {
 			ret = -ENOMEM;
 			goto retake_lock;
 		}
 		io_end->flag |= EXT4_IO_END_DIRECT;
-		iocb->private = io_end;
+		/*
+		 * Grab reference for DIO. Will be dropped in ext4_end_io_dio()
+		 */
+		iocb->private = ext4_get_io_end(io_end);
 		/*
 		 * we save the io structure for current async direct
 		 * IO, so that later ext4_map_blocks() could flag the
@@ -3087,26 +3095,27 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
 				   NULL,
 				   dio_flags);
 
-	if (iocb->private)
-		ext4_inode_aio_set(inode, NULL);
 	/*
-	 * The io_end structure takes a reference to the inode, that
-	 * structure needs to be destroyed and the reference to the
-	 * inode need to be dropped, when IO is complete, even with 0
-	 * byte write, or failed.
-	 *
-	 * In the successful AIO DIO case, the io_end structure will
-	 * be destroyed and the reference to the inode will be dropped
-	 * after the end_io call back function is called.
-	 *
-	 * In the case there is 0 byte write, or error case, since VFS
-	 * direct IO won't invoke the end_io call back function, we
-	 * need to free the end_io structure here.
+	 * Put our reference to io_end. This can free the io_end structure e.g.
+	 * in sync IO case or in case of error. It can even perform extent
+	 * conversion if all bios we submitted finished before we got here.
+	 * Note that in that case iocb->private can be already set to NULL
+	 * here.
 	 */
-	if (ret != -EIOCBQUEUED && ret <= 0 && iocb->private) {
-		ext4_free_io_end(iocb->private);
-		iocb->private = NULL;
-	} else if (ret > 0 && !overwrite && ext4_test_inode_state(inode,
+	if (io_end) {
+		ext4_inode_aio_set(inode, NULL);
+		ext4_put_io_end(io_end);
+		/*
+		 * In case of error or no write ext4_end_io_dio() was not
+		 * called so we have to put iocb's reference.
+		 */
+		if (ret <= 0 && ret != -EIOCBQUEUED) {
+			WARN_ON(iocb->private != io_end);
+			ext4_put_io_end(io_end);
+			iocb->private = NULL;
+		}
+	}
+	if (ret > 0 && !overwrite && ext4_test_inode_state(inode,
 						EXT4_STATE_DIO_UNWRITTEN)) {
 		int err;
 		/*
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 73bc011..da8bddf 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -51,17 +51,28 @@ void ext4_ioend_wait(struct inode *inode)
 	wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_ioend_count) == 0));
 }
 
-void ext4_free_io_end(ext4_io_end_t *io)
+static void ext4_release_io_end(ext4_io_end_t *io_end)
 {
-	int i;
+	BUG_ON(!list_empty(&io_end->list));
+	BUG_ON(io_end->flag & EXT4_IO_END_UNWRITTEN);
+
+	if (atomic_dec_and_test(&EXT4_I(io_end->inode)->i_ioend_count))
+		wake_up_all(ext4_ioend_wq(io_end->inode));
+	if (io_end->flag & EXT4_IO_END_DIRECT)
+		inode_dio_done(io_end->inode);
+	if (io_end->iocb)
+		aio_complete(io_end->iocb, io_end->result, 0);
+	kmem_cache_free(io_end_cachep, io_end);
+}
 
-	BUG_ON(!io);
-	BUG_ON(!list_empty(&io->list));
-	BUG_ON(io->flag & EXT4_IO_END_UNWRITTEN);
+static void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
+{
+	struct inode *inode = io_end->inode;
 
-	if (atomic_dec_and_test(&EXT4_I(io->inode)->i_ioend_count))
-		wake_up_all(ext4_ioend_wq(io->inode));
-	kmem_cache_free(io_end_cachep, io);
+	io_end->flag &= ~EXT4_IO_END_UNWRITTEN;
+	/* Wake up anyone waiting on unwritten extent conversion */
+	if (atomic_dec_and_test(&EXT4_I(inode)->i_unwritten))
+		wake_up_all(ext4_ioend_wq(inode));
 }
 
 /* check a range of space and convert unwritten extents to written. */
@@ -84,13 +95,8 @@ static int ext4_end_io(ext4_io_end_t *io)
 			 "(inode %lu, offset %llu, size %zd, error %d)",
 			 inode->i_ino, offset, size, ret);
 	}
-	/* Wake up anyone waiting on unwritten extent conversion */
-	if (atomic_dec_and_test(&EXT4_I(inode)->i_unwritten))
-		wake_up_all(ext4_ioend_wq(inode));
-	if (io->flag & EXT4_IO_END_DIRECT)
-		inode_dio_done(inode);
-	if (io->iocb)
-		aio_complete(io->iocb, io->result, 0);
+	ext4_clear_io_unwritten_flag(io);
+	ext4_release_io_end(io);
 	return ret;
 }
 
@@ -121,7 +127,7 @@ static void dump_completed_IO(struct inode *inode)
 }
 
 /* Add the io_end to per-inode completed end_io list. */
-void ext4_add_complete_io(ext4_io_end_t *io_end)
+static void ext4_add_complete_io(ext4_io_end_t *io_end)
 {
 	struct ext4_inode_info *ei = EXT4_I(io_end->inode);
 	struct workqueue_struct *wq;
@@ -158,8 +164,6 @@ static int ext4_do_flush_completed_IO(struct inode *inode)
 		err = ext4_end_io(io);
 		if (unlikely(!ret && err))
 			ret = err;
-		io->flag &= ~EXT4_IO_END_UNWRITTEN;
-		ext4_free_io_end(io);
 	}
 	return ret;
 }
@@ -191,10 +195,43 @@ ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
 		atomic_inc(&EXT4_I(inode)->i_ioend_count);
 		io->inode = inode;
 		INIT_LIST_HEAD(&io->list);
+		atomic_set(&io->count, 1);
 	}
 	return io;
 }
 
+void ext4_put_io_end_defer(ext4_io_end_t *io_end)
+{
+	if (atomic_dec_and_test(&io_end->count)) {
+		if (!(io_end->flag & EXT4_IO_END_UNWRITTEN) || !io_end->size) {
+			ext4_release_io_end(io_end);
+			return;
+		}
+		ext4_add_complete_io(io_end);
+	}
+}
+
+int ext4_put_io_end(ext4_io_end_t *io_end)
+{
+	int err = 0;
+
+	if (atomic_dec_and_test(&io_end->count)) {
+		if (io_end->flag & EXT4_IO_END_UNWRITTEN) {
+			err = ext4_convert_unwritten_extents(io_end->inode,
+						io_end->offset, io_end->size);
+			ext4_clear_io_unwritten_flag(io_end);
+		}
+		ext4_release_io_end(io_end);
+	}
+	return err;
+}
+
+ext4_io_end_t *ext4_get_io_end(ext4_io_end_t *io_end)
+{
+	atomic_inc(&io_end->count);
+	return io_end;
+}
+
 /*
  * Print an buffer I/O error compatible with the fs/buffer.c.  This
  * provides compatibility with dmesg scrapers that look for a specific
@@ -277,12 +314,7 @@ static void ext4_end_bio(struct bio *bio, int error)
 			     bi_sector >> (inode->i_blkbits - 9));
 	}
 
-	if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
-		ext4_free_io_end(io_end);
-		return;
-	}
-
-	ext4_add_complete_io(io_end);
+	ext4_put_io_end_defer(io_end);
 }
 
 void ext4_io_submit(struct ext4_io_submit *io)
@@ -296,40 +328,37 @@ void ext4_io_submit(struct ext4_io_submit *io)
 		bio_put(io->io_bio);
 	}
 	io->io_bio = NULL;
-	io->io_op = 0;
+}
+
+void ext4_io_submit_init(struct ext4_io_submit *io,
+			 struct writeback_control *wbc)
+{
+	io->io_op = (wbc->sync_mode == WB_SYNC_ALL ?  WRITE_SYNC : WRITE);
+	io->io_bio = NULL;
 	io->io_end = NULL;
 }
 
-static int io_submit_init(struct ext4_io_submit *io,
-			  struct inode *inode,
-			  struct writeback_control *wbc,
-			  struct buffer_head *bh)
+static int io_submit_init_bio(struct ext4_io_submit *io,
+			      struct buffer_head *bh)
 {
-	ext4_io_end_t *io_end;
-	struct page *page = bh->b_page;
 	int nvecs = bio_get_nr_vecs(bh->b_bdev);
 	struct bio *bio;
 
-	io_end = ext4_init_io_end(inode, GFP_NOFS);
-	if (!io_end)
-		return -ENOMEM;
 	bio = bio_alloc(GFP_NOIO, min(nvecs, BIO_MAX_PAGES));
 	bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9);
 	bio->bi_bdev = bh->b_bdev;
-	bio->bi_private = io->io_end = io_end;
 	bio->bi_end_io = ext4_end_bio;
-
-	io_end->offset = (page->index << PAGE_CACHE_SHIFT) + bh_offset(bh);
-
+	bio->bi_private = ext4_get_io_end(io->io_end);
+	if (!io->io_end->size)
+		io->io_end->offset = (bh->b_page->index << PAGE_CACHE_SHIFT)
+				     + bh_offset(bh);
 	io->io_bio = bio;
-	io->io_op = (wbc->sync_mode == WB_SYNC_ALL ?  WRITE_SYNC : WRITE);
 	io->io_next_block = bh->b_blocknr;
 	return 0;
 }
 
 static int io_submit_add_bh(struct ext4_io_submit *io,
 			    struct inode *inode,
-			    struct writeback_control *wbc,
 			    struct buffer_head *bh)
 {
 	ext4_io_end_t *io_end;
@@ -340,18 +369,18 @@ submit_and_retry:
 		ext4_io_submit(io);
 	}
 	if (io->io_bio == NULL) {
-		ret = io_submit_init(io, inode, wbc, bh);
+		ret = io_submit_init_bio(io, bh);
 		if (ret)
 			return ret;
 	}
+	ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh));
+	if (ret != bh->b_size)
+		goto submit_and_retry;
 	io_end = io->io_end;
 	if (buffer_uninit(bh))
 		ext4_set_io_unwritten_flag(inode, io_end);
-	io->io_end->size += bh->b_size;
+	io_end->size += bh->b_size;
 	io->io_next_block++;
-	ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh));
-	if (ret != bh->b_size)
-		goto submit_and_retry;
 	return 0;
 }
 
@@ -423,7 +452,7 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 	do {
 		if (!buffer_async_write(bh))
 			continue;
-		ret = io_submit_add_bh(io, inode, wbc, bh);
+		ret = io_submit_add_bh(io, inode, bh);
 		if (ret) {
 			/*
 			 * We only get here on ENOMEM.  Not much else
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 03/29] ext4: Clear buffer_uninit flag when submitting IO
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
  2013-04-08 21:32 ` [PATCH 01/29] ext4: Make ext4_bio_write_page() use BH_Async_Write flags instead page pointers from ext4_io_end Jan Kara
  2013-04-08 21:32 ` [PATCH 02/29] ext4: Use io_end for multiple bios Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-04-11 14:08   ` Zheng Liu
  2013-04-08 21:32 ` [PATCH 04/29] jbd2: Reduce journal_head size Jan Kara
                   ` (25 subsequent siblings)
  28 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Currently noone cleared buffer_uninit flag. This results in writeback
needlessly marking io_end as needing extent conversion scanning extent
tree for extents to convert. So clear the buffer_uninit flag once the
buffer is submitted for IO and the flag is transformed into
EXT4_IO_END_UNWRITTEN flag.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/page-io.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index da8bddf..efdf0a5 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -377,7 +377,7 @@ submit_and_retry:
 	if (ret != bh->b_size)
 		goto submit_and_retry;
 	io_end = io->io_end;
-	if (buffer_uninit(bh))
+	if (test_clear_buffer_uninit(bh))
 		ext4_set_io_unwritten_flag(inode, io_end);
 	io_end->size += bh->b_size;
 	io->io_next_block++;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 04/29] jbd2: Reduce journal_head size
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
                   ` (2 preceding siblings ...)
  2013-04-08 21:32 ` [PATCH 03/29] ext4: Clear buffer_uninit flag when submitting IO Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-04-11 14:10   ` Zheng Liu
  2013-04-12  4:04   ` Theodore Ts'o
  2013-04-08 21:32 ` [PATCH 05/29] jbd2: Don't create journal_head for temporary journal buffers Jan Kara
                   ` (24 subsequent siblings)
  28 siblings, 2 replies; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Remove unused t_cow_tid field (ext4 copy-on-write support doesn't seem
to be happening) and change b_modified and b_jlist to bitfields thus
saving 8 bytes in the structure.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/journal-head.h |   11 ++---------
 1 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/include/linux/journal-head.h b/include/linux/journal-head.h
index c18b46f..13a3da2 100644
--- a/include/linux/journal-head.h
+++ b/include/linux/journal-head.h
@@ -31,21 +31,14 @@ struct journal_head {
 	/*
 	 * Journalling list for this buffer [jbd_lock_bh_state()]
 	 */
-	unsigned b_jlist;
+	unsigned b_jlist:4;
 
 	/*
 	 * This flag signals the buffer has been modified by
 	 * the currently running transaction
 	 * [jbd_lock_bh_state()]
 	 */
-	unsigned b_modified;

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 05/29] jbd2: Don't create journal_head for temporary journal buffers
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
                   ` (3 preceding siblings ...)
  2013-04-08 21:32 ` [PATCH 04/29] jbd2: Reduce journal_head size Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-04-12  8:01   ` Zheng Liu
  2013-04-08 21:32 ` [PATCH 06/29] jbd2: Remove journal_head from descriptor buffers Jan Kara
                   ` (23 subsequent siblings)
  28 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

When writing metadata to the journal, we create temporary buffer heads
for that task. We also attach journal heads to these buffer heads but
the only purpose of the journal heads is to keep buffers linked in
transaction's BJ_IO list. We remove the need for journal heads by
reusing buffer_head's b_assoc_buffers list for that purpose. Also since
BJ_IO list is just a temporary list for transaction commit, we use a
private list in jbd2_journal_commit_transaction() for that thus removing
BJ_IO list from transaction completely.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c  |    1 -
 fs/jbd2/commit.c      |   65 ++++++++++++++++--------------------------------
 fs/jbd2/journal.c     |   36 ++++++++++----------------
 fs/jbd2/transaction.c |   14 +++-------
 include/linux/jbd2.h  |   32 ++++++++++++------------
 5 files changed, 56 insertions(+), 92 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index c78841e..2735fef 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -690,7 +690,6 @@ void __jbd2_journal_drop_transaction(journal_t *journal, transaction_t *transact
 	J_ASSERT(transaction->t_state == T_FINISHED);
 	J_ASSERT(transaction->t_buffers == NULL);
 	J_ASSERT(transaction->t_forget == NULL);
-	J_ASSERT(transaction->t_iobuf_list == NULL);
 	J_ASSERT(transaction->t_shadow_list == NULL);
 	J_ASSERT(transaction->t_log_list == NULL);
 	J_ASSERT(transaction->t_checkpoint_list == NULL);
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 750c701..c503df6 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -368,7 +368,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 {
 	struct transaction_stats_s stats;
 	transaction_t *commit_transaction;
-	struct journal_head *jh, *new_jh, *descriptor;
+	struct journal_head *jh, *descriptor;
 	struct buffer_head **wbuf = journal->j_wbuf;
 	int bufs;
 	int flags;
@@ -392,6 +392,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	tid_t first_tid;
 	int update_tail;
 	int csum_size = 0;
+	LIST_HEAD(io_bufs);
 
 	if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2))
 		csum_size = sizeof(struct jbd2_journal_block_tail);
@@ -658,29 +659,22 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 
 		/* Bump b_count to prevent truncate from stumbling over
                    the shadowed buffer!  @@@ This can go if we ever get
-                   rid of the BJ_IO/BJ_Shadow pairing of buffers. */
+                   rid of the shadow pairing of buffers. */
 		atomic_inc(&jh2bh(jh)->b_count);
 
-		/* Make a temporary IO buffer with which to write it out
-                   (this will requeue both the metadata buffer and the
-                   temporary IO buffer). new_bh goes on BJ_IO*/
-
-		set_bit(BH_JWrite, &jh2bh(jh)->b_state);
 		/*
-		 * akpm: jbd2_journal_write_metadata_buffer() sets
-		 * new_bh->b_transaction to commit_transaction.
-		 * We need to clean this up before we release new_bh
-		 * (which is of type BJ_IO)
+		 * Make a temporary IO buffer with which to write it out
+		 * (this will requeue the metadata buffer to BJ_Shadow).
 		 */
+		set_bit(BH_JWrite, &jh2bh(jh)->b_state);
 		JBUFFER_TRACE(jh, "ph3: write metadata");
 		flags = jbd2_journal_write_metadata_buffer(commit_transaction,
-						      jh, &new_jh, blocknr);
+						jh, &wbuf[bufs], blocknr);
 		if (flags < 0) {
 			jbd2_journal_abort(journal, flags);
 			continue;
 		}
-		set_bit(BH_JWrite, &jh2bh(new_jh)->b_state);
-		wbuf[bufs++] = jh2bh(new_jh);
+		jbd2_file_log_bh(&io_bufs, wbuf[bufs]);
 
 		/* Record the new block's tag in the current descriptor
                    buffer */
@@ -694,10 +688,11 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 		tag = (journal_block_tag_t *) tagp;
 		write_tag_block(tag_bytes, tag, jh2bh(jh)->b_blocknr);
 		tag->t_flags = cpu_to_be16(tag_flag);
-		jbd2_block_tag_csum_set(journal, tag, jh2bh(new_jh),
+		jbd2_block_tag_csum_set(journal, tag, wbuf[bufs],
 					commit_transaction->t_tid);
 		tagp += tag_bytes;
 		space_left -= tag_bytes;
+		bufs++;
 
 		if (first_tag) {
 			memcpy (tagp, journal->j_uuid, 16);
@@ -809,7 +804,7 @@ start_journal_io:
            the log.  Before we can commit it, wait for the IO so far to
            complete.  Control buffers being written are on the
            transaction's t_log_list queue, and metadata buffers are on
-           the t_iobuf_list queue.
+           the io_bufs list.
 
 	   Wait for the buffers in reverse order.  That way we are
 	   less likely to be woken up until all IOs have completed, and
@@ -818,46 +813,31 @@ start_journal_io:
 
 	jbd_debug(3, "JBD2: commit phase 3\n");
 
-	/*
-	 * akpm: these are BJ_IO, and j_list_lock is not needed.
-	 * See __journal_try_to_free_buffer.
-	 */
-wait_for_iobuf:
-	while (commit_transaction->t_iobuf_list != NULL) {
-		struct buffer_head *bh;
+	while (!list_empty(&io_bufs)) {
+		struct buffer_head *bh = list_entry(io_bufs.prev,
+						    struct buffer_head,
+						    b_assoc_buffers);
 
-		jh = commit_transaction->t_iobuf_list->b_tprev;
-		bh = jh2bh(jh);
-		if (buffer_locked(bh)) {
-			wait_on_buffer(bh);
-			goto wait_for_iobuf;
-		}
-		if (cond_resched())
-			goto wait_for_iobuf;
+		wait_on_buffer(bh);
+		cond_resched();
 
 		if (unlikely(!buffer_uptodate(bh)))
 			err = -EIO;
-
-		clear_buffer_jwrite(bh);
-
-		JBUFFER_TRACE(jh, "ph4: unfile after journal write");
-		jbd2_journal_unfile_buffer(journal, jh);
+		jbd2_unfile_log_bh(bh);
 
 		/*
-		 * ->t_iobuf_list should contain only dummy buffer_heads
-		 * which were created by jbd2_journal_write_metadata_buffer().
+		 * The list contains temporary buffer heads created by
+		 * jbd2_journal_write_metadata_buffer().
 		 */
 		BUFFER_TRACE(bh, "dumping temporary bh");
-		jbd2_journal_put_journal_head(jh);
 		__brelse(bh);
 		J_ASSERT_BH(bh, atomic_read(&bh->b_count) == 0);
 		free_buffer_head(bh);
 
-		/* We also have to unlock and free the corresponding
-                   shadowed buffer */
+		/* We also have to refile the corresponding shadowed buffer */
 		jh = commit_transaction->t_shadow_list->b_tprev;
 		bh = jh2bh(jh);
-		clear_bit(BH_JWrite, &bh->b_state);
+		clear_buffer_jwrite(bh);
 		J_ASSERT_BH(bh, buffer_jbddirty(bh));
 
 		/* The metadata is now released for reuse, but we need
@@ -952,7 +932,6 @@ wait_for_iobuf:
 	J_ASSERT(list_empty(&commit_transaction->t_inode_list));
 	J_ASSERT(commit_transaction->t_buffers == NULL);
 	J_ASSERT(commit_transaction->t_checkpoint_list == NULL);
-	J_ASSERT(commit_transaction->t_iobuf_list == NULL);
 	J_ASSERT(commit_transaction->t_shadow_list == NULL);
 	J_ASSERT(commit_transaction->t_log_list == NULL);
 
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index ed10991..eb6272b 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -310,14 +310,12 @@ static void journal_kill_thread(journal_t *journal)
  *
  * If the source buffer has already been modified by a new transaction
  * since we took the last commit snapshot, we use the frozen copy of
- * that data for IO.  If we end up using the existing buffer_head's data
- * for the write, then we *have* to lock the buffer to prevent anyone
- * else from using and possibly modifying it while the IO is in
- * progress.
+ * that data for IO. If we end up using the existing buffer_head's data
+ * for the write, then we have to make sure nobody modifies it while the
+ * IO is in progress. do_get_write_access() handles this.
  *
- * The function returns a pointer to the buffer_heads to be used for IO.
- *
- * We assume that the journal has already been locked in this function.
+ * The function returns a pointer to the buffer_head to be used for IO.
+ * 
  *
  * Return value:
  *  <0: Error
@@ -330,15 +328,14 @@ static void journal_kill_thread(journal_t *journal)
 
 int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
 				  struct journal_head  *jh_in,
-				  struct journal_head **jh_out,
-				  unsigned long long blocknr)
+				  struct buffer_head **bh_out,
+				  sector_t blocknr)
 {
 	int need_copy_out = 0;
 	int done_copy_out = 0;
 	int do_escape = 0;
 	char *mapped_data;
 	struct buffer_head *new_bh;
-	struct journal_head *new_jh;
 	struct page *new_page;
 	unsigned int new_offset;
 	struct buffer_head *bh_in = jh2bh(jh_in);
@@ -370,14 +367,13 @@ retry_alloc:
 	new_bh->b_state = 0;
 	init_buffer(new_bh, NULL, NULL);
 	atomic_set(&new_bh->b_count, 1);
-	new_jh = jbd2_journal_add_journal_head(new_bh);	/* This sleeps */
 
+	jbd_lock_bh_state(bh_in);
+repeat:
 	/*
 	 * If a new transaction has already done a buffer copy-out, then
 	 * we use that version of the data for the commit.
 	 */
-	jbd_lock_bh_state(bh_in);
-repeat:
 	if (jh_in->b_frozen_data) {
 		done_copy_out = 1;
 		new_page = virt_to_page(jh_in->b_frozen_data);
@@ -417,7 +413,7 @@ repeat:
 		jbd_unlock_bh_state(bh_in);
 		tmp = jbd2_alloc(bh_in->b_size, GFP_NOFS);
 		if (!tmp) {
-			jbd2_journal_put_journal_head(new_jh);
+			brelse(new_bh);
 			return -ENOMEM;
 		}
 		jbd_lock_bh_state(bh_in);
@@ -428,7 +424,7 @@ repeat:
 
 		jh_in->b_frozen_data = tmp;
 		mapped_data = kmap_atomic(new_page);
-		memcpy(tmp, mapped_data + new_offset, jh2bh(jh_in)->b_size);
+		memcpy(tmp, mapped_data + new_offset, bh_in->b_size);
 		kunmap_atomic(mapped_data);
 
 		new_page = virt_to_page(tmp);
@@ -454,14 +450,13 @@ repeat:
 	}
 
 	set_bh_page(new_bh, new_page, new_offset);
-	new_jh->b_transaction = NULL;
-	new_bh->b_size = jh2bh(jh_in)->b_size;
-	new_bh->b_bdev = transaction->t_journal->j_dev;
+	new_bh->b_size = bh_in->b_size;
+	new_bh->b_bdev = journal->j_dev;
 	new_bh->b_blocknr = blocknr;
 	set_buffer_mapped(new_bh);
 	set_buffer_dirty(new_bh);
 
-	*jh_out = new_jh;
+	*bh_out = new_bh;
 
 	/*
 	 * The to-be-written buffer needs to get moved to the io queue,
@@ -474,9 +469,6 @@ repeat:
 	spin_unlock(&journal->j_list_lock);
 	jbd_unlock_bh_state(bh_in);
 
-	JBUFFER_TRACE(new_jh, "file as BJ_IO");
-	jbd2_journal_file_buffer(new_jh, transaction, BJ_IO);
-
 	return do_escape | (done_copy_out << 1);
 }
 
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index d6ee5ae..3be34c7 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1589,10 +1589,10 @@ __blist_del_buffer(struct journal_head **list, struct journal_head *jh)
  * Remove a buffer from the appropriate transaction list.
  *
  * Note that this function can *change* the value of
- * bh->b_transaction->t_buffers, t_forget, t_iobuf_list, t_shadow_list,
- * t_log_list or t_reserved_list.  If the caller is holding onto a copy of one
- * of these pointers, it could go bad.  Generally the caller needs to re-read
- * the pointer from the transaction_t.
+ * bh->b_transaction->t_buffers, t_forget, t_shadow_list, t_log_list or
+ * t_reserved_list.  If the caller is holding onto a copy of one of these
+ * pointers, it could go bad.  Generally the caller needs to re-read the
+ * pointer from the transaction_t.
  *
  * Called under j_list_lock.
  */
@@ -1622,9 +1622,6 @@ static void __jbd2_journal_temp_unlink_buffer(struct journal_head *jh)
 	case BJ_Forget:
 		list = &transaction->t_forget;
 		break;
-	case BJ_IO:
-		list = &transaction->t_iobuf_list;
-		break;
 	case BJ_Shadow:
 		list = &transaction->t_shadow_list;
 		break;
@@ -2126,9 +2123,6 @@ void __jbd2_journal_file_buffer(struct journal_head *jh,
 	case BJ_Forget:
 		list = &transaction->t_forget;
 		break;
-	case BJ_IO:
-		list = &transaction->t_iobuf_list;
-		break;
 	case BJ_Shadow:
 		list = &transaction->t_shadow_list;
 		break;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 50e5a5e..a670595 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -523,12 +523,6 @@ struct transaction_s
 	struct journal_head	*t_checkpoint_io_list;
 
 	/*
-	 * Doubly-linked circular list of temporary buffers currently undergoing
-	 * IO in the log [j_list_lock]
-	 */
-	struct journal_head	*t_iobuf_list;
-
-	/*
 	 * Doubly-linked circular list of metadata buffers being shadowed by log
 	 * IO.  The IO buffers on the iobuf list and the shadow buffers on this
 	 * list match each other one for one at all times. [j_list_lock]
@@ -990,6 +984,14 @@ extern void __jbd2_journal_file_buffer(struct journal_head *, transaction_t *, i
 extern void __journal_free_buffer(struct journal_head *bh);
 extern void jbd2_journal_file_buffer(struct journal_head *, transaction_t *, int);
 extern void __journal_clean_data_list(transaction_t *transaction);
+static inline void jbd2_file_log_bh(struct list_head *head, struct buffer_head *bh)
+{
+	list_add_tail(&bh->b_assoc_buffers, head);
+}
+static inline void jbd2_unfile_log_bh(struct buffer_head *bh)
+{
+	list_del_init(&bh->b_assoc_buffers);
+}
 
 /* Log buffer allocation */
 extern struct journal_head * jbd2_journal_get_descriptor_buffer(journal_t *);
@@ -1038,11 +1040,10 @@ extern void jbd2_buffer_abort_trigger(struct journal_head *jh,
 				      struct jbd2_buffer_trigger_type *triggers);
 
 /* Buffer IO */
-extern int
-jbd2_journal_write_metadata_buffer(transaction_t	  *transaction,
-			      struct journal_head  *jh_in,
-			      struct journal_head **jh_out,
-			      unsigned long long   blocknr);
+extern int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
+					      struct journal_head *jh_in,
+					      struct buffer_head **bh_out,
+					      sector_t blocknr);
 
 /* Transaction locking */
 extern void		__wait_on_journal (journal_t *);
@@ -1284,11 +1285,10 @@ static inline int jbd_space_needed(journal_t *journal)
 #define BJ_None		0	/* Not journaled */
 #define BJ_Metadata	1	/* Normal journaled metadata */
 #define BJ_Forget	2	/* Buffer superseded by this transaction */
-#define BJ_IO		3	/* Buffer is for temporary IO use */
-#define BJ_Shadow	4	/* Buffer contents being shadowed to the log */
-#define BJ_LogCtl	5	/* Buffer contains log descriptors */
-#define BJ_Reserved	6	/* Buffer is reserved for access by journal */
-#define BJ_Types	7
+#define BJ_Shadow	3	/* Buffer contents being shadowed to the log */
+#define BJ_LogCtl	4	/* Buffer contains log descriptors */
+#define BJ_Reserved	5	/* Buffer is reserved for access by journal */
+#define BJ_Types	6
 
 extern int jbd_blocks_per_page(struct inode *inode);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 06/29] jbd2: Remove journal_head from descriptor buffers
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
                   ` (4 preceding siblings ...)
  2013-04-08 21:32 ` [PATCH 05/29] jbd2: Don't create journal_head for temporary journal buffers Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-04-12  8:10   ` Zheng Liu
  2013-04-08 21:32 ` [PATCH 07/29] jbd2: Refine waiting for shadow buffers Jan Kara
                   ` (22 subsequent siblings)
  28 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Similarly as for metadata buffers, also log descriptor buffers don't
really need the journal head. So strip it and remove BJ_LogCtl list.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c  |    1 -
 fs/jbd2/commit.c      |   78 +++++++++++++++++++-----------------------------
 fs/jbd2/journal.c     |    4 +-
 fs/jbd2/revoke.c      |   49 +++++++++++++++---------------
 fs/jbd2/transaction.c |    6 ----
 include/linux/jbd2.h  |   19 ++++-------
 6 files changed, 64 insertions(+), 93 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 2735fef..65ec076 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -691,7 +691,6 @@ void __jbd2_journal_drop_transaction(journal_t *journal, transaction_t *transact
 	J_ASSERT(transaction->t_buffers == NULL);
 	J_ASSERT(transaction->t_forget == NULL);
 	J_ASSERT(transaction->t_shadow_list == NULL);
-	J_ASSERT(transaction->t_log_list == NULL);
 	J_ASSERT(transaction->t_checkpoint_list == NULL);
 	J_ASSERT(transaction->t_checkpoint_io_list == NULL);
 	J_ASSERT(atomic_read(&transaction->t_updates) == 0);
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index c503df6..1a03762 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -85,8 +85,7 @@ nope:
 	__brelse(bh);
 }
 
-static void jbd2_commit_block_csum_set(journal_t *j,
-				       struct journal_head *descriptor)
+static void jbd2_commit_block_csum_set(journal_t *j, struct buffer_head *bh)
 {
 	struct commit_header *h;
 	__u32 csum;
@@ -94,12 +93,11 @@ static void jbd2_commit_block_csum_set(journal_t *j,
 	if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
 		return;
 
-	h = (struct commit_header *)(jh2bh(descriptor)->b_data);
+	h = (struct commit_header *)(bh->b_data);
 	h->h_chksum_type = 0;
 	h->h_chksum_size = 0;
 	h->h_chksum[0] = 0;
-	csum = jbd2_chksum(j, j->j_csum_seed, jh2bh(descriptor)->b_data,
-			   j->j_blocksize);
+	csum = jbd2_chksum(j, j->j_csum_seed, bh->b_data, j->j_blocksize);
 	h->h_chksum[0] = cpu_to_be32(csum);
 }
 
@@ -116,7 +114,6 @@ static int journal_submit_commit_record(journal_t *journal,
 					struct buffer_head **cbh,
 					__u32 crc32_sum)
 {
-	struct journal_head *descriptor;
 	struct commit_header *tmp;
 	struct buffer_head *bh;
 	int ret;
@@ -127,12 +124,10 @@ static int journal_submit_commit_record(journal_t *journal,
 	if (is_journal_aborted(journal))
 		return 0;
 
-	descriptor = jbd2_journal_get_descriptor_buffer(journal);
-	if (!descriptor)
+	bh = jbd2_journal_get_descriptor_buffer(journal);
+	if (!bh)
 		return 1;
 
-	bh = jh2bh(descriptor);
-
 	tmp = (struct commit_header *)bh->b_data;
 	tmp->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
 	tmp->h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
@@ -146,9 +141,9 @@ static int journal_submit_commit_record(journal_t *journal,
 		tmp->h_chksum_size 	= JBD2_CRC32_CHKSUM_SIZE;
 		tmp->h_chksum[0] 	= cpu_to_be32(crc32_sum);
 	}
-	jbd2_commit_block_csum_set(journal, descriptor);
+	jbd2_commit_block_csum_set(journal, bh);
 
-	JBUFFER_TRACE(descriptor, "submit commit block");
+	BUFFER_TRACE(bh, "submit commit block");
 	lock_buffer(bh);
 	clear_buffer_dirty(bh);
 	set_buffer_uptodate(bh);
@@ -180,7 +175,6 @@ static int journal_wait_on_commit_record(journal_t *journal,
 	if (unlikely(!buffer_uptodate(bh)))
 		ret = -EIO;
 	put_bh(bh);            /* One for getblk() */
-	jbd2_journal_put_journal_head(bh2jh(bh));
 
 	return ret;
 }
@@ -321,7 +315,7 @@ static void write_tag_block(int tag_bytes, journal_block_tag_t *tag,
 }
 
 static void jbd2_descr_block_csum_set(journal_t *j,
-				      struct journal_head *descriptor)
+				      struct buffer_head *bh)
 {
 	struct jbd2_journal_block_tail *tail;
 	__u32 csum;
@@ -329,12 +323,10 @@ static void jbd2_descr_block_csum_set(journal_t *j,
 	if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
 		return;
 
-	tail = (struct jbd2_journal_block_tail *)
-			(jh2bh(descriptor)->b_data + j->j_blocksize -
+	tail = (struct jbd2_journal_block_tail *)(bh->b_data + j->j_blocksize -
 			sizeof(struct jbd2_journal_block_tail));
 	tail->t_checksum = 0;
-	csum = jbd2_chksum(j, j->j_csum_seed, jh2bh(descriptor)->b_data,
-			   j->j_blocksize);
+	csum = jbd2_chksum(j, j->j_csum_seed, bh->b_data, j->j_blocksize);
 	tail->t_checksum = cpu_to_be32(csum);
 }
 
@@ -368,7 +360,8 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 {
 	struct transaction_stats_s stats;
 	transaction_t *commit_transaction;
-	struct journal_head *jh, *descriptor;
+	struct journal_head *jh;
+	struct buffer_head *descriptor;
 	struct buffer_head **wbuf = journal->j_wbuf;
 	int bufs;
 	int flags;
@@ -393,6 +386,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	int update_tail;
 	int csum_size = 0;
 	LIST_HEAD(io_bufs);
+	LIST_HEAD(log_bufs);
 
 	if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2))
 		csum_size = sizeof(struct jbd2_journal_block_tail);
@@ -546,7 +540,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 
 	blk_start_plug(&plug);
 	jbd2_journal_write_revoke_records(journal, commit_transaction,
-					  WRITE_SYNC);
+					  &log_bufs, WRITE_SYNC);
 	blk_finish_plug(&plug);
 
 	jbd_debug(3, "JBD2: commit phase 2\n");
@@ -572,8 +566,8 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 		 atomic_read(&commit_transaction->t_outstanding_credits));
 
 	err = 0;
-	descriptor = NULL;
 	bufs = 0;
+	descriptor = NULL;
 	blk_start_plug(&plug);
 	while (commit_transaction->t_buffers) {
 
@@ -605,8 +599,6 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 		   record the metadata buffer. */
 
 		if (!descriptor) {
-			struct buffer_head *bh;
-
 			J_ASSERT (bufs == 0);
 
 			jbd_debug(4, "JBD2: get descriptor\n");
@@ -617,26 +609,26 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 				continue;
 			}
 
-			bh = jh2bh(descriptor);
 			jbd_debug(4, "JBD2: got buffer %llu (%p)\n",
-				(unsigned long long)bh->b_blocknr, bh->b_data);
-			header = (journal_header_t *)&bh->b_data[0];
+				(unsigned long long)descriptor->b_blocknr,
+				descriptor->b_data);
+			header = (journal_header_t *)descriptor->b_data;
 			header->h_magic     = cpu_to_be32(JBD2_MAGIC_NUMBER);
 			header->h_blocktype = cpu_to_be32(JBD2_DESCRIPTOR_BLOCK);
 			header->h_sequence  = cpu_to_be32(commit_transaction->t_tid);
 
-			tagp = &bh->b_data[sizeof(journal_header_t)];
-			space_left = bh->b_size - sizeof(journal_header_t);
+			tagp = &descriptor->b_data[sizeof(journal_header_t)];
+			space_left = descriptor->b_size -
+						sizeof(journal_header_t);
 			first_tag = 1;
-			set_buffer_jwrite(bh);
-			set_buffer_dirty(bh);
-			wbuf[bufs++] = bh;
+			set_buffer_jwrite(descriptor);
+			set_buffer_dirty(descriptor);
+			wbuf[bufs++] = descriptor;
 
 			/* Record it so that we can wait for IO
                            completion later */
-			BUFFER_TRACE(bh, "ph3: file as descriptor");
-			jbd2_journal_file_buffer(descriptor, commit_transaction,
-					BJ_LogCtl);
+			BUFFER_TRACE(descriptor, "ph3: file as descriptor");
+			jbd2_file_log_bh(&log_bufs, descriptor);
 		}
 
 		/* Where is the buffer to be written? */
@@ -863,26 +855,19 @@ start_journal_io:
 	jbd_debug(3, "JBD2: commit phase 4\n");
 
 	/* Here we wait for the revoke record and descriptor record buffers */
- wait_for_ctlbuf:
-	while (commit_transaction->t_log_list != NULL) {
+	while (!list_empty(&log_bufs)) {
 		struct buffer_head *bh;
 
-		jh = commit_transaction->t_log_list->b_tprev;
-		bh = jh2bh(jh);
-		if (buffer_locked(bh)) {
-			wait_on_buffer(bh);
-			goto wait_for_ctlbuf;
-		}
-		if (cond_resched())
-			goto wait_for_ctlbuf;
+		bh = list_entry(log_bufs.prev, struct buffer_head, b_assoc_buffers);
+		wait_on_buffer(bh);
+		cond_resched();
 
 		if (unlikely(!buffer_uptodate(bh)))
 			err = -EIO;
 
 		BUFFER_TRACE(bh, "ph5: control buffer writeout done: unfile");
 		clear_buffer_jwrite(bh);
-		jbd2_journal_unfile_buffer(journal, jh);
-		jbd2_journal_put_journal_head(jh);
+		jbd2_unfile_log_bh(bh);
 		__brelse(bh);		/* One for getblk */
 		/* AKPM: bforget here */
 	}
@@ -933,7 +918,6 @@ start_journal_io:
 	J_ASSERT(commit_transaction->t_buffers == NULL);
 	J_ASSERT(commit_transaction->t_checkpoint_list == NULL);
 	J_ASSERT(commit_transaction->t_shadow_list == NULL);
-	J_ASSERT(commit_transaction->t_log_list == NULL);
 
 restart_loop:
 	/*
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index eb6272b..e03aae0 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -761,7 +761,7 @@ int jbd2_journal_bmap(journal_t *journal, unsigned long blocknr,
  * But we don't bother doing that, so there will be coherency problems with
  * mmaps of blockdevs which hold live JBD-controlled filesystems.
  */
-struct journal_head *jbd2_journal_get_descriptor_buffer(journal_t *journal)
+struct buffer_head *jbd2_journal_get_descriptor_buffer(journal_t *journal)
 {
 	struct buffer_head *bh;
 	unsigned long long blocknr;
@@ -780,7 +780,7 @@ struct journal_head *jbd2_journal_get_descriptor_buffer(journal_t *journal)
 	set_buffer_uptodate(bh);
 	unlock_buffer(bh);
 	BUFFER_TRACE(bh, "return this buffer");
-	return jbd2_journal_add_journal_head(bh);
+	return bh;
 }
 
 /*
diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
index f30b80b..198c9c1 100644
--- a/fs/jbd2/revoke.c
+++ b/fs/jbd2/revoke.c
@@ -122,9 +122,10 @@ struct jbd2_revoke_table_s
 
 #ifdef __KERNEL__
 static void write_one_revoke_record(journal_t *, transaction_t *,
-				    struct journal_head **, int *,
+				    struct list_head *,
+				    struct buffer_head **, int *,
 				    struct jbd2_revoke_record_s *, int);
-static void flush_descriptor(journal_t *, struct journal_head *, int, int);
+static void flush_descriptor(journal_t *, struct buffer_head *, int, int);
 #endif
 
 /* Utility functions to maintain the revoke table */
@@ -531,9 +532,10 @@ void jbd2_journal_switch_revoke_table(journal_t *journal)
  */
 void jbd2_journal_write_revoke_records(journal_t *journal,
 				       transaction_t *transaction,
+				       struct list_head *log_bufs,
 				       int write_op)
 {
-	struct journal_head *descriptor;
+	struct buffer_head *descriptor;
 	struct jbd2_revoke_record_s *record;
 	struct jbd2_revoke_table_s *revoke;
 	struct list_head *hash_list;
@@ -553,7 +555,7 @@ void jbd2_journal_write_revoke_records(journal_t *journal,
 		while (!list_empty(hash_list)) {
 			record = (struct jbd2_revoke_record_s *)
 				hash_list->next;
-			write_one_revoke_record(journal, transaction,
+			write_one_revoke_record(journal, transaction, log_bufs,
 						&descriptor, &offset,
 						record, write_op);
 			count++;
@@ -573,13 +575,14 @@ void jbd2_journal_write_revoke_records(journal_t *journal,
 
 static void write_one_revoke_record(journal_t *journal,
 				    transaction_t *transaction,
-				    struct journal_head **descriptorp,
+				    struct list_head *log_bufs,
+				    struct buffer_head **descriptorp,
 				    int *offsetp,
 				    struct jbd2_revoke_record_s *record,
 				    int write_op)
 {
 	int csum_size = 0;
-	struct journal_head *descriptor;
+	struct buffer_head *descriptor;
 	int offset;
 	journal_header_t *header;
 
@@ -609,26 +612,26 @@ static void write_one_revoke_record(journal_t *journal,
 		descriptor = jbd2_journal_get_descriptor_buffer(journal);
 		if (!descriptor)
 			return;
-		header = (journal_header_t *) &jh2bh(descriptor)->b_data[0];
+		header = (journal_header_t *)descriptor->b_data;
 		header->h_magic     = cpu_to_be32(JBD2_MAGIC_NUMBER);
 		header->h_blocktype = cpu_to_be32(JBD2_REVOKE_BLOCK);
 		header->h_sequence  = cpu_to_be32(transaction->t_tid);
 
 		/* Record it so that we can wait for IO completion later */
-		JBUFFER_TRACE(descriptor, "file as BJ_LogCtl");
-		jbd2_journal_file_buffer(descriptor, transaction, BJ_LogCtl);
+		BUFFER_TRACE(descriptor, "file in log_bufs");
+		jbd2_file_log_bh(log_bufs, descriptor);
 
 		offset = sizeof(jbd2_journal_revoke_header_t);
 		*descriptorp = descriptor;
 	}
 
 	if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_64BIT)) {
-		* ((__be64 *)(&jh2bh(descriptor)->b_data[offset])) =
+		* ((__be64 *)(&descriptor->b_data[offset])) =
 			cpu_to_be64(record->blocknr);
 		offset += 8;
 
 	} else {
-		* ((__be32 *)(&jh2bh(descriptor)->b_data[offset])) =
+		* ((__be32 *)(&descriptor->b_data[offset])) =
 			cpu_to_be32(record->blocknr);
 		offset += 4;
 	}
@@ -636,8 +639,7 @@ static void write_one_revoke_record(journal_t *journal,
 	*offsetp = offset;
 }
 
-static void jbd2_revoke_csum_set(journal_t *j,
-				 struct journal_head *descriptor)
+static void jbd2_revoke_csum_set(journal_t *j, struct buffer_head *bh)
 {
 	struct jbd2_journal_revoke_tail *tail;
 	__u32 csum;
@@ -645,12 +647,10 @@ static void jbd2_revoke_csum_set(journal_t *j,
 	if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
 		return;
 
-	tail = (struct jbd2_journal_revoke_tail *)
-			(jh2bh(descriptor)->b_data + j->j_blocksize -
+	tail = (struct jbd2_journal_revoke_tail *)(bh->b_data + j->j_blocksize -
 			sizeof(struct jbd2_journal_revoke_tail));
 	tail->r_checksum = 0;
-	csum = jbd2_chksum(j, j->j_csum_seed, jh2bh(descriptor)->b_data,
-			   j->j_blocksize);
+	csum = jbd2_chksum(j, j->j_csum_seed, bh->b_data, j->j_blocksize);
 	tail->r_checksum = cpu_to_be32(csum);
 }
 
@@ -662,25 +662,24 @@ static void jbd2_revoke_csum_set(journal_t *j,
  */
 
 static void flush_descriptor(journal_t *journal,
-			     struct journal_head *descriptor,
+			     struct buffer_head *descriptor,
 			     int offset, int write_op)
 {
 	jbd2_journal_revoke_header_t *header;
-	struct buffer_head *bh = jh2bh(descriptor);
 
 	if (is_journal_aborted(journal)) {
-		put_bh(bh);
+		put_bh(descriptor);
 		return;
 	}
 
-	header = (jbd2_journal_revoke_header_t *) jh2bh(descriptor)->b_data;
+	header = (jbd2_journal_revoke_header_t *)descriptor->b_data;
 	header->r_count = cpu_to_be32(offset);
 	jbd2_revoke_csum_set(journal, descriptor);
 
-	set_buffer_jwrite(bh);
-	BUFFER_TRACE(bh, "write");
-	set_buffer_dirty(bh);
-	write_dirty_buffer(bh, write_op);
+	set_buffer_jwrite(descriptor);
+	BUFFER_TRACE(descriptor, "write");
+	set_buffer_dirty(descriptor);
+	write_dirty_buffer(descriptor, write_op);
 }
 #endif
 
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 3be34c7..bc35899 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1625,9 +1625,6 @@ static void __jbd2_journal_temp_unlink_buffer(struct journal_head *jh)
 	case BJ_Shadow:
 		list = &transaction->t_shadow_list;
 		break;
-	case BJ_LogCtl:
-		list = &transaction->t_log_list;
-		break;
 	case BJ_Reserved:
 		list = &transaction->t_reserved_list;
 		break;
@@ -2126,9 +2123,6 @@ void __jbd2_journal_file_buffer(struct journal_head *jh,
 	case BJ_Shadow:
 		list = &transaction->t_shadow_list;
 		break;
-	case BJ_LogCtl:
-		list = &transaction->t_log_list;
-		break;
 	case BJ_Reserved:
 		list = &transaction->t_reserved_list;
 		break;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index a670595..4584518 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -530,12 +530,6 @@ struct transaction_s
 	struct journal_head	*t_shadow_list;
 
 	/*
-	 * Doubly-linked circular list of control buffers being written to the
-	 * log. [j_list_lock]
-	 */
-	struct journal_head	*t_log_list;
-
-	/*
 	 * List of inodes whose data we've modified in data=ordered mode.
 	 * [j_list_lock]
 	 */
@@ -994,7 +988,7 @@ static inline void jbd2_unfile_log_bh(struct buffer_head *bh)
 }
 
 /* Log buffer allocation */
-extern struct journal_head * jbd2_journal_get_descriptor_buffer(journal_t *);
+struct buffer_head *jbd2_journal_get_descriptor_buffer(journal_t *journal);
 int jbd2_journal_next_log_block(journal_t *, unsigned long long *);
 int jbd2_journal_get_log_tail(journal_t *journal, tid_t *tid,
 			      unsigned long *block);
@@ -1178,8 +1172,10 @@ extern int	   jbd2_journal_init_revoke_caches(void);
 extern void	   jbd2_journal_destroy_revoke(journal_t *);
 extern int	   jbd2_journal_revoke (handle_t *, unsigned long long, struct buffer_head *);
 extern int	   jbd2_journal_cancel_revoke(handle_t *, struct journal_head *);
-extern void	   jbd2_journal_write_revoke_records(journal_t *,
-						     transaction_t *, int);
+extern void	   jbd2_journal_write_revoke_records(journal_t *journal,
+						     transaction_t *transaction,
+						     struct list_head *log_bufs,
+						     int write_op);
 
 /* Recovery revoke support */
 extern int	jbd2_journal_set_revoke(journal_t *, unsigned long long, tid_t);
@@ -1286,9 +1282,8 @@ static inline int jbd_space_needed(journal_t *journal)
 #define BJ_Metadata	1	/* Normal journaled metadata */
 #define BJ_Forget	2	/* Buffer superseded by this transaction */
 #define BJ_Shadow	3	/* Buffer contents being shadowed to the log */
-#define BJ_LogCtl	4	/* Buffer contains log descriptors */
-#define BJ_Reserved	5	/* Buffer is reserved for access by journal */
-#define BJ_Types	6
+#define BJ_Reserved	4	/* Buffer is reserved for access by journal */
+#define BJ_Types	5
 
 extern int jbd_blocks_per_page(struct inode *inode);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 07/29] jbd2: Refine waiting for shadow buffers
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
                   ` (5 preceding siblings ...)
  2013-04-08 21:32 ` [PATCH 06/29] jbd2: Remove journal_head from descriptor buffers Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-05-03 14:16   ` Zheng Liu
  2013-04-08 21:32 ` [PATCH 08/29] jbd2: Remove outdated comment Jan Kara
                   ` (21 subsequent siblings)
  28 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Currently when we add a buffer to a transaction, we wait until the
buffer is removed from BJ_Shadow list (so that we prevent any changes to
the buffer that is just written to the journal). This can take
unnecessarily long as a lot happens between the time the buffer is
submitted to the journal and the time when we remove the buffer from
BJ_Shadow list (e.g.  we wait for all data buffers in the transaction,
we issue a cache flush etc.). Also this creates a dependency of
do_get_write_access() on transaction commit (namely waiting for data IO
to complete) which we want to avoid when implementing transaction
reservation.

So we modify commit code to set new BH_Shadow flag when temporary
shadowing buffer is created and we clear that flag once IO on that
buffer is complete. This allows do_get_write_access() to wait only for
BH_Shadow bit and thus removes the dependency on data IO completion.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/commit.c           |   20 ++++++++++----------
 fs/jbd2/journal.c          |    2 ++
 fs/jbd2/transaction.c      |   44 +++++++++++++++++++-------------------------
 include/linux/jbd.h        |   25 +++++++++++++++++++++++++
 include/linux/jbd2.h       |   28 ++++++++++++++++++++++++++++
 include/linux/jbd_common.h |   26 --------------------------
 6 files changed, 84 insertions(+), 61 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 1a03762..4863f5b 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -30,15 +30,22 @@
 #include <trace/events/jbd2.h>
 
 /*
- * Default IO end handler for temporary BJ_IO buffer_heads.
+ * IO end handler for temporary buffer_heads handling writes to the journal.
  */
 static void journal_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
 {
+	struct buffer_head *orig_bh = bh->b_private;
+
 	BUFFER_TRACE(bh, "");
 	if (uptodate)
 		set_buffer_uptodate(bh);
 	else
 		clear_buffer_uptodate(bh);
+	if (orig_bh) {
+		clear_bit_unlock(BH_Shadow, &orig_bh->b_state);
+		smp_mb__after_clear_bit();
+		wake_up_bit(&orig_bh->b_state, BH_Shadow);
+	}
 	unlock_buffer(bh);
 }
 
@@ -818,7 +825,7 @@ start_journal_io:
 		jbd2_unfile_log_bh(bh);
 
 		/*
-		 * The list contains temporary buffer heads created by
+		 * The list contains temporary buffer heas created by
 		 * jbd2_journal_write_metadata_buffer().
 		 */
 		BUFFER_TRACE(bh, "dumping temporary bh");
@@ -831,6 +838,7 @@ start_journal_io:
 		bh = jh2bh(jh);
 		clear_buffer_jwrite(bh);
 		J_ASSERT_BH(bh, buffer_jbddirty(bh));
+		J_ASSERT_BH(bh, !buffer_shadow(bh));
 
 		/* The metadata is now released for reuse, but we need
                    to remember it against this transaction so that when
@@ -838,14 +846,6 @@ start_journal_io:
                    required. */
 		JBUFFER_TRACE(jh, "file as BJ_Forget");
 		jbd2_journal_file_buffer(jh, commit_transaction, BJ_Forget);
-		/*
-		 * Wake up any transactions which were waiting for this IO to
-		 * complete. The barrier must be here so that changes by
-		 * jbd2_journal_file_buffer() take effect before wake_up_bit()
-		 * does the waitqueue check.
-		 */
-		smp_mb();
-		wake_up_bit(&bh->b_state, BH_Unshadow);
 		JBUFFER_TRACE(jh, "brelse shadowed buffer");
 		__brelse(bh);
 	}
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index e03aae0..e9a9cdb 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -453,6 +453,7 @@ repeat:
 	new_bh->b_size = bh_in->b_size;
 	new_bh->b_bdev = journal->j_dev;
 	new_bh->b_blocknr = blocknr;
+	new_bh->b_private = bh_in;
 	set_buffer_mapped(new_bh);
 	set_buffer_dirty(new_bh);
 
@@ -467,6 +468,7 @@ repeat:
 	spin_lock(&journal->j_list_lock);
 	__jbd2_journal_file_buffer(jh_in, transaction, BJ_Shadow);
 	spin_unlock(&journal->j_list_lock);
+	set_buffer_shadow(bh_in);
 	jbd_unlock_bh_state(bh_in);
 
 	return do_escape | (done_copy_out << 1);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index bc35899..81df09c 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -620,6 +620,12 @@ static void warn_dirty_buffer(struct buffer_head *bh)
 	       bdevname(bh->b_bdev, b), (unsigned long long)bh->b_blocknr);
 }
 
+static int sleep_on_shadow_bh(void *word)
+{
+	io_schedule();
+	return 0;
+}
+
 /*
  * If the buffer is already part of the current transaction, then there
  * is nothing we need to do.  If it is already part of a prior
@@ -747,41 +753,29 @@ repeat:
 		 * journaled.  If the primary copy is already going to
 		 * disk then we cannot do copy-out here. */
 
-		if (jh->b_jlist == BJ_Shadow) {
-			DEFINE_WAIT_BIT(wait, &bh->b_state, BH_Unshadow);
-			wait_queue_head_t *wqh;
-
-			wqh = bit_waitqueue(&bh->b_state, BH_Unshadow);
-
+		if (buffer_shadow(bh)) {
 			JBUFFER_TRACE(jh, "on shadow: sleep");
 			jbd_unlock_bh_state(bh);
-			/* commit wakes up all shadow buffers after IO */
-			for ( ; ; ) {
-				prepare_to_wait(wqh, &wait.wait,
-						TASK_UNINTERRUPTIBLE);
-				if (jh->b_jlist != BJ_Shadow)
-					break;
-				schedule();
-			}
-			finish_wait(wqh, &wait.wait);
+			wait_on_bit(&bh->b_state, BH_Shadow,
+				    sleep_on_shadow_bh, TASK_UNINTERRUPTIBLE);
 			goto repeat;
 		}
 
-		/* Only do the copy if the currently-owning transaction
-		 * still needs it.  If it is on the Forget list, the
-		 * committing transaction is past that stage.  The
-		 * buffer had better remain locked during the kmalloc,
-		 * but that should be true --- we hold the journal lock
-		 * still and the buffer is already on the BUF_JOURNAL
-		 * list so won't be flushed.
+		/*
+		 * Only do the copy if the currently-owning transaction still
+		 * needs it. If buffer isn't on BJ_Metadata list, the
+		 * committing transaction is past that stage (here we use the
+		 * fact that BH_Shadow is set under bh_state lock together with
+		 * refiling to BJ_Shadow list and at this point we know the
+		 * buffer doesn't have BH_Shadow set).
 		 *
 		 * Subtle point, though: if this is a get_undo_access,
 		 * then we will be relying on the frozen_data to contain
 		 * the new value of the committed_data record after the
 		 * transaction, so we HAVE to force the frozen_data copy
-		 * in that case. */
-
-		if (jh->b_jlist != BJ_Forget || force_copy) {
+		 * in that case.
+		 */
+		if (jh->b_jlist == BJ_Metadata || force_copy) {
 			JBUFFER_TRACE(jh, "generate frozen data");
 			if (!frozen_buffer) {
 				JBUFFER_TRACE(jh, "allocate memory for buffer");
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index c8f3297..1c36b0c 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -244,6 +244,31 @@ typedef struct journal_superblock_s
 
 #include <linux/fs.h>
 #include <linux/sched.h>
+
+enum jbd_state_bits {
+	BH_JBD			/* Has an attached ext3 journal_head */
+	  = BH_PrivateStart,
+	BH_JWrite,		/* Being written to log (@@@ DEBUGGING) */
+	BH_Freed,		/* Has been freed (truncated) */
+	BH_Revoked,		/* Has been revoked from the log */
+	BH_RevokeValid,		/* Revoked flag is valid */
+	BH_JBDDirty,		/* Is dirty but journaled */
+	BH_State,		/* Pins most journal_head state */
+	BH_JournalHead,		/* Pins bh->b_private and jh->b_bh */
+	BH_Unshadow,		/* Dummy bit, for BJ_Shadow wakeup filtering */
+	BH_JBDPrivateStart,	/* First bit available for private use by FS */
+};
+
+BUFFER_FNS(JBD, jbd)
+BUFFER_FNS(JWrite, jwrite)
+BUFFER_FNS(JBDDirty, jbddirty)
+TAS_BUFFER_FNS(JBDDirty, jbddirty)
+BUFFER_FNS(Revoked, revoked)
+TAS_BUFFER_FNS(Revoked, revoked)
+BUFFER_FNS(RevokeValid, revokevalid)
+TAS_BUFFER_FNS(RevokeValid, revokevalid)
+BUFFER_FNS(Freed, freed)
+
 #include <linux/jbd_common.h>
 
 #define J_ASSERT(assert)	BUG_ON(!(assert))
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 4584518..be5115f 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -302,6 +302,34 @@ typedef struct journal_superblock_s
 
 #include <linux/fs.h>
 #include <linux/sched.h>
+
+enum jbd_state_bits {
+	BH_JBD			/* Has an attached ext3 journal_head */
+	  = BH_PrivateStart,
+	BH_JWrite,		/* Being written to log (@@@ DEBUGGING) */
+	BH_Freed,		/* Has been freed (truncated) */
+	BH_Revoked,		/* Has been revoked from the log */
+	BH_RevokeValid,		/* Revoked flag is valid */
+	BH_JBDDirty,		/* Is dirty but journaled */
+	BH_State,		/* Pins most journal_head state */
+	BH_JournalHead,		/* Pins bh->b_private and jh->b_bh */
+	BH_Shadow,		/* IO on shadow buffer is running */
+	BH_Verified,		/* Metadata block has been verified ok */
+	BH_JBDPrivateStart,	/* First bit available for private use by FS */
+};
+
+BUFFER_FNS(JBD, jbd)
+BUFFER_FNS(JWrite, jwrite)
+BUFFER_FNS(JBDDirty, jbddirty)
+TAS_BUFFER_FNS(JBDDirty, jbddirty)
+BUFFER_FNS(Revoked, revoked)
+TAS_BUFFER_FNS(Revoked, revoked)
+BUFFER_FNS(RevokeValid, revokevalid)
+TAS_BUFFER_FNS(RevokeValid, revokevalid)
+BUFFER_FNS(Freed, freed)
+BUFFER_FNS(Shadow, shadow)
+BUFFER_FNS(Verified, verified)
+
 #include <linux/jbd_common.h>
 
 #define J_ASSERT(assert)	BUG_ON(!(assert))
diff --git a/include/linux/jbd_common.h b/include/linux/jbd_common.h
index 6133679..b1f7089 100644
--- a/include/linux/jbd_common.h
+++ b/include/linux/jbd_common.h
@@ -1,32 +1,6 @@
 #ifndef _LINUX_JBD_STATE_H
 #define _LINUX_JBD_STATE_H
 
-enum jbd_state_bits {
-	BH_JBD			/* Has an attached ext3 journal_head */
-	  = BH_PrivateStart,
-	BH_JWrite,		/* Being written to log (@@@ DEBUGGING) */
-	BH_Freed,		/* Has been freed (truncated) */
-	BH_Revoked,		/* Has been revoked from the log */
-	BH_RevokeValid,		/* Revoked flag is valid */
-	BH_JBDDirty,		/* Is dirty but journaled */
-	BH_State,		/* Pins most journal_head state */
-	BH_JournalHead,		/* Pins bh->b_private and jh->b_bh */
-	BH_Unshadow,		/* Dummy bit, for BJ_Shadow wakeup filtering */
-	BH_Verified,		/* Metadata block has been verified ok */
-	BH_JBDPrivateStart,	/* First bit available for private use by FS */
-};
-
-BUFFER_FNS(JBD, jbd)
-BUFFER_FNS(JWrite, jwrite)
-BUFFER_FNS(JBDDirty, jbddirty)
-TAS_BUFFER_FNS(JBDDirty, jbddirty)
-BUFFER_FNS(Revoked, revoked)
-TAS_BUFFER_FNS(Revoked, revoked)
-BUFFER_FNS(RevokeValid, revokevalid)
-TAS_BUFFER_FNS(RevokeValid, revokevalid)
-BUFFER_FNS(Freed, freed)
-BUFFER_FNS(Verified, verified)

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 08/29] jbd2: Remove outdated comment
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
                   ` (6 preceding siblings ...)
  2013-04-08 21:32 ` [PATCH 07/29] jbd2: Refine waiting for shadow buffers Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-05-03 14:20   ` Zheng Liu
  2013-04-08 21:32 ` [PATCH 09/29] jbd2: Cleanup needed free block estimates when starting a transaction Jan Kara
                   ` (20 subsequent siblings)
  28 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

The comment about credit estimates isn't true anymore. We do what the
comment describes now.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c |   10 ----------
 1 files changed, 0 insertions(+), 10 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 81df09c..74cfbd3 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -283,16 +283,6 @@ repeat:
 	 * reduce the free space arbitrarily.  Be careful to account for
 	 * those buffers when checkpointing.
 	 */

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 09/29] jbd2: Cleanup needed free block estimates when starting a transaction
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
                   ` (7 preceding siblings ...)
  2013-04-08 21:32 ` [PATCH 08/29] jbd2: Remove outdated comment Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-05-05  8:17   ` Zheng Liu
  2013-04-08 21:32 ` [PATCH 10/29] jbd2: Fix race in t_outstanding_credits update in jbd2_journal_extend() Jan Kara
                   ` (19 subsequent siblings)
  28 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

__jbd2_log_space_left() and jbd_space_needed() were kind of odd.
jbd_space_needed() accounted also credits needed for currently committing
transaction while it didn't account for credits needed for control blocks.
__jbd2_log_space_left() then accounted for control blocks as a fraction of free
space. Since results of these two functions are always only compared against
each other, this works correct but is somewhat strange. Move the estimates so
that jbd_space_needed() returns number of blocks needed for a transaction
including control blocks and __jbd2_log_space_left() returns free space in the
journal (with the committing transaction already subtracted). Rename functions
to jbd2_log_space_left() and jbd2_space_needed() while we are changing them.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c  |    8 ++++----
 fs/jbd2/journal.c     |   29 -----------------------------
 fs/jbd2/transaction.c |    9 +++++----
 include/linux/jbd2.h  |   32 ++++++++++++++++++++++++++------
 4 files changed, 35 insertions(+), 43 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 65ec076..a572383 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -120,8 +120,8 @@ void __jbd2_log_wait_for_space(journal_t *journal)
 	int nblocks, space_left;
 	/* assert_spin_locked(&journal->j_state_lock); */
 
-	nblocks = jbd_space_needed(journal);
-	while (__jbd2_log_space_left(journal) < nblocks) {
+	nblocks = jbd2_space_needed(journal);
+	while (jbd2_log_space_left(journal) < nblocks) {
 		if (journal->j_flags & JBD2_ABORT)
 			return;
 		write_unlock(&journal->j_state_lock);
@@ -140,8 +140,8 @@ void __jbd2_log_wait_for_space(journal_t *journal)
 		 */
 		write_lock(&journal->j_state_lock);
 		spin_lock(&journal->j_list_lock);
-		nblocks = jbd_space_needed(journal);
-		space_left = __jbd2_log_space_left(journal);
+		nblocks = jbd2_space_needed(journal);
+		space_left = jbd2_log_space_left(journal);
 		if (space_left < nblocks) {
 			int chkpt = journal->j_checkpoint_transactions != NULL;
 			tid_t tid = 0;
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index e9a9cdb..e6f14e0 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -480,35 +480,6 @@ repeat:
  */
 
 /*
- * __jbd2_log_space_left: Return the number of free blocks left in the journal.
- *
- * Called with the journal already locked.
- *
- * Called under j_state_lock
- */
-
-int __jbd2_log_space_left(journal_t *journal)
-{
-	int left = journal->j_free;
-
-	/* assert_spin_locked(&journal->j_state_lock); */
-
-	/*
-	 * Be pessimistic here about the number of those free blocks which
-	 * might be required for log descriptor control blocks.
-	 */
-
-#define MIN_LOG_RESERVED_BLOCKS 32 /* Allow for rounding errors */
-
-	left -= MIN_LOG_RESERVED_BLOCKS;
-
-	if (left <= 0)
-		return 0;
-	left -= (left >> 3);
-	return left;
-}
-
-/*
  * Called with j_state_lock locked for writing.
  * Returns true if a transaction commit was started.
  */
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 74cfbd3..aee40c9 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -283,12 +283,12 @@ repeat:
 	 * reduce the free space arbitrarily.  Be careful to account for
 	 * those buffers when checkpointing.
 	 */
-	if (__jbd2_log_space_left(journal) < jbd_space_needed(journal)) {
+	if (jbd2_log_space_left(journal) < jbd2_space_needed(journal)) {
 		jbd_debug(2, "Handle %p waiting for checkpoint...\n", handle);
 		atomic_sub(nblocks, &transaction->t_outstanding_credits);
 		read_unlock(&journal->j_state_lock);
 		write_lock(&journal->j_state_lock);
-		if (__jbd2_log_space_left(journal) < jbd_space_needed(journal))
+		if (jbd2_log_space_left(journal) < jbd2_space_needed(journal))
 			__jbd2_log_wait_for_space(journal);
 		write_unlock(&journal->j_state_lock);
 		goto repeat;
@@ -306,7 +306,7 @@ repeat:
 	jbd_debug(4, "Handle %p given %d credits (total %d, free %d)\n",
 		  handle, nblocks,
 		  atomic_read(&transaction->t_outstanding_credits),
-		  __jbd2_log_space_left(journal));
+		  jbd2_log_space_left(journal));
 	read_unlock(&journal->j_state_lock);
 
 	lock_map_acquire(&handle->h_lockdep_map);
@@ -442,7 +442,8 @@ int jbd2_journal_extend(handle_t *handle, int nblocks)
 		goto unlock;
 	}
 
-	if (wanted > __jbd2_log_space_left(journal)) {
+	if (wanted + (wanted >> JBD2_CONTROL_BLOCKS_SHIFT) >
+	    jbd2_log_space_left(journal)) {
 		jbd_debug(3, "denied handle %p %d blocks: "
 			  "insufficient log space\n", handle, nblocks);
 		goto unlock;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index be5115f..9197d1b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1219,7 +1219,6 @@ extern void	jbd2_clear_buffer_revoked_flags(journal_t *journal);
  * transitions on demand.
  */
 
-int __jbd2_log_space_left(journal_t *); /* Called with journal locked */
 int jbd2_log_start_commit(journal_t *journal, tid_t tid);
 int __jbd2_log_start_commit(journal_t *journal, tid_t tid);
 int jbd2_journal_start_commit(journal_t *journal, tid_t *tid);
@@ -1289,16 +1288,37 @@ extern int jbd2_journal_blocks_per_page(struct inode *inode);
 extern size_t journal_tag_bytes(journal_t *journal);
 
 /*
+ * We reserve t_outstanding_credits >> JBD2_CONTROL_BLOCKS_SHIFT for
+ * transaction control blocks.
+ */
+#define JBD2_CONTROL_BLOCKS_SHIFT 5
+
+/*
  * Return the minimum number of blocks which must be free in the journal
  * before a new transaction may be started.  Must be called under j_state_lock.
  */
-static inline int jbd_space_needed(journal_t *journal)
+static inline int jbd2_space_needed(journal_t *journal)
 {
 	int nblocks = journal->j_max_transaction_buffers;
-	if (journal->j_committing_transaction)
-		nblocks += atomic_read(&journal->j_committing_transaction->
-				       t_outstanding_credits);
-	return nblocks;
+	return nblocks + (nblocks >> JBD2_CONTROL_BLOCKS_SHIFT);
+}
+
+/*
+ * Return number of free blocks in the log. Must be called under j_state_lock.
+ */
+static inline unsigned long jbd2_log_space_left(journal_t *journal)
+{
+	/* Allow for rounding errors */
+	unsigned long free = journal->j_free - 32;
+
+	if (journal->j_committing_transaction) {
+		unsigned long committing = atomic_read(&journal->
+			j_committing_transaction->t_outstanding_credits);
+
+		/* Transaction + control blocks */
+		free -= committing + (committing >> JBD2_CONTROL_BLOCKS_SHIFT);
+	}
+	return free;
 }
 
 /*
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 10/29] jbd2: Fix race in t_outstanding_credits update in jbd2_journal_extend()
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
                   ` (8 preceding siblings ...)
  2013-04-08 21:32 ` [PATCH 09/29] jbd2: Cleanup needed free block estimates when starting a transaction Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-05-05  8:37   ` Zheng Liu
  2013-04-08 21:32 ` [PATCH 11/29] jbd2: Remove unused waitqueues Jan Kara
                   ` (18 subsequent siblings)
  28 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

jbd2_journal_extend() first checked whether transaction can accept extending
handle with more credits and then added credits to t_outstanding_credits.
This can race with start_this_handle() adding another handle to a transaction
and thus overbooking a transaction. Make jbd2_journal_extend() use
atomic_add_return() to close the race.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index aee40c9..9639e47 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -434,11 +434,13 @@ int jbd2_journal_extend(handle_t *handle, int nblocks)
 	}
 
 	spin_lock(&transaction->t_handle_lock);
-	wanted = atomic_read(&transaction->t_outstanding_credits) + nblocks;
+	wanted = atomic_add_return(nblocks,
+				   &transaction->t_outstanding_credits);
 
 	if (wanted > journal->j_max_transaction_buffers) {
 		jbd_debug(3, "denied handle %p %d blocks: "
 			  "transaction too large\n", handle, nblocks);
+		atomic_sub(nblocks, &transaction->t_outstanding_credits);
 		goto unlock;
 	}
 
@@ -446,6 +448,7 @@ int jbd2_journal_extend(handle_t *handle, int nblocks)
 	    jbd2_log_space_left(journal)) {
 		jbd_debug(3, "denied handle %p %d blocks: "
 			  "insufficient log space\n", handle, nblocks);
+		atomic_sub(nblocks, &transaction->t_outstanding_credits);
 		goto unlock;
 	}
 
@@ -457,7 +460,6 @@ int jbd2_journal_extend(handle_t *handle, int nblocks)
 
 	handle->h_buffer_credits += nblocks;
 	handle->h_requested_credits += nblocks;
-	atomic_add(nblocks, &transaction->t_outstanding_credits);
 	result = 0;
 
 	jbd_debug(3, "extended handle %p by %d\n", handle, nblocks);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 11/29] jbd2: Remove unused waitqueues
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
                   ` (9 preceding siblings ...)
  2013-04-08 21:32 ` [PATCH 10/29] jbd2: Fix race in t_outstanding_credits update in jbd2_journal_extend() Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-05-05  8:41   ` Zheng Liu
  2013-04-08 21:32 ` [PATCH 12/29] jbd2: Transaction reservation support Jan Kara
                   ` (17 subsequent siblings)
  28 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

j_wait_logspace and j_wait_checkpoint are unused. Remove them.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c |    4 ----
 fs/jbd2/journal.c    |    2 --
 include/linux/jbd2.h |    8 --------
 3 files changed, 0 insertions(+), 14 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index a572383..75a15f3 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -625,10 +625,6 @@ int __jbd2_journal_remove_checkpoint(struct journal_head *jh)
 
 	__jbd2_journal_drop_transaction(journal, transaction);
 	jbd2_journal_free_transaction(transaction);
-
-	/* Just in case anybody was waiting for more transactions to be
-           checkpointed... */
-	wake_up(&journal->j_wait_logspace);
 	ret = 1;
 out:
 	return ret;
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index e6f14e0..63e2929 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -998,9 +998,7 @@ static journal_t * journal_init_common (void)
 		return NULL;
 
 	init_waitqueue_head(&journal->j_wait_transaction_locked);
-	init_waitqueue_head(&journal->j_wait_logspace);
 	init_waitqueue_head(&journal->j_wait_done_commit);
-	init_waitqueue_head(&journal->j_wait_checkpoint);
 	init_waitqueue_head(&journal->j_wait_commit);
 	init_waitqueue_head(&journal->j_wait_updates);
 	mutex_init(&journal->j_barrier);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 9197d1b..ad4b3bb 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -686,9 +686,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
  *  waiting for checkpointing
  * @j_wait_transaction_locked: Wait queue for waiting for a locked transaction
  *  to start committing, or for a barrier lock to be released
- * @j_wait_logspace: Wait queue for waiting for checkpointing to complete
  * @j_wait_done_commit: Wait queue for waiting for commit to complete
- * @j_wait_checkpoint:  Wait queue to trigger checkpointing
  * @j_wait_commit: Wait queue to trigger commit
  * @j_wait_updates: Wait queue to wait for updates to complete
  * @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
@@ -793,15 +791,9 @@ struct journal_s
 	 */
 	wait_queue_head_t	j_wait_transaction_locked;
 
-	/* Wait queue for waiting for checkpointing to complete */
-	wait_queue_head_t	j_wait_logspace;
-
 	/* Wait queue for waiting for commit to complete */
 	wait_queue_head_t	j_wait_done_commit;
 
-	/* Wait queue to trigger checkpointing */
-	wait_queue_head_t	j_wait_checkpoint;

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 12/29] jbd2: Transaction reservation support
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
                   ` (10 preceding siblings ...)
  2013-04-08 21:32 ` [PATCH 11/29] jbd2: Remove unused waitqueues Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-05-05  9:39   ` Zheng Liu
  2013-04-08 21:32 ` [PATCH 13/29] ext4: Provide wrappers for transaction reservation calls Jan Kara
                   ` (16 subsequent siblings)
  28 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

In some cases we cannot start a transaction because of locking constraints and
passing started transaction into those places is not handy either because we
could block transaction commit for too long. Transaction reservation is
designed to solve these issues. It reserves a handle with given number of
credits in the journal and the handle can be later attached to the running
transaction without blocking on commit or checkpointing. Reserved handles do
not block transaction commit in any way, they only reduce maximum size of the
running transaction (because we have to always be prepared to accomodate
request for attaching reserved handle).

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/commit.c      |    6 +
 fs/jbd2/journal.c     |    2 +
 fs/jbd2/transaction.c |  289 +++++++++++++++++++++++++++++++++++++------------
 include/linux/jbd2.h  |   21 ++++-
 4 files changed, 245 insertions(+), 73 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 4863f5b..59c572e 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -522,6 +522,12 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	 */
 	jbd2_journal_switch_revoke_table(journal);
 
+	/*
+	 * Reserved credits cannot be claimed anymore, free them
+	 */
+	atomic_sub(atomic_read(&journal->j_reserved_credits),
+		   &commit_transaction->t_outstanding_credits);
+
 	trace_jbd2_commit_flushing(journal, commit_transaction);
 	stats.run.rs_flushing = jiffies;
 	stats.run.rs_locked = jbd2_time_diff(stats.run.rs_locked,
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 63e2929..04c52ac 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1001,6 +1001,7 @@ static journal_t * journal_init_common (void)
 	init_waitqueue_head(&journal->j_wait_done_commit);
 	init_waitqueue_head(&journal->j_wait_commit);
 	init_waitqueue_head(&journal->j_wait_updates);
+	init_waitqueue_head(&journal->j_wait_reserved);
 	mutex_init(&journal->j_barrier);
 	mutex_init(&journal->j_checkpoint_mutex);
 	spin_lock_init(&journal->j_revoke_lock);
@@ -1010,6 +1011,7 @@ static journal_t * journal_init_common (void)
 	journal->j_commit_interval = (HZ * JBD2_DEFAULT_MAX_COMMIT_AGE);
 	journal->j_min_batch_time = 0;
 	journal->j_max_batch_time = 15000; /* 15ms */
+	atomic_set(&journal->j_reserved_credits, 0);
 
 	/* The journal is marked for error until we succeed with recovery! */
 	journal->j_flags = JBD2_ABORT;
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 9639e47..036c01c 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -89,7 +89,8 @@ jbd2_get_transaction(journal_t *journal, transaction_t *transaction)
 	transaction->t_expires = jiffies + journal->j_commit_interval;
 	spin_lock_init(&transaction->t_handle_lock);
 	atomic_set(&transaction->t_updates, 0);
-	atomic_set(&transaction->t_outstanding_credits, 0);
+	atomic_set(&transaction->t_outstanding_credits,
+		   atomic_read(&journal->j_reserved_credits));
 	atomic_set(&transaction->t_handle_count, 0);
 	INIT_LIST_HEAD(&transaction->t_inode_list);
 	INIT_LIST_HEAD(&transaction->t_private_list);
@@ -141,6 +142,91 @@ static inline void update_t_max_wait(transaction_t *transaction,
 }
 
 /*
+ * Wait until running transaction passes T_LOCKED state. Also starts the commit
+ * if needed. The function expects running transaction to exist and releases
+ * j_state_lock.
+ */
+static void wait_transaction_locked(journal_t *journal)
+	__releases(journal->j_state_lock)
+{
+	DEFINE_WAIT(wait);
+	int need_to_start;
+	tid_t tid = journal->j_running_transaction->t_tid;
+
+	prepare_to_wait(&journal->j_wait_transaction_locked, &wait,
+			TASK_UNINTERRUPTIBLE);
+	need_to_start = !tid_geq(journal->j_commit_request, tid);
+	read_unlock(&journal->j_state_lock);
+	if (need_to_start)
+		jbd2_log_start_commit(journal, tid);
+	schedule();
+	finish_wait(&journal->j_wait_transaction_locked, &wait);
+}
+
+/*
+ * Wait until we can add credits for handle to the running transaction.  Called
+ * with j_state_lock held for reading. Returns 0 if handle joined the running
+ * transaction. Returns 1 if we had to wait, j_state_lock is dropped, and
+ * caller must retry.
+ */
+static int add_transaction_credits(journal_t *journal, handle_t *handle)
+{
+	transaction_t *t = journal->j_running_transaction;
+	int nblocks = handle->h_buffer_credits;
+	int needed;
+
+	/*
+	 * If the current transaction is locked down for commit, wait
+	 * for the lock to be released.
+	 */
+	if (t->t_state == T_LOCKED) {
+		wait_transaction_locked(journal);
+		return 1;
+	}
+
+	/*
+	 * If there is not enough space left in the log to write all
+	 * potential buffers requested by this operation, we need to
+	 * stall pending a log checkpoint to free some more log space.
+	 */
+	needed = atomic_add_return(nblocks, &t->t_outstanding_credits);
+	if (needed > journal->j_max_transaction_buffers) {
+		/*
+		 * If the current transaction is already too large,
+		 * then start to commit it: we can then go back and
+		 * attach this handle to a new transaction.
+		 */
+		jbd_debug(2, "Handle %p starting new commit...\n", handle);
+		atomic_sub(nblocks, &t->t_outstanding_credits);
+		wait_transaction_locked(journal);
+		return 1;
+	}
+
+	/*
+	 * The commit code assumes that it can get enough log space
+	 * without forcing a checkpoint.  This is *critical* for
+	 * correctness: a checkpoint of a buffer which is also
+	 * associated with a committing transaction creates a deadlock,
+	 * so commit simply cannot force through checkpoints.
+	 *
+	 * We must therefore ensure the necessary space in the journal
+	 * *before* starting to dirty potentially checkpointed buffers
+	 * in the new transaction.
+	 */
+	if (jbd2_log_space_left(journal) < jbd2_space_needed(journal)) {
+		jbd_debug(2, "Handle %p waiting for checkpoint...\n", handle);
+		atomic_sub(nblocks, &t->t_outstanding_credits);
+		read_unlock(&journal->j_state_lock);
+		write_lock(&journal->j_state_lock);
+		if (jbd2_log_space_left(journal) < jbd2_space_needed(journal))
+			__jbd2_log_wait_for_space(journal);
+		write_unlock(&journal->j_state_lock);
+		return 1;
+	}
+	return 0;
+}
+
+/*
  * start_this_handle: Given a handle, deal with any locking or stalling
  * needed to make sure that there is enough journal space for the handle
  * to begin.  Attach the handle to a transaction and set up the
@@ -151,12 +237,14 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
 			     gfp_t gfp_mask)
 {
 	transaction_t	*transaction, *new_transaction = NULL;
-	tid_t		tid;
-	int		needed, need_to_start;
 	int		nblocks = handle->h_buffer_credits;
 	unsigned long ts = jiffies;
 
-	if (nblocks > journal->j_max_transaction_buffers) {
+	/*
+	 * 1/2 of transaction can be reserved so we can practically handle
+	 * only 1/2 of maximum transaction size per operation
+	 */
+	if (nblocks > journal->j_max_transaction_buffers / 2) {
 		printk(KERN_ERR "JBD2: %s wants too many credits (%d > %d)\n",
 		       current->comm, nblocks,
 		       journal->j_max_transaction_buffers);
@@ -223,75 +311,18 @@ repeat:
 
 	transaction = journal->j_running_transaction;
 
-	/*
-	 * If the current transaction is locked down for commit, wait for the
-	 * lock to be released.
-	 */
-	if (transaction->t_state == T_LOCKED) {
-		DEFINE_WAIT(wait);
-
-		prepare_to_wait(&journal->j_wait_transaction_locked,
-					&wait, TASK_UNINTERRUPTIBLE);
-		read_unlock(&journal->j_state_lock);
-		schedule();
-		finish_wait(&journal->j_wait_transaction_locked, &wait);
-		goto repeat;
-	}
-
-	/*
-	 * If there is not enough space left in the log to write all potential
-	 * buffers requested by this operation, we need to stall pending a log
-	 * checkpoint to free some more log space.
-	 */
-	needed = atomic_add_return(nblocks,
-				   &transaction->t_outstanding_credits);
-
-	if (needed > journal->j_max_transaction_buffers) {
+	if (!handle->h_reserved) {
+		if (add_transaction_credits(journal, handle))
+			goto repeat;
+	} else {
 		/*
-		 * If the current transaction is already too large, then start
-		 * to commit it: we can then go back and attach this handle to
-		 * a new transaction.
+		 * We have handle reserved so we are allowed to join T_LOCKED
+		 * transaction and we don't have to check for transaction size
+		 * and journal space.
 		 */
-		DEFINE_WAIT(wait);
-
-		jbd_debug(2, "Handle %p starting new commit...\n", handle);
-		atomic_sub(nblocks, &transaction->t_outstanding_credits);
-		prepare_to_wait(&journal->j_wait_transaction_locked, &wait,
-				TASK_UNINTERRUPTIBLE);
-		tid = transaction->t_tid;
-		need_to_start = !tid_geq(journal->j_commit_request, tid);
-		read_unlock(&journal->j_state_lock);
-		if (need_to_start)
-			jbd2_log_start_commit(journal, tid);
-		schedule();
-		finish_wait(&journal->j_wait_transaction_locked, &wait);
-		goto repeat;
-	}
-
-	/*
-	 * The commit code assumes that it can get enough log space
-	 * without forcing a checkpoint.  This is *critical* for
-	 * correctness: a checkpoint of a buffer which is also
-	 * associated with a committing transaction creates a deadlock,
-	 * so commit simply cannot force through checkpoints.
-	 *
-	 * We must therefore ensure the necessary space in the journal
-	 * *before* starting to dirty potentially checkpointed buffers
-	 * in the new transaction.
-	 *
-	 * The worst part is, any transaction currently committing can
-	 * reduce the free space arbitrarily.  Be careful to account for
-	 * those buffers when checkpointing.
-	 */
-	if (jbd2_log_space_left(journal) < jbd2_space_needed(journal)) {
-		jbd_debug(2, "Handle %p waiting for checkpoint...\n", handle);
-		atomic_sub(nblocks, &transaction->t_outstanding_credits);
-		read_unlock(&journal->j_state_lock);
-		write_lock(&journal->j_state_lock);
-		if (jbd2_log_space_left(journal) < jbd2_space_needed(journal))
-			__jbd2_log_wait_for_space(journal);
-		write_unlock(&journal->j_state_lock);
-		goto repeat;
+		atomic_sub(nblocks, &journal->j_reserved_credits);
+		wake_up(&journal->j_wait_reserved);
+		handle->h_reserved = 0;
 	}
 
 	/* OK, account for the buffers that this operation expects to
@@ -390,6 +421,122 @@ handle_t *jbd2_journal_start(journal_t *journal, int nblocks)
 }
 EXPORT_SYMBOL(jbd2_journal_start);
 
+/**
+ * handle_t *jbd2_journal_reserve(journal_t *journal, int nblocks)
+ * @journal: journal to reserve transaction on.
+ * @nblocks: number of blocks we might modify
+ *
+ * This function reserves transaction with @nblocks blocks in @journal.  The
+ * function waits for enough journal space to be available and possibly also
+ * for some reservations to be converted to real transactions if there are too
+ * many of them. Note that this means that calling this function while having
+ * another transaction started or reserved can cause deadlock. The returned
+ * handle cannot be used for anything until it is started using
+ * jbd2_journal_start_reserved().
+ */
+handle_t *jbd2_journal_reserve(journal_t *journal, int nblocks,
+			       unsigned int type, unsigned int line_no)
+{
+	handle_t *handle;
+	unsigned long wanted;
+
+	handle = new_handle(nblocks);
+	if (!handle)
+		return ERR_PTR(-ENOMEM);
+	handle->h_journal = journal;
+	handle->h_reserved = 1;
+	handle->h_type = type;
+	handle->h_line_no = line_no;
+
+repeat:
+	/*
+	 * We need j_state_lock early to avoid transaction creation to race
+	 * with us and using elevated j_reserved_credits.
+	 */
+	read_lock(&journal->j_state_lock);
+	wanted = atomic_add_return(nblocks, &journal->j_reserved_credits);
+	/* We allow at most half of a transaction to be reserved */
+	if (wanted > journal->j_max_transaction_buffers / 2) {
+		atomic_sub(nblocks, &journal->j_reserved_credits);
+		read_unlock(&journal->j_state_lock);
+		wait_event(journal->j_wait_reserved,
+			   atomic_read(&journal->j_reserved_credits) + nblocks
+			   <= journal->j_max_transaction_buffers / 2);
+		goto repeat;
+	}
+	if (journal->j_running_transaction) {
+		transaction_t *t = journal->j_running_transaction;
+
+		wanted = atomic_add_return(nblocks,
+					   &t->t_outstanding_credits);
+		if (wanted > journal->j_max_transaction_buffers) {
+			atomic_sub(nblocks, &t->t_outstanding_credits);
+			atomic_sub(nblocks, &journal->j_reserved_credits);
+			wait_transaction_locked(journal);
+			goto repeat;
+		}
+	}
+	read_unlock(&journal->j_state_lock);
+
+	return handle;
+}
+EXPORT_SYMBOL(jbd2_journal_reserve);
+
+void jbd2_journal_free_reserved(handle_t *handle)
+{
+	journal_t *journal = handle->h_journal;
+
+	atomic_sub(handle->h_buffer_credits, &journal->j_reserved_credits);
+	wake_up(&journal->j_wait_reserved);
+	jbd2_free_handle(handle);
+}
+EXPORT_SYMBOL(jbd2_journal_free_reserved);
+
+/**
+ * int jbd2_journal_start_reserved(handle_t *handle) - start reserved handle
+ * @handle: handle to start
+ *
+ * Start handle that has been previously reserved with jbd2_journal_reserve().
+ * This attaches @handle to the running transaction (or creates one if there's
+ * not transaction running). Unlike jbd2_journal_start() this function cannot
+ * block on journal commit, checkpointing, or similar stuff. It can block on
+ * memory allocation or frozen journal though.
+ *
+ * Return 0 on success, non-zero on error - handle is freed in that case.
+ */
+int jbd2_journal_start_reserved(handle_t *handle)
+{
+	journal_t *journal = handle->h_journal;
+	int ret = -EIO;
+
+	if (WARN_ON(!handle->h_reserved)) {
+		/* Someone passed in normal handle? Just stop it. */
+		jbd2_journal_stop(handle);
+		return ret;
+	}
+	/*
+	 * Usefulness of mixing of reserved and unreserved handles is
+	 * questionable. So far nobody seems to need it so just error out.
+	 */
+	if (WARN_ON(current->journal_info)) {
+		jbd2_journal_free_reserved(handle);
+		return ret;
+	}
+
+	handle->h_journal = NULL;
+	current->journal_info = handle;
+	/*
+	 * GFP_NOFS is here because callers are likely from writeback or
+	 * similarly constrained call sites
+	 */
+	ret = start_this_handle(journal, handle, GFP_NOFS);
+	if (ret < 0) {
+		current->journal_info = NULL;
+		jbd2_journal_free_reserved(handle);
+	}
+	return ret;
+}
+EXPORT_SYMBOL(jbd2_journal_start_reserved);
 
 /**
  * int jbd2_journal_extend() - extend buffer credits.
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index ad4b3bb..b3c1283 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -410,8 +410,12 @@ struct jbd2_revoke_table_s;
 
 struct jbd2_journal_handle
 {
-	/* Which compound transaction is this update a part of? */
-	transaction_t		*h_transaction;
+	union {
+		/* Which compound transaction is this update a part of? */
+		transaction_t	*h_transaction;
+		/* Which journal handle belongs to - used iff h_reserved set */
+		journal_t	*h_journal;
+	};
 
 	/* Number of remaining buffers we are allowed to dirty: */
 	int			h_buffer_credits;
@@ -426,6 +430,7 @@ struct jbd2_journal_handle
 	/* Flags [no locking] */
 	unsigned int	h_sync:		1;	/* sync-on-close */
 	unsigned int	h_jdata:	1;	/* force data journaling */
+	unsigned int	h_reserved:	1;	/* handle with reserved credits */
 	unsigned int	h_aborted:	1;	/* fatal error on handle */
 	unsigned int	h_type:		8;	/* for handle statistics */
 	unsigned int	h_line_no:	16;	/* for handle statistics */
@@ -689,6 +694,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
  * @j_wait_done_commit: Wait queue for waiting for commit to complete
  * @j_wait_commit: Wait queue to trigger commit
  * @j_wait_updates: Wait queue to wait for updates to complete
+ * @j_wait_reserved: Wait queue to wait for reserved buffer credits to drop
  * @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
  * @j_head: Journal head - identifies the first unused block in the journal
  * @j_tail: Journal tail - identifies the oldest still-used block in the
@@ -702,6 +708,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
  *     journal
  * @j_fs_dev: Device which holds the client fs.  For internal journal this will
  *     be equal to j_dev
+ * @j_reserved_credits: Number of buffers reserved from the running transaction
  * @j_maxlen: Total maximum capacity of the journal region on disk.
  * @j_list_lock: Protects the buffer lists and internal buffer state.
  * @j_inode: Optional inode where we store the journal.  If present, all journal
@@ -800,6 +807,9 @@ struct journal_s
 	/* Wait queue to wait for updates to complete */
 	wait_queue_head_t	j_wait_updates;
 
+	/* Wait queue to wait for reserved buffer credits to drop */
+	wait_queue_head_t	j_wait_reserved;
+
 	/* Semaphore for locking against concurrent checkpoints */
 	struct mutex		j_checkpoint_mutex;
 
@@ -854,6 +864,9 @@ struct journal_s
 	/* Total maximum capacity of the journal region on disk. */
 	unsigned int		j_maxlen;
 
+	/* Number of buffers reserved from the running transaction */
+	atomic_t		j_reserved_credits;
+
 	/*
 	 * Protects the buffer lists and internal buffer state.
 	 */
@@ -1094,6 +1107,10 @@ extern handle_t *jbd2__journal_start(journal_t *, int nblocks, gfp_t gfp_mask,
 				     unsigned int type, unsigned int line_no);
 extern int	 jbd2_journal_restart(handle_t *, int nblocks);
 extern int	 jbd2__journal_restart(handle_t *, int nblocks, gfp_t gfp_mask);
+extern handle_t *jbd2_journal_reserve(journal_t *, int nblocks,
+				      unsigned int type, unsigned int line_no);
+extern int	 jbd2_journal_start_reserved(handle_t *handle);
+extern void	 jbd2_journal_free_reserved(handle_t *handle);
 extern int	 jbd2_journal_extend (handle_t *, int nblocks);
 extern int	 jbd2_journal_get_write_access(handle_t *, struct buffer_head *);
 extern int	 jbd2_journal_get_create_access (handle_t *, struct buffer_head *);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 13/29] ext4: Provide wrappers for transaction reservation calls
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
                   ` (11 preceding siblings ...)
  2013-04-08 21:32 ` [PATCH 12/29] jbd2: Transaction reservation support Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-05-05 11:51   ` Zheng Liu
  2013-05-05 11:58   ` Zheng Liu
  2013-04-08 21:32 ` [PATCH 14/29] ext4: Stop messing with nr_to_write in ext4_da_writepages() Jan Kara
                   ` (15 subsequent siblings)
  28 siblings, 2 replies; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4_jbd2.c         |   71 +++++++++++++++++++++++++++++++++++++-----
 fs/ext4/ext4_jbd2.h         |   13 ++++++++
 include/trace/events/ext4.h |   20 +++++++++++-
 3 files changed, 94 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 7058975..b3e04bf 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -38,28 +38,40 @@ static void ext4_put_nojournal(handle_t *handle)
 /*
  * Wrappers for jbd2_journal_start/end.
  */
-handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
-				  int type, int nblocks)
+static int ext4_journal_check_start(struct super_block *sb)
 {
 	journal_t *journal;
 
-	trace_ext4_journal_start(sb, nblocks, _RET_IP_);
 	if (sb->s_flags & MS_RDONLY)
-		return ERR_PTR(-EROFS);
-
+		return -EROFS;
 	WARN_ON(sb->s_writers.frozen == SB_FREEZE_COMPLETE);
 	journal = EXT4_SB(sb)->s_journal;
-	if (!journal)
-		return ext4_get_nojournal();
 	/*
 	 * Special case here: if the journal has aborted behind our
 	 * backs (eg. EIO in the commit thread), then we still need to
 	 * take the FS itself readonly cleanly.
 	 */
-	if (is_journal_aborted(journal)) {
+	if (journal && is_journal_aborted(journal)) {
 		ext4_abort(sb, "Detected aborted journal");
-		return ERR_PTR(-EROFS);
+		return -EROFS;
 	}
+	return 0;
+}
+
+handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
+				  int type, int nblocks)
+{
+	journal_t *journal;
+	int err;
+
+	trace_ext4_journal_start(sb, nblocks, _RET_IP_);
+	err = ext4_journal_check_start(sb);
+	if (err < 0)
+		return ERR_PTR(err);
+
+	journal = EXT4_SB(sb)->s_journal;
+	if (!journal)
+		return ext4_get_nojournal();
 	return jbd2__journal_start(journal, nblocks, GFP_NOFS, type, line);
 }
 
@@ -84,6 +96,47 @@ int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle)
 	return err;
 }
 
+handle_t *__ext4_journal_reserve(struct inode *inode, unsigned int line,
+				 int type, int nblocks)
+{
+	struct super_block *sb = inode->i_sb;
+	journal_t *journal;
+	int err;
+
+	trace_ext4_journal_reserve(sb, nblocks, _RET_IP_);
+	err = ext4_journal_check_start(sb);
+	if (err < 0)
+		return ERR_PTR(err);
+
+	journal = EXT4_SB(sb)->s_journal;
+	if (!journal)
+		return (handle_t *)1;	/* Hack to return !NULL */
+	return jbd2_journal_reserve(journal, nblocks, type, line);
+}
+
+handle_t *ext4_journal_start_reserved(handle_t *handle)
+{
+	struct super_block *sb;
+	int err;
+
+	if (!ext4_handle_valid(handle))
+		return ext4_get_nojournal();
+
+	sb = handle->h_journal->j_private;
+	trace_ext4_journal_start_reserved(sb, handle->h_buffer_credits,
+					  _RET_IP_);
+	err = ext4_journal_check_start(sb);
+	if (err < 0) {
+		jbd2_journal_free_reserved(handle);
+		return ERR_PTR(err);
+	}
+
+	err = jbd2_journal_start_reserved(handle);
+	if (err < 0)
+		return ERR_PTR(err);
+	return handle;
+}
+
 void ext4_journal_abort_handle(const char *caller, unsigned int line,
 			       const char *err_fn, struct buffer_head *bh,
 			       handle_t *handle, int err)
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 4c216b1..bb17931 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -309,6 +309,19 @@ static inline handle_t *__ext4_journal_start(struct inode *inode,
 #define ext4_journal_stop(handle) \
 	__ext4_journal_stop(__func__, __LINE__, (handle))
 
+#define ext4_journal_reserve(inode, type, nblocks)			\
+	__ext4_journal_reserve((inode), __LINE__, (type), (nblocks))
+
+handle_t *__ext4_journal_reserve(struct inode *inode, unsigned int line,
+				 int type, int nblocks);
+handle_t *ext4_journal_start_reserved(handle_t *handle);
+
+static inline void ext4_journal_free_reserved(handle_t *handle)
+{
+	if (ext4_handle_valid(handle))
+		jbd2_journal_free_reserved(handle);
+}
+
 static inline handle_t *ext4_journal_current_handle(void)
 {
 	return journal_current_handle();
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 4ee4710..a601bb3 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -1645,7 +1645,7 @@ TRACE_EVENT(ext4_load_inode,
 		  (unsigned long) __entry->ino)
 );
 
-TRACE_EVENT(ext4_journal_start,
+DECLARE_EVENT_CLASS(ext4_journal_start_class,
 	TP_PROTO(struct super_block *sb, int nblocks, unsigned long IP),
 
 	TP_ARGS(sb, nblocks, IP),
@@ -1667,6 +1667,24 @@ TRACE_EVENT(ext4_journal_start,
 		  __entry->nblocks, (void *)__entry->ip)
 );
 
+DEFINE_EVENT(ext4_journal_start_class, ext4_journal_start,
+	TP_PROTO(struct super_block *sb, int nblocks, unsigned long IP),
+
+	TP_ARGS(sb, nblocks, IP)
+);
+
+DEFINE_EVENT(ext4_journal_start_class, ext4_journal_reserve,
+	TP_PROTO(struct super_block *sb, int nblocks, unsigned long IP),
+
+	TP_ARGS(sb, nblocks, IP)
+);
+
+DEFINE_EVENT(ext4_journal_start_class, ext4_journal_start_reserved,
+	TP_PROTO(struct super_block *sb, int nblocks, unsigned long IP),
+
+	TP_ARGS(sb, nblocks, IP)
+);
+
 DECLARE_EVENT_CLASS(ext4__trim,
 	TP_PROTO(struct super_block *sb,
 		 ext4_group_t group,
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 14/29] ext4: Stop messing with nr_to_write in ext4_da_writepages()
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
                   ` (12 preceding siblings ...)
  2013-04-08 21:32 ` [PATCH 13/29] ext4: Provide wrappers for transaction reservation calls Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-05-05 12:40   ` Zheng Liu
  2013-04-08 21:32 ` [PATCH 15/29] ext4: Deprecate max_writeback_mb_bump sysfs attribute Jan Kara
                   ` (14 subsequent siblings)
  28 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Writeback code got better in how it submits IO and now the number of
pages requested to be written is usually higher than original 1024. The
number is now dynamically computed based on observed throughput and is
set to be about 0.5 s worth of writeback. E.g. on ordinary SATA drive
this ends up somewhere around 10000 as my testing shows. So remove the
unnecessary smarts from ext4_da_writepages().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c |   96 -------------------------------------------------------
 1 files changed, 0 insertions(+), 96 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ba07412..f4dc4a1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -423,66 +423,6 @@ static int __check_block_validity(struct inode *inode, const char *func,
 	__check_block_validity((inode), __func__, __LINE__, (map))
 
 /*
- * Return the number of contiguous dirty pages in a given inode
- * starting at page frame idx.
- */
-static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
-				    unsigned int max_pages)
-{
-	struct address_space *mapping = inode->i_mapping;
-	pgoff_t	index;
-	struct pagevec pvec;
-	pgoff_t num = 0;
-	int i, nr_pages, done = 0;
-
-	if (max_pages == 0)
-		return 0;
-	pagevec_init(&pvec, 0);
-	while (!done) {
-		index = idx;
-		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
-					      PAGECACHE_TAG_DIRTY,
-					      (pgoff_t)PAGEVEC_SIZE);
-		if (nr_pages == 0)
-			break;
-		for (i = 0; i < nr_pages; i++) {
-			struct page *page = pvec.pages[i];
-			struct buffer_head *bh, *head;
-
-			lock_page(page);
-			if (unlikely(page->mapping != mapping) ||
-			    !PageDirty(page) ||
-			    PageWriteback(page) ||
-			    page->index != idx) {
-				done = 1;
-				unlock_page(page);
-				break;
-			}
-			if (page_has_buffers(page)) {
-				bh = head = page_buffers(page);
-				do {
-					if (!buffer_delay(bh) &&
-					    !buffer_unwritten(bh))
-						done = 1;
-					bh = bh->b_this_page;
-				} while (!done && (bh != head));
-			}
-			unlock_page(page);
-			if (done)
-				break;
-			idx++;
-			num++;
-			if (num >= max_pages) {
-				done = 1;
-				break;
-			}
-		}
-		pagevec_release(&pvec);
-	}
-	return num;
-}
-
-/*
  * The ext4_map_blocks() function tries to look up the requested blocks,
  * and returns if the blocks are already mapped.
  *
@@ -2334,10 +2274,8 @@ static int ext4_da_writepages(struct address_space *mapping,
 	struct mpage_da_data mpd;
 	struct inode *inode = mapping->host;
 	int pages_written = 0;
-	unsigned int max_pages;
 	int range_cyclic, cycled = 1, io_done = 0;
 	int needed_blocks, ret = 0;
-	long desired_nr_to_write, nr_to_writebump = 0;
 	loff_t range_start = wbc->range_start;
 	struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
 	pgoff_t done_index = 0;
@@ -2384,39 +2322,6 @@ static int ext4_da_writepages(struct address_space *mapping,
 		end = wbc->range_end >> PAGE_CACHE_SHIFT;
 	}
 
-	/*
-	 * This works around two forms of stupidity.  The first is in
-	 * the writeback code, which caps the maximum number of pages
-	 * written to be 1024 pages.  This is wrong on multiple
-	 * levels; different architectues have a different page size,
-	 * which changes the maximum amount of data which gets
-	 * written.  Secondly, 4 megabytes is way too small.  XFS
-	 * forces this value to be 16 megabytes by multiplying
-	 * nr_to_write parameter by four, and then relies on its
-	 * allocator to allocate larger extents to make them
-	 * contiguous.  Unfortunately this brings us to the second
-	 * stupidity, which is that ext4's mballoc code only allocates
-	 * at most 2048 blocks.  So we force contiguous writes up to
-	 * the number of dirty blocks in the inode, or
-	 * sbi->max_writeback_mb_bump whichever is smaller.
-	 */
-	max_pages = sbi->s_max_writeback_mb_bump << (20 - PAGE_CACHE_SHIFT);
-	if (!range_cyclic && range_whole) {
-		if (wbc->nr_to_write == LONG_MAX)
-			desired_nr_to_write = wbc->nr_to_write;
-		else
-			desired_nr_to_write = wbc->nr_to_write * 8;
-	} else
-		desired_nr_to_write = ext4_num_dirty_pages(inode, index,
-							   max_pages);
-	if (desired_nr_to_write > max_pages)
-		desired_nr_to_write = max_pages;
-
-	if (wbc->nr_to_write < desired_nr_to_write) {
-		nr_to_writebump = desired_nr_to_write - wbc->nr_to_write;
-		wbc->nr_to_write = desired_nr_to_write;
-	}

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 15/29] ext4: Deprecate max_writeback_mb_bump sysfs attribute
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
                   ` (13 preceding siblings ...)
  2013-04-08 21:32 ` [PATCH 14/29] ext4: Stop messing with nr_to_write in ext4_da_writepages() Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-05-05 12:47   ` Zheng Liu
  2013-04-08 21:32 ` [PATCH 16/29] ext4: Improve writepage credit estimate for files with indirect blocks Jan Kara
                   ` (13 subsequent siblings)
  28 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

This attribute is now unused so deprecate it. We still show the old
default value to keep some compatibility but we don't allow writing to
that attribute anymore.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h  |    1 -
 fs/ext4/super.c |   30 ++++++++++++++++++++++++------
 2 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index edf9b9e..3575fdb 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1234,7 +1234,6 @@ struct ext4_sb_info {
 	unsigned int s_mb_stats;
 	unsigned int s_mb_order2_reqs;
 	unsigned int s_mb_group_prealloc;
-	unsigned int s_max_writeback_mb_bump;
 	unsigned int s_max_dir_size_kb;
 	/* where last allocation was done - for stream allocation */
 	unsigned long s_mb_last_group;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 34e8552..09ff724 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2374,7 +2374,10 @@ struct ext4_attr {
 	ssize_t (*show)(struct ext4_attr *, struct ext4_sb_info *, char *);
 	ssize_t (*store)(struct ext4_attr *, struct ext4_sb_info *,
 			 const char *, size_t);
-	int offset;
+	union {
+		int offset;
+		int deprecated_val;
+	} u;
 };
 
 static int parse_strtoul(const char *buf,
@@ -2443,7 +2446,7 @@ static ssize_t inode_readahead_blks_store(struct ext4_attr *a,
 static ssize_t sbi_ui_show(struct ext4_attr *a,
 			   struct ext4_sb_info *sbi, char *buf)
 {
-	unsigned int *ui = (unsigned int *) (((char *) sbi) + a->offset);
+	unsigned int *ui = (unsigned int *) (((char *) sbi) + a->u.offset);
 
 	return snprintf(buf, PAGE_SIZE, "%u\n", *ui);
 }
@@ -2452,7 +2455,7 @@ static ssize_t sbi_ui_store(struct ext4_attr *a,
 			    struct ext4_sb_info *sbi,
 			    const char *buf, size_t count)
 {
-	unsigned int *ui = (unsigned int *) (((char *) sbi) + a->offset);
+	unsigned int *ui = (unsigned int *) (((char *) sbi) + a->u.offset);
 	unsigned long t;
 
 	if (parse_strtoul(buf, 0xffffffff, &t))
@@ -2478,12 +2481,20 @@ static ssize_t trigger_test_error(struct ext4_attr *a,
 	return count;
 }
 
+static ssize_t sbi_deprecated_show(struct ext4_attr *a,
+				   struct ext4_sb_info *sbi, char *buf)
+{
+	return snprintf(buf, PAGE_SIZE, "%d\n", a->u.deprecated_val);
+}
+
 #define EXT4_ATTR_OFFSET(_name,_mode,_show,_store,_elname) \
 static struct ext4_attr ext4_attr_##_name = {			\
 	.attr = {.name = __stringify(_name), .mode = _mode },	\
 	.show	= _show,					\
 	.store	= _store,					\
-	.offset = offsetof(struct ext4_sb_info, _elname),	\
+	.u = {							\
+		.offset = offsetof(struct ext4_sb_info, _elname),\
+	},							\
 }
 #define EXT4_ATTR(name, mode, show, store) \
 static struct ext4_attr ext4_attr_##name = __ATTR(name, mode, show, store)
@@ -2494,6 +2505,14 @@ static struct ext4_attr ext4_attr_##name = __ATTR(name, mode, show, store)
 #define EXT4_RW_ATTR_SBI_UI(name, elname)	\
 	EXT4_ATTR_OFFSET(name, 0644, sbi_ui_show, sbi_ui_store, elname)
 #define ATTR_LIST(name) &ext4_attr_##name.attr
+#define EXT4_DEPRECATED_ATTR(_name, _val)	\
+static struct ext4_attr ext4_attr_##_name = {			\
+	.attr = {.name = __stringify(_name), .mode = 0444 },	\
+	.show	= sbi_deprecated_show,				\
+	.u = {							\
+		.deprecated_val = _val,				\
+	},							\
+}
 
 EXT4_RO_ATTR(delayed_allocation_blocks);
 EXT4_RO_ATTR(session_write_kbytes);
@@ -2507,7 +2526,7 @@ EXT4_RW_ATTR_SBI_UI(mb_min_to_scan, s_mb_min_to_scan);
 EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs);
 EXT4_RW_ATTR_SBI_UI(mb_stream_req, s_mb_stream_request);
 EXT4_RW_ATTR_SBI_UI(mb_group_prealloc, s_mb_group_prealloc);
-EXT4_RW_ATTR_SBI_UI(max_writeback_mb_bump, s_max_writeback_mb_bump);
+EXT4_DEPRECATED_ATTR(max_writeback_mb_bump, 128);
 EXT4_RW_ATTR_SBI_UI(extent_max_zeroout_kb, s_extent_max_zeroout_kb);
 EXT4_ATTR(trigger_fs_error, 0200, NULL, trigger_test_error);
 
@@ -3718,7 +3737,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	}
 
 	sbi->s_stripe = ext4_get_stripe_size(sbi);
-	sbi->s_max_writeback_mb_bump = 128;
 	sbi->s_extent_max_zeroout_kb = 32;
 
 	/* Register extent status tree shrinker */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 16/29] ext4: Improve writepage credit estimate for files with indirect blocks
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
                   ` (14 preceding siblings ...)
  2013-04-08 21:32 ` [PATCH 15/29] ext4: Deprecate max_writeback_mb_bump sysfs attribute Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-05-07  5:39   ` Zheng Liu
  2013-04-08 21:32 ` [PATCH 17/29] ext4: Better estimate credits needed for ext4_da_writepages() Jan Kara
                   ` (12 subsequent siblings)
  28 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

ext4_ind_trans_blocks() wrongly used 'chunk' argument to decide whether
blocks mapped are logically continguous. That is wrong since the argument
informs whether the blocks are physically continguous. As the blocks
mapped are always logically continguous and that's all
ext4_ind_trans_blocks() cares about, just remove the 'chunk' argument.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h     |    2 +-
 fs/ext4/indirect.c |   27 +++++++++------------------
 fs/ext4/inode.c    |    2 +-
 3 files changed, 11 insertions(+), 20 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 3575fdb..d3a54f2 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2093,7 +2093,7 @@ extern ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
 				const struct iovec *iov, loff_t offset,
 				unsigned long nr_segs);
 extern int ext4_ind_calc_metadata_amount(struct inode *inode, sector_t lblock);
-extern int ext4_ind_trans_blocks(struct inode *inode, int nrblocks, int chunk);
+extern int ext4_ind_trans_blocks(struct inode *inode, int nrblocks);
 extern void ext4_ind_truncate(struct inode *inode);
 extern int ext4_ind_punch_hole(struct file *file, loff_t offset, loff_t length);
 
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index b505a14..197b202 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -913,27 +913,18 @@ int ext4_ind_calc_metadata_amount(struct inode *inode, sector_t lblock)
 	return (blk_bits / EXT4_ADDR_PER_BLOCK_BITS(inode->i_sb)) + 1;
 }
 
-int ext4_ind_trans_blocks(struct inode *inode, int nrblocks, int chunk)
+/*
+ * Calculate number of indirect blocks touched by mapping @nrblocks logically
+ * continguous blocks
+ */
+int ext4_ind_trans_blocks(struct inode *inode, int nrblocks)
 {
-	int indirects;
-
-	/* if nrblocks are contiguous */
-	if (chunk) {
-		/*
-		 * With N contiguous data blocks, we need at most
-		 * N/EXT4_ADDR_PER_BLOCK(inode->i_sb) + 1 indirect blocks,
-		 * 2 dindirect blocks, and 1 tindirect block
-		 */
-		return DIV_ROUND_UP(nrblocks,
-				    EXT4_ADDR_PER_BLOCK(inode->i_sb)) + 4;
-	}
 	/*
-	 * if nrblocks are not contiguous, worse case, each block touch
-	 * a indirect block, and each indirect block touch a double indirect
-	 * block, plus a triple indirect block
+	 * With N contiguous data blocks, we need at most
+	 * N/EXT4_ADDR_PER_BLOCK(inode->i_sb) + 1 indirect blocks,
+	 * 2 dindirect blocks, and 1 tindirect block
 	 */
-	indirects = nrblocks * 2 + 1;
-	return indirects;
+	return DIV_ROUND_UP(nrblocks, EXT4_ADDR_PER_BLOCK(inode->i_sb)) + 4;
 }
 
 /*
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f4dc4a1..aa26f4c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4399,7 +4399,7 @@ int ext4_getattr(struct vfsmount *mnt, struct dentry *dentry,
 static int ext4_index_trans_blocks(struct inode *inode, int nrblocks, int chunk)
 {
 	if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
-		return ext4_ind_trans_blocks(inode, nrblocks, chunk);
+		return ext4_ind_trans_blocks(inode, nrblocks);
 	return ext4_ext_index_trans_blocks(inode, nrblocks, chunk);
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 17/29] ext4: Better estimate credits needed for ext4_da_writepages()
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
                   ` (15 preceding siblings ...)
  2013-04-08 21:32 ` [PATCH 16/29] ext4: Improve writepage credit estimate for files with indirect blocks Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-05-07  6:33   ` Zheng Liu
  2013-04-08 21:32 ` [PATCH 18/29] ext4: Restructure writeback path Jan Kara
                   ` (11 subsequent siblings)
  28 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

We limit the number of blocks written in a single loop of
ext4_da_writepages() to 64 when inode uses indirect blocks. That is
unnecessary as credit estimates for mapping logically continguous run of
blocks is rather low even for inode with indirect blocks. So just lift
this limitation and properly calculate the number of necessary credits.

This better credit estimate will also later allow us to always write at
least a single page in one iteration.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h    |    3 +-
 fs/ext4/extents.c |   16 ++++++--------
 fs/ext4/inode.c   |   58 ++++++++++++++++++++--------------------------------
 3 files changed, 30 insertions(+), 47 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d3a54f2..a6f7331 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2555,8 +2555,7 @@ struct ext4_extent;
 
 extern int ext4_ext_tree_init(handle_t *handle, struct inode *);
 extern int ext4_ext_writepage_trans_blocks(struct inode *, int);
-extern int ext4_ext_index_trans_blocks(struct inode *inode, int nrblocks,
-				       int chunk);
+extern int ext4_ext_index_trans_blocks(struct inode *inode, int extents);
 extern int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 			       struct ext4_map_blocks *map, int flags);
 extern void ext4_ext_truncate(struct inode *);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 9dba80b..8064b71 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2269,17 +2269,15 @@ int ext4_ext_calc_credits_for_single_extent(struct inode *inode, int nrblocks,
 }
 
 /*
- * How many index/leaf blocks need to change/allocate to modify nrblocks?
+ * How many index/leaf blocks need to change/allocate to add @extents extents?
  *
- * if nrblocks are fit in a single extent (chunk flag is 1), then
- * in the worse case, each tree level index/leaf need to be changed
- * if the tree split due to insert a new extent, then the old tree
- * index/leaf need to be updated too
+ * If we add a single extent, then in the worse case, each tree level
+ * index/leaf need to be changed in case of the tree split.
  *
- * If the nrblocks are discontiguous, they could cause
- * the whole tree split more than once, but this is really rare.
+ * If more extents are inserted, they could cause the whole tree split more
+ * than once, but this is really rare.
  */
-int ext4_ext_index_trans_blocks(struct inode *inode, int nrblocks, int chunk)
+int ext4_ext_index_trans_blocks(struct inode *inode, int extents)
 {
 	int index;
 	int depth;
@@ -2290,7 +2288,7 @@ int ext4_ext_index_trans_blocks(struct inode *inode, int nrblocks, int chunk)
 
 	depth = ext_depth(inode);
 
-	if (chunk)
+	if (extents <= 1)
 		index = depth * 2;
 	else
 		index = depth * 3;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index aa26f4c..9cb4e75 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -137,6 +137,9 @@ static int ext4_bh_delay_or_unwritten(handle_t *handle, struct buffer_head *bh);
 static int ext4_discard_partial_page_buffers_no_lock(handle_t *handle,
 		struct inode *inode, struct page *page, loff_t from,
 		loff_t length, int flags);
+static int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
+				  int pextents);
+
 
 /*
  * Test whether an inode is a fast symlink.
@@ -2075,28 +2078,18 @@ static int ext4_writepage(struct page *page,
 }
 
 /*
- * This is called via ext4_da_writepages() to
- * calculate the total number of credits to reserve to fit
- * a single extent allocation into a single transaction,
- * ext4_da_writpeages() will loop calling this before
- * the block allocation.
+ * Calculate the total number of credits to reserve for one writepages
+ * iteration. This is called from ext4_da_writepages(). We map an extent of
+ * upto MAX_WRITEPAGES_EXTENT_LEN blocks and then we go on and finish mapping
+ * the last partial page. So in total we can map MAX_WRITEPAGES_EXTENT_LEN +
+ * bpp - 1 blocks in bpp different extents.
  */
-
 static int ext4_da_writepages_trans_blocks(struct inode *inode)
 {
-	int max_blocks = EXT4_I(inode)->i_reserved_data_blocks;
-
-	/*
-	 * With non-extent format the journal credit needed to
-	 * insert nrblocks contiguous block is dependent on
-	 * number of contiguous block. So we will limit
-	 * number of contiguous block to a sane value
-	 */
-	if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) &&
-	    (max_blocks > EXT4_MAX_TRANS_DATA))
-		max_blocks = EXT4_MAX_TRANS_DATA;
+	int bpp = ext4_journal_blocks_per_page(inode);
 
-	return ext4_chunk_trans_blocks(inode, max_blocks);
+	return ext4_meta_trans_blocks(inode,
+				MAX_WRITEPAGES_EXTENT_LEN + bpp - 1, bpp);
 }
 
 /*
@@ -4396,11 +4389,12 @@ int ext4_getattr(struct vfsmount *mnt, struct dentry *dentry,
 	return 0;
 }
 
-static int ext4_index_trans_blocks(struct inode *inode, int nrblocks, int chunk)
+static int ext4_index_trans_blocks(struct inode *inode, int lblocks,
+				   int pextents)
 {
 	if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
-		return ext4_ind_trans_blocks(inode, nrblocks);
-	return ext4_ext_index_trans_blocks(inode, nrblocks, chunk);
+		return ext4_ind_trans_blocks(inode, lblocks);
+	return ext4_ext_index_trans_blocks(inode, pextents);
 }
 
 /*
@@ -4414,7 +4408,8 @@ static int ext4_index_trans_blocks(struct inode *inode, int nrblocks, int chunk)
  *
  * Also account for superblock, inode, quota and xattr blocks
  */
-static int ext4_meta_trans_blocks(struct inode *inode, int nrblocks, int chunk)
+static int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
+				  int pextents)
 {
 	ext4_group_t groups, ngroups = ext4_get_groups_count(inode->i_sb);
 	int gdpblocks;
@@ -4422,14 +4417,10 @@ static int ext4_meta_trans_blocks(struct inode *inode, int nrblocks, int chunk)
 	int ret = 0;
 
 	/*
-	 * How many index blocks need to touch to modify nrblocks?
-	 * The "Chunk" flag indicating whether the nrblocks is
-	 * physically contiguous on disk
-	 *
-	 * For Direct IO and fallocate, they calls get_block to allocate
-	 * one single extent at a time, so they could set the "Chunk" flag
+	 * How many index blocks need to touch to map @lblocks logical blocks
+	 * to @pextents physical extents?
 	 */
-	idxblocks = ext4_index_trans_blocks(inode, nrblocks, chunk);
+	idxblocks = ext4_index_trans_blocks(inode, lblocks, pextents);
 
 	ret = idxblocks;
 
@@ -4437,12 +4428,7 @@ static int ext4_meta_trans_blocks(struct inode *inode, int nrblocks, int chunk)
 	 * Now let's see how many group bitmaps and group descriptors need
 	 * to account
 	 */
-	groups = idxblocks;
-	if (chunk)
-		groups += 1;
-	else
-		groups += nrblocks;

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 18/29] ext4: Restructure writeback path
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
                   ` (16 preceding siblings ...)
  2013-04-08 21:32 ` [PATCH 17/29] ext4: Better estimate credits needed for ext4_da_writepages() Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-05-08  3:48   ` Zheng Liu
  2013-04-08 21:32 ` [PATCH 19/29] ext4: Remove buffer_uninit handling Jan Kara
                   ` (10 subsequent siblings)
  28 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

There are two issues with current writeback path in ext4. For one we
don't necessarily map complete pages when blocksize < pagesize and thus
needn't do any writeback in one iteration. We always map some blocks
though so we will eventually finish mapping the page. Just if writeback
races with other operations on the file, forward progress is not really
guaranteed. The second problem is that current code structure makes it
hard to associate all the bios to some range of pages with one io_end
structure so that unwritten extents can be converted after all the bios
are finished. This will be especially difficult later when io_end will
be associated with reserved transaction handle.

We restructure the writeback path to a relatively simple loop which
first prepares extent of pages, then maps one or more extents so that
no page is partially mapped, and once page is fully mapped it is
submitted for IO. We keep all the mapping and IO submission information
in mpage_da_data structure to somewhat reduce stack usage. Resulting
code is somewhat shorter than the old one and hopefully also easier to
read.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h              |   15 -
 fs/ext4/inode.c             |  978 +++++++++++++++++++++----------------------
 include/trace/events/ext4.h |   64 ++-
 3 files changed, 508 insertions(+), 549 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index a6f7331..cb1ba1c 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -175,21 +175,6 @@ struct ext4_map_blocks {
 };
 
 /*
- * For delayed allocation tracking
- */
-struct mpage_da_data {
-	struct inode *inode;
-	sector_t b_blocknr;		/* start block number of extent */
-	size_t b_size;			/* size of extent */
-	unsigned long b_state;		/* state of the extent */
-	unsigned long first_page, next_page;	/* extent of pages */
-	struct writeback_control *wbc;
-	int io_done;
-	int pages_written;
-	int retval;
-};
-
-/*
  * Flags for ext4_io_end->flags
  */
 #define	EXT4_IO_END_UNWRITTEN	0x0001
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9cb4e75..5c191a3 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1324,145 +1324,42 @@ static void ext4_da_page_release_reservation(struct page *page,
  * Delayed allocation stuff
  */
 
-/*
- * mpage_da_submit_io - walks through extent of pages and try to write
- * them with writepage() call back
- *
- * @mpd->inode: inode
- * @mpd->first_page: first page of the extent
- * @mpd->next_page: page after the last page of the extent
- *
- * By the time mpage_da_submit_io() is called we expect all blocks
- * to be allocated. this may be wrong if allocation failed.
- *
- * As pages are already locked by write_cache_pages(), we can't use it
- */
-static int mpage_da_submit_io(struct mpage_da_data *mpd,
-			      struct ext4_map_blocks *map)
-{
-	struct pagevec pvec;
-	unsigned long index, end;
-	int ret = 0, err, nr_pages, i;
-	struct inode *inode = mpd->inode;
-	struct address_space *mapping = inode->i_mapping;
-	loff_t size = i_size_read(inode);
-	unsigned int len, block_start;
-	struct buffer_head *bh, *page_bufs = NULL;
-	sector_t pblock = 0, cur_logical = 0;
-	struct ext4_io_submit io_submit;
-
-	BUG_ON(mpd->next_page <= mpd->first_page);
-	ext4_io_submit_init(&io_submit, mpd->wbc);
-	io_submit.io_end = ext4_init_io_end(inode, GFP_NOFS);
-	if (!io_submit.io_end)
-		return -ENOMEM;
+struct mpage_da_data {
+	struct inode *inode;
+	struct writeback_control *wbc;
+	pgoff_t first_page;	/* The first page to write */
+	pgoff_t next_page;	/* Current page to examine */
+	pgoff_t last_page;	/* Last page to examine */
 	/*
-	 * We need to start from the first_page to the next_page - 1
-	 * to make sure we also write the mapped dirty buffer_heads.
-	 * If we look at mpd->b_blocknr we would only be looking
-	 * at the currently mapped buffer_heads.
+	 * Extent to map - this can be after first_page because that can be
+	 * fully mapped. We somewhat abuse m_flags to store whether the extent
+	 * is delalloc or unwritten.
 	 */
-	index = mpd->first_page;
-	end = mpd->next_page - 1;
-
-	pagevec_init(&pvec, 0);
-	while (index <= end) {
-		nr_pages = pagevec_lookup(&pvec, mapping, index, PAGEVEC_SIZE);
-		if (nr_pages == 0)
-			break;
-		for (i = 0; i < nr_pages; i++) {
-			int skip_page = 0;
-			struct page *page = pvec.pages[i];
-
-			index = page->index;
-			if (index > end)
-				break;
-
-			if (index == size >> PAGE_CACHE_SHIFT)
-				len = size & ~PAGE_CACHE_MASK;
-			else
-				len = PAGE_CACHE_SIZE;
-			if (map) {
-				cur_logical = index << (PAGE_CACHE_SHIFT -
-							inode->i_blkbits);
-				pblock = map->m_pblk + (cur_logical -
-							map->m_lblk);
-			}
-			index++;
-
-			BUG_ON(!PageLocked(page));
-			BUG_ON(PageWriteback(page));
-
-			bh = page_bufs = page_buffers(page);
-			block_start = 0;
-			do {
-				if (map && (cur_logical >= map->m_lblk) &&
-				    (cur_logical <= (map->m_lblk +
-						     (map->m_len - 1)))) {
-					if (buffer_delay(bh)) {
-						clear_buffer_delay(bh);
-						bh->b_blocknr = pblock;
-					}
-					if (buffer_unwritten(bh) ||
-					    buffer_mapped(bh))
-						BUG_ON(bh->b_blocknr != pblock);
-					if (map->m_flags & EXT4_MAP_UNINIT)
-						set_buffer_uninit(bh);
-					clear_buffer_unwritten(bh);
-				}
-
-				/*
-				 * skip page if block allocation undone and
-				 * block is dirty
-				 */
-				if (ext4_bh_delay_or_unwritten(NULL, bh))
-					skip_page = 1;
-				bh = bh->b_this_page;
-				block_start += bh->b_size;
-				cur_logical++;
-				pblock++;
-			} while (bh != page_bufs);
-
-			if (skip_page) {
-				unlock_page(page);
-				continue;
-			}
-
-			clear_page_dirty_for_io(page);
-			err = ext4_bio_write_page(&io_submit, page, len,
-						  mpd->wbc);
-			if (!err)
-				mpd->pages_written++;
-			/*
-			 * In error case, we have to continue because
-			 * remaining pages are still locked
-			 */
-			if (ret == 0)
-				ret = err;
-		}
-		pagevec_release(&pvec);
-	}
-	ext4_io_submit(&io_submit);
-	/* Drop io_end reference we got from init */
-	ext4_put_io_end_defer(io_submit.io_end);
-	return ret;
-}
+	struct ext4_map_blocks map;
+	struct ext4_io_submit io_submit;	/* IO submission data */
+};
 
-static void ext4_da_block_invalidatepages(struct mpage_da_data *mpd)
+static void mpage_release_unused_pages(struct mpage_da_data *mpd,
+				       bool invalidate)
 {
 	int nr_pages, i;
 	pgoff_t index, end;
 	struct pagevec pvec;
 	struct inode *inode = mpd->inode;
 	struct address_space *mapping = inode->i_mapping;
-	ext4_lblk_t start, last;
+
+	/* This is necessary when next_page == 0. */
+	if (mpd->first_page >= mpd->next_page)
+		return;
 
 	index = mpd->first_page;
 	end   = mpd->next_page - 1;
-
-	start = index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
-	last = end << (PAGE_CACHE_SHIFT - inode->i_blkbits);
-	ext4_es_remove_extent(inode, start, last - start + 1);
+	if (invalidate) {
+		ext4_lblk_t start, last;
+		start = index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+		last = end << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+		ext4_es_remove_extent(inode, start, last - start + 1);
+	}
 
 	pagevec_init(&pvec, 0);
 	while (index <= end) {
@@ -1475,14 +1372,15 @@ static void ext4_da_block_invalidatepages(struct mpage_da_data *mpd)
 				break;
 			BUG_ON(!PageLocked(page));
 			BUG_ON(PageWriteback(page));
-			block_invalidatepage(page, 0);
-			ClearPageUptodate(page);
+			if (invalidate) {
+				block_invalidatepage(page, 0);
+				ClearPageUptodate(page);
+			}
 			unlock_page(page);
 		}
 		index = pvec.pages[nr_pages - 1]->index + 1;
 		pagevec_release(&pvec);
 	}
-	return;
 }
 
 static void ext4_print_free_blocks(struct inode *inode)
@@ -1508,206 +1406,6 @@ static void ext4_print_free_blocks(struct inode *inode)
 	return;
 }
 
-/*
- * mpage_da_map_and_submit - go through given space, map them
- *       if necessary, and then submit them for I/O
- *
- * @mpd - bh describing space
- *
- * The function skips space we know is already mapped to disk blocks.
- *
- */
-static void mpage_da_map_and_submit(struct mpage_da_data *mpd)
-{
-	int err, blks, get_blocks_flags;
-	struct ext4_map_blocks map, *mapp = NULL;
-	sector_t next = mpd->b_blocknr;
-	unsigned max_blocks = mpd->b_size >> mpd->inode->i_blkbits;
-	loff_t disksize = EXT4_I(mpd->inode)->i_disksize;
-	handle_t *handle = NULL;
-
-	/*
-	 * If the blocks are mapped already, or we couldn't accumulate
-	 * any blocks, then proceed immediately to the submission stage.
-	 */
-	if ((mpd->b_size == 0) ||
-	    ((mpd->b_state  & (1 << BH_Mapped)) &&
-	     !(mpd->b_state & (1 << BH_Delay)) &&
-	     !(mpd->b_state & (1 << BH_Unwritten))))
-		goto submit_io;
-
-	handle = ext4_journal_current_handle();
-	BUG_ON(!handle);
-
-	/*
-	 * Call ext4_map_blocks() to allocate any delayed allocation
-	 * blocks, or to convert an uninitialized extent to be
-	 * initialized (in the case where we have written into
-	 * one or more preallocated blocks).
-	 *
-	 * We pass in the magic EXT4_GET_BLOCKS_DELALLOC_RESERVE to
-	 * indicate that we are on the delayed allocation path.  This
-	 * affects functions in many different parts of the allocation
-	 * call path.  This flag exists primarily because we don't
-	 * want to change *many* call functions, so ext4_map_blocks()
-	 * will set the EXT4_STATE_DELALLOC_RESERVED flag once the
-	 * inode's allocation semaphore is taken.
-	 *
-	 * If the blocks in questions were delalloc blocks, set
-	 * EXT4_GET_BLOCKS_DELALLOC_RESERVE so the delalloc accounting
-	 * variables are updated after the blocks have been allocated.
-	 */
-	map.m_lblk = next;
-	map.m_len = max_blocks;
-	get_blocks_flags = EXT4_GET_BLOCKS_CREATE;
-	if (ext4_should_dioread_nolock(mpd->inode))
-		get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
-	if (mpd->b_state & (1 << BH_Delay))
-		get_blocks_flags |= EXT4_GET_BLOCKS_DELALLOC_RESERVE;
-
-	blks = ext4_map_blocks(handle, mpd->inode, &map, get_blocks_flags);
-	if (blks < 0) {
-		struct super_block *sb = mpd->inode->i_sb;
-
-		err = blks;
-		/*
-		 * If get block returns EAGAIN or ENOSPC and there
-		 * appears to be free blocks we will just let
-		 * mpage_da_submit_io() unlock all of the pages.
-		 */
-		if (err == -EAGAIN)
-			goto submit_io;
-
-		if (err == -ENOSPC && ext4_count_free_clusters(sb)) {
-			mpd->retval = err;
-			goto submit_io;
-		}
-
-		/*
-		 * get block failure will cause us to loop in
-		 * writepages, because a_ops->writepage won't be able
-		 * to make progress. The page will be redirtied by
-		 * writepage and writepages will again try to write
-		 * the same.
-		 */
-		if (!(EXT4_SB(sb)->s_mount_flags & EXT4_MF_FS_ABORTED)) {
-			ext4_msg(sb, KERN_CRIT,
-				 "delayed block allocation failed for inode %lu "
-				 "at logical offset %llu with max blocks %zd "
-				 "with error %d", mpd->inode->i_ino,
-				 (unsigned long long) next,
-				 mpd->b_size >> mpd->inode->i_blkbits, err);
-			ext4_msg(sb, KERN_CRIT,
-				"This should not happen!! Data will be lost");
-			if (err == -ENOSPC)
-				ext4_print_free_blocks(mpd->inode);
-		}
-		/* invalidate all the pages */
-		ext4_da_block_invalidatepages(mpd);
-
-		/* Mark this page range as having been completed */
-		mpd->io_done = 1;
-		return;
-	}
-	BUG_ON(blks == 0);
-
-	mapp = &map;
-	if (map.m_flags & EXT4_MAP_NEW) {
-		struct block_device *bdev = mpd->inode->i_sb->s_bdev;
-		int i;
-
-		for (i = 0; i < map.m_len; i++)
-			unmap_underlying_metadata(bdev, map.m_pblk + i);
-	}
-
-	/*
-	 * Update on-disk size along with block allocation.
-	 */
-	disksize = ((loff_t) next + blks) << mpd->inode->i_blkbits;
-	if (disksize > i_size_read(mpd->inode))
-		disksize = i_size_read(mpd->inode);
-	if (disksize > EXT4_I(mpd->inode)->i_disksize) {
-		ext4_update_i_disksize(mpd->inode, disksize);
-		err = ext4_mark_inode_dirty(handle, mpd->inode);
-		if (err)
-			ext4_error(mpd->inode->i_sb,
-				   "Failed to mark inode %lu dirty",
-				   mpd->inode->i_ino);
-	}
-
-submit_io:
-	mpage_da_submit_io(mpd, mapp);
-	mpd->io_done = 1;
-}
-
-#define BH_FLAGS ((1 << BH_Uptodate) | (1 << BH_Mapped) | \
-		(1 << BH_Delay) | (1 << BH_Unwritten))
-
-/*
- * mpage_add_bh_to_extent - try to add one more block to extent of blocks
- *
- * @mpd->lbh - extent of blocks
- * @logical - logical number of the block in the file
- * @b_state - b_state of the buffer head added
- *
- * the function is used to collect contig. blocks in same state
- */
-static void mpage_add_bh_to_extent(struct mpage_da_data *mpd, sector_t logical,
-				   unsigned long b_state)
-{
-	sector_t next;
-	int blkbits = mpd->inode->i_blkbits;
-	int nrblocks = mpd->b_size >> blkbits;
-
-	/*
-	 * XXX Don't go larger than mballoc is willing to allocate
-	 * This is a stopgap solution.  We eventually need to fold
-	 * mpage_da_submit_io() into this function and then call
-	 * ext4_map_blocks() multiple times in a loop
-	 */
-	if (nrblocks >= (8*1024*1024 >> blkbits))
-		goto flush_it;
-
-	/* check if the reserved journal credits might overflow */
-	if (!ext4_test_inode_flag(mpd->inode, EXT4_INODE_EXTENTS)) {
-		if (nrblocks >= EXT4_MAX_TRANS_DATA) {
-			/*
-			 * With non-extent format we are limited by the journal
-			 * credit available.  Total credit needed to insert
-			 * nrblocks contiguous blocks is dependent on the
-			 * nrblocks.  So limit nrblocks.
-			 */
-			goto flush_it;
-		}
-	}
-	/*
-	 * First block in the extent
-	 */
-	if (mpd->b_size == 0) {
-		mpd->b_blocknr = logical;
-		mpd->b_size = 1 << blkbits;
-		mpd->b_state = b_state & BH_FLAGS;
-		return;
-	}
-
-	next = mpd->b_blocknr + nrblocks;
-	/*
-	 * Can we merge the block to our big extent?
-	 */
-	if (logical == next && (b_state & BH_FLAGS) == mpd->b_state) {
-		mpd->b_size += 1 << blkbits;
-		return;
-	}
-
-flush_it:
-	/*
-	 * We couldn't merge the block to our extent, so we
-	 * need to flush current  extent and start new one
-	 */
-	mpage_da_map_and_submit(mpd);
-	return;
-}
-
 static int ext4_bh_delay_or_unwritten(handle_t *handle, struct buffer_head *bh)
 {
 	return (buffer_delay(bh) || buffer_unwritten(bh)) && buffer_dirty(bh);
@@ -2077,6 +1775,301 @@ static int ext4_writepage(struct page *page,
 	return ret;
 }
 
+#define BH_FLAGS ((1 << BH_Unwritten) | (1 << BH_Delay))
+
+/*
+ * mballoc gives us at most this number of blocks...
+ * XXX: That seems to be only a limitation of ext4_mb_normalize_request().
+ * The rest of mballoc seems to handle chunks upto full group size.
+ */
+#define MAX_WRITEPAGES_EXTENT_LEN 2048
+
+/*
+ * mpage_add_bh_to_extent - try to add bh to extent of blocks to map
+ *
+ * @mpd - extent of blocks
+ * @lblk - logical number of the block in the file
+ * @b_state - b_state of the buffer head added
+ *
+ * the function is used to collect contig. blocks in same state
+ */
+static int mpage_add_bh_to_extent(struct mpage_da_data *mpd, ext4_lblk_t lblk,
+				  unsigned long b_state)
+{
+	struct ext4_map_blocks *map = &mpd->map;
+
+	/* Don't go larger than mballoc is willing to allocate */
+	if (map->m_len >= MAX_WRITEPAGES_EXTENT_LEN)
+		return 0;
+
+	/* First block in the extent? */
+	if (map->m_len == 0) {
+		map->m_lblk = lblk;
+		map->m_len = 1;
+		map->m_flags = b_state & BH_FLAGS;
+		return 1;
+	}
+
+	/* Can we merge the block to our big extent? */
+	if (lblk == map->m_lblk + map->m_len &&
+	    (b_state & BH_FLAGS) == map->m_flags) {
+		map->m_len++;
+		return 1;
+	}
+	return 0;
+}
+
+static bool add_page_bufs_to_extent(struct mpage_da_data *mpd,
+				    struct buffer_head *head,
+				    struct buffer_head *bh,
+				    ext4_lblk_t lblk)
+{
+	do {
+		BUG_ON(buffer_locked(bh));
+
+		if (!buffer_dirty(bh) || !buffer_mapped(bh) ||
+		    (!buffer_delay(bh) && !buffer_unwritten(bh))) {
+			/* Found extent to map? */
+			if (mpd->map.m_len)
+				return false;
+			continue;
+		}
+		if (!mpage_add_bh_to_extent(mpd, lblk, bh->b_state))
+			return false;
+	} while (lblk++, (bh = bh->b_this_page) != head);
+	return true;
+}
+
+static int mpage_submit_page(struct mpage_da_data *mpd, struct page *page)
+{
+	int len;
+	loff_t size = i_size_read(mpd->inode);
+	int err;
+
+	BUG_ON(page->index != mpd->first_page);
+	if (page->index == size >> PAGE_CACHE_SHIFT)
+		len = size & ~PAGE_CACHE_MASK;
+	else
+		len = PAGE_CACHE_SIZE;
+	clear_page_dirty_for_io(page);
+	err = ext4_bio_write_page(&mpd->io_submit, page, len, mpd->wbc);
+	if (!err)
+		mpd->wbc->nr_to_write--;
+	mpd->first_page++;
+
+	return err;
+}
+
+/*
+ * mpage_map_buffers - update buffers corresponding to changed extent and
+ *		       submit fully mapped pages for IO
+ *
+ * @mpd - description of extent to map, on return next extent to map
+ *
+ * Scan buffers corresponding to changed extent (we expect corresponding pages
+ * to be already locked) and update buffer state according to new extent state.
+ * We map delalloc buffers to their physical location, clear unwritten bits,
+ * and mark buffers as uninit when we perform writes to uninitialized extents
+ * and do extent conversion after IO is finished. If the last page is not fully
+ * mapped, we update @map to the next extent in the last page that needs
+ * mapping. Otherwise we submit the page for IO.
+ */
+static int mpage_map_and_submit_buffers(struct mpage_da_data *mpd)
+{
+	struct pagevec pvec;
+	int nr_pages, i;
+	struct inode *inode = mpd->inode;
+	struct buffer_head *head, *bh;
+	int bpp_bits = PAGE_CACHE_SHIFT - inode->i_blkbits;
+	pgoff_t start, end;
+	ext4_lblk_t lblk;
+	sector_t pblock;
+	int err;
+
+	start = mpd->map.m_lblk >> bpp_bits;
+	end = (mpd->map.m_lblk + mpd->map.m_len - 1) >> bpp_bits;
+	lblk = start << bpp_bits;
+	pblock = mpd->map.m_pblk;
+
+	pagevec_init(&pvec, 0);
+	while (start <= end) {
+		nr_pages = pagevec_lookup(&pvec, inode->i_mapping, start,
+					  PAGEVEC_SIZE);
+		if (nr_pages == 0)
+			break;
+		for (i = 0; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			if (page->index > end)
+				break;
+			/* Upto 'end' pages must be contiguous */
+			BUG_ON(page->index != start);
+			bh = head = page_buffers(page);
+			do {
+				if (lblk < mpd->map.m_lblk)
+					continue;
+				if (lblk >= mpd->map.m_lblk + mpd->map.m_len) {
+					/*
+					 * Buffer after end of mapped extent.
+					 * Find next buffer in the page to map.
+					 */
+					mpd->map.m_len = 0;
+					mpd->map.m_flags = 0;
+					add_page_bufs_to_extent(mpd, head, bh,
+								lblk);
+					pagevec_release(&pvec);
+					return 0;
+				}
+				if (buffer_delay(bh)) {
+					clear_buffer_delay(bh);
+					bh->b_blocknr = pblock++;
+				}
+				if (mpd->map.m_flags & EXT4_MAP_UNINIT)
+					set_buffer_uninit(bh);
+				clear_buffer_unwritten(bh);
+			} while (lblk++, (bh = bh->b_this_page) != head);
+
+			/* Page fully mapped - let IO run! */
+			err = mpage_submit_page(mpd, page);
+			if (err < 0) {
+				pagevec_release(&pvec);
+				return err;
+			}
+			start++;
+		}
+		pagevec_release(&pvec);
+	}
+	/* Extent fully mapped and matches with page boundary. We are done. */
+	mpd->map.m_len = 0;
+	mpd->map.m_flags = 0;
+	return 0;
+}
+
+static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
+{
+	struct inode *inode = mpd->inode;
+	struct ext4_map_blocks *map = &mpd->map;
+	int get_blocks_flags;
+	int err;
+
+	trace_ext4_da_write_pages_extent(inode, map);
+	/*
+	 * Call ext4_map_blocks() to allocate any delayed allocation
+	 * blocks, or to convert an uninitialized extent to be
+	 * initialized (in the case where we have written into
+	 * one or more preallocated blocks).
+	 *
+	 * We pass in the magic EXT4_GET_BLOCKS_DELALLOC_RESERVE if the blocks
+	 * in question are delalloc blocks.  This affects functions in many
+	 * different parts of the allocation call path.  This flag exists
+	 * primarily because we don't want to change *many* call functions, so
+	 * ext4_map_blocks() will set the EXT4_STATE_DELALLOC_RESERVED flag
+	 * once the inode's allocation semaphore is taken.
+	 */
+	get_blocks_flags = EXT4_GET_BLOCKS_CREATE;
+	if (ext4_should_dioread_nolock(inode))
+		get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
+	if (map->m_flags & (1 << BH_Delay))
+		get_blocks_flags |= EXT4_GET_BLOCKS_DELALLOC_RESERVE;
+
+	err = ext4_map_blocks(handle, inode, map, get_blocks_flags);
+	if (err < 0)
+		return err;
+
+	BUG_ON(map->m_len == 0);
+	if (map->m_flags & EXT4_MAP_NEW) {
+		struct block_device *bdev = inode->i_sb->s_bdev;
+		int i;
+
+		for (i = 0; i < map->m_len; i++)
+			unmap_underlying_metadata(bdev, map->m_pblk + i);
+	}
+	return 0;
+}
+
+/*
+ * mpage_map_and_submit_extent - map extent starting at mpd->lblk of length
+ *				 mpd->len and submit pages underlying it for IO
+ *
+ * @handle - handle for journal operations
+ * @mpd - extent to map
+ *
+ * The function maps extent starting at mpd->lblk of length mpd->len. If it is
+ * delayed, blocks are allocated, if it is unwritten, we may need to convert
+ * them to initialized or split the described range from larger unwritten
+ * extent. Note that we need not map all the described range since allocation
+ * can return less blocks or the range is covered by more unwritten extents. We
+ * cannot map more because we are limited by reserved transaction credits. On
+ * the other hand we always make sure that the last touched page is fully
+ * mapped so that it can be written out (and thus forward progress is
+ * guaranteed). After mapping we submit all mapped pages for IO.
+ */
+static int mpage_map_and_submit_extent(handle_t *handle,
+				       struct mpage_da_data *mpd)
+{
+	struct inode *inode = mpd->inode;
+	struct ext4_map_blocks *map = &mpd->map;
+	int err;
+	loff_t disksize;
+
+	while (map->m_len) {
+		err = mpage_map_one_extent(handle, mpd);
+		if (err < 0) {
+			struct super_block *sb = inode->i_sb;
+
+			/*
+			 * Need to commit transaction to free blocks. Let upper
+			 * layers sort it out.
+			 */
+			if (err == -ENOSPC && ext4_count_free_clusters(sb))
+				return -ENOSPC;
+
+			if (!(EXT4_SB(sb)->s_mount_flags & EXT4_MF_FS_ABORTED)) {
+				ext4_msg(sb, KERN_CRIT,
+					 "Delayed block allocation failed for "
+					 "inode %lu at logical offset %llu with"
+					 " max blocks %u with error %d",
+					 inode->i_ino,
+					 (unsigned long long)map->m_lblk,
+					 (unsigned)map->m_len, err);
+				ext4_msg(sb, KERN_CRIT,
+					 "This should not happen!! Data will "
+					 "be lost\n");
+				if (err == -ENOSPC)
+					ext4_print_free_blocks(inode);
+			}
+			/* invalidate all the pages */
+			mpage_release_unused_pages(mpd, true);
+			return err;
+		}
+		/*
+		 * Update buffer state, submit mapped pages, and get us new
+		 * extent to map
+		 */
+		err = mpage_map_and_submit_buffers(mpd);
+		if (err < 0)
+			return err;
+	}
+
+	/* Update on-disk size after IO is submitted */
+	disksize = ((loff_t)mpd->first_page) << PAGE_CACHE_SHIFT;
+	if (disksize > i_size_read(inode))
+		disksize = i_size_read(inode);
+	if (disksize > EXT4_I(inode)->i_disksize) {
+		int err2;
+
+		ext4_update_i_disksize(inode, disksize);
+		err2 = ext4_mark_inode_dirty(handle, inode);
+		if (err2)
+			ext4_error(inode->i_sb,
+				   "Failed to mark inode %lu dirty",
+				   inode->i_ino);
+		if (!err)
+			err = err2;
+	}
+	return err;
+}
+
 /*
  * Calculate the total number of credits to reserve for one writepages
  * iteration. This is called from ext4_da_writepages(). We map an extent of
@@ -2093,44 +2086,44 @@ static int ext4_da_writepages_trans_blocks(struct inode *inode)
 }
 
 /*
- * write_cache_pages_da - walk the list of dirty pages of the given
- * address space and accumulate pages that need writing, and call
- * mpage_da_map_and_submit to map a single contiguous memory region
- * and then write them.
+ * mpage_prepare_extent_to_map - find & lock contiguous range of dirty pages
+ * 				 and underlying extent to map
+ *
+ * @mpd - where to look for pages
+ *
+ * Walk dirty pages in the mapping while they are contiguous and lock them.
+ * While pages are fully mapped submit them for IO. When we find a page which
+ * isn't mapped we start accumulating extent of buffers underlying these pages
+ * that needs mapping (formed by either delayed or unwritten buffers). The
+ * extent found is returned in @mpd structure (starting at mpd->lblk with
+ * length mpd->len blocks).
  */
-static int write_cache_pages_da(handle_t *handle,
-				struct address_space *mapping,
-				struct writeback_control *wbc,
-				struct mpage_da_data *mpd,
-				pgoff_t *done_index)
+static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
 {
-	struct buffer_head	*bh, *head;
-	struct inode		*inode = mapping->host;
-	struct pagevec		pvec;
-	unsigned int		nr_pages;
-	sector_t		logical;
-	pgoff_t			index, end;
-	long			nr_to_write = wbc->nr_to_write;
-	int			i, tag, ret = 0;
-
-	memset(mpd, 0, sizeof(struct mpage_da_data));
-	mpd->wbc = wbc;
-	mpd->inode = inode;
-	pagevec_init(&pvec, 0);
-	index = wbc->range_start >> PAGE_CACHE_SHIFT;
-	end = wbc->range_end >> PAGE_CACHE_SHIFT;
+	struct address_space *mapping = mpd->inode->i_mapping;
+	struct pagevec pvec;
+	unsigned int nr_pages;
+	pgoff_t index = mpd->first_page;
+	pgoff_t end = mpd->last_page;
+	bool first_page_found = false;
+	int tag;
+	int i, err = 0;
+	int blkbits = mpd->inode->i_blkbits;
+	ext4_lblk_t lblk;
+	struct buffer_head *head;
 
-	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
+	if (mpd->wbc->sync_mode == WB_SYNC_ALL || mpd->wbc->tagged_writepages)
 		tag = PAGECACHE_TAG_TOWRITE;
 	else
 		tag = PAGECACHE_TAG_DIRTY;
 
-	*done_index = index;
+	mpd->map.m_len = 0;
+	mpd->next_page = index;
 	while (index <= end) {
 		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
 			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
 		if (nr_pages == 0)
-			return 0;
+			goto out;
 
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
@@ -2145,31 +2138,21 @@ static int write_cache_pages_da(handle_t *handle,
 			if (page->index > end)
 				goto out;
 
-			*done_index = page->index + 1;
-
-			/*
-			 * If we can't merge this page, and we have
-			 * accumulated an contiguous region, write it
-			 */
-			if ((mpd->next_page != page->index) &&
-			    (mpd->next_page != mpd->first_page)) {
-				mpage_da_map_and_submit(mpd);
-				goto ret_extent_tail;
-			}
+			/* If we can't merge this page, we are done. */
+			if (first_page_found && mpd->next_page != page->index)
+				goto out;
 
 			lock_page(page);
-
 			/*
-			 * If the page is no longer dirty, or its
-			 * mapping no longer corresponds to inode we
-			 * are writing (which means it has been
-			 * truncated or invalidated), or the page is
-			 * already under writeback and we are not
-			 * doing a data integrity writeback, skip the page
+			 * If the page is no longer dirty, or its mapping no
+			 * longer corresponds to inode we are writing (which
+			 * means it has been truncated or invalidated), or the
+			 * page is already under writeback and we are not doing
+			 * a data integrity writeback, skip the page
 			 */
 			if (!PageDirty(page) ||
 			    (PageWriteback(page) &&
-			     (wbc->sync_mode == WB_SYNC_NONE)) ||
+			     (mpd->wbc->sync_mode == WB_SYNC_NONE)) ||
 			    unlikely(page->mapping != mapping)) {
 				unlock_page(page);
 				continue;
@@ -2178,101 +2161,60 @@ static int write_cache_pages_da(handle_t *handle,
 			wait_on_page_writeback(page);
 			BUG_ON(PageWriteback(page));
 
-			/*
-			 * If we have inline data and arrive here, it means that
-			 * we will soon create the block for the 1st page, so
-			 * we'd better clear the inline data here.
-			 */
-			if (ext4_has_inline_data(inode)) {
-				BUG_ON(ext4_test_inode_state(inode,
-						EXT4_STATE_MAY_INLINE_DATA));
-				ext4_destroy_inline_data(handle, inode);
-			}
-
-			if (mpd->next_page != page->index)
+			if (!first_page_found) {
 				mpd->first_page = page->index;
+				first_page_found = true;
+			}
 			mpd->next_page = page->index + 1;
-			logical = (sector_t) page->index <<
-				(PAGE_CACHE_SHIFT - inode->i_blkbits);
+			lblk = ((ext4_lblk_t)page->index) <<
+				(PAGE_CACHE_SHIFT - blkbits);
 
 			/* Add all dirty buffers to mpd */
 			head = page_buffers(page);
-			bh = head;
-			do {
-				BUG_ON(buffer_locked(bh));
-				/*
-				 * We need to try to allocate unmapped blocks
-				 * in the same page.  Otherwise we won't make
-				 * progress with the page in ext4_writepage
-				 */
-				if (ext4_bh_delay_or_unwritten(NULL, bh)) {
-					mpage_add_bh_to_extent(mpd, logical,
-							       bh->b_state);
-					if (mpd->io_done)
-						goto ret_extent_tail;
-				} else if (buffer_dirty(bh) &&
-					   buffer_mapped(bh)) {
-					/*
-					 * mapped dirty buffer. We need to
-					 * update the b_state because we look
-					 * at b_state in mpage_da_map_blocks.
-					 * We don't update b_size because if we
-					 * find an unmapped buffer_head later
-					 * we need to use the b_state flag of
-					 * that buffer_head.
-					 */
-					if (mpd->b_size == 0)
-						mpd->b_state =
-							bh->b_state & BH_FLAGS;
-				}
-				logical++;
-			} while ((bh = bh->b_this_page) != head);
-
-			if (nr_to_write > 0) {
-				nr_to_write--;
-				if (nr_to_write == 0 &&
-				    wbc->sync_mode == WB_SYNC_NONE)
-					/*
-					 * We stop writing back only if we are
-					 * not doing integrity sync. In case of
-					 * integrity sync we have to keep going
-					 * because someone may be concurrently
-					 * dirtying pages, and we might have
-					 * synced a lot of newly appeared dirty
-					 * pages, but have not synced all of the
-					 * old dirty pages.
-					 */
+			if (!add_page_bufs_to_extent(mpd, head, head, lblk))
+				goto out;
+			/* So far everything mapped? Submit the page for IO. */
+			if (mpd->map.m_len == 0) {
+				err = mpage_submit_page(mpd, page);
+				if (err < 0)
 					goto out;
 			}
+
+			/*
+			 * Accumulated enough dirty pages? This doesn't apply
+			 * to WB_SYNC_ALL mode. For integrity sync we have to
+			 * keep going because someone may be concurrently
+			 * dirtying pages, and we might have synced a lot of
+			 * newly appeared dirty pages, but have not synced all
+			 * of the old dirty pages.
+			 */
+			if (mpd->wbc->sync_mode == WB_SYNC_NONE &&
+			    mpd->next_page - mpd->first_page >=
+							mpd->wbc->nr_to_write)
+				goto out;
 		}
 		pagevec_release(&pvec);
 		cond_resched();
 	}
 	return 0;
-ret_extent_tail:
-	ret = MPAGE_DA_EXTENT_TAIL;
 out:
 	pagevec_release(&pvec);
-	cond_resched();
-	return ret;
+	return err;
 }
 
-
 static int ext4_da_writepages(struct address_space *mapping,
 			      struct writeback_control *wbc)
 {
-	pgoff_t	index;
+	pgoff_t	writeback_index = 0;
+	long nr_to_write = wbc->nr_to_write;
 	int range_whole = 0;
+	int cycled = 1;
 	handle_t *handle = NULL;
 	struct mpage_da_data mpd;
 	struct inode *inode = mapping->host;
-	int pages_written = 0;
-	int range_cyclic, cycled = 1, io_done = 0;
 	int needed_blocks, ret = 0;
-	loff_t range_start = wbc->range_start;
 	struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
-	pgoff_t done_index = 0;
-	pgoff_t end;
+	bool done;
 	struct blk_plug plug;
 
 	trace_ext4_da_writepages(inode, wbc);
@@ -2298,40 +2240,65 @@ static int ext4_da_writepages(struct address_space *mapping,
 	if (unlikely(sbi->s_mount_flags & EXT4_MF_FS_ABORTED))
 		return -EROFS;
 
+	/*
+	 * If we have inline data and arrive here, it means that
+	 * we will soon create the block for the 1st page, so
+	 * we'd better clear the inline data here.
+	 */
+	if (ext4_has_inline_data(inode)) {
+		/* Just inode will be modified... */
+		handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
+		if (IS_ERR(handle)) {
+			ret = PTR_ERR(handle);
+			goto out_writepages;
+		}
+		BUG_ON(ext4_test_inode_state(inode,
+				EXT4_STATE_MAY_INLINE_DATA));
+		ext4_destroy_inline_data(handle, inode);
+		ext4_journal_stop(handle);
+	}
+
 	if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 		range_whole = 1;
 
-	range_cyclic = wbc->range_cyclic;
 	if (wbc->range_cyclic) {
-		index = mapping->writeback_index;
-		if (index)
+		writeback_index = mapping->writeback_index;
+		if (writeback_index)
 			cycled = 0;
-		wbc->range_start = index << PAGE_CACHE_SHIFT;
-		wbc->range_end  = LLONG_MAX;
-		wbc->range_cyclic = 0;
-		end = -1;
+		mpd.first_page = writeback_index;
+		mpd.last_page = -1;
 	} else {
-		index = wbc->range_start >> PAGE_CACHE_SHIFT;
-		end = wbc->range_end >> PAGE_CACHE_SHIFT;
+		mpd.first_page = wbc->range_start >> PAGE_CACHE_SHIFT;
+		mpd.last_page = wbc->range_end >> PAGE_CACHE_SHIFT;
 	}
 
+	mpd.inode = inode;
+	mpd.wbc = wbc;
+	ext4_io_submit_init(&mpd.io_submit, wbc);
 retry:
 	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
-		tag_pages_for_writeback(mapping, index, end);
-
+		tag_pages_for_writeback(mapping, mpd.first_page, mpd.last_page);
+	done = false;
 	blk_start_plug(&plug);
-	while (!ret && wbc->nr_to_write > 0) {
+	while (!done && mpd.first_page <= mpd.last_page) {
+		/* For each extent of pages we use new io_end */
+		mpd.io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
+		if (!mpd.io_submit.io_end) {
+			ret = -ENOMEM;
+			break;
+		}
 
 		/*
-		 * we  insert one extent at a time. So we need
-		 * credit needed for single extent allocation.
-		 * journalled mode is currently not supported
-		 * by delalloc
+		 * We have two constraints: We find one extent to map and we
+		 * must always write out whole page (makes a difference when
+		 * blocksize < pagesize) so that we don't block on IO when we
+		 * try to write out the rest of the page. Journalled mode is
+		 * not supported by delalloc.
 		 */
 		BUG_ON(ext4_should_journal_data(inode));
 		needed_blocks = ext4_da_writepages_trans_blocks(inode);
 
-		/* start a new transaction*/
+		/* start a new transaction */
 		handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE,
 					    needed_blocks);
 		if (IS_ERR(handle)) {
@@ -2339,76 +2306,67 @@ retry:
 			ext4_msg(inode->i_sb, KERN_CRIT, "%s: jbd2_start: "
 			       "%ld pages, ino %lu; err %d", __func__,
 				wbc->nr_to_write, inode->i_ino, ret);
-			blk_finish_plug(&plug);
-			goto out_writepages;
+			/* Release allocated io_end */
+			ext4_put_io_end(mpd.io_submit.io_end);
+			break;
 		}
 
-		/*
-		 * Now call write_cache_pages_da() to find the next
-		 * contiguous region of logical blocks that need
-		 * blocks to be allocated by ext4 and submit them.
-		 */
-		ret = write_cache_pages_da(handle, mapping,
-					   wbc, &mpd, &done_index);
-		/*
-		 * If we have a contiguous extent of pages and we
-		 * haven't done the I/O yet, map the blocks and submit
-		 * them for I/O.
-		 */
-		if (!mpd.io_done && mpd.next_page != mpd.first_page) {
-			mpage_da_map_and_submit(&mpd);
-			ret = MPAGE_DA_EXTENT_TAIL;
+		trace_ext4_da_write_pages(inode, mpd.first_page, mpd.wbc);
+		ret = mpage_prepare_extent_to_map(&mpd);
+		if (!ret) {
+			if (mpd.map.m_len)
+				ret = mpage_map_and_submit_extent(handle, &mpd);
+			else {
+				/*
+				 * We scanned the whole range (or exhausted
+				 * nr_to_write), submitted what was mapped and
+				 * didn't find anything needing mapping. We are
+				 * done.
+				 */
+				done = true;
+			}
 		}
-		trace_ext4_da_write_pages(inode, &mpd);
-		wbc->nr_to_write -= mpd.pages_written;
-
 		ext4_journal_stop(handle);
-
-		if ((mpd.retval == -ENOSPC) && sbi->s_journal) {
-			/* commit the transaction which would
+		/* Submit prepared bio */
+		ext4_io_submit(&mpd.io_submit);
+		/* Unlock pages we didn't use */
+		mpage_release_unused_pages(&mpd, false);
+		/* Drop our io_end reference we got from init */
+		ext4_put_io_end(mpd.io_submit.io_end);
+
+		if (ret == -ENOSPC && sbi->s_journal) {
+			/*
+			 * Commit the transaction which would
 			 * free blocks released in the transaction
 			 * and try again
 			 */
 			jbd2_journal_force_commit_nested(sbi->s_journal);
 			ret = 0;
-		} else if (ret == MPAGE_DA_EXTENT_TAIL) {
-			/*
-			 * Got one extent now try with rest of the pages.
-			 * If mpd.retval is set -EIO, journal is aborted.
-			 * So we don't need to write any more.
-			 */
-			pages_written += mpd.pages_written;
-			ret = mpd.retval;
-			io_done = 1;
-		} else if (wbc->nr_to_write)
-			/*
-			 * There is no more writeout needed
-			 * or we requested for a noblocking writeout
-			 * and we found the device congested
-			 */
+			continue;
+		}
+		/* Fatal error - ENOMEM, EIO... */
+		if (ret)
 			break;
 	}
 	blk_finish_plug(&plug);
-	if (!io_done && !cycled) {
+	if (!ret && !cycled) {
 		cycled = 1;
-		index = 0;
-		wbc->range_start = index << PAGE_CACHE_SHIFT;
-		wbc->range_end  = mapping->writeback_index - 1;
+		mpd.last_page = writeback_index - 1;
+		mpd.first_page = 0;
 		goto retry;
 	}
 
 	/* Update index */
-	wbc->range_cyclic = range_cyclic;
 	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
 		/*
-		 * set the writeback_index so that range_cyclic
+		 * Set the writeback_index so that range_cyclic
 		 * mode will write it back later
 		 */
-		mapping->writeback_index = done_index;
+		mapping->writeback_index = mpd.first_page;
 
 out_writepages:
-	wbc->range_start = range_start;
-	trace_ext4_da_writepages_result(inode, wbc, ret, pages_written);
+	trace_ext4_da_writepages_result(inode, wbc, ret,
+					nr_to_write - wbc->nr_to_write);
 	return ret;
 }
 
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index a601bb3..203dcd5 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -332,43 +332,59 @@ TRACE_EVENT(ext4_da_writepages,
 );
 
 TRACE_EVENT(ext4_da_write_pages,
-	TP_PROTO(struct inode *inode, struct mpage_da_data *mpd),
+	TP_PROTO(struct inode *inode, pgoff_t first_page,
+		 struct writeback_control *wbc),
 
-	TP_ARGS(inode, mpd),
+	TP_ARGS(inode, first_page, wbc),
 
 	TP_STRUCT__entry(
 		__field(	dev_t,	dev			)
 		__field(	ino_t,	ino			)
-		__field(	__u64,	b_blocknr		)
-		__field(	__u32,	b_size			)
-		__field(	__u32,	b_state			)
-		__field(	unsigned long,	first_page	)
-		__field(	int,	io_done			)
-		__field(	int,	pages_written		)
-		__field(	int,	sync_mode		)
+		__field(      pgoff_t,	first_page		)
+		__field(	 long,	nr_to_write		)
+		__field(	  int,	sync_mode		)
 	),
 
 	TP_fast_assign(
 		__entry->dev		= inode->i_sb->s_dev;
 		__entry->ino		= inode->i_ino;
-		__entry->b_blocknr	= mpd->b_blocknr;
-		__entry->b_size		= mpd->b_size;
-		__entry->b_state	= mpd->b_state;
-		__entry->first_page	= mpd->first_page;
-		__entry->io_done	= mpd->io_done;
-		__entry->pages_written	= mpd->pages_written;
-		__entry->sync_mode	= mpd->wbc->sync_mode;
+		__entry->first_page	= first_page;
+		__entry->nr_to_write	= wbc->nr_to_write;
+		__entry->sync_mode	= wbc->sync_mode;
 	),
 
-	TP_printk("dev %d,%d ino %lu b_blocknr %llu b_size %u b_state 0x%04x "
-		  "first_page %lu io_done %d pages_written %d sync_mode %d",
+	TP_printk("dev %d,%d ino %lu first_page %lu nr_to_write %ld "
+		  "sync_mode %d",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
-		  (unsigned long) __entry->ino,
-		  __entry->b_blocknr, __entry->b_size,
-		  __entry->b_state, __entry->first_page,
-		  __entry->io_done, __entry->pages_written,
-		  __entry->sync_mode
-                  )
+		  (unsigned long) __entry->ino, __entry->first_page,
+		  __entry->nr_to_write, __entry->sync_mode)
+);
+
+TRACE_EVENT(ext4_da_write_pages_extent,
+	TP_PROTO(struct inode *inode, struct ext4_map_blocks *map),
+
+	TP_ARGS(inode, map),
+
+	TP_STRUCT__entry(
+		__field(	dev_t,	dev			)
+		__field(	ino_t,	ino			)
+		__field(	__u64,	lblk			)
+		__field(	__u32,	len			)
+		__field(	__u32,	flags			)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= inode->i_sb->s_dev;
+		__entry->ino		= inode->i_ino;
+		__entry->lblk		= map->m_lblk;
+		__entry->len		= map->m_len;
+		__entry->flags		= map->m_flags;
+	),
+
+	TP_printk("dev %d,%d ino %lu lblk %llu len %u flags 0x%04x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  (unsigned long) __entry->ino, __entry->lblk, __entry->len,
+		  __entry->flags)
 );
 
 TRACE_EVENT(ext4_da_writepages_result,
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 19/29] ext4: Remove buffer_uninit handling
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
                   ` (17 preceding siblings ...)
  2013-04-08 21:32 ` [PATCH 18/29] ext4: Restructure writeback path Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-05-08  6:56   ` Zheng Liu
  2013-04-08 21:32 ` [PATCH 20/29] ext4: Use transaction reservation for extent conversion in ext4_end_io Jan Kara
                   ` (9 subsequent siblings)
  28 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

There isn't any need for setting BH_Uninit on buffers anymore. It was
only used to signal we need to mark io_end as needing extent conversion
in add_bh_to_extent() but now we can mark the io_end directly when
mapping extent.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h    |   15 ++++++---------
 fs/ext4/inode.c   |    4 ++--
 fs/ext4/page-io.c |    2 --
 3 files changed, 8 insertions(+), 13 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index cb1ba1c..3c3827a 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2606,20 +2606,17 @@ extern void ext4_mmp_csum_set(struct super_block *sb, struct mmp_struct *mmp);
 extern int ext4_mmp_csum_verify(struct super_block *sb,
 				struct mmp_struct *mmp);
 
-/* BH_Uninit flag: blocks are allocated but uninitialized on disk */
+/*
+ * Note that these flags will never ever appear in a buffer_head's state flag.
+ * See EXT4_MAP_... to see where this is used.
+ */
 enum ext4_state_bits {
 	BH_Uninit	/* blocks are allocated but uninitialized on disk */
-	  = BH_JBDPrivateStart,
+	 = BH_JBDPrivateStart,
 	BH_AllocFromCluster,	/* allocated blocks were part of already
-				 * allocated cluster. Note that this flag will
-				 * never, ever appear in a buffer_head's state
-				 * flag. See EXT4_MAP_FROM_CLUSTER to see where
-				 * this is used. */
+				 * allocated cluster. */
 };
 
-BUFFER_FNS(Uninit, uninit)
-TAS_BUFFER_FNS(Uninit, uninit)
-
 /*
  * Add new method to test whether block and inode bitmaps are properly
  * initialized. With uninit_bg reading the block from disk is not enough
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5c191a3..0602a09 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1924,8 +1924,6 @@ static int mpage_map_and_submit_buffers(struct mpage_da_data *mpd)
 					clear_buffer_delay(bh);
 					bh->b_blocknr = pblock++;
 				}
-				if (mpd->map.m_flags & EXT4_MAP_UNINIT)
-					set_buffer_uninit(bh);
 				clear_buffer_unwritten(bh);
 			} while (lblk++, (bh = bh->b_this_page) != head);
 
@@ -1975,6 +1973,8 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
 	err = ext4_map_blocks(handle, inode, map, get_blocks_flags);
 	if (err < 0)
 		return err;
+	if (map->m_flags & EXT4_MAP_UNINIT)
+		ext4_set_io_unwritten_flag(inode, mpd->io_submit.io_end);
 
 	BUG_ON(map->m_len == 0);
 	if (map->m_flags & EXT4_MAP_NEW) {
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index efdf0a5..cc59cd9 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -377,8 +377,6 @@ submit_and_retry:
 	if (ret != bh->b_size)
 		goto submit_and_retry;
 	io_end = io->io_end;
-	if (test_clear_buffer_uninit(bh))
-		ext4_set_io_unwritten_flag(inode, io_end);
 	io_end->size += bh->b_size;
 	io->io_next_block++;
 	return 0;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 20/29] ext4: Use transaction reservation for extent conversion in ext4_end_io
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
                   ` (18 preceding siblings ...)
  2013-04-08 21:32 ` [PATCH 19/29] ext4: Remove buffer_uninit handling Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-05-08  6:57   ` Zheng Liu
  2013-04-08 21:32 ` [PATCH 21/29] ext4: Split extent conversion lists to reserved & unreserved parts Jan Kara
                   ` (8 subsequent siblings)
  28 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Later we would like to clear PageWriteback bit only after extent conversion
from unwritten to written extents is performed. However it is not possible
to start a transaction after PageWriteback is set because that violates
lock ordering (and is easy to deadlock). So we have to reserve a transaction
before locking pages and sending them for IO and later we use the transaction
for extent conversion from ext4_end_io().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h      |   12 +++++++++---
 fs/ext4/ext4_jbd2.h |    3 ++-
 fs/ext4/extents.c   |   39 ++++++++++++++++++++++++++++-----------
 fs/ext4/inode.c     |   32 ++++++++++++++++++++++++++++++--
 fs/ext4/page-io.c   |   11 ++++++++---
 5 files changed, 77 insertions(+), 20 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 3c3827a..65adf0d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -182,10 +182,13 @@ struct ext4_map_blocks {
 #define EXT4_IO_END_DIRECT	0x0004
 
 /*
- * For converting uninitialized extents on a work queue.
+ * For converting uninitialized extents on a work queue. 'handle' is used for
+ * buffered writeback.
  */
 typedef struct ext4_io_end {
 	struct list_head	list;		/* per-file finished IO list */
+	handle_t		*handle;	/* handle reserved for extent
+						 * conversion */
 	struct inode		*inode;		/* file being written to */
 	unsigned int		flag;		/* unwritten or not */
 	loff_t			offset;		/* offset in the file */
@@ -1314,6 +1317,9 @@ static inline void ext4_set_io_unwritten_flag(struct inode *inode,
 					      struct ext4_io_end *io_end)
 {
 	if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
+		/* Writeback has to have coversion transaction reserved */
+		WARN_ON(!io_end->handle &&
+			!(io_end->flag & EXT4_IO_END_DIRECT));
 		io_end->flag |= EXT4_IO_END_UNWRITTEN;
 		atomic_inc(&EXT4_I(inode)->i_unwritten);
 	}
@@ -2550,8 +2556,8 @@ extern void ext4_ext_init(struct super_block *);
 extern void ext4_ext_release(struct super_block *);
 extern long ext4_fallocate(struct file *file, int mode, loff_t offset,
 			  loff_t len);
-extern int ext4_convert_unwritten_extents(struct inode *inode, loff_t offset,
-			  ssize_t len);
+extern int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode,
+					  loff_t offset, ssize_t len);
 extern int ext4_map_blocks(handle_t *handle, struct inode *inode,
 			   struct ext4_map_blocks *map, int flags);
 extern int ext4_ext_calc_metadata_amount(struct inode *inode,
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index bb17931..88e95d7 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -132,7 +132,8 @@ static inline int ext4_jbd2_credits_xattr(struct inode *inode)
 #define EXT4_HT_MIGRATE          8
 #define EXT4_HT_MOVE_EXTENTS     9
 #define EXT4_HT_XATTR           10
-#define EXT4_HT_MAX             11
+#define EXT4_HT_EXT_CONVERT     11
+#define EXT4_HT_MAX             12
 
 /**
  *   struct ext4_journal_cb_entry - Base structure for callback information.
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 8064b71..ae22735 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4484,10 +4484,9 @@ retry:
  * function, to convert the fallocated extents after IO is completed.
  * Returns 0 on success.
  */
-int ext4_convert_unwritten_extents(struct inode *inode, loff_t offset,
-				    ssize_t len)
+int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode,
+				   loff_t offset, ssize_t len)
 {
-	handle_t *handle;
 	unsigned int max_blocks;
 	int ret = 0;
 	int ret2 = 0;
@@ -4502,16 +4501,31 @@ int ext4_convert_unwritten_extents(struct inode *inode, loff_t offset,
 	max_blocks = ((EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) -
 		      map.m_lblk);
 	/*
-	 * credits to insert 1 extent into extent tree
+	 * This is somewhat ugly but the idea is clear: When transaction is
+	 * reserved, everything goes into it. Otherwise we rather start several
+	 * smaller transactions for conversion of each extent separately.
 	 */
-	credits = ext4_chunk_trans_blocks(inode, max_blocks);
+	if (handle) {
+		handle = ext4_journal_start_reserved(handle);
+		if (IS_ERR(handle))
+			return PTR_ERR(handle);
+		credits = 0;
+	} else {
+		/*
+		 * credits to insert 1 extent into extent tree
+		 */
+		credits = ext4_chunk_trans_blocks(inode, max_blocks);
+	}
 	while (ret >= 0 && ret < max_blocks) {
 		map.m_lblk += ret;
 		map.m_len = (max_blocks -= ret);
-		handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, credits);
-		if (IS_ERR(handle)) {
-			ret = PTR_ERR(handle);
-			break;
+		if (credits) {
+			handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
+						    credits);
+			if (IS_ERR(handle)) {
+				ret = PTR_ERR(handle);
+				break;
+			}
 		}
 		ret = ext4_map_blocks(handle, inode, &map,
 				      EXT4_GET_BLOCKS_IO_CONVERT_EXT);
@@ -4522,10 +4536,13 @@ int ext4_convert_unwritten_extents(struct inode *inode, loff_t offset,
 				     inode->i_ino, map.m_lblk,
 				     map.m_len, ret);
 		ext4_mark_inode_dirty(handle, inode);
-		ret2 = ext4_journal_stop(handle);
-		if (ret <= 0 || ret2 )
+		if (credits)
+			ret2 = ext4_journal_stop(handle);
+		if (ret <= 0 || ret2)
 			break;
 	}
+	if (!credits)
+		ret2 = ext4_journal_stop(handle);
 	return ret > 0 ? ret2 : ret;
 }
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0602a09..f8e78ce 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1327,6 +1327,8 @@ static void ext4_da_page_release_reservation(struct page *page,
 struct mpage_da_data {
 	struct inode *inode;
 	struct writeback_control *wbc;
+	handle_t *reserved_handle;	/* Handle reserved for conversion */
+
 	pgoff_t first_page;	/* The first page to write */
 	pgoff_t next_page;	/* Current page to examine */
 	pgoff_t last_page;	/* Last page to examine */
@@ -1973,8 +1975,13 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
 	err = ext4_map_blocks(handle, inode, map, get_blocks_flags);
 	if (err < 0)
 		return err;
-	if (map->m_flags & EXT4_MAP_UNINIT)
+	if (map->m_flags & EXT4_MAP_UNINIT) {
+		if (!mpd->io_submit.io_end->handle) {
+			mpd->io_submit.io_end->handle = mpd->reserved_handle;
+			mpd->reserved_handle = NULL;
+		}
 		ext4_set_io_unwritten_flag(inode, mpd->io_submit.io_end);
+	}
 
 	BUG_ON(map->m_len == 0);
 	if (map->m_flags & EXT4_MAP_NEW) {
@@ -2274,6 +2281,7 @@ static int ext4_da_writepages(struct address_space *mapping,
 
 	mpd.inode = inode;
 	mpd.wbc = wbc;
+	mpd.reserved_handle = NULL;
 	ext4_io_submit_init(&mpd.io_submit, wbc);
 retry:
 	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
@@ -2288,6 +2296,23 @@ retry:
 			break;
 		}
 
+		/* Reserve handle if it may be needed for extent conversion */
+		if (ext4_should_dioread_nolock(inode) && !mpd.reserved_handle) {
+			/*
+			 * We may need to convert upto one extent per block in
+			 * the page and we may dirty the inode.
+			 */
+			mpd.reserved_handle = ext4_journal_reserve(inode,
+				EXT4_HT_EXT_CONVERT,
+				1 + (PAGE_CACHE_SIZE >> inode->i_blkbits));
+			if (IS_ERR(mpd.reserved_handle)) {
+				ret = PTR_ERR(mpd.reserved_handle);
+				mpd.reserved_handle = NULL;
+				ext4_put_io_end(mpd.io_submit.io_end);
+				break;
+			}
+		}
+
 		/*
 		 * We have two constraints: We find one extent to map and we
 		 * must always write out whole page (makes a difference when
@@ -2364,6 +2389,9 @@ retry:
 		 */
 		mapping->writeback_index = mpd.first_page;
 
+	if (mpd.reserved_handle)
+		ext4_journal_free_reserved(mpd.reserved_handle);
+
 out_writepages:
 	trace_ext4_da_writepages_result(inode, wbc, ret,
 					nr_to_write - wbc->nr_to_write);
@@ -2977,7 +3005,7 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
 		 * for non AIO case, since the IO is already
 		 * completed, we could do the conversion right here
 		 */
-		err = ext4_convert_unwritten_extents(inode,
+		err = ext4_convert_unwritten_extents(NULL, inode,
 						     offset, ret);
 		if (err < 0)
 			ret = err;
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index cc59cd9..e8ee4da 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -55,6 +55,7 @@ static void ext4_release_io_end(ext4_io_end_t *io_end)
 {
 	BUG_ON(!list_empty(&io_end->list));
 	BUG_ON(io_end->flag & EXT4_IO_END_UNWRITTEN);
+	WARN_ON(io_end->handle);
 
 	if (atomic_dec_and_test(&EXT4_I(io_end->inode)->i_ioend_count))
 		wake_up_all(ext4_ioend_wq(io_end->inode));
@@ -81,13 +82,15 @@ static int ext4_end_io(ext4_io_end_t *io)
 	struct inode *inode = io->inode;
 	loff_t offset = io->offset;
 	ssize_t size = io->size;
+	handle_t *handle = io->handle;
 	int ret = 0;
 
 	ext4_debug("ext4_end_io_nolock: io 0x%p from inode %lu,list->next 0x%p,"
 		   "list->prev 0x%p\n",
 		   io, inode->i_ino, io->list.next, io->list.prev);
 
-	ret = ext4_convert_unwritten_extents(inode, offset, size);
+	io->handle = NULL;	/* Following call will use up the handle */
+	ret = ext4_convert_unwritten_extents(handle, inode, offset, size);
 	if (ret < 0) {
 		ext4_msg(inode->i_sb, KERN_EMERG,
 			 "failed to convert unwritten extents to written "
@@ -217,8 +220,10 @@ int ext4_put_io_end(ext4_io_end_t *io_end)
 
 	if (atomic_dec_and_test(&io_end->count)) {
 		if (io_end->flag & EXT4_IO_END_UNWRITTEN) {
-			err = ext4_convert_unwritten_extents(io_end->inode,
-						io_end->offset, io_end->size);
+			err = ext4_convert_unwritten_extents(io_end->handle,
+						io_end->inode, io_end->offset,
+						io_end->size);
+			io_end->handle = NULL;
 			ext4_clear_io_unwritten_flag(io_end);
 		}
 		ext4_release_io_end(io_end);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 21/29] ext4: Split extent conversion lists to reserved & unreserved parts
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
                   ` (19 preceding siblings ...)
  2013-04-08 21:32 ` [PATCH 20/29] ext4: Use transaction reservation for extent conversion in ext4_end_io Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-05-08  7:03   ` Zheng Liu
  2013-04-08 21:32 ` [PATCH 22/29] ext4: Defer clearing of PageWriteback after extent conversion Jan Kara
                   ` (7 subsequent siblings)
  28 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Now that we have extent conversions with reserved transaction, we have
to prevent extent conversions without reserved transaction (from DIO
code) to block these (as that would effectively void any transaction
reservation we did). So split lists, work items, and work queues to
reserved and unreserved parts.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h    |   25 +++++++++++++++++-----
 fs/ext4/page-io.c |   59 ++++++++++++++++++++++++++++++++++------------------
 fs/ext4/super.c   |   38 ++++++++++++++++++++++++---------
 3 files changed, 84 insertions(+), 38 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 65adf0d..a594a94 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -889,12 +889,22 @@ struct ext4_inode_info {
 	qsize_t i_reserved_quota;
 #endif
 
-	/* completed IOs that might need unwritten extents handling */
-	struct list_head i_completed_io_list;
+	/* Lock protecting lists below */
 	spinlock_t i_completed_io_lock;
+	/*
+	 * Completed IOs that need unwritten extents handling and have
+	 * transaction reserved
+	 */
+	struct list_head i_rsv_conversion_list;
+	/*
+	 * Completed IOs that need unwritten extents handling and don't have
+	 * transaction reserved
+	 */
+	struct list_head i_unrsv_conversion_list;
 	atomic_t i_ioend_count;	/* Number of outstanding io_end structs */
 	atomic_t i_unwritten; /* Nr. of inflight conversions pending */
-	struct work_struct i_unwritten_work;	/* deferred extent conversion */
+	struct work_struct i_rsv_conversion_work;
+	struct work_struct i_unrsv_conversion_work;
 
 	spinlock_t i_block_reservation_lock;
 
@@ -1257,8 +1267,10 @@ struct ext4_sb_info {
 	struct flex_groups *s_flex_groups;
 	ext4_group_t s_flex_groups_allocated;
 
-	/* workqueue for dio unwritten */
-	struct workqueue_struct *dio_unwritten_wq;
+	/* workqueue for unreserved extent convertions (dio) */
+	struct workqueue_struct *unrsv_conversion_wq;
+	/* workqueue for reserved extent conversions (buffered io) */
+	struct workqueue_struct *rsv_conversion_wq;
 
 	/* timer for periodic error stats printing */
 	struct timer_list s_err_report;
@@ -2599,7 +2611,8 @@ extern int ext4_put_io_end(ext4_io_end_t *io_end);
 extern void ext4_put_io_end_defer(ext4_io_end_t *io_end);
 extern void ext4_io_submit_init(struct ext4_io_submit *io,
 				struct writeback_control *wbc);
-extern void ext4_end_io_work(struct work_struct *work);
+extern void ext4_end_io_rsv_work(struct work_struct *work);
+extern void ext4_end_io_unrsv_work(struct work_struct *work);
 extern void ext4_io_submit(struct ext4_io_submit *io);
 extern int ext4_bio_write_page(struct ext4_io_submit *io,
 			       struct page *page,
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index e8ee4da..8bff3b3 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -103,20 +103,17 @@ static int ext4_end_io(ext4_io_end_t *io)
 	return ret;
 }
 
-static void dump_completed_IO(struct inode *inode)
+static void dump_completed_IO(struct inode *inode, struct list_head *head)
 {
 #ifdef	EXT4FS_DEBUG
 	struct list_head *cur, *before, *after;
 	ext4_io_end_t *io, *io0, *io1;
 
-	if (list_empty(&EXT4_I(inode)->i_completed_io_list)) {
-		ext4_debug("inode %lu completed_io list is empty\n",
-			   inode->i_ino);
+	if (list_empty(head))
 		return;
-	}
 
-	ext4_debug("Dump inode %lu completed_io list\n", inode->i_ino);
-	list_for_each_entry(io, &EXT4_I(inode)->i_completed_io_list, list) {
+	ext4_debug("Dump inode %lu completed io list\n", inode->i_ino);
+	list_for_each_entry(io, head, list) {
 		cur = &io->list;
 		before = cur->prev;
 		io0 = container_of(before, ext4_io_end_t, list);
@@ -137,16 +134,23 @@ static void ext4_add_complete_io(ext4_io_end_t *io_end)
 	unsigned long flags;
 
 	BUG_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
-	wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
-
 	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
-	if (list_empty(&ei->i_completed_io_list))
-		queue_work(wq, &ei->i_unwritten_work);
-	list_add_tail(&io_end->list, &ei->i_completed_io_list);
+	if (io_end->handle) {
+		wq = EXT4_SB(io_end->inode->i_sb)->rsv_conversion_wq;
+		if (list_empty(&ei->i_rsv_conversion_list))
+			queue_work(wq, &ei->i_rsv_conversion_work);
+		list_add_tail(&io_end->list, &ei->i_rsv_conversion_list);
+	} else {
+		wq = EXT4_SB(io_end->inode->i_sb)->unrsv_conversion_wq;
+		if (list_empty(&ei->i_unrsv_conversion_list))
+			queue_work(wq, &ei->i_unrsv_conversion_work);
+		list_add_tail(&io_end->list, &ei->i_unrsv_conversion_list);
+	}
 	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
 }
 
-static int ext4_do_flush_completed_IO(struct inode *inode)
+static int ext4_do_flush_completed_IO(struct inode *inode,
+				      struct list_head *head)
 {
 	ext4_io_end_t *io;
 	struct list_head unwritten;
@@ -155,8 +159,8 @@ static int ext4_do_flush_completed_IO(struct inode *inode)
 	int err, ret = 0;
 
 	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
-	dump_completed_IO(inode);
-	list_replace_init(&ei->i_completed_io_list, &unwritten);
+	dump_completed_IO(inode, head);
+	list_replace_init(head, &unwritten);
 	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
 
 	while (!list_empty(&unwritten)) {
@@ -172,21 +176,34 @@ static int ext4_do_flush_completed_IO(struct inode *inode)
 }
 
 /*
- * work on completed aio dio IO, to convert unwritten extents to extents
+ * work on completed IO, to convert unwritten extents to extents
  */
-void ext4_end_io_work(struct work_struct *work)
+void ext4_end_io_rsv_work(struct work_struct *work)
 {
 	struct ext4_inode_info *ei = container_of(work, struct ext4_inode_info,
-						  i_unwritten_work);
-	ext4_do_flush_completed_IO(&ei->vfs_inode);
+						  i_rsv_conversion_work);
+	ext4_do_flush_completed_IO(&ei->vfs_inode, &ei->i_rsv_conversion_list);
+}
+
+void ext4_end_io_unrsv_work(struct work_struct *work)
+{
+	struct ext4_inode_info *ei = container_of(work, struct ext4_inode_info,
+						  i_unrsv_conversion_work);
+	ext4_do_flush_completed_IO(&ei->vfs_inode, &ei->i_unrsv_conversion_list);
 }
 
 int ext4_flush_unwritten_io(struct inode *inode)
 {
-	int ret;
+	int ret, err;
+
 	WARN_ON_ONCE(!mutex_is_locked(&inode->i_mutex) &&
 		     !(inode->i_state & I_FREEING));
-	ret = ext4_do_flush_completed_IO(inode);
+	ret = ext4_do_flush_completed_IO(inode,
+					 &EXT4_I(inode)->i_rsv_conversion_list);
+	err = ext4_do_flush_completed_IO(inode,
+					 &EXT4_I(inode)->i_unrsv_conversion_list);
+	if (!ret)
+		ret = err;
 	ext4_unwritten_wait(inode);
 	return ret;
 }
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 09ff724..916c4fb 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -747,8 +747,10 @@ static void ext4_put_super(struct super_block *sb)
 	ext4_unregister_li_request(sb);
 	dquot_disable(sb, -1, DQUOT_USAGE_ENABLED | DQUOT_LIMITS_ENABLED);
 
-	flush_workqueue(sbi->dio_unwritten_wq);
-	destroy_workqueue(sbi->dio_unwritten_wq);
+	flush_workqueue(sbi->unrsv_conversion_wq);
+	flush_workqueue(sbi->rsv_conversion_wq);
+	destroy_workqueue(sbi->unrsv_conversion_wq);
+	destroy_workqueue(sbi->rsv_conversion_wq);
 
 	if (sbi->s_journal) {
 		err = jbd2_journal_destroy(sbi->s_journal);
@@ -856,13 +858,15 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	ei->i_reserved_quota = 0;
 #endif
 	ei->jinode = NULL;
-	INIT_LIST_HEAD(&ei->i_completed_io_list);
+	INIT_LIST_HEAD(&ei->i_rsv_conversion_list);
+	INIT_LIST_HEAD(&ei->i_unrsv_conversion_list);
 	spin_lock_init(&ei->i_completed_io_lock);
 	ei->i_sync_tid = 0;
 	ei->i_datasync_tid = 0;
 	atomic_set(&ei->i_ioend_count, 0);
 	atomic_set(&ei->i_unwritten, 0);
-	INIT_WORK(&ei->i_unwritten_work, ext4_end_io_work);
+	INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
+	INIT_WORK(&ei->i_unrsv_conversion_work, ext4_end_io_unrsv_work);
 
 	return &ei->vfs_inode;
 }
@@ -3867,12 +3871,20 @@ no_journal:
 	 * The maximum number of concurrent works can be high and
 	 * concurrency isn't really necessary.  Limit it to 1.
 	 */
-	EXT4_SB(sb)->dio_unwritten_wq =
-		alloc_workqueue("ext4-dio-unwritten", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
-	if (!EXT4_SB(sb)->dio_unwritten_wq) {
-		printk(KERN_ERR "EXT4-fs: failed to create DIO workqueue\n");
+	EXT4_SB(sb)->rsv_conversion_wq =
+		alloc_workqueue("ext4-rsv-conversion", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
+	if (!EXT4_SB(sb)->rsv_conversion_wq) {
+		printk(KERN_ERR "EXT4-fs: failed to create workqueue\n");
 		ret = -ENOMEM;
-		goto failed_mount_wq;
+		goto failed_mount4;
+	}
+
+	EXT4_SB(sb)->unrsv_conversion_wq =
+		alloc_workqueue("ext4-unrsv-conversion", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
+	if (!EXT4_SB(sb)->unrsv_conversion_wq) {
+		printk(KERN_ERR "EXT4-fs: failed to create workqueue\n");
+		ret = -ENOMEM;
+		goto failed_mount4;
 	}
 
 	/*
@@ -4019,7 +4031,10 @@ failed_mount4a:
 	sb->s_root = NULL;
 failed_mount4:
 	ext4_msg(sb, KERN_ERR, "mount failed");
-	destroy_workqueue(EXT4_SB(sb)->dio_unwritten_wq);
+	if (EXT4_SB(sb)->rsv_conversion_wq)
+		destroy_workqueue(EXT4_SB(sb)->rsv_conversion_wq);
+	if (EXT4_SB(sb)->unrsv_conversion_wq)
+		destroy_workqueue(EXT4_SB(sb)->unrsv_conversion_wq);
 failed_mount_wq:
 	if (sbi->s_journal) {
 		jbd2_journal_destroy(sbi->s_journal);
@@ -4464,7 +4479,8 @@ static int ext4_sync_fs(struct super_block *sb, int wait)
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 
 	trace_ext4_sync_fs(sb, wait);
-	flush_workqueue(sbi->dio_unwritten_wq);
+	flush_workqueue(sbi->rsv_conversion_wq);
+	flush_workqueue(sbi->unrsv_conversion_wq);
 	/*
 	 * Writeback quota in non-journalled quota case - journalled quota has
 	 * no dirty dquots
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 22/29] ext4: Defer clearing of PageWriteback after extent conversion
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
                   ` (20 preceding siblings ...)
  2013-04-08 21:32 ` [PATCH 21/29] ext4: Split extent conversion lists to reserved & unreserved parts Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-05-08  7:08   ` Zheng Liu
  2013-04-08 21:32 ` [PATCH 23/29] ext4: Protect extent conversion after DIO with i_dio_count Jan Kara
                   ` (6 subsequent siblings)
  28 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Currently PageWriteback bit gets cleared from put_io_page() called from
ext4_end_bio(). This is somewhat inconvenient as extent tree is not
fully updated at that time (unwritten extents are not marked as written)
so we cannot read the data back yet. This design was dictated by lock
ordering as we cannot start a transaction while PageWriteback bit is set
(we could easily deadlock with ext4_da_writepages()). But now that we
use transaction reservation for extent conversion, locking issues are
solved and we can move PageWriteback bit clearing after extent
conversion is done. As a result we can remove wait for unwritt en extent
conversion from ext4_sync_file() because it already implicitely happe ns
through wait_on_page_writeback().

We implement deferring of PageWriteback clearing by queueing completed
bios to appropriate io_end and processing all the pages when io_end is
going to be freed instead of at the moment ext4_io_end() is called.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h    |    2 +
 fs/ext4/fsync.c   |    4 --
 fs/ext4/page-io.c |  132 +++++++++++++++++++++++++++++------------------------
 3 files changed, 74 insertions(+), 64 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index a594a94..2b0dd9a 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -190,6 +190,8 @@ typedef struct ext4_io_end {
 	handle_t		*handle;	/* handle reserved for extent
 						 * conversion */
 	struct inode		*inode;		/* file being written to */
+	struct bio		*bio;		/* Linked list of completed
+						 * bios covering the extent */
 	unsigned int		flag;		/* unwritten or not */
 	loff_t			offset;		/* offset in the file */
 	ssize_t			size;		/* size of the extent */
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 3278e64..e02ba1b 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -132,10 +132,6 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	if (inode->i_sb->s_flags & MS_RDONLY)
 		goto out;
 
-	ret = ext4_flush_unwritten_io(inode);
-	if (ret < 0)
-		goto out;
-
 	if (!journal) {
 		ret = __sync_inode(inode, datasync);
 		if (!ret && !hlist_empty(&inode->i_dentry))
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 8bff3b3..2967794 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -51,14 +51,83 @@ void ext4_ioend_wait(struct inode *inode)
 	wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_ioend_count) == 0));
 }
 
+/*
+ * Print an buffer I/O error compatible with the fs/buffer.c.  This
+ * provides compatibility with dmesg scrapers that look for a specific
+ * buffer I/O error message.  We really need a unified error reporting
+ * structure to userspace ala Digital Unix's uerf system, but it's
+ * probably not going to happen in my lifetime, due to LKML politics...
+ */
+static void buffer_io_error(struct buffer_head *bh)
+{
+	char b[BDEVNAME_SIZE];
+	printk(KERN_ERR "Buffer I/O error on device %s, logical block %llu\n",
+			bdevname(bh->b_bdev, b),
+			(unsigned long long)bh->b_blocknr);
+}
+
+static void ext4_finish_bio(struct bio *bio)
+{
+	int i;
+	int error = !test_bit(BIO_UPTODATE, &bio->bi_flags);
+
+	for (i = 0; i < bio->bi_vcnt; i++) {
+		struct bio_vec *bvec = &bio->bi_io_vec[i];
+		struct page *page = bvec->bv_page;
+		struct buffer_head *bh, *head;
+		unsigned bio_start = bvec->bv_offset;
+		unsigned bio_end = bio_start + bvec->bv_len;
+		unsigned under_io = 0;
+		unsigned long flags;
+
+		if (!page)
+			continue;
+
+		if (error) {
+			SetPageError(page);
+			set_bit(AS_EIO, &page->mapping->flags);
+		}
+		bh = head = page_buffers(page);
+		/*
+		 * We check all buffers in the page under BH_Uptodate_Lock
+		 * to avoid races with other end io clearing async_write flags
+		 */
+		local_irq_save(flags);
+		bit_spin_lock(BH_Uptodate_Lock, &head->b_state);
+		do {
+			if (bh_offset(bh) < bio_start ||
+			    bh_offset(bh) + bh->b_size > bio_end) {
+				if (buffer_async_write(bh))
+					under_io++;
+				continue;
+			}
+			clear_buffer_async_write(bh);
+			if (error)
+				buffer_io_error(bh);
+		} while ((bh = bh->b_this_page) != head);
+		bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
+		local_irq_restore(flags);
+		if (!under_io)
+			end_page_writeback(page);
+	}
+}
+
 static void ext4_release_io_end(ext4_io_end_t *io_end)
 {
+	struct bio *bio, *next_bio;
+
 	BUG_ON(!list_empty(&io_end->list));
 	BUG_ON(io_end->flag & EXT4_IO_END_UNWRITTEN);
 	WARN_ON(io_end->handle);
 
 	if (atomic_dec_and_test(&EXT4_I(io_end->inode)->i_ioend_count))
 		wake_up_all(ext4_ioend_wq(io_end->inode));
+
+	for (bio = io_end->bio; bio; bio = next_bio) {
+		next_bio = bio->bi_private;
+		ext4_finish_bio(bio);
+		bio_put(bio);
+	}
 	if (io_end->flag & EXT4_IO_END_DIRECT)
 		inode_dio_done(io_end->inode);
 	if (io_end->iocb)
@@ -254,76 +323,20 @@ ext4_io_end_t *ext4_get_io_end(ext4_io_end_t *io_end)
 	return io_end;
 }
 
-/*
- * Print an buffer I/O error compatible with the fs/buffer.c.  This
- * provides compatibility with dmesg scrapers that look for a specific
- * buffer I/O error message.  We really need a unified error reporting
- * structure to userspace ala Digital Unix's uerf system, but it's
- * probably not going to happen in my lifetime, due to LKML politics...
- */
-static void buffer_io_error(struct buffer_head *bh)
-{
-	char b[BDEVNAME_SIZE];
-	printk(KERN_ERR "Buffer I/O error on device %s, logical block %llu\n",
-			bdevname(bh->b_bdev, b),
-			(unsigned long long)bh->b_blocknr);
-}
-
 static void ext4_end_bio(struct bio *bio, int error)
 {
 	ext4_io_end_t *io_end = bio->bi_private;
 	struct inode *inode;
-	int i;
-	int blocksize;
 	sector_t bi_sector = bio->bi_sector;
 
 	BUG_ON(!io_end);
 	inode = io_end->inode;
-	blocksize = 1 << inode->i_blkbits;
-	bio->bi_private = NULL;
 	bio->bi_end_io = NULL;
+	/* Link bio into list hanging from io_end */
+	bio->bi_private = io_end->bio;
+	io_end->bio = bio;
 	if (test_bit(BIO_UPTODATE, &bio->bi_flags))
 		error = 0;
-	for (i = 0; i < bio->bi_vcnt; i++) {
-		struct bio_vec *bvec = &bio->bi_io_vec[i];
-		struct page *page = bvec->bv_page;
-		struct buffer_head *bh, *head;
-		unsigned bio_start = bvec->bv_offset;
-		unsigned bio_end = bio_start + bvec->bv_len;
-		unsigned under_io = 0;
-		unsigned long flags;
-
-		if (!page)
-			continue;
-
-		if (error) {
-			SetPageError(page);
-			set_bit(AS_EIO, &page->mapping->flags);
-		}
-		bh = head = page_buffers(page);
-		/*
-		 * We check all buffers in the page under BH_Uptodate_Lock
-		 * to avoid races with other end io clearing async_write flags
-		 */
-		local_irq_save(flags);
-		bit_spin_lock(BH_Uptodate_Lock, &head->b_state);
-		do {
-			if (bh_offset(bh) < bio_start ||
-			    bh_offset(bh) + blocksize > bio_end) {
-				if (buffer_async_write(bh))
-					under_io++;
-				continue;
-			}
-			clear_buffer_async_write(bh);
-			if (error)
-				buffer_io_error(bh);
-		} while ((bh = bh->b_this_page) != head);
-		bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
-		local_irq_restore(flags);
-		if (!under_io)
-			end_page_writeback(page);
-	}
-	bio_put(bio);
 
 	if (error) {
 		io_end->flag |= EXT4_IO_END_ERROR;
@@ -335,7 +348,6 @@ static void ext4_end_bio(struct bio *bio, int error)
 			     (unsigned long long)
 			     bi_sector >> (inode->i_blkbits - 9));
 	}

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 23/29] ext4: Protect extent conversion after DIO with i_dio_count
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
                   ` (21 preceding siblings ...)
  2013-04-08 21:32 ` [PATCH 22/29] ext4: Defer clearing of PageWriteback after extent conversion Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-05-08  7:08   ` Zheng Liu
  2013-04-08 21:32 ` [PATCH 24/29] ext4: Remove wait for unwritten extent conversion from ext4_ext_truncate() Jan Kara
                   ` (5 subsequent siblings)
  28 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Make sure extent conversion after DIO happens while i_dio_count is still
elevated so that inode_dio_wait() waits until extent conversion is done.
This removes the need for explicit waiting for extent conversion in some
cases.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c |   12 ++++++++++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f8e78ce..f493ec2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2914,11 +2914,18 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
 
 	BUG_ON(iocb->private == NULL);
 
+	/*
+	 * Make all waiters for direct IO properly wait also for extent
+	 * conversion. This also disallows race between truncate() and
+	 * overwrite DIO as i_dio_count needs to be incremented under i_mutex.
+	 */
+	if (rw == WRITE)
+		atomic_inc(&inode->i_dio_count);
+
 	/* If we do a overwrite dio, i_mutex locking can be released */
 	overwrite = *((int *)iocb->private);
 
 	if (overwrite) {
-		atomic_inc(&inode->i_dio_count);
 		down_read(&EXT4_I(inode)->i_data_sem);
 		mutex_unlock(&inode->i_mutex);
 	}
@@ -3013,9 +3020,10 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
 	}
 
 retake_lock:
+	if (rw == WRITE)
+		inode_dio_done(inode);
 	/* take i_mutex locking again if we do a ovewrite dio */
 	if (overwrite) {
-		inode_dio_done(inode);
 		up_read(&EXT4_I(inode)->i_data_sem);
 		mutex_lock(&inode->i_mutex);
 	}
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 24/29] ext4: Remove wait for unwritten extent conversion from ext4_ext_truncate()
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
                   ` (22 preceding siblings ...)
  2013-04-08 21:32 ` [PATCH 23/29] ext4: Protect extent conversion after DIO with i_dio_count Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-05-08  7:35   ` Zheng Liu
  2013-04-08 21:32 ` [PATCH 25/29] ext4: Use generic_file_fsync() in ext4_file_fsync() in nojournal mode Jan Kara
                   ` (4 subsequent siblings)
  28 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Since PageWriteback bit is now cleared after extents are converted from
unwritten to written ones, we have full exclusion of writeback path from
truncate (truncate_inode_pages() waits for PageWriteback bits to get cleared
on all invalidated pages). Exclusion from DIO path is achieved by
inode_dio_wait() call in ext4_setattr(). So there's no need to wait for
extent convertion in ext4_ext_truncate() anymore.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/extents.c |    6 ------
 fs/ext4/page-io.c |    9 ++++++++-
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index ae22735..ca4ff71 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4256,12 +4256,6 @@ void ext4_ext_truncate(struct inode *inode)
 	int err = 0;
 
 	/*
-	 * finish any pending end_io work so we won't run the risk of
-	 * converting any truncated blocks to initialized later
-	 */
-	ext4_flush_unwritten_io(inode);
-
-	/*
 	 * probably first extent we're gonna free will be last in block
 	 */
 	err = ext4_writepage_trans_blocks(inode);
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 2967794..2f0b943 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -145,7 +145,14 @@ static void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
 		wake_up_all(ext4_ioend_wq(inode));
 }
 
-/* check a range of space and convert unwritten extents to written. */
+/*
+ * Check a range of space and convert unwritten extents to written. Note that
+ * we are protected from truncate touching same part of extent tree by the
+ * fact that truncate code waits for all DIO to finish (thus exclusion from
+ * direct IO is achieved) and also waits for PageWriteback bits. Thus we
+ * cannot get to ext4_ext_truncate() before all IOs overlapping that range are
+ * completed (happens from ext4_free_ioend()).
+ */
 static int ext4_end_io(ext4_io_end_t *io)
 {
 	struct inode *inode = io->inode;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 25/29] ext4: Use generic_file_fsync() in ext4_file_fsync() in nojournal mode
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
                   ` (23 preceding siblings ...)
  2013-04-08 21:32 ` [PATCH 24/29] ext4: Remove wait for unwritten extent conversion from ext4_ext_truncate() Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-05-08  7:37   ` Zheng Liu
  2013-04-08 21:32 ` [PATCH 26/29] ext4: Remove i_mutex from ext4_file_sync() Jan Kara
                   ` (3 subsequent siblings)
  28 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Just use the generic function instead of duplicating it. We only need
to reshuffle the read-only check a bit (which is there to prevent
writing to a filesystem which has been remounted read-only after error
I assume).

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/fsync.c |   45 ++++++++++-----------------------------------
 1 files changed, 10 insertions(+), 35 deletions(-)

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index e02ba1b..1c08780 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -73,32 +73,6 @@ static int ext4_sync_parent(struct inode *inode)
 	return ret;
 }
 
-/**
- * __sync_file - generic_file_fsync without the locking and filemap_write
- * @inode:	inode to sync
- * @datasync:	only sync essential metadata if true
- *
- * This is just generic_file_fsync without the locking.  This is needed for
- * nojournal mode to make sure this inodes data/metadata makes it to disk
- * properly.  The i_mutex should be held already.
- */
-static int __sync_inode(struct inode *inode, int datasync)
-{
-	int err;
-	int ret;
-
-	ret = sync_mapping_buffers(inode->i_mapping);
-	if (!(inode->i_state & I_DIRTY))
-		return ret;
-	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
-		return ret;
-
-	err = sync_inode_metadata(inode, 1);
-	if (ret == 0)
-		ret = err;
-	return ret;
-}
-
 /*
  * akpm: A new design for ext4_sync_file().
  *
@@ -116,7 +90,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	struct inode *inode = file->f_mapping->host;
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
-	int ret, err;
+	int ret = 0, err;
 	tid_t commit_tid;
 	bool needs_barrier = false;
 
@@ -124,21 +98,21 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 
 	trace_ext4_sync_file_enter(file, datasync);
 
-	ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
-	if (ret)
-		return ret;
-	mutex_lock(&inode->i_mutex);
-
 	if (inode->i_sb->s_flags & MS_RDONLY)
-		goto out;
+		goto out_trace;
 
 	if (!journal) {
-		ret = __sync_inode(inode, datasync);
+		ret = generic_file_fsync(file, start, end, datasync);
 		if (!ret && !hlist_empty(&inode->i_dentry))
 			ret = ext4_sync_parent(inode);
 		goto out;
 	}
 
+	ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
+	if (ret)
+		return ret;
+	mutex_lock(&inode->i_mutex);
+
 	/*
 	 * data=writeback,ordered:
 	 *  The caller's filemap_fdatawrite()/wait will sync the data.
@@ -169,8 +143,9 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 		if (!ret)
 			ret = err;
 	}
- out:
+out:
 	mutex_unlock(&inode->i_mutex);
+out_trace:
 	trace_ext4_sync_file_exit(inode, ret);
 	return ret;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 26/29] ext4: Remove i_mutex from ext4_file_sync()
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
                   ` (24 preceding siblings ...)
  2013-04-08 21:32 ` [PATCH 25/29] ext4: Use generic_file_fsync() in ext4_file_fsync() in nojournal mode Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-05-08  7:41   ` Zheng Liu
  2013-04-08 21:32 ` [PATCH 27/29] ext4: Remove wait for unwritten extents in ext4_ind_direct_IO() Jan Kara
                   ` (2 subsequent siblings)
  28 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

After removal of ext4_flush_unwritten_io() call, ext4_file_sync()
doesn't need i_mutex anymore. Forcing of transaction commits doesn't
need i_mutex as there's nothing inode specific in that code apart from
grabbing transaction ids from the inode. So remove the lock.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/fsync.c |    6 +-----
 1 files changed, 1 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 1c08780..9040faa 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -99,7 +99,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	trace_ext4_sync_file_enter(file, datasync);
 
 	if (inode->i_sb->s_flags & MS_RDONLY)
-		goto out_trace;
+		goto out;
 
 	if (!journal) {
 		ret = generic_file_fsync(file, start, end, datasync);
@@ -111,8 +111,6 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
 	if (ret)
 		return ret;
-	mutex_lock(&inode->i_mutex);

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 27/29] ext4: Remove wait for unwritten extents in ext4_ind_direct_IO()
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
                   ` (25 preceding siblings ...)
  2013-04-08 21:32 ` [PATCH 26/29] ext4: Remove i_mutex from ext4_file_sync() Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-05-08  7:55   ` Zheng Liu
  2013-04-08 21:32 ` [PATCH 28/29] ext4: Don't wait for extent conversion in ext4_ext_punch_hole() Jan Kara
  2013-04-08 21:32 ` [PATCH 29/29] ext4: Remove ext4_ioend_wait() Jan Kara
  28 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

We don't have to wait for unwritten extent conversion in
ext4_ind_direct_IO() as all writes that happened before DIO are flushed
by the generic code and extent conversion has happened before we cleared
PageWriteback bit.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/indirect.c |    5 -----
 1 files changed, 0 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 197b202..d1c6be4 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -809,11 +809,6 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
 
 retry:
 	if (rw == READ && ext4_should_dioread_nolock(inode)) {
-		if (unlikely(atomic_read(&EXT4_I(inode)->i_unwritten))) {
-			mutex_lock(&inode->i_mutex);
-			ext4_flush_unwritten_io(inode);
-			mutex_unlock(&inode->i_mutex);
-		}
 		/*
 		 * Nolock dioread optimization may be dynamically disabled
 		 * via ext4_inode_block_unlocked_dio(). Check inode's state
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 28/29] ext4: Don't wait for extent conversion in ext4_ext_punch_hole()
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
                   ` (26 preceding siblings ...)
  2013-04-08 21:32 ` [PATCH 27/29] ext4: Remove wait for unwritten extents in ext4_ind_direct_IO() Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-05-08  7:56   ` Zheng Liu
  2013-04-08 21:32 ` [PATCH 29/29] ext4: Remove ext4_ioend_wait() Jan Kara
  28 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

We don't have to wait for extent conversion in ext4_ext_punch_hole() as
buffered IO for the punched range has been flushed and waited upon (thus
all extent conversions for that range have completed). Also we wait for
all DIO to finish using inode_dio_wait() so there cannot be any extent
conversions pending due to direct IO.

Also remove ext4_flush_unwritten_io() since it's unused now.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h    |    1 -
 fs/ext4/extents.c |    3 ---
 fs/ext4/page-io.c |   16 ----------------
 3 files changed, 0 insertions(+), 20 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 2b0dd9a..859f235 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1986,7 +1986,6 @@ static inline  unsigned char get_dtype(struct super_block *sb, int filetype)
 
 /* fsync.c */
 extern int ext4_sync_file(struct file *, loff_t, loff_t, int);
-extern int ext4_flush_unwritten_io(struct inode *);
 
 /* hash.c */
 extern int ext4fs_dirhash(const char *name, int len, struct
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index ca4ff71..96c4855 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4697,9 +4697,6 @@ int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length)
 
 	/* Wait all existing dio workers, newcomers will block on i_mutex */
 	ext4_inode_block_unlocked_dio(inode);
-	err = ext4_flush_unwritten_io(inode);
-	if (err)
-		goto out_dio;
 	inode_dio_wait(inode);
 
 	credits = ext4_writepage_trans_blocks(inode);
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 2f0b943..1156b9f 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -268,22 +268,6 @@ void ext4_end_io_unrsv_work(struct work_struct *work)
 	ext4_do_flush_completed_IO(&ei->vfs_inode, &ei->i_unrsv_conversion_list);
 }
 
-int ext4_flush_unwritten_io(struct inode *inode)
-{
-	int ret, err;
-
-	WARN_ON_ONCE(!mutex_is_locked(&inode->i_mutex) &&
-		     !(inode->i_state & I_FREEING));
-	ret = ext4_do_flush_completed_IO(inode,
-					 &EXT4_I(inode)->i_rsv_conversion_list);
-	err = ext4_do_flush_completed_IO(inode,
-					 &EXT4_I(inode)->i_unrsv_conversion_list);
-	if (!ret)
-		ret = err;
-	ext4_unwritten_wait(inode);
-	return ret;
-}

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 29/29] ext4: Remove ext4_ioend_wait()
  2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
                   ` (27 preceding siblings ...)
  2013-04-08 21:32 ` [PATCH 28/29] ext4: Don't wait for extent conversion in ext4_ext_punch_hole() Jan Kara
@ 2013-04-08 21:32 ` Jan Kara
  2013-05-08  7:57   ` Zheng Liu
  28 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-04-08 21:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Now that we clear PageWriteback after extent conversion, there's no need
to wait for io_end processing in ext4_evict_inode(). Running AIO/DIO
keeps file reference until aio_complete() is called so ext4_evict_inode()
cannot be called. For io_end structures resulting from buffered IO
waiting is happening because we wait for PageWriteback in
truncate_inode_pages().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h    |    1 -
 fs/ext4/inode.c   |    5 +++--
 fs/ext4/page-io.c |    7 -------
 3 files changed, 3 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 859f235..b359aef 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2605,7 +2605,6 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
 /* page-io.c */
 extern int __init ext4_init_pageio(void);
 extern void ext4_exit_pageio(void);
-extern void ext4_ioend_wait(struct inode *);
 extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
 extern ext4_io_end_t *ext4_get_io_end(ext4_io_end_t *io_end);
 extern int ext4_put_io_end(ext4_io_end_t *io_end);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f493ec2..1f88941 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -188,8 +188,6 @@ void ext4_evict_inode(struct inode *inode)
 
 	trace_ext4_evict_inode(inode);
 
-	ext4_ioend_wait(inode);
-
 	if (inode->i_nlink) {
 		/*
 		 * When journalling data dirty buffers are tracked only in the
@@ -219,6 +217,8 @@ void ext4_evict_inode(struct inode *inode)
 			filemap_write_and_wait(&inode->i_data);
 		}
 		truncate_inode_pages(&inode->i_data, 0);
+
+		WARN_ON(atomic_read(&EXT4_I(inode)->i_ioend_count));
 		goto no_delete;
 	}
 
@@ -229,6 +229,7 @@ void ext4_evict_inode(struct inode *inode)
 		ext4_begin_ordered_truncate(inode, 0);
 	truncate_inode_pages(&inode->i_data, 0);
 
+	WARN_ON(atomic_read(&EXT4_I(inode)->i_ioend_count));
 	if (is_bad_inode(inode))
 		goto no_delete;
 
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 1156b9f..e720d4e 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -44,13 +44,6 @@ void ext4_exit_pageio(void)
 	kmem_cache_destroy(io_end_cachep);
 }
 
-void ext4_ioend_wait(struct inode *inode)
-{
-	wait_queue_head_t *wq = ext4_ioend_wq(inode);
-
-	wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_ioend_count) == 0));
-}

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 01/29] ext4: Make ext4_bio_write_page() use BH_Async_Write flags instead page pointers from ext4_io_end
  2013-04-08 21:32 ` [PATCH 01/29] ext4: Make ext4_bio_write_page() use BH_Async_Write flags instead page pointers from ext4_io_end Jan Kara
@ 2013-04-10 18:05   ` Dmitry Monakhov
  2013-04-11 13:38   ` Zheng Liu
  2013-04-12  3:50   ` Theodore Ts'o
  2 siblings, 0 replies; 76+ messages in thread
From: Dmitry Monakhov @ 2013-04-10 18:05 UTC (permalink / raw)
  To: Jan Kara, Ted Tso; +Cc: linux-ext4, Jan Kara

On Mon,  8 Apr 2013 23:32:06 +0200, Jan Kara <jack@suse.cz> wrote:
> So far ext4_bio_write_page() attached all the pages to ext4_io_end structure.
> This makes that structure pretty heavy (1 KB for pointers + 16 bytes per
> page attached to the bio). Also later we would like to share ext4_io_end
> structure among several bios in case IO to a single extent needs to be split
> among several bios and pointing to pages from ext4_io_end makes this complex.
> 
> We remove page pointers from ext4_io_end and use pointers from bio itself
> instead. This isn't as easy when blocksize < pagesize because then we can have
> several bios in flight for a single page and we have to be careful when to call
> end_page_writeback(). However this is a known problem already solved by
> block_write_full_page() / end_buffer_async_write() so we mimic its behavior
> here. We mark buffers going to disk with BH_Async_Write flag and in
> ext4_bio_end_io() we check whether there are any buffers with BH_Async_Write
> flag left. If there are not, we can call end_page_writeback().
Definitely looks better than my semi-fix.
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/ext4.h    |   14 -----
>  fs/ext4/page-io.c |  163 +++++++++++++++++++++++++----------------------------
>  2 files changed, 77 insertions(+), 100 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 4a01ba3..3c70547 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -196,19 +196,8 @@ struct mpage_da_data {
>  #define EXT4_IO_END_ERROR	0x0002
>  #define EXT4_IO_END_DIRECT	0x0004
>  
> -struct ext4_io_page {
> -	struct page	*p_page;
> -	atomic_t	p_count;
> -};
> -
> -#define MAX_IO_PAGES 128
> -
>  /*
>   * For converting uninitialized extents on a work queue.
> - *
> - * 'page' is only used from the writepage() path; 'pages' is only used for
> - * buffered writes; they are used to keep page references until conversion
> - * takes place.  For AIO/DIO, neither field is filled in.
>   */
>  typedef struct ext4_io_end {
>  	struct list_head	list;		/* per-file finished IO list */
> @@ -218,15 +207,12 @@ typedef struct ext4_io_end {
>  	ssize_t			size;		/* size of the extent */
>  	struct kiocb		*iocb;		/* iocb struct for AIO */
>  	int			result;		/* error value for AIO */
> -	int			num_io_pages;   /* for writepages() */
> -	struct ext4_io_page	*pages[MAX_IO_PAGES]; /* for writepages() */
>  } ext4_io_end_t;
>  
>  struct ext4_io_submit {
>  	int			io_op;
>  	struct bio		*io_bio;
>  	ext4_io_end_t		*io_end;
> -	struct ext4_io_page	*io_page;
>  	sector_t		io_next_block;
>  };
>  
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 809b310..73bc011 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -29,25 +29,19 @@
>  #include "xattr.h"
>  #include "acl.h"
>  
> -static struct kmem_cache *io_page_cachep, *io_end_cachep;
> +static struct kmem_cache *io_end_cachep;
>  
>  int __init ext4_init_pageio(void)
>  {
> -	io_page_cachep = KMEM_CACHE(ext4_io_page, SLAB_RECLAIM_ACCOUNT);
> -	if (io_page_cachep == NULL)
> -		return -ENOMEM;
>  	io_end_cachep = KMEM_CACHE(ext4_io_end, SLAB_RECLAIM_ACCOUNT);
> -	if (io_end_cachep == NULL) {
> -		kmem_cache_destroy(io_page_cachep);
> +	if (io_end_cachep == NULL)
>  		return -ENOMEM;
> -	}
>  	return 0;
>  }
>  
>  void ext4_exit_pageio(void)
>  {
>  	kmem_cache_destroy(io_end_cachep);
> -	kmem_cache_destroy(io_page_cachep);
>  }
>  
>  void ext4_ioend_wait(struct inode *inode)
> @@ -57,15 +51,6 @@ void ext4_ioend_wait(struct inode *inode)
>  	wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_ioend_count) == 0));
>  }
>  
> -static void put_io_page(struct ext4_io_page *io_page)
> -{
> -	if (atomic_dec_and_test(&io_page->p_count)) {
> -		end_page_writeback(io_page->p_page);
> -		put_page(io_page->p_page);
> -		kmem_cache_free(io_page_cachep, io_page);
> -	}
> -}
> -
>  void ext4_free_io_end(ext4_io_end_t *io)
>  {
>  	int i;
> @@ -74,9 +59,6 @@ void ext4_free_io_end(ext4_io_end_t *io)
>  	BUG_ON(!list_empty(&io->list));
>  	BUG_ON(io->flag & EXT4_IO_END_UNWRITTEN);
>  
> -	for (i = 0; i < io->num_io_pages; i++)
> -		put_io_page(io->pages[i]);
> -	io->num_io_pages = 0;
>  	if (atomic_dec_and_test(&EXT4_I(io->inode)->i_ioend_count))
>  		wake_up_all(ext4_ioend_wq(io->inode));
>  	kmem_cache_free(io_end_cachep, io);
> @@ -233,45 +215,56 @@ static void ext4_end_bio(struct bio *bio, int error)
>  	ext4_io_end_t *io_end = bio->bi_private;
>  	struct inode *inode;
>  	int i;
> +	int blocksize;
>  	sector_t bi_sector = bio->bi_sector;
>  
>  	BUG_ON(!io_end);
> +	inode = io_end->inode;
> +	blocksize = 1 << inode->i_blkbits;
>  	bio->bi_private = NULL;
>  	bio->bi_end_io = NULL;
>  	if (test_bit(BIO_UPTODATE, &bio->bi_flags))
>  		error = 0;
> -	bio_put(bio);
> -
> -	for (i = 0; i < io_end->num_io_pages; i++) {
> -		struct page *page = io_end->pages[i]->p_page;
> +	for (i = 0; i < bio->bi_vcnt; i++) {
> +		struct bio_vec *bvec = &bio->bi_io_vec[i];
> +		struct page *page = bvec->bv_page;
>  		struct buffer_head *bh, *head;
> -		loff_t offset;
> -		loff_t io_end_offset;
> +		unsigned bio_start = bvec->bv_offset;
> +		unsigned bio_end = bio_start + bvec->bv_len;
> +		unsigned under_io = 0;
> +		unsigned long flags;
> +
> +		if (!page)
> +			continue;
>  
>  		if (error) {
>  			SetPageError(page);
>  			set_bit(AS_EIO, &page->mapping->flags);
> -			head = page_buffers(page);
> -			BUG_ON(!head);
> -
> -			io_end_offset = io_end->offset + io_end->size;
> -
> -			offset = (sector_t) page->index << PAGE_CACHE_SHIFT;
> -			bh = head;
> -			do {
> -				if ((offset >= io_end->offset) &&
> -				    (offset+bh->b_size <= io_end_offset))
> -					buffer_io_error(bh);
> -
> -				offset += bh->b_size;
> -				bh = bh->b_this_page;
> -			} while (bh != head);
>  		}
> -
> -		put_io_page(io_end->pages[i]);
> +		bh = head = page_buffers(page);
> +		/*
> +		 * We check all buffers in the page under BH_Uptodate_Lock
> +		 * to avoid races with other end io clearing async_write flags
> +		 */
> +		local_irq_save(flags);
> +		bit_spin_lock(BH_Uptodate_Lock, &head->b_state);
> +		do {
> +			if (bh_offset(bh) < bio_start ||
> +			    bh_offset(bh) + blocksize > bio_end) {
> +				if (buffer_async_write(bh))
> +					under_io++;
> +				continue;
> +			}
> +			clear_buffer_async_write(bh);
> +			if (error)
> +				buffer_io_error(bh);
> +		} while ((bh = bh->b_this_page) != head);
> +		bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
> +		local_irq_restore(flags);
> +		if (!under_io)
> +			end_page_writeback(page);
>  	}
> -	io_end->num_io_pages = 0;
> -	inode = io_end->inode;
> +	bio_put(bio);
>  
>  	if (error) {
>  		io_end->flag |= EXT4_IO_END_ERROR;
> @@ -335,7 +328,6 @@ static int io_submit_init(struct ext4_io_submit *io,
>  }
>  
>  static int io_submit_add_bh(struct ext4_io_submit *io,
> -			    struct ext4_io_page *io_page,
>  			    struct inode *inode,
>  			    struct writeback_control *wbc,
>  			    struct buffer_head *bh)
> @@ -343,11 +335,6 @@ static int io_submit_add_bh(struct ext4_io_submit *io,
>  	ext4_io_end_t *io_end;
>  	int ret;
>  
> -	if (buffer_new(bh)) {
> -		clear_buffer_new(bh);
> -		unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
> -	}
> -
>  	if (io->io_bio && bh->b_blocknr != io->io_next_block) {
>  submit_and_retry:
>  		ext4_io_submit(io);
> @@ -358,9 +345,6 @@ submit_and_retry:
>  			return ret;
>  	}
>  	io_end = io->io_end;
> -	if ((io_end->num_io_pages >= MAX_IO_PAGES) &&
> -	    (io_end->pages[io_end->num_io_pages-1] != io_page))
> -		goto submit_and_retry;
>  	if (buffer_uninit(bh))
>  		ext4_set_io_unwritten_flag(inode, io_end);
>  	io->io_end->size += bh->b_size;
> @@ -368,11 +352,6 @@ submit_and_retry:
>  	ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh));
>  	if (ret != bh->b_size)
>  		goto submit_and_retry;
> -	if ((io_end->num_io_pages == 0) ||
> -	    (io_end->pages[io_end->num_io_pages-1] != io_page)) {
> -		io_end->pages[io_end->num_io_pages++] = io_page;
> -		atomic_inc(&io_page->p_count);
> -	}
>  	return 0;
>  }
>  
> @@ -382,33 +361,29 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
>  			struct writeback_control *wbc)
>  {
>  	struct inode *inode = page->mapping->host;
> -	unsigned block_start, block_end, blocksize;
> -	struct ext4_io_page *io_page;
> +	unsigned block_start, blocksize;
>  	struct buffer_head *bh, *head;
>  	int ret = 0;
> +	int nr_submitted = 0;
>  
>  	blocksize = 1 << inode->i_blkbits;
>  
>  	BUG_ON(!PageLocked(page));
>  	BUG_ON(PageWriteback(page));
>  
> -	io_page = kmem_cache_alloc(io_page_cachep, GFP_NOFS);
> -	if (!io_page) {
> -		redirty_page_for_writepage(wbc, page);
> -		unlock_page(page);
> -		return -ENOMEM;
> -	}
> -	io_page->p_page = page;
> -	atomic_set(&io_page->p_count, 1);
> -	get_page(page);
>  	set_page_writeback(page);
>  	ClearPageError(page);
>  
> -	for (bh = head = page_buffers(page), block_start = 0;
> -	     bh != head || !block_start;
> -	     block_start = block_end, bh = bh->b_this_page) {
> -
> -		block_end = block_start + blocksize;
> +	/*
> +	 * In the first loop we prepare and mark buffers to submit. We have to
> +	 * mark all buffers in the page before submitting so that
> +	 * end_page_writeback() cannot be called from ext4_bio_end_io() when IO
> +	 * on the first buffer finishes and we are still working on submitting
> +	 * the second buffer.
> +	 */
> +	bh = head = page_buffers(page);
> +	do {
> +		block_start = bh_offset(bh);
>  		if (block_start >= len) {
>  			/*
>  			 * Comments copied from block_write_full_page_endio:
> @@ -421,7 +396,8 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
>  			 * mapped, and writes to that region are not written
>  			 * out to the file."
>  			 */
> -			zero_user_segment(page, block_start, block_end);
> +			zero_user_segment(page, block_start,
> +					  block_start + blocksize);
>  			clear_buffer_dirty(bh);
>  			set_buffer_uptodate(bh);
>  			continue;
> @@ -435,7 +411,19 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
>  				ext4_io_submit(io);
>  			continue;
>  		}
> -		ret = io_submit_add_bh(io, io_page, inode, wbc, bh);
> +		if (buffer_new(bh)) {
> +			clear_buffer_new(bh);
> +			unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
> +		}
> +		set_buffer_async_write(bh);
> +	} while ((bh = bh->b_this_page) != head);
> +
> +	/* Now submit buffers to write */
> +	bh = head = page_buffers(page);
> +	do {
> +		if (!buffer_async_write(bh))
> +			continue;
> +		ret = io_submit_add_bh(io, inode, wbc, bh);
>  		if (ret) {
>  			/*
>  			 * We only get here on ENOMEM.  Not much else
> @@ -445,17 +433,20 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
>  			redirty_page_for_writepage(wbc, page);
>  			break;
>  		}
> +		nr_submitted++;
>  		clear_buffer_dirty(bh);
> +	} while ((bh = bh->b_this_page) != head);
> +
> +	/* Error stopped previous loop? Clean up buffers... */
> +	if (ret) {
> +		do {
> +			clear_buffer_async_write(bh);
> +			bh = bh->b_this_page;
> +		} while (bh != head);
>  	}
>  	unlock_page(page);
> -	/*
> -	 * If the page was truncated before we could do the writeback,
> -	 * or we had a memory allocation error while trying to write
> -	 * the first buffer head, we won't have submitted any pages for
> -	 * I/O.  In that case we need to make sure we've cleared the
> -	 * PageWriteback bit from the page to prevent the system from
> -	 * wedging later on.
> -	 */
> -	put_io_page(io_page);
> +	/* Nothing submitted - we have to end page writeback */
> +	if (!nr_submitted)
> +		end_page_writeback(page);
>  	return ret;
>  }
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 02/29] ext4: Use io_end for multiple bios
  2013-04-08 21:32 ` [PATCH 02/29] ext4: Use io_end for multiple bios Jan Kara
@ 2013-04-11  5:10   ` Dmitry Monakhov
  2013-04-11 14:04   ` Zheng Liu
  2013-04-12  3:55   ` Theodore Ts'o
  2 siblings, 0 replies; 76+ messages in thread
From: Dmitry Monakhov @ 2013-04-11  5:10 UTC (permalink / raw)
  To: Jan Kara, Ted Tso; +Cc: linux-ext4

On Mon,  8 Apr 2013 23:32:07 +0200, Jan Kara <jack@suse.cz> wrote:
> Change writeback path to create just one io_end structure for the extent
> to which we submit IO and share it among bios writing that extent. This
> prevents needless splitting and joining of unwritten extents when they
> cannot be submitted as a single bio.
Looks good.
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/ext4.h    |    8 +++-
>  fs/ext4/inode.c   |   85 ++++++++++++++++++++-----------------
>  fs/ext4/page-io.c |  121 +++++++++++++++++++++++++++++++++--------------------
>  3 files changed, 128 insertions(+), 86 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 3c70547..edf9b9e 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -207,6 +207,7 @@ typedef struct ext4_io_end {
>  	ssize_t			size;		/* size of the extent */
>  	struct kiocb		*iocb;		/* iocb struct for AIO */
>  	int			result;		/* error value for AIO */
> +	atomic_t		count;		/* reference counter */
>  } ext4_io_end_t;
>  
>  struct ext4_io_submit {
> @@ -2601,11 +2602,14 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
>  
>  /* page-io.c */
>  extern int __init ext4_init_pageio(void);
> -extern void ext4_add_complete_io(ext4_io_end_t *io_end);
>  extern void ext4_exit_pageio(void);
>  extern void ext4_ioend_wait(struct inode *);
> -extern void ext4_free_io_end(ext4_io_end_t *io);
>  extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
> +extern ext4_io_end_t *ext4_get_io_end(ext4_io_end_t *io_end);
> +extern int ext4_put_io_end(ext4_io_end_t *io_end);
> +extern void ext4_put_io_end_defer(ext4_io_end_t *io_end);
> +extern void ext4_io_submit_init(struct ext4_io_submit *io,
> +				struct writeback_control *wbc);
>  extern void ext4_end_io_work(struct work_struct *work);
>  extern void ext4_io_submit(struct ext4_io_submit *io);
>  extern int ext4_bio_write_page(struct ext4_io_submit *io,
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 6b8ec2a..ba07412 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1409,7 +1409,10 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd,
>  	struct ext4_io_submit io_submit;
>  
>  	BUG_ON(mpd->next_page <= mpd->first_page);
> -	memset(&io_submit, 0, sizeof(io_submit));
> +	ext4_io_submit_init(&io_submit, mpd->wbc);
> +	io_submit.io_end = ext4_init_io_end(inode, GFP_NOFS);
> +	if (!io_submit.io_end)
> +		return -ENOMEM;
>  	/*
>  	 * We need to start from the first_page to the next_page - 1
>  	 * to make sure we also write the mapped dirty buffer_heads.
> @@ -1497,6 +1500,8 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd,
>  		pagevec_release(&pvec);
>  	}
>  	ext4_io_submit(&io_submit);
> +	/* Drop io_end reference we got from init */
> +	ext4_put_io_end_defer(io_submit.io_end);
>  	return ret;
>  }
>  
> @@ -2116,9 +2121,16 @@ static int ext4_writepage(struct page *page,
>  		 */
>  		return __ext4_journalled_writepage(page, len);
>  
> -	memset(&io_submit, 0, sizeof(io_submit));
> +	ext4_io_submit_init(&io_submit, wbc);
> +	io_submit.io_end = ext4_init_io_end(inode, GFP_NOFS);
> +	if (!io_submit.io_end) {
> +		redirty_page_for_writepage(wbc, page);
> +		return -ENOMEM;
> +	}
>  	ret = ext4_bio_write_page(&io_submit, page, len, wbc);
>  	ext4_io_submit(&io_submit);
> +	/* Drop io_end reference we got from init */
> +	ext4_put_io_end_defer(io_submit.io_end);
>  	return ret;
>  }
>  
> @@ -2957,9 +2969,13 @@ static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
>  	struct inode *inode = file_inode(iocb->ki_filp);
>          ext4_io_end_t *io_end = iocb->private;
>  
> -	/* if not async direct IO or dio with 0 bytes write, just return */
> -	if (!io_end || !size)
> -		goto out;
> +	/* if not async direct IO just return */
> +	if (!io_end) {
> +		inode_dio_done(inode);
> +		if (is_async)
> +			aio_complete(iocb, ret, 0);
> +		return;
> +	}
>  
>  	ext_debug("ext4_end_io_dio(): io_end 0x%p "
>  		  "for inode %lu, iocb 0x%p, offset %llu, size %zd\n",
> @@ -2967,25 +2983,13 @@ static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
>  		  size);
>  
>  	iocb->private = NULL;
> -
> -	/* if not aio dio with unwritten extents, just free io and return */
> -	if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
> -		ext4_free_io_end(io_end);
> -out:
> -		inode_dio_done(inode);
> -		if (is_async)
> -			aio_complete(iocb, ret, 0);
> -		return;
> -	}
> -
>  	io_end->offset = offset;
>  	io_end->size = size;
>  	if (is_async) {
>  		io_end->iocb = iocb;
>  		io_end->result = ret;
>  	}
> -
> -	ext4_add_complete_io(io_end);
> +	ext4_put_io_end_defer(io_end);
>  }
>  
>  /*
> @@ -3019,6 +3023,7 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
>  	get_block_t *get_block_func = NULL;
>  	int dio_flags = 0;
>  	loff_t final_size = offset + count;
> +	ext4_io_end_t *io_end = NULL;
>  
>  	/* Use the old path for reads and writes beyond i_size. */
>  	if (rw != WRITE || final_size > inode->i_size)
> @@ -3057,13 +3062,16 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
>  	iocb->private = NULL;
>  	ext4_inode_aio_set(inode, NULL);
>  	if (!is_sync_kiocb(iocb)) {
> -		ext4_io_end_t *io_end = ext4_init_io_end(inode, GFP_NOFS);
> +		io_end = ext4_init_io_end(inode, GFP_NOFS);
>  		if (!io_end) {
>  			ret = -ENOMEM;
>  			goto retake_lock;
>  		}
>  		io_end->flag |= EXT4_IO_END_DIRECT;
> -		iocb->private = io_end;
> +		/*
> +		 * Grab reference for DIO. Will be dropped in ext4_end_io_dio()
> +		 */
> +		iocb->private = ext4_get_io_end(io_end);
>  		/*
>  		 * we save the io structure for current async direct
>  		 * IO, so that later ext4_map_blocks() could flag the
> @@ -3087,26 +3095,27 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
>  				   NULL,
>  				   dio_flags);
>  
> -	if (iocb->private)
> -		ext4_inode_aio_set(inode, NULL);
>  	/*
> -	 * The io_end structure takes a reference to the inode, that
> -	 * structure needs to be destroyed and the reference to the
> -	 * inode need to be dropped, when IO is complete, even with 0
> -	 * byte write, or failed.
> -	 *
> -	 * In the successful AIO DIO case, the io_end structure will
> -	 * be destroyed and the reference to the inode will be dropped
> -	 * after the end_io call back function is called.
> -	 *
> -	 * In the case there is 0 byte write, or error case, since VFS
> -	 * direct IO won't invoke the end_io call back function, we
> -	 * need to free the end_io structure here.
> +	 * Put our reference to io_end. This can free the io_end structure e.g.
> +	 * in sync IO case or in case of error. It can even perform extent
> +	 * conversion if all bios we submitted finished before we got here.
> +	 * Note that in that case iocb->private can be already set to NULL
> +	 * here.
>  	 */
> -	if (ret != -EIOCBQUEUED && ret <= 0 && iocb->private) {
> -		ext4_free_io_end(iocb->private);
> -		iocb->private = NULL;
> -	} else if (ret > 0 && !overwrite && ext4_test_inode_state(inode,
> +	if (io_end) {
> +		ext4_inode_aio_set(inode, NULL);
> +		ext4_put_io_end(io_end);
> +		/*
> +		 * In case of error or no write ext4_end_io_dio() was not
> +		 * called so we have to put iocb's reference.
> +		 */
> +		if (ret <= 0 && ret != -EIOCBQUEUED) {
> +			WARN_ON(iocb->private != io_end);
> +			ext4_put_io_end(io_end);
> +			iocb->private = NULL;
> +		}
> +	}
> +	if (ret > 0 && !overwrite && ext4_test_inode_state(inode,
>  						EXT4_STATE_DIO_UNWRITTEN)) {
>  		int err;
>  		/*
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 73bc011..da8bddf 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -51,17 +51,28 @@ void ext4_ioend_wait(struct inode *inode)
>  	wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_ioend_count) == 0));
>  }
>  
> -void ext4_free_io_end(ext4_io_end_t *io)
> +static void ext4_release_io_end(ext4_io_end_t *io_end)
>  {
> -	int i;
> +	BUG_ON(!list_empty(&io_end->list));
> +	BUG_ON(io_end->flag & EXT4_IO_END_UNWRITTEN);
> +
> +	if (atomic_dec_and_test(&EXT4_I(io_end->inode)->i_ioend_count))
> +		wake_up_all(ext4_ioend_wq(io_end->inode));
> +	if (io_end->flag & EXT4_IO_END_DIRECT)
> +		inode_dio_done(io_end->inode);
> +	if (io_end->iocb)
> +		aio_complete(io_end->iocb, io_end->result, 0);
> +	kmem_cache_free(io_end_cachep, io_end);
> +}
>  
> -	BUG_ON(!io);
> -	BUG_ON(!list_empty(&io->list));
> -	BUG_ON(io->flag & EXT4_IO_END_UNWRITTEN);
> +static void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
> +{
> +	struct inode *inode = io_end->inode;
>  
> -	if (atomic_dec_and_test(&EXT4_I(io->inode)->i_ioend_count))
> -		wake_up_all(ext4_ioend_wq(io->inode));
> -	kmem_cache_free(io_end_cachep, io);
> +	io_end->flag &= ~EXT4_IO_END_UNWRITTEN;
> +	/* Wake up anyone waiting on unwritten extent conversion */
> +	if (atomic_dec_and_test(&EXT4_I(inode)->i_unwritten))
> +		wake_up_all(ext4_ioend_wq(inode));
>  }
>  
>  /* check a range of space and convert unwritten extents to written. */
> @@ -84,13 +95,8 @@ static int ext4_end_io(ext4_io_end_t *io)
>  			 "(inode %lu, offset %llu, size %zd, error %d)",
>  			 inode->i_ino, offset, size, ret);
>  	}
> -	/* Wake up anyone waiting on unwritten extent conversion */
> -	if (atomic_dec_and_test(&EXT4_I(inode)->i_unwritten))
> -		wake_up_all(ext4_ioend_wq(inode));
> -	if (io->flag & EXT4_IO_END_DIRECT)
> -		inode_dio_done(inode);
> -	if (io->iocb)
> -		aio_complete(io->iocb, io->result, 0);
> +	ext4_clear_io_unwritten_flag(io);
> +	ext4_release_io_end(io);
>  	return ret;
>  }
>  
> @@ -121,7 +127,7 @@ static void dump_completed_IO(struct inode *inode)
>  }
>  
>  /* Add the io_end to per-inode completed end_io list. */
> -void ext4_add_complete_io(ext4_io_end_t *io_end)
> +static void ext4_add_complete_io(ext4_io_end_t *io_end)
>  {
>  	struct ext4_inode_info *ei = EXT4_I(io_end->inode);
>  	struct workqueue_struct *wq;
> @@ -158,8 +164,6 @@ static int ext4_do_flush_completed_IO(struct inode *inode)
>  		err = ext4_end_io(io);
>  		if (unlikely(!ret && err))
>  			ret = err;
> -		io->flag &= ~EXT4_IO_END_UNWRITTEN;
> -		ext4_free_io_end(io);
>  	}
>  	return ret;
>  }
> @@ -191,10 +195,43 @@ ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
>  		atomic_inc(&EXT4_I(inode)->i_ioend_count);
>  		io->inode = inode;
>  		INIT_LIST_HEAD(&io->list);
> +		atomic_set(&io->count, 1);
>  	}
>  	return io;
>  }
>  
> +void ext4_put_io_end_defer(ext4_io_end_t *io_end)
> +{
> +	if (atomic_dec_and_test(&io_end->count)) {
> +		if (!(io_end->flag & EXT4_IO_END_UNWRITTEN) || !io_end->size) {
> +			ext4_release_io_end(io_end);
> +			return;
> +		}
> +		ext4_add_complete_io(io_end);
> +	}
> +}
> +
> +int ext4_put_io_end(ext4_io_end_t *io_end)
> +{
> +	int err = 0;
> +
> +	if (atomic_dec_and_test(&io_end->count)) {
> +		if (io_end->flag & EXT4_IO_END_UNWRITTEN) {
> +			err = ext4_convert_unwritten_extents(io_end->inode,
> +						io_end->offset, io_end->size);
> +			ext4_clear_io_unwritten_flag(io_end);
> +		}
> +		ext4_release_io_end(io_end);
> +	}
> +	return err;
> +}
> +
> +ext4_io_end_t *ext4_get_io_end(ext4_io_end_t *io_end)
> +{
> +	atomic_inc(&io_end->count);
> +	return io_end;
> +}
> +
>  /*
>   * Print an buffer I/O error compatible with the fs/buffer.c.  This
>   * provides compatibility with dmesg scrapers that look for a specific
> @@ -277,12 +314,7 @@ static void ext4_end_bio(struct bio *bio, int error)
>  			     bi_sector >> (inode->i_blkbits - 9));
>  	}
>  
> -	if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
> -		ext4_free_io_end(io_end);
> -		return;
> -	}
> -
> -	ext4_add_complete_io(io_end);
> +	ext4_put_io_end_defer(io_end);
>  }
>  
>  void ext4_io_submit(struct ext4_io_submit *io)
> @@ -296,40 +328,37 @@ void ext4_io_submit(struct ext4_io_submit *io)
>  		bio_put(io->io_bio);
>  	}
>  	io->io_bio = NULL;
> -	io->io_op = 0;
> +}
> +
> +void ext4_io_submit_init(struct ext4_io_submit *io,
> +			 struct writeback_control *wbc)
> +{
> +	io->io_op = (wbc->sync_mode == WB_SYNC_ALL ?  WRITE_SYNC : WRITE);
> +	io->io_bio = NULL;
>  	io->io_end = NULL;
>  }
>  
> -static int io_submit_init(struct ext4_io_submit *io,
> -			  struct inode *inode,
> -			  struct writeback_control *wbc,
> -			  struct buffer_head *bh)
> +static int io_submit_init_bio(struct ext4_io_submit *io,
> +			      struct buffer_head *bh)
>  {
> -	ext4_io_end_t *io_end;
> -	struct page *page = bh->b_page;
>  	int nvecs = bio_get_nr_vecs(bh->b_bdev);
>  	struct bio *bio;
>  
> -	io_end = ext4_init_io_end(inode, GFP_NOFS);
> -	if (!io_end)
> -		return -ENOMEM;
>  	bio = bio_alloc(GFP_NOIO, min(nvecs, BIO_MAX_PAGES));
>  	bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9);
>  	bio->bi_bdev = bh->b_bdev;
> -	bio->bi_private = io->io_end = io_end;
>  	bio->bi_end_io = ext4_end_bio;
> -
> -	io_end->offset = (page->index << PAGE_CACHE_SHIFT) + bh_offset(bh);
> -
> +	bio->bi_private = ext4_get_io_end(io->io_end);
> +	if (!io->io_end->size)
> +		io->io_end->offset = (bh->b_page->index << PAGE_CACHE_SHIFT)
> +				     + bh_offset(bh);
>  	io->io_bio = bio;
> -	io->io_op = (wbc->sync_mode == WB_SYNC_ALL ?  WRITE_SYNC : WRITE);
>  	io->io_next_block = bh->b_blocknr;
>  	return 0;
>  }
>  
>  static int io_submit_add_bh(struct ext4_io_submit *io,
>  			    struct inode *inode,
> -			    struct writeback_control *wbc,
>  			    struct buffer_head *bh)
>  {
>  	ext4_io_end_t *io_end;
> @@ -340,18 +369,18 @@ submit_and_retry:
>  		ext4_io_submit(io);
>  	}
>  	if (io->io_bio == NULL) {
> -		ret = io_submit_init(io, inode, wbc, bh);
> +		ret = io_submit_init_bio(io, bh);
>  		if (ret)
>  			return ret;
>  	}
> +	ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh));
> +	if (ret != bh->b_size)
> +		goto submit_and_retry;
>  	io_end = io->io_end;
>  	if (buffer_uninit(bh))
>  		ext4_set_io_unwritten_flag(inode, io_end);
> -	io->io_end->size += bh->b_size;
> +	io_end->size += bh->b_size;
>  	io->io_next_block++;
> -	ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh));
> -	if (ret != bh->b_size)
> -		goto submit_and_retry;
>  	return 0;
>  }
>  
> @@ -423,7 +452,7 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
>  	do {
>  		if (!buffer_async_write(bh))
>  			continue;
> -		ret = io_submit_add_bh(io, inode, wbc, bh);
> +		ret = io_submit_add_bh(io, inode, bh);
>  		if (ret) {
>  			/*
>  			 * We only get here on ENOMEM.  Not much else
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 01/29] ext4: Make ext4_bio_write_page() use BH_Async_Write flags instead page pointers from ext4_io_end
  2013-04-08 21:32 ` [PATCH 01/29] ext4: Make ext4_bio_write_page() use BH_Async_Write flags instead page pointers from ext4_io_end Jan Kara
  2013-04-10 18:05   ` Dmitry Monakhov
@ 2013-04-11 13:38   ` Zheng Liu
  2013-04-12  3:50   ` Theodore Ts'o
  2 siblings, 0 replies; 76+ messages in thread
From: Zheng Liu @ 2013-04-11 13:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:06PM +0200, Jan Kara wrote:
> So far ext4_bio_write_page() attached all the pages to ext4_io_end structure.
> This makes that structure pretty heavy (1 KB for pointers + 16 bytes per
> page attached to the bio). Also later we would like to share ext4_io_end
> structure among several bios in case IO to a single extent needs to be split
> among several bios and pointing to pages from ext4_io_end makes this complex.
> 
> We remove page pointers from ext4_io_end and use pointers from bio itself
> instead. This isn't as easy when blocksize < pagesize because then we can have
> several bios in flight for a single page and we have to be careful when to call
> end_page_writeback(). However this is a known problem already solved by
> block_write_full_page() / end_buffer_async_write() so we mimic its behavior
> here. We mark buffers going to disk with BH_Async_Write flag and in
> ext4_bio_end_io() we check whether there are any buffers with BH_Async_Write
> flag left. If there are not, we can call end_page_writeback().
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Regards,
                                                - Zheng

> ---
>  fs/ext4/ext4.h    |   14 -----
>  fs/ext4/page-io.c |  163 +++++++++++++++++++++++++----------------------------
>  2 files changed, 77 insertions(+), 100 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 4a01ba3..3c70547 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -196,19 +196,8 @@ struct mpage_da_data {
>  #define EXT4_IO_END_ERROR	0x0002
>  #define EXT4_IO_END_DIRECT	0x0004
>  
> -struct ext4_io_page {
> -	struct page	*p_page;
> -	atomic_t	p_count;
> -};
> -
> -#define MAX_IO_PAGES 128
> -
>  /*
>   * For converting uninitialized extents on a work queue.
> - *
> - * 'page' is only used from the writepage() path; 'pages' is only used for
> - * buffered writes; they are used to keep page references until conversion
> - * takes place.  For AIO/DIO, neither field is filled in.
>   */
>  typedef struct ext4_io_end {
>  	struct list_head	list;		/* per-file finished IO list */
> @@ -218,15 +207,12 @@ typedef struct ext4_io_end {
>  	ssize_t			size;		/* size of the extent */
>  	struct kiocb		*iocb;		/* iocb struct for AIO */
>  	int			result;		/* error value for AIO */
> -	int			num_io_pages;   /* for writepages() */
> -	struct ext4_io_page	*pages[MAX_IO_PAGES]; /* for writepages() */
>  } ext4_io_end_t;
>  
>  struct ext4_io_submit {
>  	int			io_op;
>  	struct bio		*io_bio;
>  	ext4_io_end_t		*io_end;
> -	struct ext4_io_page	*io_page;
>  	sector_t		io_next_block;
>  };
>  
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 809b310..73bc011 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -29,25 +29,19 @@
>  #include "xattr.h"
>  #include "acl.h"
>  
> -static struct kmem_cache *io_page_cachep, *io_end_cachep;
> +static struct kmem_cache *io_end_cachep;
>  
>  int __init ext4_init_pageio(void)
>  {
> -	io_page_cachep = KMEM_CACHE(ext4_io_page, SLAB_RECLAIM_ACCOUNT);
> -	if (io_page_cachep == NULL)
> -		return -ENOMEM;
>  	io_end_cachep = KMEM_CACHE(ext4_io_end, SLAB_RECLAIM_ACCOUNT);
> -	if (io_end_cachep == NULL) {
> -		kmem_cache_destroy(io_page_cachep);
> +	if (io_end_cachep == NULL)
>  		return -ENOMEM;
> -	}
>  	return 0;
>  }
>  
>  void ext4_exit_pageio(void)
>  {
>  	kmem_cache_destroy(io_end_cachep);
> -	kmem_cache_destroy(io_page_cachep);
>  }
>  
>  void ext4_ioend_wait(struct inode *inode)
> @@ -57,15 +51,6 @@ void ext4_ioend_wait(struct inode *inode)
>  	wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_ioend_count) == 0));
>  }
>  
> -static void put_io_page(struct ext4_io_page *io_page)
> -{
> -	if (atomic_dec_and_test(&io_page->p_count)) {
> -		end_page_writeback(io_page->p_page);
> -		put_page(io_page->p_page);
> -		kmem_cache_free(io_page_cachep, io_page);
> -	}
> -}
> -
>  void ext4_free_io_end(ext4_io_end_t *io)
>  {
>  	int i;
> @@ -74,9 +59,6 @@ void ext4_free_io_end(ext4_io_end_t *io)
>  	BUG_ON(!list_empty(&io->list));
>  	BUG_ON(io->flag & EXT4_IO_END_UNWRITTEN);
>  
> -	for (i = 0; i < io->num_io_pages; i++)
> -		put_io_page(io->pages[i]);
> -	io->num_io_pages = 0;
>  	if (atomic_dec_and_test(&EXT4_I(io->inode)->i_ioend_count))
>  		wake_up_all(ext4_ioend_wq(io->inode));
>  	kmem_cache_free(io_end_cachep, io);
> @@ -233,45 +215,56 @@ static void ext4_end_bio(struct bio *bio, int error)
>  	ext4_io_end_t *io_end = bio->bi_private;
>  	struct inode *inode;
>  	int i;
> +	int blocksize;
>  	sector_t bi_sector = bio->bi_sector;
>  
>  	BUG_ON(!io_end);
> +	inode = io_end->inode;
> +	blocksize = 1 << inode->i_blkbits;
>  	bio->bi_private = NULL;
>  	bio->bi_end_io = NULL;
>  	if (test_bit(BIO_UPTODATE, &bio->bi_flags))
>  		error = 0;
> -	bio_put(bio);
> -
> -	for (i = 0; i < io_end->num_io_pages; i++) {
> -		struct page *page = io_end->pages[i]->p_page;
> +	for (i = 0; i < bio->bi_vcnt; i++) {
> +		struct bio_vec *bvec = &bio->bi_io_vec[i];
> +		struct page *page = bvec->bv_page;
>  		struct buffer_head *bh, *head;
> -		loff_t offset;
> -		loff_t io_end_offset;
> +		unsigned bio_start = bvec->bv_offset;
> +		unsigned bio_end = bio_start + bvec->bv_len;
> +		unsigned under_io = 0;
> +		unsigned long flags;
> +
> +		if (!page)
> +			continue;
>  
>  		if (error) {
>  			SetPageError(page);
>  			set_bit(AS_EIO, &page->mapping->flags);
> -			head = page_buffers(page);
> -			BUG_ON(!head);
> -
> -			io_end_offset = io_end->offset + io_end->size;
> -
> -			offset = (sector_t) page->index << PAGE_CACHE_SHIFT;
> -			bh = head;
> -			do {
> -				if ((offset >= io_end->offset) &&
> -				    (offset+bh->b_size <= io_end_offset))
> -					buffer_io_error(bh);
> -
> -				offset += bh->b_size;
> -				bh = bh->b_this_page;
> -			} while (bh != head);
>  		}
> -
> -		put_io_page(io_end->pages[i]);
> +		bh = head = page_buffers(page);
> +		/*
> +		 * We check all buffers in the page under BH_Uptodate_Lock
> +		 * to avoid races with other end io clearing async_write flags
> +		 */
> +		local_irq_save(flags);
> +		bit_spin_lock(BH_Uptodate_Lock, &head->b_state);
> +		do {
> +			if (bh_offset(bh) < bio_start ||
> +			    bh_offset(bh) + blocksize > bio_end) {
> +				if (buffer_async_write(bh))
> +					under_io++;
> +				continue;
> +			}
> +			clear_buffer_async_write(bh);
> +			if (error)
> +				buffer_io_error(bh);
> +		} while ((bh = bh->b_this_page) != head);
> +		bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
> +		local_irq_restore(flags);
> +		if (!under_io)
> +			end_page_writeback(page);
>  	}
> -	io_end->num_io_pages = 0;
> -	inode = io_end->inode;
> +	bio_put(bio);
>  
>  	if (error) {
>  		io_end->flag |= EXT4_IO_END_ERROR;
> @@ -335,7 +328,6 @@ static int io_submit_init(struct ext4_io_submit *io,
>  }
>  
>  static int io_submit_add_bh(struct ext4_io_submit *io,
> -			    struct ext4_io_page *io_page,
>  			    struct inode *inode,
>  			    struct writeback_control *wbc,
>  			    struct buffer_head *bh)
> @@ -343,11 +335,6 @@ static int io_submit_add_bh(struct ext4_io_submit *io,
>  	ext4_io_end_t *io_end;
>  	int ret;
>  
> -	if (buffer_new(bh)) {
> -		clear_buffer_new(bh);
> -		unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
> -	}
> -
>  	if (io->io_bio && bh->b_blocknr != io->io_next_block) {
>  submit_and_retry:
>  		ext4_io_submit(io);
> @@ -358,9 +345,6 @@ submit_and_retry:
>  			return ret;
>  	}
>  	io_end = io->io_end;
> -	if ((io_end->num_io_pages >= MAX_IO_PAGES) &&
> -	    (io_end->pages[io_end->num_io_pages-1] != io_page))
> -		goto submit_and_retry;
>  	if (buffer_uninit(bh))
>  		ext4_set_io_unwritten_flag(inode, io_end);
>  	io->io_end->size += bh->b_size;
> @@ -368,11 +352,6 @@ submit_and_retry:
>  	ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh));
>  	if (ret != bh->b_size)
>  		goto submit_and_retry;
> -	if ((io_end->num_io_pages == 0) ||
> -	    (io_end->pages[io_end->num_io_pages-1] != io_page)) {
> -		io_end->pages[io_end->num_io_pages++] = io_page;
> -		atomic_inc(&io_page->p_count);
> -	}
>  	return 0;
>  }
>  
> @@ -382,33 +361,29 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
>  			struct writeback_control *wbc)
>  {
>  	struct inode *inode = page->mapping->host;
> -	unsigned block_start, block_end, blocksize;
> -	struct ext4_io_page *io_page;
> +	unsigned block_start, blocksize;
>  	struct buffer_head *bh, *head;
>  	int ret = 0;
> +	int nr_submitted = 0;
>  
>  	blocksize = 1 << inode->i_blkbits;
>  
>  	BUG_ON(!PageLocked(page));
>  	BUG_ON(PageWriteback(page));
>  
> -	io_page = kmem_cache_alloc(io_page_cachep, GFP_NOFS);
> -	if (!io_page) {
> -		redirty_page_for_writepage(wbc, page);
> -		unlock_page(page);
> -		return -ENOMEM;
> -	}
> -	io_page->p_page = page;
> -	atomic_set(&io_page->p_count, 1);
> -	get_page(page);
>  	set_page_writeback(page);
>  	ClearPageError(page);
>  
> -	for (bh = head = page_buffers(page), block_start = 0;
> -	     bh != head || !block_start;
> -	     block_start = block_end, bh = bh->b_this_page) {
> -
> -		block_end = block_start + blocksize;
> +	/*
> +	 * In the first loop we prepare and mark buffers to submit. We have to
> +	 * mark all buffers in the page before submitting so that
> +	 * end_page_writeback() cannot be called from ext4_bio_end_io() when IO
> +	 * on the first buffer finishes and we are still working on submitting
> +	 * the second buffer.
> +	 */
> +	bh = head = page_buffers(page);
> +	do {
> +		block_start = bh_offset(bh);
>  		if (block_start >= len) {
>  			/*
>  			 * Comments copied from block_write_full_page_endio:
> @@ -421,7 +396,8 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
>  			 * mapped, and writes to that region are not written
>  			 * out to the file."
>  			 */
> -			zero_user_segment(page, block_start, block_end);
> +			zero_user_segment(page, block_start,
> +					  block_start + blocksize);
>  			clear_buffer_dirty(bh);
>  			set_buffer_uptodate(bh);
>  			continue;
> @@ -435,7 +411,19 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
>  				ext4_io_submit(io);
>  			continue;
>  		}
> -		ret = io_submit_add_bh(io, io_page, inode, wbc, bh);
> +		if (buffer_new(bh)) {
> +			clear_buffer_new(bh);
> +			unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
> +		}
> +		set_buffer_async_write(bh);
> +	} while ((bh = bh->b_this_page) != head);
> +
> +	/* Now submit buffers to write */
> +	bh = head = page_buffers(page);
> +	do {
> +		if (!buffer_async_write(bh))
> +			continue;
> +		ret = io_submit_add_bh(io, inode, wbc, bh);
>  		if (ret) {
>  			/*
>  			 * We only get here on ENOMEM.  Not much else
> @@ -445,17 +433,20 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
>  			redirty_page_for_writepage(wbc, page);
>  			break;
>  		}
> +		nr_submitted++;
>  		clear_buffer_dirty(bh);
> +	} while ((bh = bh->b_this_page) != head);
> +
> +	/* Error stopped previous loop? Clean up buffers... */
> +	if (ret) {
> +		do {
> +			clear_buffer_async_write(bh);
> +			bh = bh->b_this_page;
> +		} while (bh != head);
>  	}
>  	unlock_page(page);
> -	/*
> -	 * If the page was truncated before we could do the writeback,
> -	 * or we had a memory allocation error while trying to write
> -	 * the first buffer head, we won't have submitted any pages for
> -	 * I/O.  In that case we need to make sure we've cleared the
> -	 * PageWriteback bit from the page to prevent the system from
> -	 * wedging later on.
> -	 */
> -	put_io_page(io_page);
> +	/* Nothing submitted - we have to end page writeback */
> +	if (!nr_submitted)
> +		end_page_writeback(page);
>  	return ret;
>  }
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 02/29] ext4: Use io_end for multiple bios
  2013-04-08 21:32 ` [PATCH 02/29] ext4: Use io_end for multiple bios Jan Kara
  2013-04-11  5:10   ` Dmitry Monakhov
@ 2013-04-11 14:04   ` Zheng Liu
  2013-04-12  3:55   ` Theodore Ts'o
  2 siblings, 0 replies; 76+ messages in thread
From: Zheng Liu @ 2013-04-11 14:04 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:07PM +0200, Jan Kara wrote:
> Change writeback path to create just one io_end structure for the extent
> to which we submit IO and share it among bios writing that extent. This
> prevents needless splitting and joining of unwritten extents when they
> cannot be submitted as a single bio.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Regards,
                                                - Zheng

> ---
>  fs/ext4/ext4.h    |    8 +++-
>  fs/ext4/inode.c   |   85 ++++++++++++++++++++-----------------
>  fs/ext4/page-io.c |  121 +++++++++++++++++++++++++++++++++--------------------
>  3 files changed, 128 insertions(+), 86 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 3c70547..edf9b9e 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -207,6 +207,7 @@ typedef struct ext4_io_end {
>  	ssize_t			size;		/* size of the extent */
>  	struct kiocb		*iocb;		/* iocb struct for AIO */
>  	int			result;		/* error value for AIO */
> +	atomic_t		count;		/* reference counter */
>  } ext4_io_end_t;
>  
>  struct ext4_io_submit {
> @@ -2601,11 +2602,14 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
>  
>  /* page-io.c */
>  extern int __init ext4_init_pageio(void);
> -extern void ext4_add_complete_io(ext4_io_end_t *io_end);
>  extern void ext4_exit_pageio(void);
>  extern void ext4_ioend_wait(struct inode *);
> -extern void ext4_free_io_end(ext4_io_end_t *io);
>  extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
> +extern ext4_io_end_t *ext4_get_io_end(ext4_io_end_t *io_end);
> +extern int ext4_put_io_end(ext4_io_end_t *io_end);
> +extern void ext4_put_io_end_defer(ext4_io_end_t *io_end);
> +extern void ext4_io_submit_init(struct ext4_io_submit *io,
> +				struct writeback_control *wbc);
>  extern void ext4_end_io_work(struct work_struct *work);
>  extern void ext4_io_submit(struct ext4_io_submit *io);
>  extern int ext4_bio_write_page(struct ext4_io_submit *io,
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 6b8ec2a..ba07412 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1409,7 +1409,10 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd,
>  	struct ext4_io_submit io_submit;
>  
>  	BUG_ON(mpd->next_page <= mpd->first_page);
> -	memset(&io_submit, 0, sizeof(io_submit));
> +	ext4_io_submit_init(&io_submit, mpd->wbc);
> +	io_submit.io_end = ext4_init_io_end(inode, GFP_NOFS);
> +	if (!io_submit.io_end)
> +		return -ENOMEM;
>  	/*
>  	 * We need to start from the first_page to the next_page - 1
>  	 * to make sure we also write the mapped dirty buffer_heads.
> @@ -1497,6 +1500,8 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd,
>  		pagevec_release(&pvec);
>  	}
>  	ext4_io_submit(&io_submit);
> +	/* Drop io_end reference we got from init */
> +	ext4_put_io_end_defer(io_submit.io_end);
>  	return ret;
>  }
>  
> @@ -2116,9 +2121,16 @@ static int ext4_writepage(struct page *page,
>  		 */
>  		return __ext4_journalled_writepage(page, len);
>  
> -	memset(&io_submit, 0, sizeof(io_submit));
> +	ext4_io_submit_init(&io_submit, wbc);
> +	io_submit.io_end = ext4_init_io_end(inode, GFP_NOFS);
> +	if (!io_submit.io_end) {
> +		redirty_page_for_writepage(wbc, page);
> +		return -ENOMEM;
> +	}
>  	ret = ext4_bio_write_page(&io_submit, page, len, wbc);
>  	ext4_io_submit(&io_submit);
> +	/* Drop io_end reference we got from init */
> +	ext4_put_io_end_defer(io_submit.io_end);
>  	return ret;
>  }
>  
> @@ -2957,9 +2969,13 @@ static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
>  	struct inode *inode = file_inode(iocb->ki_filp);
>          ext4_io_end_t *io_end = iocb->private;
>  
> -	/* if not async direct IO or dio with 0 bytes write, just return */
> -	if (!io_end || !size)
> -		goto out;
> +	/* if not async direct IO just return */
> +	if (!io_end) {
> +		inode_dio_done(inode);
> +		if (is_async)
> +			aio_complete(iocb, ret, 0);
> +		return;
> +	}
>  
>  	ext_debug("ext4_end_io_dio(): io_end 0x%p "
>  		  "for inode %lu, iocb 0x%p, offset %llu, size %zd\n",
> @@ -2967,25 +2983,13 @@ static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
>  		  size);
>  
>  	iocb->private = NULL;
> -
> -	/* if not aio dio with unwritten extents, just free io and return */
> -	if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
> -		ext4_free_io_end(io_end);
> -out:
> -		inode_dio_done(inode);
> -		if (is_async)
> -			aio_complete(iocb, ret, 0);
> -		return;
> -	}
> -
>  	io_end->offset = offset;
>  	io_end->size = size;
>  	if (is_async) {
>  		io_end->iocb = iocb;
>  		io_end->result = ret;
>  	}
> -
> -	ext4_add_complete_io(io_end);
> +	ext4_put_io_end_defer(io_end);
>  }
>  
>  /*
> @@ -3019,6 +3023,7 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
>  	get_block_t *get_block_func = NULL;
>  	int dio_flags = 0;
>  	loff_t final_size = offset + count;
> +	ext4_io_end_t *io_end = NULL;
>  
>  	/* Use the old path for reads and writes beyond i_size. */
>  	if (rw != WRITE || final_size > inode->i_size)
> @@ -3057,13 +3062,16 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
>  	iocb->private = NULL;
>  	ext4_inode_aio_set(inode, NULL);
>  	if (!is_sync_kiocb(iocb)) {
> -		ext4_io_end_t *io_end = ext4_init_io_end(inode, GFP_NOFS);
> +		io_end = ext4_init_io_end(inode, GFP_NOFS);
>  		if (!io_end) {
>  			ret = -ENOMEM;
>  			goto retake_lock;
>  		}
>  		io_end->flag |= EXT4_IO_END_DIRECT;
> -		iocb->private = io_end;
> +		/*
> +		 * Grab reference for DIO. Will be dropped in ext4_end_io_dio()
> +		 */
> +		iocb->private = ext4_get_io_end(io_end);
>  		/*
>  		 * we save the io structure for current async direct
>  		 * IO, so that later ext4_map_blocks() could flag the
> @@ -3087,26 +3095,27 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
>  				   NULL,
>  				   dio_flags);
>  
> -	if (iocb->private)
> -		ext4_inode_aio_set(inode, NULL);
>  	/*
> -	 * The io_end structure takes a reference to the inode, that
> -	 * structure needs to be destroyed and the reference to the
> -	 * inode need to be dropped, when IO is complete, even with 0
> -	 * byte write, or failed.
> -	 *
> -	 * In the successful AIO DIO case, the io_end structure will
> -	 * be destroyed and the reference to the inode will be dropped
> -	 * after the end_io call back function is called.
> -	 *
> -	 * In the case there is 0 byte write, or error case, since VFS
> -	 * direct IO won't invoke the end_io call back function, we
> -	 * need to free the end_io structure here.
> +	 * Put our reference to io_end. This can free the io_end structure e.g.
> +	 * in sync IO case or in case of error. It can even perform extent
> +	 * conversion if all bios we submitted finished before we got here.
> +	 * Note that in that case iocb->private can be already set to NULL
> +	 * here.
>  	 */
> -	if (ret != -EIOCBQUEUED && ret <= 0 && iocb->private) {
> -		ext4_free_io_end(iocb->private);
> -		iocb->private = NULL;
> -	} else if (ret > 0 && !overwrite && ext4_test_inode_state(inode,
> +	if (io_end) {
> +		ext4_inode_aio_set(inode, NULL);
> +		ext4_put_io_end(io_end);
> +		/*
> +		 * In case of error or no write ext4_end_io_dio() was not
> +		 * called so we have to put iocb's reference.
> +		 */
> +		if (ret <= 0 && ret != -EIOCBQUEUED) {
> +			WARN_ON(iocb->private != io_end);
> +			ext4_put_io_end(io_end);
> +			iocb->private = NULL;
> +		}
> +	}
> +	if (ret > 0 && !overwrite && ext4_test_inode_state(inode,
>  						EXT4_STATE_DIO_UNWRITTEN)) {
>  		int err;
>  		/*
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 73bc011..da8bddf 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -51,17 +51,28 @@ void ext4_ioend_wait(struct inode *inode)
>  	wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_ioend_count) == 0));
>  }
>  
> -void ext4_free_io_end(ext4_io_end_t *io)
> +static void ext4_release_io_end(ext4_io_end_t *io_end)
>  {
> -	int i;
> +	BUG_ON(!list_empty(&io_end->list));
> +	BUG_ON(io_end->flag & EXT4_IO_END_UNWRITTEN);
> +
> +	if (atomic_dec_and_test(&EXT4_I(io_end->inode)->i_ioend_count))
> +		wake_up_all(ext4_ioend_wq(io_end->inode));
> +	if (io_end->flag & EXT4_IO_END_DIRECT)
> +		inode_dio_done(io_end->inode);
> +	if (io_end->iocb)
> +		aio_complete(io_end->iocb, io_end->result, 0);
> +	kmem_cache_free(io_end_cachep, io_end);
> +}
>  
> -	BUG_ON(!io);
> -	BUG_ON(!list_empty(&io->list));
> -	BUG_ON(io->flag & EXT4_IO_END_UNWRITTEN);
> +static void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
> +{
> +	struct inode *inode = io_end->inode;
>  
> -	if (atomic_dec_and_test(&EXT4_I(io->inode)->i_ioend_count))
> -		wake_up_all(ext4_ioend_wq(io->inode));
> -	kmem_cache_free(io_end_cachep, io);
> +	io_end->flag &= ~EXT4_IO_END_UNWRITTEN;
> +	/* Wake up anyone waiting on unwritten extent conversion */
> +	if (atomic_dec_and_test(&EXT4_I(inode)->i_unwritten))
> +		wake_up_all(ext4_ioend_wq(inode));
>  }
>  
>  /* check a range of space and convert unwritten extents to written. */
> @@ -84,13 +95,8 @@ static int ext4_end_io(ext4_io_end_t *io)
>  			 "(inode %lu, offset %llu, size %zd, error %d)",
>  			 inode->i_ino, offset, size, ret);
>  	}
> -	/* Wake up anyone waiting on unwritten extent conversion */
> -	if (atomic_dec_and_test(&EXT4_I(inode)->i_unwritten))
> -		wake_up_all(ext4_ioend_wq(inode));
> -	if (io->flag & EXT4_IO_END_DIRECT)
> -		inode_dio_done(inode);
> -	if (io->iocb)
> -		aio_complete(io->iocb, io->result, 0);
> +	ext4_clear_io_unwritten_flag(io);
> +	ext4_release_io_end(io);
>  	return ret;
>  }
>  
> @@ -121,7 +127,7 @@ static void dump_completed_IO(struct inode *inode)
>  }
>  
>  /* Add the io_end to per-inode completed end_io list. */
> -void ext4_add_complete_io(ext4_io_end_t *io_end)
> +static void ext4_add_complete_io(ext4_io_end_t *io_end)
>  {
>  	struct ext4_inode_info *ei = EXT4_I(io_end->inode);
>  	struct workqueue_struct *wq;
> @@ -158,8 +164,6 @@ static int ext4_do_flush_completed_IO(struct inode *inode)
>  		err = ext4_end_io(io);
>  		if (unlikely(!ret && err))
>  			ret = err;
> -		io->flag &= ~EXT4_IO_END_UNWRITTEN;
> -		ext4_free_io_end(io);
>  	}
>  	return ret;
>  }
> @@ -191,10 +195,43 @@ ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
>  		atomic_inc(&EXT4_I(inode)->i_ioend_count);
>  		io->inode = inode;
>  		INIT_LIST_HEAD(&io->list);
> +		atomic_set(&io->count, 1);
>  	}
>  	return io;
>  }
>  
> +void ext4_put_io_end_defer(ext4_io_end_t *io_end)
> +{
> +	if (atomic_dec_and_test(&io_end->count)) {
> +		if (!(io_end->flag & EXT4_IO_END_UNWRITTEN) || !io_end->size) {
> +			ext4_release_io_end(io_end);
> +			return;
> +		}
> +		ext4_add_complete_io(io_end);
> +	}
> +}
> +
> +int ext4_put_io_end(ext4_io_end_t *io_end)
> +{
> +	int err = 0;
> +
> +	if (atomic_dec_and_test(&io_end->count)) {
> +		if (io_end->flag & EXT4_IO_END_UNWRITTEN) {
> +			err = ext4_convert_unwritten_extents(io_end->inode,
> +						io_end->offset, io_end->size);
> +			ext4_clear_io_unwritten_flag(io_end);
> +		}
> +		ext4_release_io_end(io_end);
> +	}
> +	return err;
> +}
> +
> +ext4_io_end_t *ext4_get_io_end(ext4_io_end_t *io_end)
> +{
> +	atomic_inc(&io_end->count);
> +	return io_end;
> +}
> +
>  /*
>   * Print an buffer I/O error compatible with the fs/buffer.c.  This
>   * provides compatibility with dmesg scrapers that look for a specific
> @@ -277,12 +314,7 @@ static void ext4_end_bio(struct bio *bio, int error)
>  			     bi_sector >> (inode->i_blkbits - 9));
>  	}
>  
> -	if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
> -		ext4_free_io_end(io_end);
> -		return;
> -	}
> -
> -	ext4_add_complete_io(io_end);
> +	ext4_put_io_end_defer(io_end);
>  }
>  
>  void ext4_io_submit(struct ext4_io_submit *io)
> @@ -296,40 +328,37 @@ void ext4_io_submit(struct ext4_io_submit *io)
>  		bio_put(io->io_bio);
>  	}
>  	io->io_bio = NULL;
> -	io->io_op = 0;
> +}
> +
> +void ext4_io_submit_init(struct ext4_io_submit *io,
> +			 struct writeback_control *wbc)
> +{
> +	io->io_op = (wbc->sync_mode == WB_SYNC_ALL ?  WRITE_SYNC : WRITE);
> +	io->io_bio = NULL;
>  	io->io_end = NULL;
>  }
>  
> -static int io_submit_init(struct ext4_io_submit *io,
> -			  struct inode *inode,
> -			  struct writeback_control *wbc,
> -			  struct buffer_head *bh)
> +static int io_submit_init_bio(struct ext4_io_submit *io,
> +			      struct buffer_head *bh)
>  {
> -	ext4_io_end_t *io_end;
> -	struct page *page = bh->b_page;
>  	int nvecs = bio_get_nr_vecs(bh->b_bdev);
>  	struct bio *bio;
>  
> -	io_end = ext4_init_io_end(inode, GFP_NOFS);
> -	if (!io_end)
> -		return -ENOMEM;
>  	bio = bio_alloc(GFP_NOIO, min(nvecs, BIO_MAX_PAGES));
>  	bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9);
>  	bio->bi_bdev = bh->b_bdev;
> -	bio->bi_private = io->io_end = io_end;
>  	bio->bi_end_io = ext4_end_bio;
> -
> -	io_end->offset = (page->index << PAGE_CACHE_SHIFT) + bh_offset(bh);
> -
> +	bio->bi_private = ext4_get_io_end(io->io_end);
> +	if (!io->io_end->size)
> +		io->io_end->offset = (bh->b_page->index << PAGE_CACHE_SHIFT)
> +				     + bh_offset(bh);
>  	io->io_bio = bio;
> -	io->io_op = (wbc->sync_mode == WB_SYNC_ALL ?  WRITE_SYNC : WRITE);
>  	io->io_next_block = bh->b_blocknr;
>  	return 0;
>  }
>  
>  static int io_submit_add_bh(struct ext4_io_submit *io,
>  			    struct inode *inode,
> -			    struct writeback_control *wbc,
>  			    struct buffer_head *bh)
>  {
>  	ext4_io_end_t *io_end;
> @@ -340,18 +369,18 @@ submit_and_retry:
>  		ext4_io_submit(io);
>  	}
>  	if (io->io_bio == NULL) {
> -		ret = io_submit_init(io, inode, wbc, bh);
> +		ret = io_submit_init_bio(io, bh);
>  		if (ret)
>  			return ret;
>  	}
> +	ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh));
> +	if (ret != bh->b_size)
> +		goto submit_and_retry;
>  	io_end = io->io_end;
>  	if (buffer_uninit(bh))
>  		ext4_set_io_unwritten_flag(inode, io_end);
> -	io->io_end->size += bh->b_size;
> +	io_end->size += bh->b_size;
>  	io->io_next_block++;
> -	ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh));
> -	if (ret != bh->b_size)
> -		goto submit_and_retry;
>  	return 0;
>  }
>  
> @@ -423,7 +452,7 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
>  	do {
>  		if (!buffer_async_write(bh))
>  			continue;
> -		ret = io_submit_add_bh(io, inode, wbc, bh);
> +		ret = io_submit_add_bh(io, inode, bh);
>  		if (ret) {
>  			/*
>  			 * We only get here on ENOMEM.  Not much else
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 03/29] ext4: Clear buffer_uninit flag when submitting IO
  2013-04-08 21:32 ` [PATCH 03/29] ext4: Clear buffer_uninit flag when submitting IO Jan Kara
@ 2013-04-11 14:08   ` Zheng Liu
  0 siblings, 0 replies; 76+ messages in thread
From: Zheng Liu @ 2013-04-11 14:08 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:08PM +0200, Jan Kara wrote:
> Currently noone cleared buffer_uninit flag. This results in writeback
> needlessly marking io_end as needing extent conversion scanning extent
> tree for extents to convert. So clear the buffer_uninit flag once the
> buffer is submitted for IO and the flag is transformed into
> EXT4_IO_END_UNWRITTEN flag.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Regards,
                                                - Zheng

> ---
>  fs/ext4/page-io.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index da8bddf..efdf0a5 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -377,7 +377,7 @@ submit_and_retry:
>  	if (ret != bh->b_size)
>  		goto submit_and_retry;
>  	io_end = io->io_end;
> -	if (buffer_uninit(bh))
> +	if (test_clear_buffer_uninit(bh))
>  		ext4_set_io_unwritten_flag(inode, io_end);
>  	io_end->size += bh->b_size;
>  	io->io_next_block++;
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 04/29] jbd2: Reduce journal_head size
  2013-04-08 21:32 ` [PATCH 04/29] jbd2: Reduce journal_head size Jan Kara
@ 2013-04-11 14:10   ` Zheng Liu
  2013-04-12  4:04   ` Theodore Ts'o
  1 sibling, 0 replies; 76+ messages in thread
From: Zheng Liu @ 2013-04-11 14:10 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:09PM +0200, Jan Kara wrote:
> Remove unused t_cow_tid field (ext4 copy-on-write support doesn't seem
> to be happening) and change b_modified and b_jlist to bitfields thus
> saving 8 bytes in the structure.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Regards,
                                                - Zheng
> ---
>  include/linux/journal-head.h |   11 ++---------
>  1 files changed, 2 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/journal-head.h b/include/linux/journal-head.h
> index c18b46f..13a3da2 100644
> --- a/include/linux/journal-head.h
> +++ b/include/linux/journal-head.h
> @@ -31,21 +31,14 @@ struct journal_head {
>  	/*
>  	 * Journalling list for this buffer [jbd_lock_bh_state()]
>  	 */
> -	unsigned b_jlist;
> +	unsigned b_jlist:4;
>  
>  	/*
>  	 * This flag signals the buffer has been modified by
>  	 * the currently running transaction
>  	 * [jbd_lock_bh_state()]
>  	 */
> -	unsigned b_modified;
> -
> -	/*
> -	 * This feild tracks the last transaction id in which this buffer
> -	 * has been cowed
> -	 * [jbd_lock_bh_state()]
> -	 */
> -	tid_t b_cow_tid;
> +	unsigned b_modified:1;
>  
>  	/*
>  	 * Copy of the buffer data frozen for writing to the log.
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 01/29] ext4: Make ext4_bio_write_page() use BH_Async_Write flags instead page pointers from ext4_io_end
  2013-04-08 21:32 ` [PATCH 01/29] ext4: Make ext4_bio_write_page() use BH_Async_Write flags instead page pointers from ext4_io_end Jan Kara
  2013-04-10 18:05   ` Dmitry Monakhov
  2013-04-11 13:38   ` Zheng Liu
@ 2013-04-12  3:50   ` Theodore Ts'o
  2 siblings, 0 replies; 76+ messages in thread
From: Theodore Ts'o @ 2013-04-12  3:50 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

Thanks, I've applied this to the ext4 dev branch for testing.

	     	     	     	      - Ted

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 02/29] ext4: Use io_end for multiple bios
  2013-04-08 21:32 ` [PATCH 02/29] ext4: Use io_end for multiple bios Jan Kara
  2013-04-11  5:10   ` Dmitry Monakhov
  2013-04-11 14:04   ` Zheng Liu
@ 2013-04-12  3:55   ` Theodore Ts'o
  2 siblings, 0 replies; 76+ messages in thread
From: Theodore Ts'o @ 2013-04-12  3:55 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Mon, Apr 08, 2013 at 11:32:07PM +0200, Jan Kara wrote:
> Change writeback path to create just one io_end structure for the extent
> to which we submit IO and share it among bios writing that extent. This
> prevents needless splitting and joining of unwritten extents when they
> cannot be submitted as a single bio.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Thanks, applied to the ext4 patch queue for testing.

		       	    	  	- Ted

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 04/29] jbd2: Reduce journal_head size
  2013-04-08 21:32 ` [PATCH 04/29] jbd2: Reduce journal_head size Jan Kara
  2013-04-11 14:10   ` Zheng Liu
@ 2013-04-12  4:04   ` Theodore Ts'o
  1 sibling, 0 replies; 76+ messages in thread
From: Theodore Ts'o @ 2013-04-12  4:04 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Mon, Apr 08, 2013 at 11:32:09PM +0200, Jan Kara wrote:
> Remove unused t_cow_tid field (ext4 copy-on-write support doesn't seem
> to be happening) and change b_modified and b_jlist to bitfields thus
> saving 8 bytes in the structure.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Applied for testing in the ext4 dev branch.

						- Ted

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 05/29] jbd2: Don't create journal_head for temporary journal buffers
  2013-04-08 21:32 ` [PATCH 05/29] jbd2: Don't create journal_head for temporary journal buffers Jan Kara
@ 2013-04-12  8:01   ` Zheng Liu
  0 siblings, 0 replies; 76+ messages in thread
From: Zheng Liu @ 2013-04-12  8:01 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:10PM +0200, Jan Kara wrote:
> When writing metadata to the journal, we create temporary buffer heads
> for that task. We also attach journal heads to these buffer heads but
> the only purpose of the journal heads is to keep buffers linked in
> transaction's BJ_IO list. We remove the need for journal heads by
> reusing buffer_head's b_assoc_buffers list for that purpose. Also since
> BJ_IO list is just a temporary list for transaction commit, we use a
> private list in jbd2_journal_commit_transaction() for that thus removing
> BJ_IO list from transaction completely.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Regards,
                                                - Zheng
> ---
>  fs/jbd2/checkpoint.c  |    1 -
>  fs/jbd2/commit.c      |   65 ++++++++++++++++--------------------------------
>  fs/jbd2/journal.c     |   36 ++++++++++----------------
>  fs/jbd2/transaction.c |   14 +++-------
>  include/linux/jbd2.h  |   32 ++++++++++++------------
>  5 files changed, 56 insertions(+), 92 deletions(-)
> 
> diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> index c78841e..2735fef 100644
> --- a/fs/jbd2/checkpoint.c
> +++ b/fs/jbd2/checkpoint.c
> @@ -690,7 +690,6 @@ void __jbd2_journal_drop_transaction(journal_t *journal, transaction_t *transact
>  	J_ASSERT(transaction->t_state == T_FINISHED);
>  	J_ASSERT(transaction->t_buffers == NULL);
>  	J_ASSERT(transaction->t_forget == NULL);
> -	J_ASSERT(transaction->t_iobuf_list == NULL);
>  	J_ASSERT(transaction->t_shadow_list == NULL);
>  	J_ASSERT(transaction->t_log_list == NULL);
>  	J_ASSERT(transaction->t_checkpoint_list == NULL);
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index 750c701..c503df6 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -368,7 +368,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>  {
>  	struct transaction_stats_s stats;
>  	transaction_t *commit_transaction;
> -	struct journal_head *jh, *new_jh, *descriptor;
> +	struct journal_head *jh, *descriptor;
>  	struct buffer_head **wbuf = journal->j_wbuf;
>  	int bufs;
>  	int flags;
> @@ -392,6 +392,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>  	tid_t first_tid;
>  	int update_tail;
>  	int csum_size = 0;
> +	LIST_HEAD(io_bufs);
>  
>  	if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2))
>  		csum_size = sizeof(struct jbd2_journal_block_tail);
> @@ -658,29 +659,22 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>  
>  		/* Bump b_count to prevent truncate from stumbling over
>                     the shadowed buffer!  @@@ This can go if we ever get
> -                   rid of the BJ_IO/BJ_Shadow pairing of buffers. */
> +                   rid of the shadow pairing of buffers. */
>  		atomic_inc(&jh2bh(jh)->b_count);
>  
> -		/* Make a temporary IO buffer with which to write it out
> -                   (this will requeue both the metadata buffer and the
> -                   temporary IO buffer). new_bh goes on BJ_IO*/
> -
> -		set_bit(BH_JWrite, &jh2bh(jh)->b_state);
>  		/*
> -		 * akpm: jbd2_journal_write_metadata_buffer() sets
> -		 * new_bh->b_transaction to commit_transaction.
> -		 * We need to clean this up before we release new_bh
> -		 * (which is of type BJ_IO)
> +		 * Make a temporary IO buffer with which to write it out
> +		 * (this will requeue the metadata buffer to BJ_Shadow).
>  		 */
> +		set_bit(BH_JWrite, &jh2bh(jh)->b_state);
>  		JBUFFER_TRACE(jh, "ph3: write metadata");
>  		flags = jbd2_journal_write_metadata_buffer(commit_transaction,
> -						      jh, &new_jh, blocknr);
> +						jh, &wbuf[bufs], blocknr);
>  		if (flags < 0) {
>  			jbd2_journal_abort(journal, flags);
>  			continue;
>  		}
> -		set_bit(BH_JWrite, &jh2bh(new_jh)->b_state);
> -		wbuf[bufs++] = jh2bh(new_jh);
> +		jbd2_file_log_bh(&io_bufs, wbuf[bufs]);
>  
>  		/* Record the new block's tag in the current descriptor
>                     buffer */
> @@ -694,10 +688,11 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>  		tag = (journal_block_tag_t *) tagp;
>  		write_tag_block(tag_bytes, tag, jh2bh(jh)->b_blocknr);
>  		tag->t_flags = cpu_to_be16(tag_flag);
> -		jbd2_block_tag_csum_set(journal, tag, jh2bh(new_jh),
> +		jbd2_block_tag_csum_set(journal, tag, wbuf[bufs],
>  					commit_transaction->t_tid);
>  		tagp += tag_bytes;
>  		space_left -= tag_bytes;
> +		bufs++;
>  
>  		if (first_tag) {
>  			memcpy (tagp, journal->j_uuid, 16);
> @@ -809,7 +804,7 @@ start_journal_io:
>             the log.  Before we can commit it, wait for the IO so far to
>             complete.  Control buffers being written are on the
>             transaction's t_log_list queue, and metadata buffers are on
> -           the t_iobuf_list queue.
> +           the io_bufs list.
>  
>  	   Wait for the buffers in reverse order.  That way we are
>  	   less likely to be woken up until all IOs have completed, and
> @@ -818,46 +813,31 @@ start_journal_io:
>  
>  	jbd_debug(3, "JBD2: commit phase 3\n");
>  
> -	/*
> -	 * akpm: these are BJ_IO, and j_list_lock is not needed.
> -	 * See __journal_try_to_free_buffer.
> -	 */
> -wait_for_iobuf:
> -	while (commit_transaction->t_iobuf_list != NULL) {
> -		struct buffer_head *bh;
> +	while (!list_empty(&io_bufs)) {
> +		struct buffer_head *bh = list_entry(io_bufs.prev,
> +						    struct buffer_head,
> +						    b_assoc_buffers);
>  
> -		jh = commit_transaction->t_iobuf_list->b_tprev;
> -		bh = jh2bh(jh);
> -		if (buffer_locked(bh)) {
> -			wait_on_buffer(bh);
> -			goto wait_for_iobuf;
> -		}
> -		if (cond_resched())
> -			goto wait_for_iobuf;
> +		wait_on_buffer(bh);
> +		cond_resched();
>  
>  		if (unlikely(!buffer_uptodate(bh)))
>  			err = -EIO;
> -
> -		clear_buffer_jwrite(bh);
> -
> -		JBUFFER_TRACE(jh, "ph4: unfile after journal write");
> -		jbd2_journal_unfile_buffer(journal, jh);
> +		jbd2_unfile_log_bh(bh);
>  
>  		/*
> -		 * ->t_iobuf_list should contain only dummy buffer_heads
> -		 * which were created by jbd2_journal_write_metadata_buffer().
> +		 * The list contains temporary buffer heads created by
> +		 * jbd2_journal_write_metadata_buffer().
>  		 */
>  		BUFFER_TRACE(bh, "dumping temporary bh");
> -		jbd2_journal_put_journal_head(jh);
>  		__brelse(bh);
>  		J_ASSERT_BH(bh, atomic_read(&bh->b_count) == 0);
>  		free_buffer_head(bh);
>  
> -		/* We also have to unlock and free the corresponding
> -                   shadowed buffer */
> +		/* We also have to refile the corresponding shadowed buffer */
>  		jh = commit_transaction->t_shadow_list->b_tprev;
>  		bh = jh2bh(jh);
> -		clear_bit(BH_JWrite, &bh->b_state);
> +		clear_buffer_jwrite(bh);
>  		J_ASSERT_BH(bh, buffer_jbddirty(bh));
>  
>  		/* The metadata is now released for reuse, but we need
> @@ -952,7 +932,6 @@ wait_for_iobuf:
>  	J_ASSERT(list_empty(&commit_transaction->t_inode_list));
>  	J_ASSERT(commit_transaction->t_buffers == NULL);
>  	J_ASSERT(commit_transaction->t_checkpoint_list == NULL);
> -	J_ASSERT(commit_transaction->t_iobuf_list == NULL);
>  	J_ASSERT(commit_transaction->t_shadow_list == NULL);
>  	J_ASSERT(commit_transaction->t_log_list == NULL);
>  
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index ed10991..eb6272b 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -310,14 +310,12 @@ static void journal_kill_thread(journal_t *journal)
>   *
>   * If the source buffer has already been modified by a new transaction
>   * since we took the last commit snapshot, we use the frozen copy of
> - * that data for IO.  If we end up using the existing buffer_head's data
> - * for the write, then we *have* to lock the buffer to prevent anyone
> - * else from using and possibly modifying it while the IO is in
> - * progress.
> + * that data for IO. If we end up using the existing buffer_head's data
> + * for the write, then we have to make sure nobody modifies it while the
> + * IO is in progress. do_get_write_access() handles this.
>   *
> - * The function returns a pointer to the buffer_heads to be used for IO.
> - *
> - * We assume that the journal has already been locked in this function.
> + * The function returns a pointer to the buffer_head to be used for IO.
> + * 
>   *
>   * Return value:
>   *  <0: Error
> @@ -330,15 +328,14 @@ static void journal_kill_thread(journal_t *journal)
>  
>  int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
>  				  struct journal_head  *jh_in,
> -				  struct journal_head **jh_out,
> -				  unsigned long long blocknr)
> +				  struct buffer_head **bh_out,
> +				  sector_t blocknr)
>  {
>  	int need_copy_out = 0;
>  	int done_copy_out = 0;
>  	int do_escape = 0;
>  	char *mapped_data;
>  	struct buffer_head *new_bh;
> -	struct journal_head *new_jh;
>  	struct page *new_page;
>  	unsigned int new_offset;
>  	struct buffer_head *bh_in = jh2bh(jh_in);
> @@ -370,14 +367,13 @@ retry_alloc:
>  	new_bh->b_state = 0;
>  	init_buffer(new_bh, NULL, NULL);
>  	atomic_set(&new_bh->b_count, 1);
> -	new_jh = jbd2_journal_add_journal_head(new_bh);	/* This sleeps */
>  
> +	jbd_lock_bh_state(bh_in);
> +repeat:
>  	/*
>  	 * If a new transaction has already done a buffer copy-out, then
>  	 * we use that version of the data for the commit.
>  	 */
> -	jbd_lock_bh_state(bh_in);
> -repeat:
>  	if (jh_in->b_frozen_data) {
>  		done_copy_out = 1;
>  		new_page = virt_to_page(jh_in->b_frozen_data);
> @@ -417,7 +413,7 @@ repeat:
>  		jbd_unlock_bh_state(bh_in);
>  		tmp = jbd2_alloc(bh_in->b_size, GFP_NOFS);
>  		if (!tmp) {
> -			jbd2_journal_put_journal_head(new_jh);
> +			brelse(new_bh);
>  			return -ENOMEM;
>  		}
>  		jbd_lock_bh_state(bh_in);
> @@ -428,7 +424,7 @@ repeat:
>  
>  		jh_in->b_frozen_data = tmp;
>  		mapped_data = kmap_atomic(new_page);
> -		memcpy(tmp, mapped_data + new_offset, jh2bh(jh_in)->b_size);
> +		memcpy(tmp, mapped_data + new_offset, bh_in->b_size);
>  		kunmap_atomic(mapped_data);
>  
>  		new_page = virt_to_page(tmp);
> @@ -454,14 +450,13 @@ repeat:
>  	}
>  
>  	set_bh_page(new_bh, new_page, new_offset);
> -	new_jh->b_transaction = NULL;
> -	new_bh->b_size = jh2bh(jh_in)->b_size;
> -	new_bh->b_bdev = transaction->t_journal->j_dev;
> +	new_bh->b_size = bh_in->b_size;
> +	new_bh->b_bdev = journal->j_dev;
>  	new_bh->b_blocknr = blocknr;
>  	set_buffer_mapped(new_bh);
>  	set_buffer_dirty(new_bh);
>  
> -	*jh_out = new_jh;
> +	*bh_out = new_bh;
>  
>  	/*
>  	 * The to-be-written buffer needs to get moved to the io queue,
> @@ -474,9 +469,6 @@ repeat:
>  	spin_unlock(&journal->j_list_lock);
>  	jbd_unlock_bh_state(bh_in);
>  
> -	JBUFFER_TRACE(new_jh, "file as BJ_IO");
> -	jbd2_journal_file_buffer(new_jh, transaction, BJ_IO);
> -
>  	return do_escape | (done_copy_out << 1);
>  }
>  
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index d6ee5ae..3be34c7 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -1589,10 +1589,10 @@ __blist_del_buffer(struct journal_head **list, struct journal_head *jh)
>   * Remove a buffer from the appropriate transaction list.
>   *
>   * Note that this function can *change* the value of
> - * bh->b_transaction->t_buffers, t_forget, t_iobuf_list, t_shadow_list,
> - * t_log_list or t_reserved_list.  If the caller is holding onto a copy of one
> - * of these pointers, it could go bad.  Generally the caller needs to re-read
> - * the pointer from the transaction_t.
> + * bh->b_transaction->t_buffers, t_forget, t_shadow_list, t_log_list or
> + * t_reserved_list.  If the caller is holding onto a copy of one of these
> + * pointers, it could go bad.  Generally the caller needs to re-read the
> + * pointer from the transaction_t.
>   *
>   * Called under j_list_lock.
>   */
> @@ -1622,9 +1622,6 @@ static void __jbd2_journal_temp_unlink_buffer(struct journal_head *jh)
>  	case BJ_Forget:
>  		list = &transaction->t_forget;
>  		break;
> -	case BJ_IO:
> -		list = &transaction->t_iobuf_list;
> -		break;
>  	case BJ_Shadow:
>  		list = &transaction->t_shadow_list;
>  		break;
> @@ -2126,9 +2123,6 @@ void __jbd2_journal_file_buffer(struct journal_head *jh,
>  	case BJ_Forget:
>  		list = &transaction->t_forget;
>  		break;
> -	case BJ_IO:
> -		list = &transaction->t_iobuf_list;
> -		break;
>  	case BJ_Shadow:
>  		list = &transaction->t_shadow_list;
>  		break;
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 50e5a5e..a670595 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -523,12 +523,6 @@ struct transaction_s
>  	struct journal_head	*t_checkpoint_io_list;
>  
>  	/*
> -	 * Doubly-linked circular list of temporary buffers currently undergoing
> -	 * IO in the log [j_list_lock]
> -	 */
> -	struct journal_head	*t_iobuf_list;
> -
> -	/*
>  	 * Doubly-linked circular list of metadata buffers being shadowed by log
>  	 * IO.  The IO buffers on the iobuf list and the shadow buffers on this
>  	 * list match each other one for one at all times. [j_list_lock]
> @@ -990,6 +984,14 @@ extern void __jbd2_journal_file_buffer(struct journal_head *, transaction_t *, i
>  extern void __journal_free_buffer(struct journal_head *bh);
>  extern void jbd2_journal_file_buffer(struct journal_head *, transaction_t *, int);
>  extern void __journal_clean_data_list(transaction_t *transaction);
> +static inline void jbd2_file_log_bh(struct list_head *head, struct buffer_head *bh)
> +{
> +	list_add_tail(&bh->b_assoc_buffers, head);
> +}
> +static inline void jbd2_unfile_log_bh(struct buffer_head *bh)
> +{
> +	list_del_init(&bh->b_assoc_buffers);
> +}
>  
>  /* Log buffer allocation */
>  extern struct journal_head * jbd2_journal_get_descriptor_buffer(journal_t *);
> @@ -1038,11 +1040,10 @@ extern void jbd2_buffer_abort_trigger(struct journal_head *jh,
>  				      struct jbd2_buffer_trigger_type *triggers);
>  
>  /* Buffer IO */
> -extern int
> -jbd2_journal_write_metadata_buffer(transaction_t	  *transaction,
> -			      struct journal_head  *jh_in,
> -			      struct journal_head **jh_out,
> -			      unsigned long long   blocknr);
> +extern int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
> +					      struct journal_head *jh_in,
> +					      struct buffer_head **bh_out,
> +					      sector_t blocknr);
>  
>  /* Transaction locking */
>  extern void		__wait_on_journal (journal_t *);
> @@ -1284,11 +1285,10 @@ static inline int jbd_space_needed(journal_t *journal)
>  #define BJ_None		0	/* Not journaled */
>  #define BJ_Metadata	1	/* Normal journaled metadata */
>  #define BJ_Forget	2	/* Buffer superseded by this transaction */
> -#define BJ_IO		3	/* Buffer is for temporary IO use */
> -#define BJ_Shadow	4	/* Buffer contents being shadowed to the log */
> -#define BJ_LogCtl	5	/* Buffer contains log descriptors */
> -#define BJ_Reserved	6	/* Buffer is reserved for access by journal */
> -#define BJ_Types	7
> +#define BJ_Shadow	3	/* Buffer contents being shadowed to the log */
> +#define BJ_LogCtl	4	/* Buffer contains log descriptors */
> +#define BJ_Reserved	5	/* Buffer is reserved for access by journal */
> +#define BJ_Types	6
>  
>  extern int jbd_blocks_per_page(struct inode *inode);
>  
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 06/29] jbd2: Remove journal_head from descriptor buffers
  2013-04-08 21:32 ` [PATCH 06/29] jbd2: Remove journal_head from descriptor buffers Jan Kara
@ 2013-04-12  8:10   ` Zheng Liu
  0 siblings, 0 replies; 76+ messages in thread
From: Zheng Liu @ 2013-04-12  8:10 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:11PM +0200, Jan Kara wrote:
> Similarly as for metadata buffers, also log descriptor buffers don't
> really need the journal head. So strip it and remove BJ_LogCtl list.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Regards,
                                                - Zheng
> ---
>  fs/jbd2/checkpoint.c  |    1 -
>  fs/jbd2/commit.c      |   78 +++++++++++++++++++-----------------------------
>  fs/jbd2/journal.c     |    4 +-
>  fs/jbd2/revoke.c      |   49 +++++++++++++++---------------
>  fs/jbd2/transaction.c |    6 ----
>  include/linux/jbd2.h  |   19 ++++-------
>  6 files changed, 64 insertions(+), 93 deletions(-)
> 
> diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> index 2735fef..65ec076 100644
> --- a/fs/jbd2/checkpoint.c
> +++ b/fs/jbd2/checkpoint.c
> @@ -691,7 +691,6 @@ void __jbd2_journal_drop_transaction(journal_t *journal, transaction_t *transact
>  	J_ASSERT(transaction->t_buffers == NULL);
>  	J_ASSERT(transaction->t_forget == NULL);
>  	J_ASSERT(transaction->t_shadow_list == NULL);
> -	J_ASSERT(transaction->t_log_list == NULL);
>  	J_ASSERT(transaction->t_checkpoint_list == NULL);
>  	J_ASSERT(transaction->t_checkpoint_io_list == NULL);
>  	J_ASSERT(atomic_read(&transaction->t_updates) == 0);
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index c503df6..1a03762 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -85,8 +85,7 @@ nope:
>  	__brelse(bh);
>  }
>  
> -static void jbd2_commit_block_csum_set(journal_t *j,
> -				       struct journal_head *descriptor)
> +static void jbd2_commit_block_csum_set(journal_t *j, struct buffer_head *bh)
>  {
>  	struct commit_header *h;
>  	__u32 csum;
> @@ -94,12 +93,11 @@ static void jbd2_commit_block_csum_set(journal_t *j,
>  	if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
>  		return;
>  
> -	h = (struct commit_header *)(jh2bh(descriptor)->b_data);
> +	h = (struct commit_header *)(bh->b_data);
>  	h->h_chksum_type = 0;
>  	h->h_chksum_size = 0;
>  	h->h_chksum[0] = 0;
> -	csum = jbd2_chksum(j, j->j_csum_seed, jh2bh(descriptor)->b_data,
> -			   j->j_blocksize);
> +	csum = jbd2_chksum(j, j->j_csum_seed, bh->b_data, j->j_blocksize);
>  	h->h_chksum[0] = cpu_to_be32(csum);
>  }
>  
> @@ -116,7 +114,6 @@ static int journal_submit_commit_record(journal_t *journal,
>  					struct buffer_head **cbh,
>  					__u32 crc32_sum)
>  {
> -	struct journal_head *descriptor;
>  	struct commit_header *tmp;
>  	struct buffer_head *bh;
>  	int ret;
> @@ -127,12 +124,10 @@ static int journal_submit_commit_record(journal_t *journal,
>  	if (is_journal_aborted(journal))
>  		return 0;
>  
> -	descriptor = jbd2_journal_get_descriptor_buffer(journal);
> -	if (!descriptor)
> +	bh = jbd2_journal_get_descriptor_buffer(journal);
> +	if (!bh)
>  		return 1;
>  
> -	bh = jh2bh(descriptor);
> -
>  	tmp = (struct commit_header *)bh->b_data;
>  	tmp->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
>  	tmp->h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
> @@ -146,9 +141,9 @@ static int journal_submit_commit_record(journal_t *journal,
>  		tmp->h_chksum_size 	= JBD2_CRC32_CHKSUM_SIZE;
>  		tmp->h_chksum[0] 	= cpu_to_be32(crc32_sum);
>  	}
> -	jbd2_commit_block_csum_set(journal, descriptor);
> +	jbd2_commit_block_csum_set(journal, bh);
>  
> -	JBUFFER_TRACE(descriptor, "submit commit block");
> +	BUFFER_TRACE(bh, "submit commit block");
>  	lock_buffer(bh);
>  	clear_buffer_dirty(bh);
>  	set_buffer_uptodate(bh);
> @@ -180,7 +175,6 @@ static int journal_wait_on_commit_record(journal_t *journal,
>  	if (unlikely(!buffer_uptodate(bh)))
>  		ret = -EIO;
>  	put_bh(bh);            /* One for getblk() */
> -	jbd2_journal_put_journal_head(bh2jh(bh));
>  
>  	return ret;
>  }
> @@ -321,7 +315,7 @@ static void write_tag_block(int tag_bytes, journal_block_tag_t *tag,
>  }
>  
>  static void jbd2_descr_block_csum_set(journal_t *j,
> -				      struct journal_head *descriptor)
> +				      struct buffer_head *bh)
>  {
>  	struct jbd2_journal_block_tail *tail;
>  	__u32 csum;
> @@ -329,12 +323,10 @@ static void jbd2_descr_block_csum_set(journal_t *j,
>  	if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
>  		return;
>  
> -	tail = (struct jbd2_journal_block_tail *)
> -			(jh2bh(descriptor)->b_data + j->j_blocksize -
> +	tail = (struct jbd2_journal_block_tail *)(bh->b_data + j->j_blocksize -
>  			sizeof(struct jbd2_journal_block_tail));
>  	tail->t_checksum = 0;
> -	csum = jbd2_chksum(j, j->j_csum_seed, jh2bh(descriptor)->b_data,
> -			   j->j_blocksize);
> +	csum = jbd2_chksum(j, j->j_csum_seed, bh->b_data, j->j_blocksize);
>  	tail->t_checksum = cpu_to_be32(csum);
>  }
>  
> @@ -368,7 +360,8 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>  {
>  	struct transaction_stats_s stats;
>  	transaction_t *commit_transaction;
> -	struct journal_head *jh, *descriptor;
> +	struct journal_head *jh;
> +	struct buffer_head *descriptor;
>  	struct buffer_head **wbuf = journal->j_wbuf;
>  	int bufs;
>  	int flags;
> @@ -393,6 +386,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>  	int update_tail;
>  	int csum_size = 0;
>  	LIST_HEAD(io_bufs);
> +	LIST_HEAD(log_bufs);
>  
>  	if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2))
>  		csum_size = sizeof(struct jbd2_journal_block_tail);
> @@ -546,7 +540,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>  
>  	blk_start_plug(&plug);
>  	jbd2_journal_write_revoke_records(journal, commit_transaction,
> -					  WRITE_SYNC);
> +					  &log_bufs, WRITE_SYNC);
>  	blk_finish_plug(&plug);
>  
>  	jbd_debug(3, "JBD2: commit phase 2\n");
> @@ -572,8 +566,8 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>  		 atomic_read(&commit_transaction->t_outstanding_credits));
>  
>  	err = 0;
> -	descriptor = NULL;
>  	bufs = 0;
> +	descriptor = NULL;
>  	blk_start_plug(&plug);
>  	while (commit_transaction->t_buffers) {
>  
> @@ -605,8 +599,6 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>  		   record the metadata buffer. */
>  
>  		if (!descriptor) {
> -			struct buffer_head *bh;
> -
>  			J_ASSERT (bufs == 0);
>  
>  			jbd_debug(4, "JBD2: get descriptor\n");
> @@ -617,26 +609,26 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>  				continue;
>  			}
>  
> -			bh = jh2bh(descriptor);
>  			jbd_debug(4, "JBD2: got buffer %llu (%p)\n",
> -				(unsigned long long)bh->b_blocknr, bh->b_data);
> -			header = (journal_header_t *)&bh->b_data[0];
> +				(unsigned long long)descriptor->b_blocknr,
> +				descriptor->b_data);
> +			header = (journal_header_t *)descriptor->b_data;
>  			header->h_magic     = cpu_to_be32(JBD2_MAGIC_NUMBER);
>  			header->h_blocktype = cpu_to_be32(JBD2_DESCRIPTOR_BLOCK);
>  			header->h_sequence  = cpu_to_be32(commit_transaction->t_tid);
>  
> -			tagp = &bh->b_data[sizeof(journal_header_t)];
> -			space_left = bh->b_size - sizeof(journal_header_t);
> +			tagp = &descriptor->b_data[sizeof(journal_header_t)];
> +			space_left = descriptor->b_size -
> +						sizeof(journal_header_t);
>  			first_tag = 1;
> -			set_buffer_jwrite(bh);
> -			set_buffer_dirty(bh);
> -			wbuf[bufs++] = bh;
> +			set_buffer_jwrite(descriptor);
> +			set_buffer_dirty(descriptor);
> +			wbuf[bufs++] = descriptor;
>  
>  			/* Record it so that we can wait for IO
>                             completion later */
> -			BUFFER_TRACE(bh, "ph3: file as descriptor");
> -			jbd2_journal_file_buffer(descriptor, commit_transaction,
> -					BJ_LogCtl);
> +			BUFFER_TRACE(descriptor, "ph3: file as descriptor");
> +			jbd2_file_log_bh(&log_bufs, descriptor);
>  		}
>  
>  		/* Where is the buffer to be written? */
> @@ -863,26 +855,19 @@ start_journal_io:
>  	jbd_debug(3, "JBD2: commit phase 4\n");
>  
>  	/* Here we wait for the revoke record and descriptor record buffers */
> - wait_for_ctlbuf:
> -	while (commit_transaction->t_log_list != NULL) {
> +	while (!list_empty(&log_bufs)) {
>  		struct buffer_head *bh;
>  
> -		jh = commit_transaction->t_log_list->b_tprev;
> -		bh = jh2bh(jh);
> -		if (buffer_locked(bh)) {
> -			wait_on_buffer(bh);
> -			goto wait_for_ctlbuf;
> -		}
> -		if (cond_resched())
> -			goto wait_for_ctlbuf;
> +		bh = list_entry(log_bufs.prev, struct buffer_head, b_assoc_buffers);
> +		wait_on_buffer(bh);
> +		cond_resched();
>  
>  		if (unlikely(!buffer_uptodate(bh)))
>  			err = -EIO;
>  
>  		BUFFER_TRACE(bh, "ph5: control buffer writeout done: unfile");
>  		clear_buffer_jwrite(bh);
> -		jbd2_journal_unfile_buffer(journal, jh);
> -		jbd2_journal_put_journal_head(jh);
> +		jbd2_unfile_log_bh(bh);
>  		__brelse(bh);		/* One for getblk */
>  		/* AKPM: bforget here */
>  	}
> @@ -933,7 +918,6 @@ start_journal_io:
>  	J_ASSERT(commit_transaction->t_buffers == NULL);
>  	J_ASSERT(commit_transaction->t_checkpoint_list == NULL);
>  	J_ASSERT(commit_transaction->t_shadow_list == NULL);
> -	J_ASSERT(commit_transaction->t_log_list == NULL);
>  
>  restart_loop:
>  	/*
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index eb6272b..e03aae0 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -761,7 +761,7 @@ int jbd2_journal_bmap(journal_t *journal, unsigned long blocknr,
>   * But we don't bother doing that, so there will be coherency problems with
>   * mmaps of blockdevs which hold live JBD-controlled filesystems.
>   */
> -struct journal_head *jbd2_journal_get_descriptor_buffer(journal_t *journal)
> +struct buffer_head *jbd2_journal_get_descriptor_buffer(journal_t *journal)
>  {
>  	struct buffer_head *bh;
>  	unsigned long long blocknr;
> @@ -780,7 +780,7 @@ struct journal_head *jbd2_journal_get_descriptor_buffer(journal_t *journal)
>  	set_buffer_uptodate(bh);
>  	unlock_buffer(bh);
>  	BUFFER_TRACE(bh, "return this buffer");
> -	return jbd2_journal_add_journal_head(bh);
> +	return bh;
>  }
>  
>  /*
> diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
> index f30b80b..198c9c1 100644
> --- a/fs/jbd2/revoke.c
> +++ b/fs/jbd2/revoke.c
> @@ -122,9 +122,10 @@ struct jbd2_revoke_table_s
>  
>  #ifdef __KERNEL__
>  static void write_one_revoke_record(journal_t *, transaction_t *,
> -				    struct journal_head **, int *,
> +				    struct list_head *,
> +				    struct buffer_head **, int *,
>  				    struct jbd2_revoke_record_s *, int);
> -static void flush_descriptor(journal_t *, struct journal_head *, int, int);
> +static void flush_descriptor(journal_t *, struct buffer_head *, int, int);
>  #endif
>  
>  /* Utility functions to maintain the revoke table */
> @@ -531,9 +532,10 @@ void jbd2_journal_switch_revoke_table(journal_t *journal)
>   */
>  void jbd2_journal_write_revoke_records(journal_t *journal,
>  				       transaction_t *transaction,
> +				       struct list_head *log_bufs,
>  				       int write_op)
>  {
> -	struct journal_head *descriptor;
> +	struct buffer_head *descriptor;
>  	struct jbd2_revoke_record_s *record;
>  	struct jbd2_revoke_table_s *revoke;
>  	struct list_head *hash_list;
> @@ -553,7 +555,7 @@ void jbd2_journal_write_revoke_records(journal_t *journal,
>  		while (!list_empty(hash_list)) {
>  			record = (struct jbd2_revoke_record_s *)
>  				hash_list->next;
> -			write_one_revoke_record(journal, transaction,
> +			write_one_revoke_record(journal, transaction, log_bufs,
>  						&descriptor, &offset,
>  						record, write_op);
>  			count++;
> @@ -573,13 +575,14 @@ void jbd2_journal_write_revoke_records(journal_t *journal,
>  
>  static void write_one_revoke_record(journal_t *journal,
>  				    transaction_t *transaction,
> -				    struct journal_head **descriptorp,
> +				    struct list_head *log_bufs,
> +				    struct buffer_head **descriptorp,
>  				    int *offsetp,
>  				    struct jbd2_revoke_record_s *record,
>  				    int write_op)
>  {
>  	int csum_size = 0;
> -	struct journal_head *descriptor;
> +	struct buffer_head *descriptor;
>  	int offset;
>  	journal_header_t *header;
>  
> @@ -609,26 +612,26 @@ static void write_one_revoke_record(journal_t *journal,
>  		descriptor = jbd2_journal_get_descriptor_buffer(journal);
>  		if (!descriptor)
>  			return;
> -		header = (journal_header_t *) &jh2bh(descriptor)->b_data[0];
> +		header = (journal_header_t *)descriptor->b_data;
>  		header->h_magic     = cpu_to_be32(JBD2_MAGIC_NUMBER);
>  		header->h_blocktype = cpu_to_be32(JBD2_REVOKE_BLOCK);
>  		header->h_sequence  = cpu_to_be32(transaction->t_tid);
>  
>  		/* Record it so that we can wait for IO completion later */
> -		JBUFFER_TRACE(descriptor, "file as BJ_LogCtl");
> -		jbd2_journal_file_buffer(descriptor, transaction, BJ_LogCtl);
> +		BUFFER_TRACE(descriptor, "file in log_bufs");
> +		jbd2_file_log_bh(log_bufs, descriptor);
>  
>  		offset = sizeof(jbd2_journal_revoke_header_t);
>  		*descriptorp = descriptor;
>  	}
>  
>  	if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_64BIT)) {
> -		* ((__be64 *)(&jh2bh(descriptor)->b_data[offset])) =
> +		* ((__be64 *)(&descriptor->b_data[offset])) =
>  			cpu_to_be64(record->blocknr);
>  		offset += 8;
>  
>  	} else {
> -		* ((__be32 *)(&jh2bh(descriptor)->b_data[offset])) =
> +		* ((__be32 *)(&descriptor->b_data[offset])) =
>  			cpu_to_be32(record->blocknr);
>  		offset += 4;
>  	}
> @@ -636,8 +639,7 @@ static void write_one_revoke_record(journal_t *journal,
>  	*offsetp = offset;
>  }
>  
> -static void jbd2_revoke_csum_set(journal_t *j,
> -				 struct journal_head *descriptor)
> +static void jbd2_revoke_csum_set(journal_t *j, struct buffer_head *bh)
>  {
>  	struct jbd2_journal_revoke_tail *tail;
>  	__u32 csum;
> @@ -645,12 +647,10 @@ static void jbd2_revoke_csum_set(journal_t *j,
>  	if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
>  		return;
>  
> -	tail = (struct jbd2_journal_revoke_tail *)
> -			(jh2bh(descriptor)->b_data + j->j_blocksize -
> +	tail = (struct jbd2_journal_revoke_tail *)(bh->b_data + j->j_blocksize -
>  			sizeof(struct jbd2_journal_revoke_tail));
>  	tail->r_checksum = 0;
> -	csum = jbd2_chksum(j, j->j_csum_seed, jh2bh(descriptor)->b_data,
> -			   j->j_blocksize);
> +	csum = jbd2_chksum(j, j->j_csum_seed, bh->b_data, j->j_blocksize);
>  	tail->r_checksum = cpu_to_be32(csum);
>  }
>  
> @@ -662,25 +662,24 @@ static void jbd2_revoke_csum_set(journal_t *j,
>   */
>  
>  static void flush_descriptor(journal_t *journal,
> -			     struct journal_head *descriptor,
> +			     struct buffer_head *descriptor,
>  			     int offset, int write_op)
>  {
>  	jbd2_journal_revoke_header_t *header;
> -	struct buffer_head *bh = jh2bh(descriptor);
>  
>  	if (is_journal_aborted(journal)) {
> -		put_bh(bh);
> +		put_bh(descriptor);
>  		return;
>  	}
>  
> -	header = (jbd2_journal_revoke_header_t *) jh2bh(descriptor)->b_data;
> +	header = (jbd2_journal_revoke_header_t *)descriptor->b_data;
>  	header->r_count = cpu_to_be32(offset);
>  	jbd2_revoke_csum_set(journal, descriptor);
>  
> -	set_buffer_jwrite(bh);
> -	BUFFER_TRACE(bh, "write");
> -	set_buffer_dirty(bh);
> -	write_dirty_buffer(bh, write_op);
> +	set_buffer_jwrite(descriptor);
> +	BUFFER_TRACE(descriptor, "write");
> +	set_buffer_dirty(descriptor);
> +	write_dirty_buffer(descriptor, write_op);
>  }
>  #endif
>  
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 3be34c7..bc35899 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -1625,9 +1625,6 @@ static void __jbd2_journal_temp_unlink_buffer(struct journal_head *jh)
>  	case BJ_Shadow:
>  		list = &transaction->t_shadow_list;
>  		break;
> -	case BJ_LogCtl:
> -		list = &transaction->t_log_list;
> -		break;
>  	case BJ_Reserved:
>  		list = &transaction->t_reserved_list;
>  		break;
> @@ -2126,9 +2123,6 @@ void __jbd2_journal_file_buffer(struct journal_head *jh,
>  	case BJ_Shadow:
>  		list = &transaction->t_shadow_list;
>  		break;
> -	case BJ_LogCtl:
> -		list = &transaction->t_log_list;
> -		break;
>  	case BJ_Reserved:
>  		list = &transaction->t_reserved_list;
>  		break;
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index a670595..4584518 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -530,12 +530,6 @@ struct transaction_s
>  	struct journal_head	*t_shadow_list;
>  
>  	/*
> -	 * Doubly-linked circular list of control buffers being written to the
> -	 * log. [j_list_lock]
> -	 */
> -	struct journal_head	*t_log_list;
> -
> -	/*
>  	 * List of inodes whose data we've modified in data=ordered mode.
>  	 * [j_list_lock]
>  	 */
> @@ -994,7 +988,7 @@ static inline void jbd2_unfile_log_bh(struct buffer_head *bh)
>  }
>  
>  /* Log buffer allocation */
> -extern struct journal_head * jbd2_journal_get_descriptor_buffer(journal_t *);
> +struct buffer_head *jbd2_journal_get_descriptor_buffer(journal_t *journal);
>  int jbd2_journal_next_log_block(journal_t *, unsigned long long *);
>  int jbd2_journal_get_log_tail(journal_t *journal, tid_t *tid,
>  			      unsigned long *block);
> @@ -1178,8 +1172,10 @@ extern int	   jbd2_journal_init_revoke_caches(void);
>  extern void	   jbd2_journal_destroy_revoke(journal_t *);
>  extern int	   jbd2_journal_revoke (handle_t *, unsigned long long, struct buffer_head *);
>  extern int	   jbd2_journal_cancel_revoke(handle_t *, struct journal_head *);
> -extern void	   jbd2_journal_write_revoke_records(journal_t *,
> -						     transaction_t *, int);
> +extern void	   jbd2_journal_write_revoke_records(journal_t *journal,
> +						     transaction_t *transaction,
> +						     struct list_head *log_bufs,
> +						     int write_op);
>  
>  /* Recovery revoke support */
>  extern int	jbd2_journal_set_revoke(journal_t *, unsigned long long, tid_t);
> @@ -1286,9 +1282,8 @@ static inline int jbd_space_needed(journal_t *journal)
>  #define BJ_Metadata	1	/* Normal journaled metadata */
>  #define BJ_Forget	2	/* Buffer superseded by this transaction */
>  #define BJ_Shadow	3	/* Buffer contents being shadowed to the log */
> -#define BJ_LogCtl	4	/* Buffer contains log descriptors */
> -#define BJ_Reserved	5	/* Buffer is reserved for access by journal */
> -#define BJ_Types	6
> +#define BJ_Reserved	4	/* Buffer is reserved for access by journal */
> +#define BJ_Types	5
>  
>  extern int jbd_blocks_per_page(struct inode *inode);
>  
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 07/29] jbd2: Refine waiting for shadow buffers
  2013-04-08 21:32 ` [PATCH 07/29] jbd2: Refine waiting for shadow buffers Jan Kara
@ 2013-05-03 14:16   ` Zheng Liu
  2013-05-03 20:44     ` Jan Kara
  0 siblings, 1 reply; 76+ messages in thread
From: Zheng Liu @ 2013-05-03 14:16 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:12PM +0200, Jan Kara wrote:
> Currently when we add a buffer to a transaction, we wait until the
> buffer is removed from BJ_Shadow list (so that we prevent any changes to
> the buffer that is just written to the journal). This can take
> unnecessarily long as a lot happens between the time the buffer is
> submitted to the journal and the time when we remove the buffer from
> BJ_Shadow list (e.g.  we wait for all data buffers in the transaction,
> we issue a cache flush etc.). Also this creates a dependency of
> do_get_write_access() on transaction commit (namely waiting for data IO
> to complete) which we want to avoid when implementing transaction
> reservation.
> 
> So we modify commit code to set new BH_Shadow flag when temporary
> shadowing buffer is created and we clear that flag once IO on that
> buffer is complete. This allows do_get_write_access() to wait only for
> BH_Shadow bit and thus removes the dependency on data IO completion.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

A minor nit below.  Otherwise the patch looks good to me.
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>

> ---
>  fs/jbd2/commit.c           |   20 ++++++++++----------
>  fs/jbd2/journal.c          |    2 ++
>  fs/jbd2/transaction.c      |   44 +++++++++++++++++++-------------------------
>  include/linux/jbd.h        |   25 +++++++++++++++++++++++++
>  include/linux/jbd2.h       |   28 ++++++++++++++++++++++++++++
>  include/linux/jbd_common.h |   26 --------------------------
>  6 files changed, 84 insertions(+), 61 deletions(-)
> 
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index 1a03762..4863f5b 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -30,15 +30,22 @@
>  #include <trace/events/jbd2.h>
>  
>  /*
> - * Default IO end handler for temporary BJ_IO buffer_heads.
> + * IO end handler for temporary buffer_heads handling writes to the journal.
>   */
>  static void journal_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
>  {
> +	struct buffer_head *orig_bh = bh->b_private;
> +
>  	BUFFER_TRACE(bh, "");
>  	if (uptodate)
>  		set_buffer_uptodate(bh);
>  	else
>  		clear_buffer_uptodate(bh);
> +	if (orig_bh) {
> +		clear_bit_unlock(BH_Shadow, &orig_bh->b_state);
> +		smp_mb__after_clear_bit();
> +		wake_up_bit(&orig_bh->b_state, BH_Shadow);
> +	}
>  	unlock_buffer(bh);
>  }
>  
> @@ -818,7 +825,7 @@ start_journal_io:
>  		jbd2_unfile_log_bh(bh);
>  
>  		/*
> -		 * The list contains temporary buffer heads created by
> +		 * The list contains temporary buffer heas created by
                                                      ^^^^
                                                typo: head

Regards,
                                                - Zheng

>  		 * jbd2_journal_write_metadata_buffer().
>  		 */
>  		BUFFER_TRACE(bh, "dumping temporary bh");
> @@ -831,6 +838,7 @@ start_journal_io:
>  		bh = jh2bh(jh);
>  		clear_buffer_jwrite(bh);
>  		J_ASSERT_BH(bh, buffer_jbddirty(bh));
> +		J_ASSERT_BH(bh, !buffer_shadow(bh));
>  
>  		/* The metadata is now released for reuse, but we need
>                     to remember it against this transaction so that when
> @@ -838,14 +846,6 @@ start_journal_io:
>                     required. */
>  		JBUFFER_TRACE(jh, "file as BJ_Forget");
>  		jbd2_journal_file_buffer(jh, commit_transaction, BJ_Forget);
> -		/*
> -		 * Wake up any transactions which were waiting for this IO to
> -		 * complete. The barrier must be here so that changes by
> -		 * jbd2_journal_file_buffer() take effect before wake_up_bit()
> -		 * does the waitqueue check.
> -		 */
> -		smp_mb();
> -		wake_up_bit(&bh->b_state, BH_Unshadow);
>  		JBUFFER_TRACE(jh, "brelse shadowed buffer");
>  		__brelse(bh);
>  	}
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index e03aae0..e9a9cdb 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -453,6 +453,7 @@ repeat:
>  	new_bh->b_size = bh_in->b_size;
>  	new_bh->b_bdev = journal->j_dev;
>  	new_bh->b_blocknr = blocknr;
> +	new_bh->b_private = bh_in;
>  	set_buffer_mapped(new_bh);
>  	set_buffer_dirty(new_bh);
>  
> @@ -467,6 +468,7 @@ repeat:
>  	spin_lock(&journal->j_list_lock);
>  	__jbd2_journal_file_buffer(jh_in, transaction, BJ_Shadow);
>  	spin_unlock(&journal->j_list_lock);
> +	set_buffer_shadow(bh_in);
>  	jbd_unlock_bh_state(bh_in);
>  
>  	return do_escape | (done_copy_out << 1);
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index bc35899..81df09c 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -620,6 +620,12 @@ static void warn_dirty_buffer(struct buffer_head *bh)
>  	       bdevname(bh->b_bdev, b), (unsigned long long)bh->b_blocknr);
>  }
>  
> +static int sleep_on_shadow_bh(void *word)
> +{
> +	io_schedule();
> +	return 0;
> +}
> +
>  /*
>   * If the buffer is already part of the current transaction, then there
>   * is nothing we need to do.  If it is already part of a prior
> @@ -747,41 +753,29 @@ repeat:
>  		 * journaled.  If the primary copy is already going to
>  		 * disk then we cannot do copy-out here. */
>  
> -		if (jh->b_jlist == BJ_Shadow) {
> -			DEFINE_WAIT_BIT(wait, &bh->b_state, BH_Unshadow);
> -			wait_queue_head_t *wqh;
> -
> -			wqh = bit_waitqueue(&bh->b_state, BH_Unshadow);
> -
> +		if (buffer_shadow(bh)) {
>  			JBUFFER_TRACE(jh, "on shadow: sleep");
>  			jbd_unlock_bh_state(bh);
> -			/* commit wakes up all shadow buffers after IO */
> -			for ( ; ; ) {
> -				prepare_to_wait(wqh, &wait.wait,
> -						TASK_UNINTERRUPTIBLE);
> -				if (jh->b_jlist != BJ_Shadow)
> -					break;
> -				schedule();
> -			}
> -			finish_wait(wqh, &wait.wait);
> +			wait_on_bit(&bh->b_state, BH_Shadow,
> +				    sleep_on_shadow_bh, TASK_UNINTERRUPTIBLE);
>  			goto repeat;
>  		}
>  
> -		/* Only do the copy if the currently-owning transaction
> -		 * still needs it.  If it is on the Forget list, the
> -		 * committing transaction is past that stage.  The
> -		 * buffer had better remain locked during the kmalloc,
> -		 * but that should be true --- we hold the journal lock
> -		 * still and the buffer is already on the BUF_JOURNAL
> -		 * list so won't be flushed.
> +		/*
> +		 * Only do the copy if the currently-owning transaction still
> +		 * needs it. If buffer isn't on BJ_Metadata list, the
> +		 * committing transaction is past that stage (here we use the
> +		 * fact that BH_Shadow is set under bh_state lock together with
> +		 * refiling to BJ_Shadow list and at this point we know the
> +		 * buffer doesn't have BH_Shadow set).
>  		 *
>  		 * Subtle point, though: if this is a get_undo_access,
>  		 * then we will be relying on the frozen_data to contain
>  		 * the new value of the committed_data record after the
>  		 * transaction, so we HAVE to force the frozen_data copy
> -		 * in that case. */
> -
> -		if (jh->b_jlist != BJ_Forget || force_copy) {
> +		 * in that case.
> +		 */
> +		if (jh->b_jlist == BJ_Metadata || force_copy) {
>  			JBUFFER_TRACE(jh, "generate frozen data");
>  			if (!frozen_buffer) {
>  				JBUFFER_TRACE(jh, "allocate memory for buffer");
> diff --git a/include/linux/jbd.h b/include/linux/jbd.h
> index c8f3297..1c36b0c 100644
> --- a/include/linux/jbd.h
> +++ b/include/linux/jbd.h
> @@ -244,6 +244,31 @@ typedef struct journal_superblock_s
>  
>  #include <linux/fs.h>
>  #include <linux/sched.h>
> +
> +enum jbd_state_bits {
> +	BH_JBD			/* Has an attached ext3 journal_head */
> +	  = BH_PrivateStart,
> +	BH_JWrite,		/* Being written to log (@@@ DEBUGGING) */
> +	BH_Freed,		/* Has been freed (truncated) */
> +	BH_Revoked,		/* Has been revoked from the log */
> +	BH_RevokeValid,		/* Revoked flag is valid */
> +	BH_JBDDirty,		/* Is dirty but journaled */
> +	BH_State,		/* Pins most journal_head state */
> +	BH_JournalHead,		/* Pins bh->b_private and jh->b_bh */
> +	BH_Unshadow,		/* Dummy bit, for BJ_Shadow wakeup filtering */
> +	BH_JBDPrivateStart,	/* First bit available for private use by FS */
> +};
> +
> +BUFFER_FNS(JBD, jbd)
> +BUFFER_FNS(JWrite, jwrite)
> +BUFFER_FNS(JBDDirty, jbddirty)
> +TAS_BUFFER_FNS(JBDDirty, jbddirty)
> +BUFFER_FNS(Revoked, revoked)
> +TAS_BUFFER_FNS(Revoked, revoked)
> +BUFFER_FNS(RevokeValid, revokevalid)
> +TAS_BUFFER_FNS(RevokeValid, revokevalid)
> +BUFFER_FNS(Freed, freed)
> +
>  #include <linux/jbd_common.h>
>  
>  #define J_ASSERT(assert)	BUG_ON(!(assert))
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 4584518..be5115f 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -302,6 +302,34 @@ typedef struct journal_superblock_s
>  
>  #include <linux/fs.h>
>  #include <linux/sched.h>
> +
> +enum jbd_state_bits {
> +	BH_JBD			/* Has an attached ext3 journal_head */
> +	  = BH_PrivateStart,
> +	BH_JWrite,		/* Being written to log (@@@ DEBUGGING) */
> +	BH_Freed,		/* Has been freed (truncated) */
> +	BH_Revoked,		/* Has been revoked from the log */
> +	BH_RevokeValid,		/* Revoked flag is valid */
> +	BH_JBDDirty,		/* Is dirty but journaled */
> +	BH_State,		/* Pins most journal_head state */
> +	BH_JournalHead,		/* Pins bh->b_private and jh->b_bh */
> +	BH_Shadow,		/* IO on shadow buffer is running */
> +	BH_Verified,		/* Metadata block has been verified ok */
> +	BH_JBDPrivateStart,	/* First bit available for private use by FS */
> +};
> +
> +BUFFER_FNS(JBD, jbd)
> +BUFFER_FNS(JWrite, jwrite)
> +BUFFER_FNS(JBDDirty, jbddirty)
> +TAS_BUFFER_FNS(JBDDirty, jbddirty)
> +BUFFER_FNS(Revoked, revoked)
> +TAS_BUFFER_FNS(Revoked, revoked)
> +BUFFER_FNS(RevokeValid, revokevalid)
> +TAS_BUFFER_FNS(RevokeValid, revokevalid)
> +BUFFER_FNS(Freed, freed)
> +BUFFER_FNS(Shadow, shadow)
> +BUFFER_FNS(Verified, verified)
> +
>  #include <linux/jbd_common.h>
>  
>  #define J_ASSERT(assert)	BUG_ON(!(assert))
> diff --git a/include/linux/jbd_common.h b/include/linux/jbd_common.h
> index 6133679..b1f7089 100644
> --- a/include/linux/jbd_common.h
> +++ b/include/linux/jbd_common.h
> @@ -1,32 +1,6 @@
>  #ifndef _LINUX_JBD_STATE_H
>  #define _LINUX_JBD_STATE_H
>  
> -enum jbd_state_bits {
> -	BH_JBD			/* Has an attached ext3 journal_head */
> -	  = BH_PrivateStart,
> -	BH_JWrite,		/* Being written to log (@@@ DEBUGGING) */
> -	BH_Freed,		/* Has been freed (truncated) */
> -	BH_Revoked,		/* Has been revoked from the log */
> -	BH_RevokeValid,		/* Revoked flag is valid */
> -	BH_JBDDirty,		/* Is dirty but journaled */
> -	BH_State,		/* Pins most journal_head state */
> -	BH_JournalHead,		/* Pins bh->b_private and jh->b_bh */
> -	BH_Unshadow,		/* Dummy bit, for BJ_Shadow wakeup filtering */
> -	BH_Verified,		/* Metadata block has been verified ok */
> -	BH_JBDPrivateStart,	/* First bit available for private use by FS */
> -};
> -
> -BUFFER_FNS(JBD, jbd)
> -BUFFER_FNS(JWrite, jwrite)
> -BUFFER_FNS(JBDDirty, jbddirty)
> -TAS_BUFFER_FNS(JBDDirty, jbddirty)
> -BUFFER_FNS(Revoked, revoked)
> -TAS_BUFFER_FNS(Revoked, revoked)
> -BUFFER_FNS(RevokeValid, revokevalid)
> -TAS_BUFFER_FNS(RevokeValid, revokevalid)
> -BUFFER_FNS(Freed, freed)
> -BUFFER_FNS(Verified, verified)
> -
>  static inline struct buffer_head *jh2bh(struct journal_head *jh)
>  {
>  	return jh->b_bh;
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 08/29] jbd2: Remove outdated comment
  2013-04-08 21:32 ` [PATCH 08/29] jbd2: Remove outdated comment Jan Kara
@ 2013-05-03 14:20   ` Zheng Liu
  0 siblings, 0 replies; 76+ messages in thread
From: Zheng Liu @ 2013-05-03 14:20 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:13PM +0200, Jan Kara wrote:
> The comment about credit estimates isn't true anymore. We do what the
> comment describes now.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Regards,
                                                - Zheng

> ---
>  fs/jbd2/transaction.c |   10 ----------
>  1 files changed, 0 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 81df09c..74cfbd3 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -283,16 +283,6 @@ repeat:
>  	 * reduce the free space arbitrarily.  Be careful to account for
>  	 * those buffers when checkpointing.
>  	 */
> -
> -	/*
> -	 * @@@ AKPM: This seems rather over-defensive.  We're giving commit
> -	 * a _lot_ of headroom: 1/4 of the journal plus the size of
> -	 * the committing transaction.  Really, we only need to give it
> -	 * committing_transaction->t_outstanding_credits plus "enough" for
> -	 * the log control blocks.
> -	 * Also, this test is inconsistent with the matching one in
> -	 * jbd2_journal_extend().
> -	 */
>  	if (__jbd2_log_space_left(journal) < jbd_space_needed(journal)) {
>  		jbd_debug(2, "Handle %p waiting for checkpoint...\n", handle);
>  		atomic_sub(nblocks, &transaction->t_outstanding_credits);
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 07/29] jbd2: Refine waiting for shadow buffers
  2013-05-03 14:16   ` Zheng Liu
@ 2013-05-03 20:44     ` Jan Kara
  0 siblings, 0 replies; 76+ messages in thread
From: Jan Kara @ 2013-05-03 20:44 UTC (permalink / raw)
  To: Zheng Liu; +Cc: Jan Kara, Ted Tso, linux-ext4

On Fri 03-05-13 22:16:13, Zheng Liu wrote:
> On Mon, Apr 08, 2013 at 11:32:12PM +0200, Jan Kara wrote:
> > Currently when we add a buffer to a transaction, we wait until the
> > buffer is removed from BJ_Shadow list (so that we prevent any changes to
> > the buffer that is just written to the journal). This can take
> > unnecessarily long as a lot happens between the time the buffer is
> > submitted to the journal and the time when we remove the buffer from
> > BJ_Shadow list (e.g.  we wait for all data buffers in the transaction,
> > we issue a cache flush etc.). Also this creates a dependency of
> > do_get_write_access() on transaction commit (namely waiting for data IO
> > to complete) which we want to avoid when implementing transaction
> > reservation.
> > 
> > So we modify commit code to set new BH_Shadow flag when temporary
> > shadowing buffer is created and we clear that flag once IO on that
> > buffer is complete. This allows do_get_write_access() to wait only for
> > BH_Shadow bit and thus removes the dependency on data IO completion.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> A minor nit below.  Otherwise the patch looks good to me.
> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
  Thanks for review! I'll fix that typo.

							Honza
> 
> > ---
> >  fs/jbd2/commit.c           |   20 ++++++++++----------
> >  fs/jbd2/journal.c          |    2 ++
> >  fs/jbd2/transaction.c      |   44 +++++++++++++++++++-------------------------
> >  include/linux/jbd.h        |   25 +++++++++++++++++++++++++
> >  include/linux/jbd2.h       |   28 ++++++++++++++++++++++++++++
> >  include/linux/jbd_common.h |   26 --------------------------
> >  6 files changed, 84 insertions(+), 61 deletions(-)
> > 
> > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > index 1a03762..4863f5b 100644
> > --- a/fs/jbd2/commit.c
> > +++ b/fs/jbd2/commit.c
> > @@ -30,15 +30,22 @@
> >  #include <trace/events/jbd2.h>
> >  
> >  /*
> > - * Default IO end handler for temporary BJ_IO buffer_heads.
> > + * IO end handler for temporary buffer_heads handling writes to the journal.
> >   */
> >  static void journal_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
> >  {
> > +	struct buffer_head *orig_bh = bh->b_private;
> > +
> >  	BUFFER_TRACE(bh, "");
> >  	if (uptodate)
> >  		set_buffer_uptodate(bh);
> >  	else
> >  		clear_buffer_uptodate(bh);
> > +	if (orig_bh) {
> > +		clear_bit_unlock(BH_Shadow, &orig_bh->b_state);
> > +		smp_mb__after_clear_bit();
> > +		wake_up_bit(&orig_bh->b_state, BH_Shadow);
> > +	}
> >  	unlock_buffer(bh);
> >  }
> >  
> > @@ -818,7 +825,7 @@ start_journal_io:
> >  		jbd2_unfile_log_bh(bh);
> >  
> >  		/*
> > -		 * The list contains temporary buffer heads created by
> > +		 * The list contains temporary buffer heas created by
>                                                       ^^^^
>                                                 typo: head
> 
> Regards,
>                                                 - Zheng
> 
> >  		 * jbd2_journal_write_metadata_buffer().
> >  		 */
> >  		BUFFER_TRACE(bh, "dumping temporary bh");
> > @@ -831,6 +838,7 @@ start_journal_io:
> >  		bh = jh2bh(jh);
> >  		clear_buffer_jwrite(bh);
> >  		J_ASSERT_BH(bh, buffer_jbddirty(bh));
> > +		J_ASSERT_BH(bh, !buffer_shadow(bh));
> >  
> >  		/* The metadata is now released for reuse, but we need
> >                     to remember it against this transaction so that when
> > @@ -838,14 +846,6 @@ start_journal_io:
> >                     required. */
> >  		JBUFFER_TRACE(jh, "file as BJ_Forget");
> >  		jbd2_journal_file_buffer(jh, commit_transaction, BJ_Forget);
> > -		/*
> > -		 * Wake up any transactions which were waiting for this IO to
> > -		 * complete. The barrier must be here so that changes by
> > -		 * jbd2_journal_file_buffer() take effect before wake_up_bit()
> > -		 * does the waitqueue check.
> > -		 */
> > -		smp_mb();
> > -		wake_up_bit(&bh->b_state, BH_Unshadow);
> >  		JBUFFER_TRACE(jh, "brelse shadowed buffer");
> >  		__brelse(bh);
> >  	}
> > diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> > index e03aae0..e9a9cdb 100644
> > --- a/fs/jbd2/journal.c
> > +++ b/fs/jbd2/journal.c
> > @@ -453,6 +453,7 @@ repeat:
> >  	new_bh->b_size = bh_in->b_size;
> >  	new_bh->b_bdev = journal->j_dev;
> >  	new_bh->b_blocknr = blocknr;
> > +	new_bh->b_private = bh_in;
> >  	set_buffer_mapped(new_bh);
> >  	set_buffer_dirty(new_bh);
> >  
> > @@ -467,6 +468,7 @@ repeat:
> >  	spin_lock(&journal->j_list_lock);
> >  	__jbd2_journal_file_buffer(jh_in, transaction, BJ_Shadow);
> >  	spin_unlock(&journal->j_list_lock);
> > +	set_buffer_shadow(bh_in);
> >  	jbd_unlock_bh_state(bh_in);
> >  
> >  	return do_escape | (done_copy_out << 1);
> > diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> > index bc35899..81df09c 100644
> > --- a/fs/jbd2/transaction.c
> > +++ b/fs/jbd2/transaction.c
> > @@ -620,6 +620,12 @@ static void warn_dirty_buffer(struct buffer_head *bh)
> >  	       bdevname(bh->b_bdev, b), (unsigned long long)bh->b_blocknr);
> >  }
> >  
> > +static int sleep_on_shadow_bh(void *word)
> > +{
> > +	io_schedule();
> > +	return 0;
> > +}
> > +
> >  /*
> >   * If the buffer is already part of the current transaction, then there
> >   * is nothing we need to do.  If it is already part of a prior
> > @@ -747,41 +753,29 @@ repeat:
> >  		 * journaled.  If the primary copy is already going to
> >  		 * disk then we cannot do copy-out here. */
> >  
> > -		if (jh->b_jlist == BJ_Shadow) {
> > -			DEFINE_WAIT_BIT(wait, &bh->b_state, BH_Unshadow);
> > -			wait_queue_head_t *wqh;
> > -
> > -			wqh = bit_waitqueue(&bh->b_state, BH_Unshadow);
> > -
> > +		if (buffer_shadow(bh)) {
> >  			JBUFFER_TRACE(jh, "on shadow: sleep");
> >  			jbd_unlock_bh_state(bh);
> > -			/* commit wakes up all shadow buffers after IO */
> > -			for ( ; ; ) {
> > -				prepare_to_wait(wqh, &wait.wait,
> > -						TASK_UNINTERRUPTIBLE);
> > -				if (jh->b_jlist != BJ_Shadow)
> > -					break;
> > -				schedule();
> > -			}
> > -			finish_wait(wqh, &wait.wait);
> > +			wait_on_bit(&bh->b_state, BH_Shadow,
> > +				    sleep_on_shadow_bh, TASK_UNINTERRUPTIBLE);
> >  			goto repeat;
> >  		}
> >  
> > -		/* Only do the copy if the currently-owning transaction
> > -		 * still needs it.  If it is on the Forget list, the
> > -		 * committing transaction is past that stage.  The
> > -		 * buffer had better remain locked during the kmalloc,
> > -		 * but that should be true --- we hold the journal lock
> > -		 * still and the buffer is already on the BUF_JOURNAL
> > -		 * list so won't be flushed.
> > +		/*
> > +		 * Only do the copy if the currently-owning transaction still
> > +		 * needs it. If buffer isn't on BJ_Metadata list, the
> > +		 * committing transaction is past that stage (here we use the
> > +		 * fact that BH_Shadow is set under bh_state lock together with
> > +		 * refiling to BJ_Shadow list and at this point we know the
> > +		 * buffer doesn't have BH_Shadow set).
> >  		 *
> >  		 * Subtle point, though: if this is a get_undo_access,
> >  		 * then we will be relying on the frozen_data to contain
> >  		 * the new value of the committed_data record after the
> >  		 * transaction, so we HAVE to force the frozen_data copy
> > -		 * in that case. */
> > -
> > -		if (jh->b_jlist != BJ_Forget || force_copy) {
> > +		 * in that case.
> > +		 */
> > +		if (jh->b_jlist == BJ_Metadata || force_copy) {
> >  			JBUFFER_TRACE(jh, "generate frozen data");
> >  			if (!frozen_buffer) {
> >  				JBUFFER_TRACE(jh, "allocate memory for buffer");
> > diff --git a/include/linux/jbd.h b/include/linux/jbd.h
> > index c8f3297..1c36b0c 100644
> > --- a/include/linux/jbd.h
> > +++ b/include/linux/jbd.h
> > @@ -244,6 +244,31 @@ typedef struct journal_superblock_s
> >  
> >  #include <linux/fs.h>
> >  #include <linux/sched.h>
> > +
> > +enum jbd_state_bits {
> > +	BH_JBD			/* Has an attached ext3 journal_head */
> > +	  = BH_PrivateStart,
> > +	BH_JWrite,		/* Being written to log (@@@ DEBUGGING) */
> > +	BH_Freed,		/* Has been freed (truncated) */
> > +	BH_Revoked,		/* Has been revoked from the log */
> > +	BH_RevokeValid,		/* Revoked flag is valid */
> > +	BH_JBDDirty,		/* Is dirty but journaled */
> > +	BH_State,		/* Pins most journal_head state */
> > +	BH_JournalHead,		/* Pins bh->b_private and jh->b_bh */
> > +	BH_Unshadow,		/* Dummy bit, for BJ_Shadow wakeup filtering */
> > +	BH_JBDPrivateStart,	/* First bit available for private use by FS */
> > +};
> > +
> > +BUFFER_FNS(JBD, jbd)
> > +BUFFER_FNS(JWrite, jwrite)
> > +BUFFER_FNS(JBDDirty, jbddirty)
> > +TAS_BUFFER_FNS(JBDDirty, jbddirty)
> > +BUFFER_FNS(Revoked, revoked)
> > +TAS_BUFFER_FNS(Revoked, revoked)
> > +BUFFER_FNS(RevokeValid, revokevalid)
> > +TAS_BUFFER_FNS(RevokeValid, revokevalid)
> > +BUFFER_FNS(Freed, freed)
> > +
> >  #include <linux/jbd_common.h>
> >  
> >  #define J_ASSERT(assert)	BUG_ON(!(assert))
> > diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> > index 4584518..be5115f 100644
> > --- a/include/linux/jbd2.h
> > +++ b/include/linux/jbd2.h
> > @@ -302,6 +302,34 @@ typedef struct journal_superblock_s
> >  
> >  #include <linux/fs.h>
> >  #include <linux/sched.h>
> > +
> > +enum jbd_state_bits {
> > +	BH_JBD			/* Has an attached ext3 journal_head */
> > +	  = BH_PrivateStart,
> > +	BH_JWrite,		/* Being written to log (@@@ DEBUGGING) */
> > +	BH_Freed,		/* Has been freed (truncated) */
> > +	BH_Revoked,		/* Has been revoked from the log */
> > +	BH_RevokeValid,		/* Revoked flag is valid */
> > +	BH_JBDDirty,		/* Is dirty but journaled */
> > +	BH_State,		/* Pins most journal_head state */
> > +	BH_JournalHead,		/* Pins bh->b_private and jh->b_bh */
> > +	BH_Shadow,		/* IO on shadow buffer is running */
> > +	BH_Verified,		/* Metadata block has been verified ok */
> > +	BH_JBDPrivateStart,	/* First bit available for private use by FS */
> > +};
> > +
> > +BUFFER_FNS(JBD, jbd)
> > +BUFFER_FNS(JWrite, jwrite)
> > +BUFFER_FNS(JBDDirty, jbddirty)
> > +TAS_BUFFER_FNS(JBDDirty, jbddirty)
> > +BUFFER_FNS(Revoked, revoked)
> > +TAS_BUFFER_FNS(Revoked, revoked)
> > +BUFFER_FNS(RevokeValid, revokevalid)
> > +TAS_BUFFER_FNS(RevokeValid, revokevalid)
> > +BUFFER_FNS(Freed, freed)
> > +BUFFER_FNS(Shadow, shadow)
> > +BUFFER_FNS(Verified, verified)
> > +
> >  #include <linux/jbd_common.h>
> >  
> >  #define J_ASSERT(assert)	BUG_ON(!(assert))
> > diff --git a/include/linux/jbd_common.h b/include/linux/jbd_common.h
> > index 6133679..b1f7089 100644
> > --- a/include/linux/jbd_common.h
> > +++ b/include/linux/jbd_common.h
> > @@ -1,32 +1,6 @@
> >  #ifndef _LINUX_JBD_STATE_H
> >  #define _LINUX_JBD_STATE_H
> >  
> > -enum jbd_state_bits {
> > -	BH_JBD			/* Has an attached ext3 journal_head */
> > -	  = BH_PrivateStart,
> > -	BH_JWrite,		/* Being written to log (@@@ DEBUGGING) */
> > -	BH_Freed,		/* Has been freed (truncated) */
> > -	BH_Revoked,		/* Has been revoked from the log */
> > -	BH_RevokeValid,		/* Revoked flag is valid */
> > -	BH_JBDDirty,		/* Is dirty but journaled */
> > -	BH_State,		/* Pins most journal_head state */
> > -	BH_JournalHead,		/* Pins bh->b_private and jh->b_bh */
> > -	BH_Unshadow,		/* Dummy bit, for BJ_Shadow wakeup filtering */
> > -	BH_Verified,		/* Metadata block has been verified ok */
> > -	BH_JBDPrivateStart,	/* First bit available for private use by FS */
> > -};
> > -
> > -BUFFER_FNS(JBD, jbd)
> > -BUFFER_FNS(JWrite, jwrite)
> > -BUFFER_FNS(JBDDirty, jbddirty)
> > -TAS_BUFFER_FNS(JBDDirty, jbddirty)
> > -BUFFER_FNS(Revoked, revoked)
> > -TAS_BUFFER_FNS(Revoked, revoked)
> > -BUFFER_FNS(RevokeValid, revokevalid)
> > -TAS_BUFFER_FNS(RevokeValid, revokevalid)
> > -BUFFER_FNS(Freed, freed)
> > -BUFFER_FNS(Verified, verified)
> > -
> >  static inline struct buffer_head *jh2bh(struct journal_head *jh)
> >  {
> >  	return jh->b_bh;
> > -- 
> > 1.7.1
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 09/29] jbd2: Cleanup needed free block estimates when starting a transaction
  2013-04-08 21:32 ` [PATCH 09/29] jbd2: Cleanup needed free block estimates when starting a transaction Jan Kara
@ 2013-05-05  8:17   ` Zheng Liu
  0 siblings, 0 replies; 76+ messages in thread
From: Zheng Liu @ 2013-05-05  8:17 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:14PM +0200, Jan Kara wrote:
> __jbd2_log_space_left() and jbd_space_needed() were kind of odd.
> jbd_space_needed() accounted also credits needed for currently committing
> transaction while it didn't account for credits needed for control blocks.
> __jbd2_log_space_left() then accounted for control blocks as a fraction of free
> space. Since results of these two functions are always only compared against
> each other, this works correct but is somewhat strange. Move the estimates so
> that jbd_space_needed() returns number of blocks needed for a transaction
> including control blocks and __jbd2_log_space_left() returns free space in the
> journal (with the committing transaction already subtracted). Rename functions
> to jbd2_log_space_left() and jbd2_space_needed() while we are changing them.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Regards,
                                                - Zheng

> ---
>  fs/jbd2/checkpoint.c  |    8 ++++----
>  fs/jbd2/journal.c     |   29 -----------------------------
>  fs/jbd2/transaction.c |    9 +++++----
>  include/linux/jbd2.h  |   32 ++++++++++++++++++++++++++------
>  4 files changed, 35 insertions(+), 43 deletions(-)
> 
> diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> index 65ec076..a572383 100644
> --- a/fs/jbd2/checkpoint.c
> +++ b/fs/jbd2/checkpoint.c
> @@ -120,8 +120,8 @@ void __jbd2_log_wait_for_space(journal_t *journal)
>  	int nblocks, space_left;
>  	/* assert_spin_locked(&journal->j_state_lock); */
>  
> -	nblocks = jbd_space_needed(journal);
> -	while (__jbd2_log_space_left(journal) < nblocks) {
> +	nblocks = jbd2_space_needed(journal);
> +	while (jbd2_log_space_left(journal) < nblocks) {
>  		if (journal->j_flags & JBD2_ABORT)
>  			return;
>  		write_unlock(&journal->j_state_lock);
> @@ -140,8 +140,8 @@ void __jbd2_log_wait_for_space(journal_t *journal)
>  		 */
>  		write_lock(&journal->j_state_lock);
>  		spin_lock(&journal->j_list_lock);
> -		nblocks = jbd_space_needed(journal);
> -		space_left = __jbd2_log_space_left(journal);
> +		nblocks = jbd2_space_needed(journal);
> +		space_left = jbd2_log_space_left(journal);
>  		if (space_left < nblocks) {
>  			int chkpt = journal->j_checkpoint_transactions != NULL;
>  			tid_t tid = 0;
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index e9a9cdb..e6f14e0 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -480,35 +480,6 @@ repeat:
>   */
>  
>  /*
> - * __jbd2_log_space_left: Return the number of free blocks left in the journal.
> - *
> - * Called with the journal already locked.
> - *
> - * Called under j_state_lock
> - */
> -
> -int __jbd2_log_space_left(journal_t *journal)
> -{
> -	int left = journal->j_free;
> -
> -	/* assert_spin_locked(&journal->j_state_lock); */
> -
> -	/*
> -	 * Be pessimistic here about the number of those free blocks which
> -	 * might be required for log descriptor control blocks.
> -	 */
> -
> -#define MIN_LOG_RESERVED_BLOCKS 32 /* Allow for rounding errors */
> -
> -	left -= MIN_LOG_RESERVED_BLOCKS;
> -
> -	if (left <= 0)
> -		return 0;
> -	left -= (left >> 3);
> -	return left;
> -}
> -
> -/*
>   * Called with j_state_lock locked for writing.
>   * Returns true if a transaction commit was started.
>   */
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 74cfbd3..aee40c9 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -283,12 +283,12 @@ repeat:
>  	 * reduce the free space arbitrarily.  Be careful to account for
>  	 * those buffers when checkpointing.
>  	 */
> -	if (__jbd2_log_space_left(journal) < jbd_space_needed(journal)) {
> +	if (jbd2_log_space_left(journal) < jbd2_space_needed(journal)) {
>  		jbd_debug(2, "Handle %p waiting for checkpoint...\n", handle);
>  		atomic_sub(nblocks, &transaction->t_outstanding_credits);
>  		read_unlock(&journal->j_state_lock);
>  		write_lock(&journal->j_state_lock);
> -		if (__jbd2_log_space_left(journal) < jbd_space_needed(journal))
> +		if (jbd2_log_space_left(journal) < jbd2_space_needed(journal))
>  			__jbd2_log_wait_for_space(journal);
>  		write_unlock(&journal->j_state_lock);
>  		goto repeat;
> @@ -306,7 +306,7 @@ repeat:
>  	jbd_debug(4, "Handle %p given %d credits (total %d, free %d)\n",
>  		  handle, nblocks,
>  		  atomic_read(&transaction->t_outstanding_credits),
> -		  __jbd2_log_space_left(journal));
> +		  jbd2_log_space_left(journal));
>  	read_unlock(&journal->j_state_lock);
>  
>  	lock_map_acquire(&handle->h_lockdep_map);
> @@ -442,7 +442,8 @@ int jbd2_journal_extend(handle_t *handle, int nblocks)
>  		goto unlock;
>  	}
>  
> -	if (wanted > __jbd2_log_space_left(journal)) {
> +	if (wanted + (wanted >> JBD2_CONTROL_BLOCKS_SHIFT) >
> +	    jbd2_log_space_left(journal)) {
>  		jbd_debug(3, "denied handle %p %d blocks: "
>  			  "insufficient log space\n", handle, nblocks);
>  		goto unlock;
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index be5115f..9197d1b 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -1219,7 +1219,6 @@ extern void	jbd2_clear_buffer_revoked_flags(journal_t *journal);
>   * transitions on demand.
>   */
>  
> -int __jbd2_log_space_left(journal_t *); /* Called with journal locked */
>  int jbd2_log_start_commit(journal_t *journal, tid_t tid);
>  int __jbd2_log_start_commit(journal_t *journal, tid_t tid);
>  int jbd2_journal_start_commit(journal_t *journal, tid_t *tid);
> @@ -1289,16 +1288,37 @@ extern int jbd2_journal_blocks_per_page(struct inode *inode);
>  extern size_t journal_tag_bytes(journal_t *journal);
>  
>  /*
> + * We reserve t_outstanding_credits >> JBD2_CONTROL_BLOCKS_SHIFT for
> + * transaction control blocks.
> + */
> +#define JBD2_CONTROL_BLOCKS_SHIFT 5
> +
> +/*
>   * Return the minimum number of blocks which must be free in the journal
>   * before a new transaction may be started.  Must be called under j_state_lock.
>   */
> -static inline int jbd_space_needed(journal_t *journal)
> +static inline int jbd2_space_needed(journal_t *journal)
>  {
>  	int nblocks = journal->j_max_transaction_buffers;
> -	if (journal->j_committing_transaction)
> -		nblocks += atomic_read(&journal->j_committing_transaction->
> -				       t_outstanding_credits);
> -	return nblocks;
> +	return nblocks + (nblocks >> JBD2_CONTROL_BLOCKS_SHIFT);
> +}
> +
> +/*
> + * Return number of free blocks in the log. Must be called under j_state_lock.
> + */
> +static inline unsigned long jbd2_log_space_left(journal_t *journal)
> +{
> +	/* Allow for rounding errors */
> +	unsigned long free = journal->j_free - 32;
> +
> +	if (journal->j_committing_transaction) {
> +		unsigned long committing = atomic_read(&journal->
> +			j_committing_transaction->t_outstanding_credits);
> +
> +		/* Transaction + control blocks */
> +		free -= committing + (committing >> JBD2_CONTROL_BLOCKS_SHIFT);
> +	}
> +	return free;
>  }
>  
>  /*
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 10/29] jbd2: Fix race in t_outstanding_credits update in jbd2_journal_extend()
  2013-04-08 21:32 ` [PATCH 10/29] jbd2: Fix race in t_outstanding_credits update in jbd2_journal_extend() Jan Kara
@ 2013-05-05  8:37   ` Zheng Liu
  0 siblings, 0 replies; 76+ messages in thread
From: Zheng Liu @ 2013-05-05  8:37 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:15PM +0200, Jan Kara wrote:
> jbd2_journal_extend() first checked whether transaction can accept extending
> handle with more credits and then added credits to t_outstanding_credits.
> This can race with start_this_handle() adding another handle to a transaction
> and thus overbooking a transaction. Make jbd2_journal_extend() use
> atomic_add_return() to close the race.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Regards,
                                                - Zheng

> ---
>  fs/jbd2/transaction.c |    6 ++++--
>  1 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index aee40c9..9639e47 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -434,11 +434,13 @@ int jbd2_journal_extend(handle_t *handle, int nblocks)
>  	}
>  
>  	spin_lock(&transaction->t_handle_lock);
> -	wanted = atomic_read(&transaction->t_outstanding_credits) + nblocks;
> +	wanted = atomic_add_return(nblocks,
> +				   &transaction->t_outstanding_credits);
>  
>  	if (wanted > journal->j_max_transaction_buffers) {
>  		jbd_debug(3, "denied handle %p %d blocks: "
>  			  "transaction too large\n", handle, nblocks);
> +		atomic_sub(nblocks, &transaction->t_outstanding_credits);
>  		goto unlock;
>  	}
>  
> @@ -446,6 +448,7 @@ int jbd2_journal_extend(handle_t *handle, int nblocks)
>  	    jbd2_log_space_left(journal)) {
>  		jbd_debug(3, "denied handle %p %d blocks: "
>  			  "insufficient log space\n", handle, nblocks);
> +		atomic_sub(nblocks, &transaction->t_outstanding_credits);
>  		goto unlock;
>  	}
>  
> @@ -457,7 +460,6 @@ int jbd2_journal_extend(handle_t *handle, int nblocks)
>  
>  	handle->h_buffer_credits += nblocks;
>  	handle->h_requested_credits += nblocks;
> -	atomic_add(nblocks, &transaction->t_outstanding_credits);
>  	result = 0;
>  
>  	jbd_debug(3, "extended handle %p by %d\n", handle, nblocks);
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 11/29] jbd2: Remove unused waitqueues
  2013-04-08 21:32 ` [PATCH 11/29] jbd2: Remove unused waitqueues Jan Kara
@ 2013-05-05  8:41   ` Zheng Liu
  0 siblings, 0 replies; 76+ messages in thread
From: Zheng Liu @ 2013-05-05  8:41 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:16PM +0200, Jan Kara wrote:
> j_wait_logspace and j_wait_checkpoint are unused. Remove them.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Regards,
                                                - Zheng

> ---
>  fs/jbd2/checkpoint.c |    4 ----
>  fs/jbd2/journal.c    |    2 --
>  include/linux/jbd2.h |    8 --------
>  3 files changed, 0 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> index a572383..75a15f3 100644
> --- a/fs/jbd2/checkpoint.c
> +++ b/fs/jbd2/checkpoint.c
> @@ -625,10 +625,6 @@ int __jbd2_journal_remove_checkpoint(struct journal_head *jh)
>  
>  	__jbd2_journal_drop_transaction(journal, transaction);
>  	jbd2_journal_free_transaction(transaction);
> -
> -	/* Just in case anybody was waiting for more transactions to be
> -           checkpointed... */
> -	wake_up(&journal->j_wait_logspace);
>  	ret = 1;
>  out:
>  	return ret;
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index e6f14e0..63e2929 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -998,9 +998,7 @@ static journal_t * journal_init_common (void)
>  		return NULL;
>  
>  	init_waitqueue_head(&journal->j_wait_transaction_locked);
> -	init_waitqueue_head(&journal->j_wait_logspace);
>  	init_waitqueue_head(&journal->j_wait_done_commit);
> -	init_waitqueue_head(&journal->j_wait_checkpoint);
>  	init_waitqueue_head(&journal->j_wait_commit);
>  	init_waitqueue_head(&journal->j_wait_updates);
>  	mutex_init(&journal->j_barrier);
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 9197d1b..ad4b3bb 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -686,9 +686,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
>   *  waiting for checkpointing
>   * @j_wait_transaction_locked: Wait queue for waiting for a locked transaction
>   *  to start committing, or for a barrier lock to be released
> - * @j_wait_logspace: Wait queue for waiting for checkpointing to complete
>   * @j_wait_done_commit: Wait queue for waiting for commit to complete
> - * @j_wait_checkpoint:  Wait queue to trigger checkpointing
>   * @j_wait_commit: Wait queue to trigger commit
>   * @j_wait_updates: Wait queue to wait for updates to complete
>   * @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
> @@ -793,15 +791,9 @@ struct journal_s
>  	 */
>  	wait_queue_head_t	j_wait_transaction_locked;
>  
> -	/* Wait queue for waiting for checkpointing to complete */
> -	wait_queue_head_t	j_wait_logspace;
> -
>  	/* Wait queue for waiting for commit to complete */
>  	wait_queue_head_t	j_wait_done_commit;
>  
> -	/* Wait queue to trigger checkpointing */
> -	wait_queue_head_t	j_wait_checkpoint;
> -
>  	/* Wait queue to trigger commit */
>  	wait_queue_head_t	j_wait_commit;
>  
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 12/29] jbd2: Transaction reservation support
  2013-04-08 21:32 ` [PATCH 12/29] jbd2: Transaction reservation support Jan Kara
@ 2013-05-05  9:39   ` Zheng Liu
  2013-05-06 12:49     ` Jan Kara
  0 siblings, 1 reply; 76+ messages in thread
From: Zheng Liu @ 2013-05-05  9:39 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:17PM +0200, Jan Kara wrote:
> In some cases we cannot start a transaction because of locking constraints and
> passing started transaction into those places is not handy either because we
> could block transaction commit for too long. Transaction reservation is
> designed to solve these issues. It reserves a handle with given number of
> credits in the journal and the handle can be later attached to the running
> transaction without blocking on commit or checkpointing. Reserved handles do
> not block transaction commit in any way, they only reduce maximum size of the
> running transaction (because we have to always be prepared to accomodate
> request for attaching reserved handle).
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Some minor nits below.  Otherwise the patch looks good to me.
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>

> ---
>  fs/jbd2/commit.c      |    6 +
>  fs/jbd2/journal.c     |    2 +
>  fs/jbd2/transaction.c |  289 +++++++++++++++++++++++++++++++++++++------------
>  include/linux/jbd2.h  |   21 ++++-
>  4 files changed, 245 insertions(+), 73 deletions(-)
> 
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index 4863f5b..59c572e 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -522,6 +522,12 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>  	 */
>  	jbd2_journal_switch_revoke_table(journal);
>  
> +	/*
> +	 * Reserved credits cannot be claimed anymore, free them
> +	 */
> +	atomic_sub(atomic_read(&journal->j_reserved_credits),
> +		   &commit_transaction->t_outstanding_credits);
> +
>  	trace_jbd2_commit_flushing(journal, commit_transaction);
>  	stats.run.rs_flushing = jiffies;
>  	stats.run.rs_locked = jbd2_time_diff(stats.run.rs_locked,
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 63e2929..04c52ac 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -1001,6 +1001,7 @@ static journal_t * journal_init_common (void)
>  	init_waitqueue_head(&journal->j_wait_done_commit);
>  	init_waitqueue_head(&journal->j_wait_commit);
>  	init_waitqueue_head(&journal->j_wait_updates);
> +	init_waitqueue_head(&journal->j_wait_reserved);
>  	mutex_init(&journal->j_barrier);
>  	mutex_init(&journal->j_checkpoint_mutex);
>  	spin_lock_init(&journal->j_revoke_lock);
> @@ -1010,6 +1011,7 @@ static journal_t * journal_init_common (void)
>  	journal->j_commit_interval = (HZ * JBD2_DEFAULT_MAX_COMMIT_AGE);
>  	journal->j_min_batch_time = 0;
>  	journal->j_max_batch_time = 15000; /* 15ms */
> +	atomic_set(&journal->j_reserved_credits, 0);
>  
>  	/* The journal is marked for error until we succeed with recovery! */
>  	journal->j_flags = JBD2_ABORT;
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 9639e47..036c01c 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -89,7 +89,8 @@ jbd2_get_transaction(journal_t *journal, transaction_t *transaction)
>  	transaction->t_expires = jiffies + journal->j_commit_interval;
>  	spin_lock_init(&transaction->t_handle_lock);
>  	atomic_set(&transaction->t_updates, 0);
> -	atomic_set(&transaction->t_outstanding_credits, 0);
> +	atomic_set(&transaction->t_outstanding_credits,
> +		   atomic_read(&journal->j_reserved_credits));
>  	atomic_set(&transaction->t_handle_count, 0);
>  	INIT_LIST_HEAD(&transaction->t_inode_list);
>  	INIT_LIST_HEAD(&transaction->t_private_list);
> @@ -141,6 +142,91 @@ static inline void update_t_max_wait(transaction_t *transaction,
>  }
>  
>  /*
> + * Wait until running transaction passes T_LOCKED state. Also starts the commit
> + * if needed. The function expects running transaction to exist and releases
> + * j_state_lock.
> + */
> +static void wait_transaction_locked(journal_t *journal)
> +	__releases(journal->j_state_lock)
> +{
> +	DEFINE_WAIT(wait);
> +	int need_to_start;
> +	tid_t tid = journal->j_running_transaction->t_tid;
> +
> +	prepare_to_wait(&journal->j_wait_transaction_locked, &wait,
> +			TASK_UNINTERRUPTIBLE);
> +	need_to_start = !tid_geq(journal->j_commit_request, tid);
> +	read_unlock(&journal->j_state_lock);
> +	if (need_to_start)
> +		jbd2_log_start_commit(journal, tid);
> +	schedule();
> +	finish_wait(&journal->j_wait_transaction_locked, &wait);
> +}
> +
> +/*
> + * Wait until we can add credits for handle to the running transaction.  Called
> + * with j_state_lock held for reading. Returns 0 if handle joined the running
> + * transaction. Returns 1 if we had to wait, j_state_lock is dropped, and
> + * caller must retry.
> + */
> +static int add_transaction_credits(journal_t *journal, handle_t *handle)
> +{
> +	transaction_t *t = journal->j_running_transaction;
> +	int nblocks = handle->h_buffer_credits;
> +	int needed;
> +
> +	/*
> +	 * If the current transaction is locked down for commit, wait
> +	 * for the lock to be released.
> +	 */
> +	if (t->t_state == T_LOCKED) {
> +		wait_transaction_locked(journal);
> +		return 1;
> +	}
> +
> +	/*
> +	 * If there is not enough space left in the log to write all
> +	 * potential buffers requested by this operation, we need to
> +	 * stall pending a log checkpoint to free some more log space.
> +	 */
> +	needed = atomic_add_return(nblocks, &t->t_outstanding_credits);
> +	if (needed > journal->j_max_transaction_buffers) {
> +		/*
> +		 * If the current transaction is already too large,
> +		 * then start to commit it: we can then go back and
> +		 * attach this handle to a new transaction.
> +		 */
> +		jbd_debug(2, "Handle %p starting new commit...\n", handle);
> +		atomic_sub(nblocks, &t->t_outstanding_credits);
> +		wait_transaction_locked(journal);
> +		return 1;
> +	}
> +
> +	/*
> +	 * The commit code assumes that it can get enough log space
> +	 * without forcing a checkpoint.  This is *critical* for
> +	 * correctness: a checkpoint of a buffer which is also
> +	 * associated with a committing transaction creates a deadlock,
> +	 * so commit simply cannot force through checkpoints.
> +	 *
> +	 * We must therefore ensure the necessary space in the journal
> +	 * *before* starting to dirty potentially checkpointed buffers
> +	 * in the new transaction.
> +	 */
> +	if (jbd2_log_space_left(journal) < jbd2_space_needed(journal)) {
> +		jbd_debug(2, "Handle %p waiting for checkpoint...\n", handle);
> +		atomic_sub(nblocks, &t->t_outstanding_credits);
> +		read_unlock(&journal->j_state_lock);
> +		write_lock(&journal->j_state_lock);
> +		if (jbd2_log_space_left(journal) < jbd2_space_needed(journal))
> +			__jbd2_log_wait_for_space(journal);
> +		write_unlock(&journal->j_state_lock);
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/*
>   * start_this_handle: Given a handle, deal with any locking or stalling
>   * needed to make sure that there is enough journal space for the handle
>   * to begin.  Attach the handle to a transaction and set up the
> @@ -151,12 +237,14 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
>  			     gfp_t gfp_mask)
>  {
>  	transaction_t	*transaction, *new_transaction = NULL;
> -	tid_t		tid;
> -	int		needed, need_to_start;
>  	int		nblocks = handle->h_buffer_credits;
>  	unsigned long ts = jiffies;
>  
> -	if (nblocks > journal->j_max_transaction_buffers) {
> +	/*
> +	 * 1/2 of transaction can be reserved so we can practically handle
> +	 * only 1/2 of maximum transaction size per operation
> +	 */

Sorry, but I don't understand here why we only reserve 1/2 of maximum
transaction size.

> +	if (nblocks > journal->j_max_transaction_buffers / 2) {
>  		printk(KERN_ERR "JBD2: %s wants too many credits (%d > %d)\n",
>  		       current->comm, nblocks,
>  		       journal->j_max_transaction_buffers);
> @@ -223,75 +311,18 @@ repeat:
>  
>  	transaction = journal->j_running_transaction;
>  
> -	/*
> -	 * If the current transaction is locked down for commit, wait for the
> -	 * lock to be released.
> -	 */
> -	if (transaction->t_state == T_LOCKED) {
> -		DEFINE_WAIT(wait);
> -
> -		prepare_to_wait(&journal->j_wait_transaction_locked,
> -					&wait, TASK_UNINTERRUPTIBLE);
> -		read_unlock(&journal->j_state_lock);
> -		schedule();
> -		finish_wait(&journal->j_wait_transaction_locked, &wait);
> -		goto repeat;
> -	}
> -
> -	/*
> -	 * If there is not enough space left in the log to write all potential
> -	 * buffers requested by this operation, we need to stall pending a log
> -	 * checkpoint to free some more log space.
> -	 */
> -	needed = atomic_add_return(nblocks,
> -				   &transaction->t_outstanding_credits);
> -
> -	if (needed > journal->j_max_transaction_buffers) {
> +	if (!handle->h_reserved) {

Maybe we need to add a comment here because we release j_state_lock in
add_transaction_credits.

Regards,
                                                - Zheng

> +		if (add_transaction_credits(journal, handle))
> +			goto repeat;
> +	} else {
>  		/*
> -		 * If the current transaction is already too large, then start
> -		 * to commit it: we can then go back and attach this handle to
> -		 * a new transaction.
> +		 * We have handle reserved so we are allowed to join T_LOCKED
> +		 * transaction and we don't have to check for transaction size
> +		 * and journal space.
>  		 */
> -		DEFINE_WAIT(wait);
> -
> -		jbd_debug(2, "Handle %p starting new commit...\n", handle);
> -		atomic_sub(nblocks, &transaction->t_outstanding_credits);
> -		prepare_to_wait(&journal->j_wait_transaction_locked, &wait,
> -				TASK_UNINTERRUPTIBLE);
> -		tid = transaction->t_tid;
> -		need_to_start = !tid_geq(journal->j_commit_request, tid);
> -		read_unlock(&journal->j_state_lock);
> -		if (need_to_start)
> -			jbd2_log_start_commit(journal, tid);
> -		schedule();
> -		finish_wait(&journal->j_wait_transaction_locked, &wait);
> -		goto repeat;
> -	}
> -
> -	/*
> -	 * The commit code assumes that it can get enough log space
> -	 * without forcing a checkpoint.  This is *critical* for
> -	 * correctness: a checkpoint of a buffer which is also
> -	 * associated with a committing transaction creates a deadlock,
> -	 * so commit simply cannot force through checkpoints.
> -	 *
> -	 * We must therefore ensure the necessary space in the journal
> -	 * *before* starting to dirty potentially checkpointed buffers
> -	 * in the new transaction.
> -	 *
> -	 * The worst part is, any transaction currently committing can
> -	 * reduce the free space arbitrarily.  Be careful to account for
> -	 * those buffers when checkpointing.
> -	 */
> -	if (jbd2_log_space_left(journal) < jbd2_space_needed(journal)) {
> -		jbd_debug(2, "Handle %p waiting for checkpoint...\n", handle);
> -		atomic_sub(nblocks, &transaction->t_outstanding_credits);
> -		read_unlock(&journal->j_state_lock);
> -		write_lock(&journal->j_state_lock);
> -		if (jbd2_log_space_left(journal) < jbd2_space_needed(journal))
> -			__jbd2_log_wait_for_space(journal);
> -		write_unlock(&journal->j_state_lock);
> -		goto repeat;
> +		atomic_sub(nblocks, &journal->j_reserved_credits);
> +		wake_up(&journal->j_wait_reserved);
> +		handle->h_reserved = 0;
>  	}
>  
>  	/* OK, account for the buffers that this operation expects to
> @@ -390,6 +421,122 @@ handle_t *jbd2_journal_start(journal_t *journal, int nblocks)
>  }
>  EXPORT_SYMBOL(jbd2_journal_start);
>  
> +/**
> + * handle_t *jbd2_journal_reserve(journal_t *journal, int nblocks)
> + * @journal: journal to reserve transaction on.
> + * @nblocks: number of blocks we might modify
> + *
> + * This function reserves transaction with @nblocks blocks in @journal.  The
> + * function waits for enough journal space to be available and possibly also
> + * for some reservations to be converted to real transactions if there are too
> + * many of them. Note that this means that calling this function while having
> + * another transaction started or reserved can cause deadlock. The returned
> + * handle cannot be used for anything until it is started using
> + * jbd2_journal_start_reserved().
> + */
> +handle_t *jbd2_journal_reserve(journal_t *journal, int nblocks,
> +			       unsigned int type, unsigned int line_no)
> +{
> +	handle_t *handle;
> +	unsigned long wanted;
> +
> +	handle = new_handle(nblocks);
> +	if (!handle)
> +		return ERR_PTR(-ENOMEM);
> +	handle->h_journal = journal;
> +	handle->h_reserved = 1;
> +	handle->h_type = type;
> +	handle->h_line_no = line_no;
> +
> +repeat:
> +	/*
> +	 * We need j_state_lock early to avoid transaction creation to race
> +	 * with us and using elevated j_reserved_credits.
> +	 */
> +	read_lock(&journal->j_state_lock);
> +	wanted = atomic_add_return(nblocks, &journal->j_reserved_credits);
> +	/* We allow at most half of a transaction to be reserved */
> +	if (wanted > journal->j_max_transaction_buffers / 2) {
> +		atomic_sub(nblocks, &journal->j_reserved_credits);
> +		read_unlock(&journal->j_state_lock);
> +		wait_event(journal->j_wait_reserved,
> +			   atomic_read(&journal->j_reserved_credits) + nblocks
> +			   <= journal->j_max_transaction_buffers / 2);
> +		goto repeat;
> +	}
> +	if (journal->j_running_transaction) {
> +		transaction_t *t = journal->j_running_transaction;
> +
> +		wanted = atomic_add_return(nblocks,
> +					   &t->t_outstanding_credits);
> +		if (wanted > journal->j_max_transaction_buffers) {
> +			atomic_sub(nblocks, &t->t_outstanding_credits);
> +			atomic_sub(nblocks, &journal->j_reserved_credits);
> +			wait_transaction_locked(journal);
> +			goto repeat;
> +		}
> +	}
> +	read_unlock(&journal->j_state_lock);
> +
> +	return handle;
> +}
> +EXPORT_SYMBOL(jbd2_journal_reserve);
> +
> +void jbd2_journal_free_reserved(handle_t *handle)
> +{
> +	journal_t *journal = handle->h_journal;
> +
> +	atomic_sub(handle->h_buffer_credits, &journal->j_reserved_credits);
> +	wake_up(&journal->j_wait_reserved);
> +	jbd2_free_handle(handle);
> +}
> +EXPORT_SYMBOL(jbd2_journal_free_reserved);
> +
> +/**
> + * int jbd2_journal_start_reserved(handle_t *handle) - start reserved handle
> + * @handle: handle to start
> + *
> + * Start handle that has been previously reserved with jbd2_journal_reserve().
> + * This attaches @handle to the running transaction (or creates one if there's
> + * not transaction running). Unlike jbd2_journal_start() this function cannot
> + * block on journal commit, checkpointing, or similar stuff. It can block on
> + * memory allocation or frozen journal though.
> + *
> + * Return 0 on success, non-zero on error - handle is freed in that case.
> + */
> +int jbd2_journal_start_reserved(handle_t *handle)
> +{
> +	journal_t *journal = handle->h_journal;
> +	int ret = -EIO;
> +
> +	if (WARN_ON(!handle->h_reserved)) {
> +		/* Someone passed in normal handle? Just stop it. */
> +		jbd2_journal_stop(handle);
> +		return ret;
> +	}
> +	/*
> +	 * Usefulness of mixing of reserved and unreserved handles is
> +	 * questionable. So far nobody seems to need it so just error out.
> +	 */
> +	if (WARN_ON(current->journal_info)) {
> +		jbd2_journal_free_reserved(handle);
> +		return ret;
> +	}
> +
> +	handle->h_journal = NULL;
> +	current->journal_info = handle;
> +	/*
> +	 * GFP_NOFS is here because callers are likely from writeback or
> +	 * similarly constrained call sites
> +	 */
> +	ret = start_this_handle(journal, handle, GFP_NOFS);
> +	if (ret < 0) {
> +		current->journal_info = NULL;
> +		jbd2_journal_free_reserved(handle);
> +	}
> +	return ret;
> +}
> +EXPORT_SYMBOL(jbd2_journal_start_reserved);
>  
>  /**
>   * int jbd2_journal_extend() - extend buffer credits.
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index ad4b3bb..b3c1283 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -410,8 +410,12 @@ struct jbd2_revoke_table_s;
>  
>  struct jbd2_journal_handle
>  {
> -	/* Which compound transaction is this update a part of? */
> -	transaction_t		*h_transaction;
> +	union {
> +		/* Which compound transaction is this update a part of? */
> +		transaction_t	*h_transaction;
> +		/* Which journal handle belongs to - used iff h_reserved set */
> +		journal_t	*h_journal;
> +	};
>  
>  	/* Number of remaining buffers we are allowed to dirty: */
>  	int			h_buffer_credits;
> @@ -426,6 +430,7 @@ struct jbd2_journal_handle
>  	/* Flags [no locking] */
>  	unsigned int	h_sync:		1;	/* sync-on-close */
>  	unsigned int	h_jdata:	1;	/* force data journaling */
> +	unsigned int	h_reserved:	1;	/* handle with reserved credits */
>  	unsigned int	h_aborted:	1;	/* fatal error on handle */
>  	unsigned int	h_type:		8;	/* for handle statistics */
>  	unsigned int	h_line_no:	16;	/* for handle statistics */
> @@ -689,6 +694,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
>   * @j_wait_done_commit: Wait queue for waiting for commit to complete
>   * @j_wait_commit: Wait queue to trigger commit
>   * @j_wait_updates: Wait queue to wait for updates to complete
> + * @j_wait_reserved: Wait queue to wait for reserved buffer credits to drop
>   * @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
>   * @j_head: Journal head - identifies the first unused block in the journal
>   * @j_tail: Journal tail - identifies the oldest still-used block in the
> @@ -702,6 +708,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
>   *     journal
>   * @j_fs_dev: Device which holds the client fs.  For internal journal this will
>   *     be equal to j_dev
> + * @j_reserved_credits: Number of buffers reserved from the running transaction
>   * @j_maxlen: Total maximum capacity of the journal region on disk.
>   * @j_list_lock: Protects the buffer lists and internal buffer state.
>   * @j_inode: Optional inode where we store the journal.  If present, all journal
> @@ -800,6 +807,9 @@ struct journal_s
>  	/* Wait queue to wait for updates to complete */
>  	wait_queue_head_t	j_wait_updates;
>  
> +	/* Wait queue to wait for reserved buffer credits to drop */
> +	wait_queue_head_t	j_wait_reserved;
> +
>  	/* Semaphore for locking against concurrent checkpoints */
>  	struct mutex		j_checkpoint_mutex;
>  
> @@ -854,6 +864,9 @@ struct journal_s
>  	/* Total maximum capacity of the journal region on disk. */
>  	unsigned int		j_maxlen;
>  
> +	/* Number of buffers reserved from the running transaction */
> +	atomic_t		j_reserved_credits;
> +
>  	/*
>  	 * Protects the buffer lists and internal buffer state.
>  	 */
> @@ -1094,6 +1107,10 @@ extern handle_t *jbd2__journal_start(journal_t *, int nblocks, gfp_t gfp_mask,
>  				     unsigned int type, unsigned int line_no);
>  extern int	 jbd2_journal_restart(handle_t *, int nblocks);
>  extern int	 jbd2__journal_restart(handle_t *, int nblocks, gfp_t gfp_mask);
> +extern handle_t *jbd2_journal_reserve(journal_t *, int nblocks,
> +				      unsigned int type, unsigned int line_no);
> +extern int	 jbd2_journal_start_reserved(handle_t *handle);
> +extern void	 jbd2_journal_free_reserved(handle_t *handle);
>  extern int	 jbd2_journal_extend (handle_t *, int nblocks);
>  extern int	 jbd2_journal_get_write_access(handle_t *, struct buffer_head *);
>  extern int	 jbd2_journal_get_create_access (handle_t *, struct buffer_head *);
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/29] ext4: Provide wrappers for transaction reservation calls
  2013-04-08 21:32 ` [PATCH 13/29] ext4: Provide wrappers for transaction reservation calls Jan Kara
@ 2013-05-05 11:51   ` Zheng Liu
  2013-05-05 11:58   ` Zheng Liu
  1 sibling, 0 replies; 76+ messages in thread
From: Zheng Liu @ 2013-05-05 11:51 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:18PM +0200, Jan Kara wrote:
> Signed-off-by: Jan Kara <jack@suse.cz>

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Regards,
                                                - Zheng

> ---
>  fs/ext4/ext4_jbd2.c         |   71 +++++++++++++++++++++++++++++++++++++-----
>  fs/ext4/ext4_jbd2.h         |   13 ++++++++
>  include/trace/events/ext4.h |   20 +++++++++++-
>  3 files changed, 94 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
> index 7058975..b3e04bf 100644
> --- a/fs/ext4/ext4_jbd2.c
> +++ b/fs/ext4/ext4_jbd2.c
> @@ -38,28 +38,40 @@ static void ext4_put_nojournal(handle_t *handle)
>  /*
>   * Wrappers for jbd2_journal_start/end.
>   */
> -handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
> -				  int type, int nblocks)
> +static int ext4_journal_check_start(struct super_block *sb)
>  {
>  	journal_t *journal;
>  
> -	trace_ext4_journal_start(sb, nblocks, _RET_IP_);
>  	if (sb->s_flags & MS_RDONLY)
> -		return ERR_PTR(-EROFS);
> -
> +		return -EROFS;
>  	WARN_ON(sb->s_writers.frozen == SB_FREEZE_COMPLETE);
>  	journal = EXT4_SB(sb)->s_journal;
> -	if (!journal)
> -		return ext4_get_nojournal();
>  	/*
>  	 * Special case here: if the journal has aborted behind our
>  	 * backs (eg. EIO in the commit thread), then we still need to
>  	 * take the FS itself readonly cleanly.
>  	 */
> -	if (is_journal_aborted(journal)) {
> +	if (journal && is_journal_aborted(journal)) {
>  		ext4_abort(sb, "Detected aborted journal");
> -		return ERR_PTR(-EROFS);
> +		return -EROFS;
>  	}
> +	return 0;
> +}
> +
> +handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
> +				  int type, int nblocks)
> +{
> +	journal_t *journal;
> +	int err;
> +
> +	trace_ext4_journal_start(sb, nblocks, _RET_IP_);
> +	err = ext4_journal_check_start(sb);
> +	if (err < 0)
> +		return ERR_PTR(err);
> +
> +	journal = EXT4_SB(sb)->s_journal;
> +	if (!journal)
> +		return ext4_get_nojournal();
>  	return jbd2__journal_start(journal, nblocks, GFP_NOFS, type, line);
>  }
>  
> @@ -84,6 +96,47 @@ int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle)
>  	return err;
>  }
>  
> +handle_t *__ext4_journal_reserve(struct inode *inode, unsigned int line,
> +				 int type, int nblocks)
> +{
> +	struct super_block *sb = inode->i_sb;
> +	journal_t *journal;
> +	int err;
> +
> +	trace_ext4_journal_reserve(sb, nblocks, _RET_IP_);
> +	err = ext4_journal_check_start(sb);
> +	if (err < 0)
> +		return ERR_PTR(err);
> +
> +	journal = EXT4_SB(sb)->s_journal;
> +	if (!journal)
> +		return (handle_t *)1;	/* Hack to return !NULL */
> +	return jbd2_journal_reserve(journal, nblocks, type, line);
> +}
> +
> +handle_t *ext4_journal_start_reserved(handle_t *handle)
> +{
> +	struct super_block *sb;
> +	int err;
> +
> +	if (!ext4_handle_valid(handle))
> +		return ext4_get_nojournal();
> +
> +	sb = handle->h_journal->j_private;
> +	trace_ext4_journal_start_reserved(sb, handle->h_buffer_credits,
> +					  _RET_IP_);
> +	err = ext4_journal_check_start(sb);
> +	if (err < 0) {
> +		jbd2_journal_free_reserved(handle);
> +		return ERR_PTR(err);
> +	}
> +
> +	err = jbd2_journal_start_reserved(handle);
> +	if (err < 0)
> +		return ERR_PTR(err);
> +	return handle;
> +}
> +
>  void ext4_journal_abort_handle(const char *caller, unsigned int line,
>  			       const char *err_fn, struct buffer_head *bh,
>  			       handle_t *handle, int err)
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index 4c216b1..bb17931 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -309,6 +309,19 @@ static inline handle_t *__ext4_journal_start(struct inode *inode,
>  #define ext4_journal_stop(handle) \
>  	__ext4_journal_stop(__func__, __LINE__, (handle))
>  
> +#define ext4_journal_reserve(inode, type, nblocks)			\
> +	__ext4_journal_reserve((inode), __LINE__, (type), (nblocks))
> +
> +handle_t *__ext4_journal_reserve(struct inode *inode, unsigned int line,
> +				 int type, int nblocks);
> +handle_t *ext4_journal_start_reserved(handle_t *handle);
> +
> +static inline void ext4_journal_free_reserved(handle_t *handle)
> +{
> +	if (ext4_handle_valid(handle))
> +		jbd2_journal_free_reserved(handle);
> +}
> +
>  static inline handle_t *ext4_journal_current_handle(void)
>  {
>  	return journal_current_handle();
> diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> index 4ee4710..a601bb3 100644
> --- a/include/trace/events/ext4.h
> +++ b/include/trace/events/ext4.h
> @@ -1645,7 +1645,7 @@ TRACE_EVENT(ext4_load_inode,
>  		  (unsigned long) __entry->ino)
>  );
>  
> -TRACE_EVENT(ext4_journal_start,
> +DECLARE_EVENT_CLASS(ext4_journal_start_class,
>  	TP_PROTO(struct super_block *sb, int nblocks, unsigned long IP),
>  
>  	TP_ARGS(sb, nblocks, IP),
> @@ -1667,6 +1667,24 @@ TRACE_EVENT(ext4_journal_start,
>  		  __entry->nblocks, (void *)__entry->ip)
>  );
>  
> +DEFINE_EVENT(ext4_journal_start_class, ext4_journal_start,
> +	TP_PROTO(struct super_block *sb, int nblocks, unsigned long IP),
> +
> +	TP_ARGS(sb, nblocks, IP)
> +);
> +
> +DEFINE_EVENT(ext4_journal_start_class, ext4_journal_reserve,
> +	TP_PROTO(struct super_block *sb, int nblocks, unsigned long IP),
> +
> +	TP_ARGS(sb, nblocks, IP)
> +);
> +
> +DEFINE_EVENT(ext4_journal_start_class, ext4_journal_start_reserved,
> +	TP_PROTO(struct super_block *sb, int nblocks, unsigned long IP),
> +
> +	TP_ARGS(sb, nblocks, IP)
> +);
> +
>  DECLARE_EVENT_CLASS(ext4__trim,
>  	TP_PROTO(struct super_block *sb,
>  		 ext4_group_t group,
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/29] ext4: Provide wrappers for transaction reservation calls
  2013-04-08 21:32 ` [PATCH 13/29] ext4: Provide wrappers for transaction reservation calls Jan Kara
  2013-05-05 11:51   ` Zheng Liu
@ 2013-05-05 11:58   ` Zheng Liu
  2013-05-06 12:51     ` Jan Kara
  1 sibling, 1 reply; 76+ messages in thread
From: Zheng Liu @ 2013-05-05 11:58 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:18PM +0200, Jan Kara wrote:
> Signed-off-by: Jan Kara <jack@suse.cz>

Oops, forgot to say, this patch needs to be rebased

Regards,
                                                - Zheng

> ---
>  fs/ext4/ext4_jbd2.c         |   71 +++++++++++++++++++++++++++++++++++++-----
>  fs/ext4/ext4_jbd2.h         |   13 ++++++++
>  include/trace/events/ext4.h |   20 +++++++++++-
>  3 files changed, 94 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
> index 7058975..b3e04bf 100644
> --- a/fs/ext4/ext4_jbd2.c
> +++ b/fs/ext4/ext4_jbd2.c
> @@ -38,28 +38,40 @@ static void ext4_put_nojournal(handle_t *handle)
>  /*
>   * Wrappers for jbd2_journal_start/end.
>   */
> -handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
> -				  int type, int nblocks)
> +static int ext4_journal_check_start(struct super_block *sb)
>  {
>  	journal_t *journal;
>  
> -	trace_ext4_journal_start(sb, nblocks, _RET_IP_);
>  	if (sb->s_flags & MS_RDONLY)
> -		return ERR_PTR(-EROFS);
> -
> +		return -EROFS;
>  	WARN_ON(sb->s_writers.frozen == SB_FREEZE_COMPLETE);
>  	journal = EXT4_SB(sb)->s_journal;
> -	if (!journal)
> -		return ext4_get_nojournal();
>  	/*
>  	 * Special case here: if the journal has aborted behind our
>  	 * backs (eg. EIO in the commit thread), then we still need to
>  	 * take the FS itself readonly cleanly.
>  	 */
> -	if (is_journal_aborted(journal)) {
> +	if (journal && is_journal_aborted(journal)) {
>  		ext4_abort(sb, "Detected aborted journal");
> -		return ERR_PTR(-EROFS);
> +		return -EROFS;
>  	}
> +	return 0;
> +}
> +
> +handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
> +				  int type, int nblocks)
> +{
> +	journal_t *journal;
> +	int err;
> +
> +	trace_ext4_journal_start(sb, nblocks, _RET_IP_);
> +	err = ext4_journal_check_start(sb);
> +	if (err < 0)
> +		return ERR_PTR(err);
> +
> +	journal = EXT4_SB(sb)->s_journal;
> +	if (!journal)
> +		return ext4_get_nojournal();
>  	return jbd2__journal_start(journal, nblocks, GFP_NOFS, type, line);
>  }
>  
> @@ -84,6 +96,47 @@ int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle)
>  	return err;
>  }
>  
> +handle_t *__ext4_journal_reserve(struct inode *inode, unsigned int line,
> +				 int type, int nblocks)
> +{
> +	struct super_block *sb = inode->i_sb;
> +	journal_t *journal;
> +	int err;
> +
> +	trace_ext4_journal_reserve(sb, nblocks, _RET_IP_);
> +	err = ext4_journal_check_start(sb);
> +	if (err < 0)
> +		return ERR_PTR(err);
> +
> +	journal = EXT4_SB(sb)->s_journal;
> +	if (!journal)
> +		return (handle_t *)1;	/* Hack to return !NULL */
> +	return jbd2_journal_reserve(journal, nblocks, type, line);
> +}
> +
> +handle_t *ext4_journal_start_reserved(handle_t *handle)
> +{
> +	struct super_block *sb;
> +	int err;
> +
> +	if (!ext4_handle_valid(handle))
> +		return ext4_get_nojournal();
> +
> +	sb = handle->h_journal->j_private;
> +	trace_ext4_journal_start_reserved(sb, handle->h_buffer_credits,
> +					  _RET_IP_);
> +	err = ext4_journal_check_start(sb);
> +	if (err < 0) {
> +		jbd2_journal_free_reserved(handle);
> +		return ERR_PTR(err);
> +	}
> +
> +	err = jbd2_journal_start_reserved(handle);
> +	if (err < 0)
> +		return ERR_PTR(err);
> +	return handle;
> +}
> +
>  void ext4_journal_abort_handle(const char *caller, unsigned int line,
>  			       const char *err_fn, struct buffer_head *bh,
>  			       handle_t *handle, int err)
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index 4c216b1..bb17931 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -309,6 +309,19 @@ static inline handle_t *__ext4_journal_start(struct inode *inode,
>  #define ext4_journal_stop(handle) \
>  	__ext4_journal_stop(__func__, __LINE__, (handle))
>  
> +#define ext4_journal_reserve(inode, type, nblocks)			\
> +	__ext4_journal_reserve((inode), __LINE__, (type), (nblocks))
> +
> +handle_t *__ext4_journal_reserve(struct inode *inode, unsigned int line,
> +				 int type, int nblocks);
> +handle_t *ext4_journal_start_reserved(handle_t *handle);
> +
> +static inline void ext4_journal_free_reserved(handle_t *handle)
> +{
> +	if (ext4_handle_valid(handle))
> +		jbd2_journal_free_reserved(handle);
> +}
> +
>  static inline handle_t *ext4_journal_current_handle(void)
>  {
>  	return journal_current_handle();
> diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> index 4ee4710..a601bb3 100644
> --- a/include/trace/events/ext4.h
> +++ b/include/trace/events/ext4.h
> @@ -1645,7 +1645,7 @@ TRACE_EVENT(ext4_load_inode,
>  		  (unsigned long) __entry->ino)
>  );
>  
> -TRACE_EVENT(ext4_journal_start,
> +DECLARE_EVENT_CLASS(ext4_journal_start_class,
>  	TP_PROTO(struct super_block *sb, int nblocks, unsigned long IP),
>  
>  	TP_ARGS(sb, nblocks, IP),
> @@ -1667,6 +1667,24 @@ TRACE_EVENT(ext4_journal_start,
>  		  __entry->nblocks, (void *)__entry->ip)
>  );
>  
> +DEFINE_EVENT(ext4_journal_start_class, ext4_journal_start,
> +	TP_PROTO(struct super_block *sb, int nblocks, unsigned long IP),
> +
> +	TP_ARGS(sb, nblocks, IP)
> +);
> +
> +DEFINE_EVENT(ext4_journal_start_class, ext4_journal_reserve,
> +	TP_PROTO(struct super_block *sb, int nblocks, unsigned long IP),
> +
> +	TP_ARGS(sb, nblocks, IP)
> +);
> +
> +DEFINE_EVENT(ext4_journal_start_class, ext4_journal_start_reserved,
> +	TP_PROTO(struct super_block *sb, int nblocks, unsigned long IP),
> +
> +	TP_ARGS(sb, nblocks, IP)
> +);
> +
>  DECLARE_EVENT_CLASS(ext4__trim,
>  	TP_PROTO(struct super_block *sb,
>  		 ext4_group_t group,
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 14/29] ext4: Stop messing with nr_to_write in ext4_da_writepages()
  2013-04-08 21:32 ` [PATCH 14/29] ext4: Stop messing with nr_to_write in ext4_da_writepages() Jan Kara
@ 2013-05-05 12:40   ` Zheng Liu
  0 siblings, 0 replies; 76+ messages in thread
From: Zheng Liu @ 2013-05-05 12:40 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:19PM +0200, Jan Kara wrote:
> Writeback code got better in how it submits IO and now the number of
> pages requested to be written is usually higher than original 1024. The
> number is now dynamically computed based on observed throughput and is
> set to be about 0.5 s worth of writeback. E.g. on ordinary SATA drive
> this ends up somewhere around 10000 as my testing shows. So remove the
> unnecessary smarts from ext4_da_writepages().
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

This patch needs to be rebase against latest dev branch of ext4 tree.
Otherwise the patch looks good to me.
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>

Regards,
                                                - Zheng

> ---
>  fs/ext4/inode.c |   96 -------------------------------------------------------
>  1 files changed, 0 insertions(+), 96 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index ba07412..f4dc4a1 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -423,66 +423,6 @@ static int __check_block_validity(struct inode *inode, const char *func,
>  	__check_block_validity((inode), __func__, __LINE__, (map))
>  
>  /*
> - * Return the number of contiguous dirty pages in a given inode
> - * starting at page frame idx.
> - */
> -static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
> -				    unsigned int max_pages)
> -{
> -	struct address_space *mapping = inode->i_mapping;
> -	pgoff_t	index;
> -	struct pagevec pvec;
> -	pgoff_t num = 0;
> -	int i, nr_pages, done = 0;
> -
> -	if (max_pages == 0)
> -		return 0;
> -	pagevec_init(&pvec, 0);
> -	while (!done) {
> -		index = idx;
> -		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
> -					      PAGECACHE_TAG_DIRTY,
> -					      (pgoff_t)PAGEVEC_SIZE);
> -		if (nr_pages == 0)
> -			break;
> -		for (i = 0; i < nr_pages; i++) {
> -			struct page *page = pvec.pages[i];
> -			struct buffer_head *bh, *head;
> -
> -			lock_page(page);
> -			if (unlikely(page->mapping != mapping) ||
> -			    !PageDirty(page) ||
> -			    PageWriteback(page) ||
> -			    page->index != idx) {
> -				done = 1;
> -				unlock_page(page);
> -				break;
> -			}
> -			if (page_has_buffers(page)) {
> -				bh = head = page_buffers(page);
> -				do {
> -					if (!buffer_delay(bh) &&
> -					    !buffer_unwritten(bh))
> -						done = 1;
> -					bh = bh->b_this_page;
> -				} while (!done && (bh != head));
> -			}
> -			unlock_page(page);
> -			if (done)
> -				break;
> -			idx++;
> -			num++;
> -			if (num >= max_pages) {
> -				done = 1;
> -				break;
> -			}
> -		}
> -		pagevec_release(&pvec);
> -	}
> -	return num;
> -}
> -
> -/*
>   * The ext4_map_blocks() function tries to look up the requested blocks,
>   * and returns if the blocks are already mapped.
>   *
> @@ -2334,10 +2274,8 @@ static int ext4_da_writepages(struct address_space *mapping,
>  	struct mpage_da_data mpd;
>  	struct inode *inode = mapping->host;
>  	int pages_written = 0;
> -	unsigned int max_pages;
>  	int range_cyclic, cycled = 1, io_done = 0;
>  	int needed_blocks, ret = 0;
> -	long desired_nr_to_write, nr_to_writebump = 0;
>  	loff_t range_start = wbc->range_start;
>  	struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
>  	pgoff_t done_index = 0;
> @@ -2384,39 +2322,6 @@ static int ext4_da_writepages(struct address_space *mapping,
>  		end = wbc->range_end >> PAGE_CACHE_SHIFT;
>  	}
>  
> -	/*
> -	 * This works around two forms of stupidity.  The first is in
> -	 * the writeback code, which caps the maximum number of pages
> -	 * written to be 1024 pages.  This is wrong on multiple
> -	 * levels; different architectues have a different page size,
> -	 * which changes the maximum amount of data which gets
> -	 * written.  Secondly, 4 megabytes is way too small.  XFS
> -	 * forces this value to be 16 megabytes by multiplying
> -	 * nr_to_write parameter by four, and then relies on its
> -	 * allocator to allocate larger extents to make them
> -	 * contiguous.  Unfortunately this brings us to the second
> -	 * stupidity, which is that ext4's mballoc code only allocates
> -	 * at most 2048 blocks.  So we force contiguous writes up to
> -	 * the number of dirty blocks in the inode, or
> -	 * sbi->max_writeback_mb_bump whichever is smaller.
> -	 */
> -	max_pages = sbi->s_max_writeback_mb_bump << (20 - PAGE_CACHE_SHIFT);
> -	if (!range_cyclic && range_whole) {
> -		if (wbc->nr_to_write == LONG_MAX)
> -			desired_nr_to_write = wbc->nr_to_write;
> -		else
> -			desired_nr_to_write = wbc->nr_to_write * 8;
> -	} else
> -		desired_nr_to_write = ext4_num_dirty_pages(inode, index,
> -							   max_pages);
> -	if (desired_nr_to_write > max_pages)
> -		desired_nr_to_write = max_pages;
> -
> -	if (wbc->nr_to_write < desired_nr_to_write) {
> -		nr_to_writebump = desired_nr_to_write - wbc->nr_to_write;
> -		wbc->nr_to_write = desired_nr_to_write;
> -	}
> -
>  retry:
>  	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
>  		tag_pages_for_writeback(mapping, index, end);
> @@ -2509,7 +2414,6 @@ retry:
>  		mapping->writeback_index = done_index;
>  
>  out_writepages:
> -	wbc->nr_to_write -= nr_to_writebump;
>  	wbc->range_start = range_start;
>  	trace_ext4_da_writepages_result(inode, wbc, ret, pages_written);
>  	return ret;
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 15/29] ext4: Deprecate max_writeback_mb_bump sysfs attribute
  2013-04-08 21:32 ` [PATCH 15/29] ext4: Deprecate max_writeback_mb_bump sysfs attribute Jan Kara
@ 2013-05-05 12:47   ` Zheng Liu
  2013-05-06 12:55     ` Jan Kara
  0 siblings, 1 reply; 76+ messages in thread
From: Zheng Liu @ 2013-05-05 12:47 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:20PM +0200, Jan Kara wrote:
> This attribute is now unused so deprecate it. We still show the old
> default value to keep some compatibility but we don't allow writing to
> that attribute anymore.

I think we can remove it completely.  IMHO, if an application tries to
set this value, it will get an error because now it can't set it.

Regards,
                                                - Zheng

> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/ext4.h  |    1 -
>  fs/ext4/super.c |   30 ++++++++++++++++++++++++------
>  2 files changed, 24 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index edf9b9e..3575fdb 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1234,7 +1234,6 @@ struct ext4_sb_info {
>  	unsigned int s_mb_stats;
>  	unsigned int s_mb_order2_reqs;
>  	unsigned int s_mb_group_prealloc;
> -	unsigned int s_max_writeback_mb_bump;
>  	unsigned int s_max_dir_size_kb;
>  	/* where last allocation was done - for stream allocation */
>  	unsigned long s_mb_last_group;
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 34e8552..09ff724 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -2374,7 +2374,10 @@ struct ext4_attr {
>  	ssize_t (*show)(struct ext4_attr *, struct ext4_sb_info *, char *);
>  	ssize_t (*store)(struct ext4_attr *, struct ext4_sb_info *,
>  			 const char *, size_t);
> -	int offset;
> +	union {
> +		int offset;
> +		int deprecated_val;
> +	} u;
>  };
>  
>  static int parse_strtoul(const char *buf,
> @@ -2443,7 +2446,7 @@ static ssize_t inode_readahead_blks_store(struct ext4_attr *a,
>  static ssize_t sbi_ui_show(struct ext4_attr *a,
>  			   struct ext4_sb_info *sbi, char *buf)
>  {
> -	unsigned int *ui = (unsigned int *) (((char *) sbi) + a->offset);
> +	unsigned int *ui = (unsigned int *) (((char *) sbi) + a->u.offset);
>  
>  	return snprintf(buf, PAGE_SIZE, "%u\n", *ui);
>  }
> @@ -2452,7 +2455,7 @@ static ssize_t sbi_ui_store(struct ext4_attr *a,
>  			    struct ext4_sb_info *sbi,
>  			    const char *buf, size_t count)
>  {
> -	unsigned int *ui = (unsigned int *) (((char *) sbi) + a->offset);
> +	unsigned int *ui = (unsigned int *) (((char *) sbi) + a->u.offset);
>  	unsigned long t;
>  
>  	if (parse_strtoul(buf, 0xffffffff, &t))
> @@ -2478,12 +2481,20 @@ static ssize_t trigger_test_error(struct ext4_attr *a,
>  	return count;
>  }
>  
> +static ssize_t sbi_deprecated_show(struct ext4_attr *a,
> +				   struct ext4_sb_info *sbi, char *buf)
> +{
> +	return snprintf(buf, PAGE_SIZE, "%d\n", a->u.deprecated_val);
> +}
> +
>  #define EXT4_ATTR_OFFSET(_name,_mode,_show,_store,_elname) \
>  static struct ext4_attr ext4_attr_##_name = {			\
>  	.attr = {.name = __stringify(_name), .mode = _mode },	\
>  	.show	= _show,					\
>  	.store	= _store,					\
> -	.offset = offsetof(struct ext4_sb_info, _elname),	\
> +	.u = {							\
> +		.offset = offsetof(struct ext4_sb_info, _elname),\
> +	},							\
>  }
>  #define EXT4_ATTR(name, mode, show, store) \
>  static struct ext4_attr ext4_attr_##name = __ATTR(name, mode, show, store)
> @@ -2494,6 +2505,14 @@ static struct ext4_attr ext4_attr_##name = __ATTR(name, mode, show, store)
>  #define EXT4_RW_ATTR_SBI_UI(name, elname)	\
>  	EXT4_ATTR_OFFSET(name, 0644, sbi_ui_show, sbi_ui_store, elname)
>  #define ATTR_LIST(name) &ext4_attr_##name.attr
> +#define EXT4_DEPRECATED_ATTR(_name, _val)	\
> +static struct ext4_attr ext4_attr_##_name = {			\
> +	.attr = {.name = __stringify(_name), .mode = 0444 },	\
> +	.show	= sbi_deprecated_show,				\
> +	.u = {							\
> +		.deprecated_val = _val,				\
> +	},							\
> +}
>  
>  EXT4_RO_ATTR(delayed_allocation_blocks);
>  EXT4_RO_ATTR(session_write_kbytes);
> @@ -2507,7 +2526,7 @@ EXT4_RW_ATTR_SBI_UI(mb_min_to_scan, s_mb_min_to_scan);
>  EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs);
>  EXT4_RW_ATTR_SBI_UI(mb_stream_req, s_mb_stream_request);
>  EXT4_RW_ATTR_SBI_UI(mb_group_prealloc, s_mb_group_prealloc);
> -EXT4_RW_ATTR_SBI_UI(max_writeback_mb_bump, s_max_writeback_mb_bump);
> +EXT4_DEPRECATED_ATTR(max_writeback_mb_bump, 128);
>  EXT4_RW_ATTR_SBI_UI(extent_max_zeroout_kb, s_extent_max_zeroout_kb);
>  EXT4_ATTR(trigger_fs_error, 0200, NULL, trigger_test_error);
>  
> @@ -3718,7 +3737,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  	}
>  
>  	sbi->s_stripe = ext4_get_stripe_size(sbi);
> -	sbi->s_max_writeback_mb_bump = 128;
>  	sbi->s_extent_max_zeroout_kb = 32;
>  
>  	/* Register extent status tree shrinker */
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 12/29] jbd2: Transaction reservation support
  2013-05-05  9:39   ` Zheng Liu
@ 2013-05-06 12:49     ` Jan Kara
  2013-05-07  5:22       ` Zheng Liu
  0 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-05-06 12:49 UTC (permalink / raw)
  To: Zheng Liu; +Cc: Jan Kara, Ted Tso, linux-ext4

On Sun 05-05-13 17:39:39, Zheng Liu wrote:
> On Mon, Apr 08, 2013 at 11:32:17PM +0200, Jan Kara wrote:
> > In some cases we cannot start a transaction because of locking constraints and
> > passing started transaction into those places is not handy either because we
> > could block transaction commit for too long. Transaction reservation is
> > designed to solve these issues. It reserves a handle with given number of
> > credits in the journal and the handle can be later attached to the running
> > transaction without blocking on commit or checkpointing. Reserved handles do
> > not block transaction commit in any way, they only reduce maximum size of the
> > running transaction (because we have to always be prepared to accomodate
> > request for attaching reserved handle).
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> Some minor nits below.  Otherwise the patch looks good to me.
> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
...
> 
> > +/*
> >   * start_this_handle: Given a handle, deal with any locking or stalling
> >   * needed to make sure that there is enough journal space for the handle
> >   * to begin.  Attach the handle to a transaction and set up the
> > @@ -151,12 +237,14 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
> >  			     gfp_t gfp_mask)
> >  {
> >  	transaction_t	*transaction, *new_transaction = NULL;
> > -	tid_t		tid;
> > -	int		needed, need_to_start;
> >  	int		nblocks = handle->h_buffer_credits;
> >  	unsigned long ts = jiffies;
> >  
> > -	if (nblocks > journal->j_max_transaction_buffers) {
> > +	/*
> > +	 * 1/2 of transaction can be reserved so we can practically handle
> > +	 * only 1/2 of maximum transaction size per operation
> > +	 */
> 
> Sorry, but I don't understand here why we only reserve 1/2 of maximum
> transaction size.
  Well, we allow 1/2 of maximum transaction size to be allocated in already
reserved handles. So if someone submitted a request for a handle with
more than 1/2 of maximum transaction size, then we might have to wait for
reserved handles to be freed. That would be a slight complication in the
code and it would also possibly introduce livelocking issues - after a
reserved transaction is freed, someone can reserve a new one before the
large handle creation request is satisfied. Again this can be solved but
the complications simply doesn't seem to be worth it.

> > +	if (nblocks > journal->j_max_transaction_buffers / 2) {
> >  		printk(KERN_ERR "JBD2: %s wants too many credits (%d > %d)\n",
> >  		       current->comm, nblocks,
> >  		       journal->j_max_transaction_buffers);
> > @@ -223,75 +311,18 @@ repeat:
> >  
> >  	transaction = journal->j_running_transaction;
> >  
> > -	/*
> > -	 * If the current transaction is locked down for commit, wait for the
> > -	 * lock to be released.
> > -	 */
> > -	if (transaction->t_state == T_LOCKED) {
> > -		DEFINE_WAIT(wait);
> > -
> > -		prepare_to_wait(&journal->j_wait_transaction_locked,
> > -					&wait, TASK_UNINTERRUPTIBLE);
> > -		read_unlock(&journal->j_state_lock);
> > -		schedule();
> > -		finish_wait(&journal->j_wait_transaction_locked, &wait);
> > -		goto repeat;
> > -	}
> > -
> > -	/*
> > -	 * If there is not enough space left in the log to write all potential
> > -	 * buffers requested by this operation, we need to stall pending a log
> > -	 * checkpoint to free some more log space.
> > -	 */
> > -	needed = atomic_add_return(nblocks,
> > -				   &transaction->t_outstanding_credits);
> > -
> > -	if (needed > journal->j_max_transaction_buffers) {
> > +	if (!handle->h_reserved) {
> 
> Maybe we need to add a comment here because we release j_state_lock in
> add_transaction_credits.
  OK, I've added a comment regarding that. Thanks for the review!

								Honza

> > +		if (add_transaction_credits(journal, handle))
> > +			goto repeat;
> > +	} else {
> >  		/*
> > -		 * If the current transaction is already too large, then start
> > -		 * to commit it: we can then go back and attach this handle to
> > -		 * a new transaction.
> > +		 * We have handle reserved so we are allowed to join T_LOCKED
> > +		 * transaction and we don't have to check for transaction size
> > +		 * and journal space.
> >  		 */
> > -		DEFINE_WAIT(wait);
> > -
> > -		jbd_debug(2, "Handle %p starting new commit...\n", handle);
> > -		atomic_sub(nblocks, &transaction->t_outstanding_credits);
> > -		prepare_to_wait(&journal->j_wait_transaction_locked, &wait,
> > -				TASK_UNINTERRUPTIBLE);
> > -		tid = transaction->t_tid;
> > -		need_to_start = !tid_geq(journal->j_commit_request, tid);
> > -		read_unlock(&journal->j_state_lock);
> > -		if (need_to_start)
> > -			jbd2_log_start_commit(journal, tid);
> > -		schedule();
> > -		finish_wait(&journal->j_wait_transaction_locked, &wait);
> > -		goto repeat;
> > -	}
> > -
> > -	/*
> > -	 * The commit code assumes that it can get enough log space
> > -	 * without forcing a checkpoint.  This is *critical* for
> > -	 * correctness: a checkpoint of a buffer which is also
> > -	 * associated with a committing transaction creates a deadlock,
> > -	 * so commit simply cannot force through checkpoints.
> > -	 *
> > -	 * We must therefore ensure the necessary space in the journal
> > -	 * *before* starting to dirty potentially checkpointed buffers
> > -	 * in the new transaction.
> > -	 *
> > -	 * The worst part is, any transaction currently committing can
> > -	 * reduce the free space arbitrarily.  Be careful to account for
> > -	 * those buffers when checkpointing.
> > -	 */
> > -	if (jbd2_log_space_left(journal) < jbd2_space_needed(journal)) {
> > -		jbd_debug(2, "Handle %p waiting for checkpoint...\n", handle);
> > -		atomic_sub(nblocks, &transaction->t_outstanding_credits);
> > -		read_unlock(&journal->j_state_lock);
> > -		write_lock(&journal->j_state_lock);
> > -		if (jbd2_log_space_left(journal) < jbd2_space_needed(journal))
> > -			__jbd2_log_wait_for_space(journal);
> > -		write_unlock(&journal->j_state_lock);
> > -		goto repeat;
> > +		atomic_sub(nblocks, &journal->j_reserved_credits);
> > +		wake_up(&journal->j_wait_reserved);
> > +		handle->h_reserved = 0;
> >  	}
> >  
> >  	/* OK, account for the buffers that this operation expects to
> > @@ -390,6 +421,122 @@ handle_t *jbd2_journal_start(journal_t *journal, int nblocks)
> >  }
> >  EXPORT_SYMBOL(jbd2_journal_start);
> >  
> > +/**
> > + * handle_t *jbd2_journal_reserve(journal_t *journal, int nblocks)
> > + * @journal: journal to reserve transaction on.
> > + * @nblocks: number of blocks we might modify
> > + *
> > + * This function reserves transaction with @nblocks blocks in @journal.  The
> > + * function waits for enough journal space to be available and possibly also
> > + * for some reservations to be converted to real transactions if there are too
> > + * many of them. Note that this means that calling this function while having
> > + * another transaction started or reserved can cause deadlock. The returned
> > + * handle cannot be used for anything until it is started using
> > + * jbd2_journal_start_reserved().
> > + */
> > +handle_t *jbd2_journal_reserve(journal_t *journal, int nblocks,
> > +			       unsigned int type, unsigned int line_no)
> > +{
> > +	handle_t *handle;
> > +	unsigned long wanted;
> > +
> > +	handle = new_handle(nblocks);
> > +	if (!handle)
> > +		return ERR_PTR(-ENOMEM);
> > +	handle->h_journal = journal;
> > +	handle->h_reserved = 1;
> > +	handle->h_type = type;
> > +	handle->h_line_no = line_no;
> > +
> > +repeat:
> > +	/*
> > +	 * We need j_state_lock early to avoid transaction creation to race
> > +	 * with us and using elevated j_reserved_credits.
> > +	 */
> > +	read_lock(&journal->j_state_lock);
> > +	wanted = atomic_add_return(nblocks, &journal->j_reserved_credits);
> > +	/* We allow at most half of a transaction to be reserved */
> > +	if (wanted > journal->j_max_transaction_buffers / 2) {
> > +		atomic_sub(nblocks, &journal->j_reserved_credits);
> > +		read_unlock(&journal->j_state_lock);
> > +		wait_event(journal->j_wait_reserved,
> > +			   atomic_read(&journal->j_reserved_credits) + nblocks
> > +			   <= journal->j_max_transaction_buffers / 2);
> > +		goto repeat;
> > +	}
> > +	if (journal->j_running_transaction) {
> > +		transaction_t *t = journal->j_running_transaction;
> > +
> > +		wanted = atomic_add_return(nblocks,
> > +					   &t->t_outstanding_credits);
> > +		if (wanted > journal->j_max_transaction_buffers) {
> > +			atomic_sub(nblocks, &t->t_outstanding_credits);
> > +			atomic_sub(nblocks, &journal->j_reserved_credits);
> > +			wait_transaction_locked(journal);
> > +			goto repeat;
> > +		}
> > +	}
> > +	read_unlock(&journal->j_state_lock);
> > +
> > +	return handle;
> > +}
> > +EXPORT_SYMBOL(jbd2_journal_reserve);
> > +
> > +void jbd2_journal_free_reserved(handle_t *handle)
> > +{
> > +	journal_t *journal = handle->h_journal;
> > +
> > +	atomic_sub(handle->h_buffer_credits, &journal->j_reserved_credits);
> > +	wake_up(&journal->j_wait_reserved);
> > +	jbd2_free_handle(handle);
> > +}
> > +EXPORT_SYMBOL(jbd2_journal_free_reserved);
> > +
> > +/**
> > + * int jbd2_journal_start_reserved(handle_t *handle) - start reserved handle
> > + * @handle: handle to start
> > + *
> > + * Start handle that has been previously reserved with jbd2_journal_reserve().
> > + * This attaches @handle to the running transaction (or creates one if there's
> > + * not transaction running). Unlike jbd2_journal_start() this function cannot
> > + * block on journal commit, checkpointing, or similar stuff. It can block on
> > + * memory allocation or frozen journal though.
> > + *
> > + * Return 0 on success, non-zero on error - handle is freed in that case.
> > + */
> > +int jbd2_journal_start_reserved(handle_t *handle)
> > +{
> > +	journal_t *journal = handle->h_journal;
> > +	int ret = -EIO;
> > +
> > +	if (WARN_ON(!handle->h_reserved)) {
> > +		/* Someone passed in normal handle? Just stop it. */
> > +		jbd2_journal_stop(handle);
> > +		return ret;
> > +	}
> > +	/*
> > +	 * Usefulness of mixing of reserved and unreserved handles is
> > +	 * questionable. So far nobody seems to need it so just error out.
> > +	 */
> > +	if (WARN_ON(current->journal_info)) {
> > +		jbd2_journal_free_reserved(handle);
> > +		return ret;
> > +	}
> > +
> > +	handle->h_journal = NULL;
> > +	current->journal_info = handle;
> > +	/*
> > +	 * GFP_NOFS is here because callers are likely from writeback or
> > +	 * similarly constrained call sites
> > +	 */
> > +	ret = start_this_handle(journal, handle, GFP_NOFS);
> > +	if (ret < 0) {
> > +		current->journal_info = NULL;
> > +		jbd2_journal_free_reserved(handle);
> > +	}
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(jbd2_journal_start_reserved);
> >  
> >  /**
> >   * int jbd2_journal_extend() - extend buffer credits.
> > diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> > index ad4b3bb..b3c1283 100644
> > --- a/include/linux/jbd2.h
> > +++ b/include/linux/jbd2.h
> > @@ -410,8 +410,12 @@ struct jbd2_revoke_table_s;
> >  
> >  struct jbd2_journal_handle
> >  {
> > -	/* Which compound transaction is this update a part of? */
> > -	transaction_t		*h_transaction;
> > +	union {
> > +		/* Which compound transaction is this update a part of? */
> > +		transaction_t	*h_transaction;
> > +		/* Which journal handle belongs to - used iff h_reserved set */
> > +		journal_t	*h_journal;
> > +	};
> >  
> >  	/* Number of remaining buffers we are allowed to dirty: */
> >  	int			h_buffer_credits;
> > @@ -426,6 +430,7 @@ struct jbd2_journal_handle
> >  	/* Flags [no locking] */
> >  	unsigned int	h_sync:		1;	/* sync-on-close */
> >  	unsigned int	h_jdata:	1;	/* force data journaling */
> > +	unsigned int	h_reserved:	1;	/* handle with reserved credits */
> >  	unsigned int	h_aborted:	1;	/* fatal error on handle */
> >  	unsigned int	h_type:		8;	/* for handle statistics */
> >  	unsigned int	h_line_no:	16;	/* for handle statistics */
> > @@ -689,6 +694,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
> >   * @j_wait_done_commit: Wait queue for waiting for commit to complete
> >   * @j_wait_commit: Wait queue to trigger commit
> >   * @j_wait_updates: Wait queue to wait for updates to complete
> > + * @j_wait_reserved: Wait queue to wait for reserved buffer credits to drop
> >   * @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
> >   * @j_head: Journal head - identifies the first unused block in the journal
> >   * @j_tail: Journal tail - identifies the oldest still-used block in the
> > @@ -702,6 +708,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
> >   *     journal
> >   * @j_fs_dev: Device which holds the client fs.  For internal journal this will
> >   *     be equal to j_dev
> > + * @j_reserved_credits: Number of buffers reserved from the running transaction
> >   * @j_maxlen: Total maximum capacity of the journal region on disk.
> >   * @j_list_lock: Protects the buffer lists and internal buffer state.
> >   * @j_inode: Optional inode where we store the journal.  If present, all journal
> > @@ -800,6 +807,9 @@ struct journal_s
> >  	/* Wait queue to wait for updates to complete */
> >  	wait_queue_head_t	j_wait_updates;
> >  
> > +	/* Wait queue to wait for reserved buffer credits to drop */
> > +	wait_queue_head_t	j_wait_reserved;
> > +
> >  	/* Semaphore for locking against concurrent checkpoints */
> >  	struct mutex		j_checkpoint_mutex;
> >  
> > @@ -854,6 +864,9 @@ struct journal_s
> >  	/* Total maximum capacity of the journal region on disk. */
> >  	unsigned int		j_maxlen;
> >  
> > +	/* Number of buffers reserved from the running transaction */
> > +	atomic_t		j_reserved_credits;
> > +
> >  	/*
> >  	 * Protects the buffer lists and internal buffer state.
> >  	 */
> > @@ -1094,6 +1107,10 @@ extern handle_t *jbd2__journal_start(journal_t *, int nblocks, gfp_t gfp_mask,
> >  				     unsigned int type, unsigned int line_no);
> >  extern int	 jbd2_journal_restart(handle_t *, int nblocks);
> >  extern int	 jbd2__journal_restart(handle_t *, int nblocks, gfp_t gfp_mask);
> > +extern handle_t *jbd2_journal_reserve(journal_t *, int nblocks,
> > +				      unsigned int type, unsigned int line_no);
> > +extern int	 jbd2_journal_start_reserved(handle_t *handle);
> > +extern void	 jbd2_journal_free_reserved(handle_t *handle);
> >  extern int	 jbd2_journal_extend (handle_t *, int nblocks);
> >  extern int	 jbd2_journal_get_write_access(handle_t *, struct buffer_head *);
> >  extern int	 jbd2_journal_get_create_access (handle_t *, struct buffer_head *);
> > -- 
> > 1.7.1
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/29] ext4: Provide wrappers for transaction reservation calls
  2013-05-05 11:58   ` Zheng Liu
@ 2013-05-06 12:51     ` Jan Kara
  0 siblings, 0 replies; 76+ messages in thread
From: Jan Kara @ 2013-05-06 12:51 UTC (permalink / raw)
  To: Zheng Liu; +Cc: Jan Kara, Ted Tso, linux-ext4

On Sun 05-05-13 19:58:55, Zheng Liu wrote:
> On Mon, Apr 08, 2013 at 11:32:18PM +0200, Jan Kara wrote:
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> Oops, forgot to say, this patch needs to be rebased
  Yeah, I'll take Ted's patch queue + current Linus' kernel and rebase all
the patch series on top of that this week.

								Honza

> > ---
> >  fs/ext4/ext4_jbd2.c         |   71 +++++++++++++++++++++++++++++++++++++-----
> >  fs/ext4/ext4_jbd2.h         |   13 ++++++++
> >  include/trace/events/ext4.h |   20 +++++++++++-
> >  3 files changed, 94 insertions(+), 10 deletions(-)
> > 
> > diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
> > index 7058975..b3e04bf 100644
> > --- a/fs/ext4/ext4_jbd2.c
> > +++ b/fs/ext4/ext4_jbd2.c
> > @@ -38,28 +38,40 @@ static void ext4_put_nojournal(handle_t *handle)
> >  /*
> >   * Wrappers for jbd2_journal_start/end.
> >   */
> > -handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
> > -				  int type, int nblocks)
> > +static int ext4_journal_check_start(struct super_block *sb)
> >  {
> >  	journal_t *journal;
> >  
> > -	trace_ext4_journal_start(sb, nblocks, _RET_IP_);
> >  	if (sb->s_flags & MS_RDONLY)
> > -		return ERR_PTR(-EROFS);
> > -
> > +		return -EROFS;
> >  	WARN_ON(sb->s_writers.frozen == SB_FREEZE_COMPLETE);
> >  	journal = EXT4_SB(sb)->s_journal;
> > -	if (!journal)
> > -		return ext4_get_nojournal();
> >  	/*
> >  	 * Special case here: if the journal has aborted behind our
> >  	 * backs (eg. EIO in the commit thread), then we still need to
> >  	 * take the FS itself readonly cleanly.
> >  	 */
> > -	if (is_journal_aborted(journal)) {
> > +	if (journal && is_journal_aborted(journal)) {
> >  		ext4_abort(sb, "Detected aborted journal");
> > -		return ERR_PTR(-EROFS);
> > +		return -EROFS;
> >  	}
> > +	return 0;
> > +}
> > +
> > +handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
> > +				  int type, int nblocks)
> > +{
> > +	journal_t *journal;
> > +	int err;
> > +
> > +	trace_ext4_journal_start(sb, nblocks, _RET_IP_);
> > +	err = ext4_journal_check_start(sb);
> > +	if (err < 0)
> > +		return ERR_PTR(err);
> > +
> > +	journal = EXT4_SB(sb)->s_journal;
> > +	if (!journal)
> > +		return ext4_get_nojournal();
> >  	return jbd2__journal_start(journal, nblocks, GFP_NOFS, type, line);
> >  }
> >  
> > @@ -84,6 +96,47 @@ int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle)
> >  	return err;
> >  }
> >  
> > +handle_t *__ext4_journal_reserve(struct inode *inode, unsigned int line,
> > +				 int type, int nblocks)
> > +{
> > +	struct super_block *sb = inode->i_sb;
> > +	journal_t *journal;
> > +	int err;
> > +
> > +	trace_ext4_journal_reserve(sb, nblocks, _RET_IP_);
> > +	err = ext4_journal_check_start(sb);
> > +	if (err < 0)
> > +		return ERR_PTR(err);
> > +
> > +	journal = EXT4_SB(sb)->s_journal;
> > +	if (!journal)
> > +		return (handle_t *)1;	/* Hack to return !NULL */
> > +	return jbd2_journal_reserve(journal, nblocks, type, line);
> > +}
> > +
> > +handle_t *ext4_journal_start_reserved(handle_t *handle)
> > +{
> > +	struct super_block *sb;
> > +	int err;
> > +
> > +	if (!ext4_handle_valid(handle))
> > +		return ext4_get_nojournal();
> > +
> > +	sb = handle->h_journal->j_private;
> > +	trace_ext4_journal_start_reserved(sb, handle->h_buffer_credits,
> > +					  _RET_IP_);
> > +	err = ext4_journal_check_start(sb);
> > +	if (err < 0) {
> > +		jbd2_journal_free_reserved(handle);
> > +		return ERR_PTR(err);
> > +	}
> > +
> > +	err = jbd2_journal_start_reserved(handle);
> > +	if (err < 0)
> > +		return ERR_PTR(err);
> > +	return handle;
> > +}
> > +
> >  void ext4_journal_abort_handle(const char *caller, unsigned int line,
> >  			       const char *err_fn, struct buffer_head *bh,
> >  			       handle_t *handle, int err)
> > diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> > index 4c216b1..bb17931 100644
> > --- a/fs/ext4/ext4_jbd2.h
> > +++ b/fs/ext4/ext4_jbd2.h
> > @@ -309,6 +309,19 @@ static inline handle_t *__ext4_journal_start(struct inode *inode,
> >  #define ext4_journal_stop(handle) \
> >  	__ext4_journal_stop(__func__, __LINE__, (handle))
> >  
> > +#define ext4_journal_reserve(inode, type, nblocks)			\
> > +	__ext4_journal_reserve((inode), __LINE__, (type), (nblocks))
> > +
> > +handle_t *__ext4_journal_reserve(struct inode *inode, unsigned int line,
> > +				 int type, int nblocks);
> > +handle_t *ext4_journal_start_reserved(handle_t *handle);
> > +
> > +static inline void ext4_journal_free_reserved(handle_t *handle)
> > +{
> > +	if (ext4_handle_valid(handle))
> > +		jbd2_journal_free_reserved(handle);
> > +}
> > +
> >  static inline handle_t *ext4_journal_current_handle(void)
> >  {
> >  	return journal_current_handle();
> > diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> > index 4ee4710..a601bb3 100644
> > --- a/include/trace/events/ext4.h
> > +++ b/include/trace/events/ext4.h
> > @@ -1645,7 +1645,7 @@ TRACE_EVENT(ext4_load_inode,
> >  		  (unsigned long) __entry->ino)
> >  );
> >  
> > -TRACE_EVENT(ext4_journal_start,
> > +DECLARE_EVENT_CLASS(ext4_journal_start_class,
> >  	TP_PROTO(struct super_block *sb, int nblocks, unsigned long IP),
> >  
> >  	TP_ARGS(sb, nblocks, IP),
> > @@ -1667,6 +1667,24 @@ TRACE_EVENT(ext4_journal_start,
> >  		  __entry->nblocks, (void *)__entry->ip)
> >  );
> >  
> > +DEFINE_EVENT(ext4_journal_start_class, ext4_journal_start,
> > +	TP_PROTO(struct super_block *sb, int nblocks, unsigned long IP),
> > +
> > +	TP_ARGS(sb, nblocks, IP)
> > +);
> > +
> > +DEFINE_EVENT(ext4_journal_start_class, ext4_journal_reserve,
> > +	TP_PROTO(struct super_block *sb, int nblocks, unsigned long IP),
> > +
> > +	TP_ARGS(sb, nblocks, IP)
> > +);
> > +
> > +DEFINE_EVENT(ext4_journal_start_class, ext4_journal_start_reserved,
> > +	TP_PROTO(struct super_block *sb, int nblocks, unsigned long IP),
> > +
> > +	TP_ARGS(sb, nblocks, IP)
> > +);
> > +
> >  DECLARE_EVENT_CLASS(ext4__trim,
> >  	TP_PROTO(struct super_block *sb,
> >  		 ext4_group_t group,
> > -- 
> > 1.7.1
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 15/29] ext4: Deprecate max_writeback_mb_bump sysfs attribute
  2013-05-05 12:47   ` Zheng Liu
@ 2013-05-06 12:55     ` Jan Kara
  0 siblings, 0 replies; 76+ messages in thread
From: Jan Kara @ 2013-05-06 12:55 UTC (permalink / raw)
  To: Zheng Liu; +Cc: Jan Kara, Ted Tso, linux-ext4

On Sun 05-05-13 20:47:19, Zheng Liu wrote:
> On Mon, Apr 08, 2013 at 11:32:20PM +0200, Jan Kara wrote:
> > This attribute is now unused so deprecate it. We still show the old
> > default value to keep some compatibility but we don't allow writing to
> > that attribute anymore.
> 
> I think we can remove it completely.  IMHO, if an application tries to
> set this value, it will get an error because now it can't set it.
  We tend to be more careful when removing userspace interfaces. So I kept
the sysfs file there and just made it do nothing. We can remove it after a
couple of kernel releases... I certainly wouldn't mind if the file was
removed right away but I'll leave that decision upto Ted.

								Honza

> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/ext4/ext4.h  |    1 -
> >  fs/ext4/super.c |   30 ++++++++++++++++++++++++------
> >  2 files changed, 24 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index edf9b9e..3575fdb 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -1234,7 +1234,6 @@ struct ext4_sb_info {
> >  	unsigned int s_mb_stats;
> >  	unsigned int s_mb_order2_reqs;
> >  	unsigned int s_mb_group_prealloc;
> > -	unsigned int s_max_writeback_mb_bump;
> >  	unsigned int s_max_dir_size_kb;
> >  	/* where last allocation was done - for stream allocation */
> >  	unsigned long s_mb_last_group;
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index 34e8552..09ff724 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -2374,7 +2374,10 @@ struct ext4_attr {
> >  	ssize_t (*show)(struct ext4_attr *, struct ext4_sb_info *, char *);
> >  	ssize_t (*store)(struct ext4_attr *, struct ext4_sb_info *,
> >  			 const char *, size_t);
> > -	int offset;
> > +	union {
> > +		int offset;
> > +		int deprecated_val;
> > +	} u;
> >  };
> >  
> >  static int parse_strtoul(const char *buf,
> > @@ -2443,7 +2446,7 @@ static ssize_t inode_readahead_blks_store(struct ext4_attr *a,
> >  static ssize_t sbi_ui_show(struct ext4_attr *a,
> >  			   struct ext4_sb_info *sbi, char *buf)
> >  {
> > -	unsigned int *ui = (unsigned int *) (((char *) sbi) + a->offset);
> > +	unsigned int *ui = (unsigned int *) (((char *) sbi) + a->u.offset);
> >  
> >  	return snprintf(buf, PAGE_SIZE, "%u\n", *ui);
> >  }
> > @@ -2452,7 +2455,7 @@ static ssize_t sbi_ui_store(struct ext4_attr *a,
> >  			    struct ext4_sb_info *sbi,
> >  			    const char *buf, size_t count)
> >  {
> > -	unsigned int *ui = (unsigned int *) (((char *) sbi) + a->offset);
> > +	unsigned int *ui = (unsigned int *) (((char *) sbi) + a->u.offset);
> >  	unsigned long t;
> >  
> >  	if (parse_strtoul(buf, 0xffffffff, &t))
> > @@ -2478,12 +2481,20 @@ static ssize_t trigger_test_error(struct ext4_attr *a,
> >  	return count;
> >  }
> >  
> > +static ssize_t sbi_deprecated_show(struct ext4_attr *a,
> > +				   struct ext4_sb_info *sbi, char *buf)
> > +{
> > +	return snprintf(buf, PAGE_SIZE, "%d\n", a->u.deprecated_val);
> > +}
> > +
> >  #define EXT4_ATTR_OFFSET(_name,_mode,_show,_store,_elname) \
> >  static struct ext4_attr ext4_attr_##_name = {			\
> >  	.attr = {.name = __stringify(_name), .mode = _mode },	\
> >  	.show	= _show,					\
> >  	.store	= _store,					\
> > -	.offset = offsetof(struct ext4_sb_info, _elname),	\
> > +	.u = {							\
> > +		.offset = offsetof(struct ext4_sb_info, _elname),\
> > +	},							\
> >  }
> >  #define EXT4_ATTR(name, mode, show, store) \
> >  static struct ext4_attr ext4_attr_##name = __ATTR(name, mode, show, store)
> > @@ -2494,6 +2505,14 @@ static struct ext4_attr ext4_attr_##name = __ATTR(name, mode, show, store)
> >  #define EXT4_RW_ATTR_SBI_UI(name, elname)	\
> >  	EXT4_ATTR_OFFSET(name, 0644, sbi_ui_show, sbi_ui_store, elname)
> >  #define ATTR_LIST(name) &ext4_attr_##name.attr
> > +#define EXT4_DEPRECATED_ATTR(_name, _val)	\
> > +static struct ext4_attr ext4_attr_##_name = {			\
> > +	.attr = {.name = __stringify(_name), .mode = 0444 },	\
> > +	.show	= sbi_deprecated_show,				\
> > +	.u = {							\
> > +		.deprecated_val = _val,				\
> > +	},							\
> > +}
> >  
> >  EXT4_RO_ATTR(delayed_allocation_blocks);
> >  EXT4_RO_ATTR(session_write_kbytes);
> > @@ -2507,7 +2526,7 @@ EXT4_RW_ATTR_SBI_UI(mb_min_to_scan, s_mb_min_to_scan);
> >  EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs);
> >  EXT4_RW_ATTR_SBI_UI(mb_stream_req, s_mb_stream_request);
> >  EXT4_RW_ATTR_SBI_UI(mb_group_prealloc, s_mb_group_prealloc);
> > -EXT4_RW_ATTR_SBI_UI(max_writeback_mb_bump, s_max_writeback_mb_bump);
> > +EXT4_DEPRECATED_ATTR(max_writeback_mb_bump, 128);
> >  EXT4_RW_ATTR_SBI_UI(extent_max_zeroout_kb, s_extent_max_zeroout_kb);
> >  EXT4_ATTR(trigger_fs_error, 0200, NULL, trigger_test_error);
> >  
> > @@ -3718,7 +3737,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> >  	}
> >  
> >  	sbi->s_stripe = ext4_get_stripe_size(sbi);
> > -	sbi->s_max_writeback_mb_bump = 128;
> >  	sbi->s_extent_max_zeroout_kb = 32;
> >  
> >  	/* Register extent status tree shrinker */
> > -- 
> > 1.7.1
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 12/29] jbd2: Transaction reservation support
  2013-05-06 12:49     ` Jan Kara
@ 2013-05-07  5:22       ` Zheng Liu
  0 siblings, 0 replies; 76+ messages in thread
From: Zheng Liu @ 2013-05-07  5:22 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, May 06, 2013 at 02:49:39PM +0200, Jan Kara wrote:
...
> > > +/*
> > >   * start_this_handle: Given a handle, deal with any locking or stalling
> > >   * needed to make sure that there is enough journal space for the handle
> > >   * to begin.  Attach the handle to a transaction and set up the
> > > @@ -151,12 +237,14 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
> > >  			     gfp_t gfp_mask)
> > >  {
> > >  	transaction_t	*transaction, *new_transaction = NULL;
> > > -	tid_t		tid;
> > > -	int		needed, need_to_start;
> > >  	int		nblocks = handle->h_buffer_credits;
> > >  	unsigned long ts = jiffies;
> > >  
> > > -	if (nblocks > journal->j_max_transaction_buffers) {
> > > +	/*
> > > +	 * 1/2 of transaction can be reserved so we can practically handle
> > > +	 * only 1/2 of maximum transaction size per operation
> > > +	 */
> > 
> > Sorry, but I don't understand here why we only reserve 1/2 of maximum
> > transaction size.
>   Well, we allow 1/2 of maximum transaction size to be allocated in already
> reserved handles. So if someone submitted a request for a handle with
> more than 1/2 of maximum transaction size, then we might have to wait for
> reserved handles to be freed. That would be a slight complication in the
> code and it would also possibly introduce livelocking issues - after a
> reserved transaction is freed, someone can reserve a new one before the
> large handle creation request is satisfied. Again this can be solved but
> the complications simply doesn't seem to be worth it.

Fair enough.  Thanks for your explanation.

Regards,
                                                - Zheng

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 16/29] ext4: Improve writepage credit estimate for files with indirect blocks
  2013-04-08 21:32 ` [PATCH 16/29] ext4: Improve writepage credit estimate for files with indirect blocks Jan Kara
@ 2013-05-07  5:39   ` Zheng Liu
  0 siblings, 0 replies; 76+ messages in thread
From: Zheng Liu @ 2013-05-07  5:39 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:21PM +0200, Jan Kara wrote:
> ext4_ind_trans_blocks() wrongly used 'chunk' argument to decide whether
> blocks mapped are logically continguous. That is wrong since the argument
> informs whether the blocks are physically continguous. As the blocks
> mapped are always logically continguous and that's all
> ext4_ind_trans_blocks() cares about, just remove the 'chunk' argument.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Regards,
                                                - Zheng

> ---
>  fs/ext4/ext4.h     |    2 +-
>  fs/ext4/indirect.c |   27 +++++++++------------------
>  fs/ext4/inode.c    |    2 +-
>  3 files changed, 11 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 3575fdb..d3a54f2 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -2093,7 +2093,7 @@ extern ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
>  				const struct iovec *iov, loff_t offset,
>  				unsigned long nr_segs);
>  extern int ext4_ind_calc_metadata_amount(struct inode *inode, sector_t lblock);
> -extern int ext4_ind_trans_blocks(struct inode *inode, int nrblocks, int chunk);
> +extern int ext4_ind_trans_blocks(struct inode *inode, int nrblocks);
>  extern void ext4_ind_truncate(struct inode *inode);
>  extern int ext4_ind_punch_hole(struct file *file, loff_t offset, loff_t length);
>  
> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> index b505a14..197b202 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c
> @@ -913,27 +913,18 @@ int ext4_ind_calc_metadata_amount(struct inode *inode, sector_t lblock)
>  	return (blk_bits / EXT4_ADDR_PER_BLOCK_BITS(inode->i_sb)) + 1;
>  }
>  
> -int ext4_ind_trans_blocks(struct inode *inode, int nrblocks, int chunk)
> +/*
> + * Calculate number of indirect blocks touched by mapping @nrblocks logically
> + * continguous blocks
> + */
> +int ext4_ind_trans_blocks(struct inode *inode, int nrblocks)
>  {
> -	int indirects;
> -
> -	/* if nrblocks are contiguous */
> -	if (chunk) {
> -		/*
> -		 * With N contiguous data blocks, we need at most
> -		 * N/EXT4_ADDR_PER_BLOCK(inode->i_sb) + 1 indirect blocks,
> -		 * 2 dindirect blocks, and 1 tindirect block
> -		 */
> -		return DIV_ROUND_UP(nrblocks,
> -				    EXT4_ADDR_PER_BLOCK(inode->i_sb)) + 4;
> -	}
>  	/*
> -	 * if nrblocks are not contiguous, worse case, each block touch
> -	 * a indirect block, and each indirect block touch a double indirect
> -	 * block, plus a triple indirect block
> +	 * With N contiguous data blocks, we need at most
> +	 * N/EXT4_ADDR_PER_BLOCK(inode->i_sb) + 1 indirect blocks,
> +	 * 2 dindirect blocks, and 1 tindirect block
>  	 */
> -	indirects = nrblocks * 2 + 1;
> -	return indirects;
> +	return DIV_ROUND_UP(nrblocks, EXT4_ADDR_PER_BLOCK(inode->i_sb)) + 4;
>  }
>  
>  /*
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index f4dc4a1..aa26f4c 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4399,7 +4399,7 @@ int ext4_getattr(struct vfsmount *mnt, struct dentry *dentry,
>  static int ext4_index_trans_blocks(struct inode *inode, int nrblocks, int chunk)
>  {
>  	if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
> -		return ext4_ind_trans_blocks(inode, nrblocks, chunk);
> +		return ext4_ind_trans_blocks(inode, nrblocks);
>  	return ext4_ext_index_trans_blocks(inode, nrblocks, chunk);
>  }
>  
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 17/29] ext4: Better estimate credits needed for ext4_da_writepages()
  2013-04-08 21:32 ` [PATCH 17/29] ext4: Better estimate credits needed for ext4_da_writepages() Jan Kara
@ 2013-05-07  6:33   ` Zheng Liu
  2013-05-07 14:17     ` Jan Kara
  0 siblings, 1 reply; 76+ messages in thread
From: Zheng Liu @ 2013-05-07  6:33 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:22PM +0200, Jan Kara wrote:
> We limit the number of blocks written in a single loop of
> ext4_da_writepages() to 64 when inode uses indirect blocks. That is
> unnecessary as credit estimates for mapping logically continguous run of
> blocks is rather low even for inode with indirect blocks. So just lift
> this limitation and properly calculate the number of necessary credits.
> 
> This better credit estimate will also later allow us to always write at
> least a single page in one iteration.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

A minor comment below.  Otherwise the patch looks good to me.
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>

> ---
>  fs/ext4/ext4.h    |    3 +-
>  fs/ext4/extents.c |   16 ++++++--------
>  fs/ext4/inode.c   |   58 ++++++++++++++++++++--------------------------------
>  3 files changed, 30 insertions(+), 47 deletions(-)
...
>  static int ext4_da_writepages_trans_blocks(struct inode *inode)
>  {
> -	int max_blocks = EXT4_I(inode)->i_reserved_data_blocks;
> -
> -	/*
> -	 * With non-extent format the journal credit needed to
> -	 * insert nrblocks contiguous block is dependent on
> -	 * number of contiguous block. So we will limit
> -	 * number of contiguous block to a sane value
> -	 */
> -	if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) &&
> -	    (max_blocks > EXT4_MAX_TRANS_DATA))
> -		max_blocks = EXT4_MAX_TRANS_DATA;
> +	int bpp = ext4_journal_blocks_per_page(inode);
>  
> -	return ext4_chunk_trans_blocks(inode, max_blocks);
> +	return ext4_meta_trans_blocks(inode,
> +				MAX_WRITEPAGES_EXTENT_LEN + bpp - 1, bpp);

FWIW, MAX_WRITEPAGES_EXTENT_LEN is defined at patch 18/29.  So after
applied this patch, we will get a build error.  So that would be great
if MAX_WRITEPAGES_EXTENT_LEN can be define in this commit.

Regards,
                                                - Zheng

>  }
>  
>  /*
> @@ -4396,11 +4389,12 @@ int ext4_getattr(struct vfsmount *mnt, struct dentry *dentry,
>  	return 0;
>  }
>  
> -static int ext4_index_trans_blocks(struct inode *inode, int nrblocks, int chunk)
> +static int ext4_index_trans_blocks(struct inode *inode, int lblocks,
> +				   int pextents)
>  {
>  	if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
> -		return ext4_ind_trans_blocks(inode, nrblocks);
> -	return ext4_ext_index_trans_blocks(inode, nrblocks, chunk);
> +		return ext4_ind_trans_blocks(inode, lblocks);
> +	return ext4_ext_index_trans_blocks(inode, pextents);
>  }
>  
>  /*
> @@ -4414,7 +4408,8 @@ static int ext4_index_trans_blocks(struct inode *inode, int nrblocks, int chunk)
>   *
>   * Also account for superblock, inode, quota and xattr blocks
>   */
> -static int ext4_meta_trans_blocks(struct inode *inode, int nrblocks, int chunk)
> +static int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
> +				  int pextents)
>  {
>  	ext4_group_t groups, ngroups = ext4_get_groups_count(inode->i_sb);
>  	int gdpblocks;
> @@ -4422,14 +4417,10 @@ static int ext4_meta_trans_blocks(struct inode *inode, int nrblocks, int chunk)
>  	int ret = 0;
>  
>  	/*
> -	 * How many index blocks need to touch to modify nrblocks?
> -	 * The "Chunk" flag indicating whether the nrblocks is
> -	 * physically contiguous on disk
> -	 *
> -	 * For Direct IO and fallocate, they calls get_block to allocate
> -	 * one single extent at a time, so they could set the "Chunk" flag
> +	 * How many index blocks need to touch to map @lblocks logical blocks
> +	 * to @pextents physical extents?
>  	 */
> -	idxblocks = ext4_index_trans_blocks(inode, nrblocks, chunk);
> +	idxblocks = ext4_index_trans_blocks(inode, lblocks, pextents);
>  
>  	ret = idxblocks;
>  
> @@ -4437,12 +4428,7 @@ static int ext4_meta_trans_blocks(struct inode *inode, int nrblocks, int chunk)
>  	 * Now let's see how many group bitmaps and group descriptors need
>  	 * to account
>  	 */
> -	groups = idxblocks;
> -	if (chunk)
> -		groups += 1;
> -	else
> -		groups += nrblocks;
> -
> +	groups = idxblocks + pextents;
>  	gdpblocks = groups;
>  	if (groups > ngroups)
>  		groups = ngroups;
> @@ -4473,7 +4459,7 @@ int ext4_writepage_trans_blocks(struct inode *inode)
>  	int bpp = ext4_journal_blocks_per_page(inode);
>  	int ret;
>  
> -	ret = ext4_meta_trans_blocks(inode, bpp, 0);
> +	ret = ext4_meta_trans_blocks(inode, bpp, bpp);
>  
>  	/* Account for data blocks for journalled mode */
>  	if (ext4_should_journal_data(inode))
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 17/29] ext4: Better estimate credits needed for ext4_da_writepages()
  2013-05-07  6:33   ` Zheng Liu
@ 2013-05-07 14:17     ` Jan Kara
  0 siblings, 0 replies; 76+ messages in thread
From: Jan Kara @ 2013-05-07 14:17 UTC (permalink / raw)
  To: Zheng Liu; +Cc: Jan Kara, Ted Tso, linux-ext4

On Tue 07-05-13 14:33:47, Zheng Liu wrote:
> On Mon, Apr 08, 2013 at 11:32:22PM +0200, Jan Kara wrote:
> > We limit the number of blocks written in a single loop of
> > ext4_da_writepages() to 64 when inode uses indirect blocks. That is
> > unnecessary as credit estimates for mapping logically continguous run of
> > blocks is rather low even for inode with indirect blocks. So just lift
> > this limitation and properly calculate the number of necessary credits.
> > 
> > This better credit estimate will also later allow us to always write at
> > least a single page in one iteration.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> A minor comment below.  Otherwise the patch looks good to me.
> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
> 
> > ---
> >  fs/ext4/ext4.h    |    3 +-
> >  fs/ext4/extents.c |   16 ++++++--------
> >  fs/ext4/inode.c   |   58 ++++++++++++++++++++--------------------------------
> >  3 files changed, 30 insertions(+), 47 deletions(-)
> ...
> >  static int ext4_da_writepages_trans_blocks(struct inode *inode)
> >  {
> > -	int max_blocks = EXT4_I(inode)->i_reserved_data_blocks;
> > -
> > -	/*
> > -	 * With non-extent format the journal credit needed to
> > -	 * insert nrblocks contiguous block is dependent on
> > -	 * number of contiguous block. So we will limit
> > -	 * number of contiguous block to a sane value
> > -	 */
> > -	if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) &&
> > -	    (max_blocks > EXT4_MAX_TRANS_DATA))
> > -		max_blocks = EXT4_MAX_TRANS_DATA;
> > +	int bpp = ext4_journal_blocks_per_page(inode);
> >  
> > -	return ext4_chunk_trans_blocks(inode, max_blocks);
> > +	return ext4_meta_trans_blocks(inode,
> > +				MAX_WRITEPAGES_EXTENT_LEN + bpp - 1, bpp);
> 
> FWIW, MAX_WRITEPAGES_EXTENT_LEN is defined at patch 18/29.  So after
> applied this patch, we will get a build error.  So that would be great
> if MAX_WRITEPAGES_EXTENT_LEN can be define in this commit.
  Good catch. Thanks! I've fixed that up.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 18/29] ext4: Restructure writeback path
  2013-04-08 21:32 ` [PATCH 18/29] ext4: Restructure writeback path Jan Kara
@ 2013-05-08  3:48   ` Zheng Liu
  2013-05-08 11:20     ` Jan Kara
  0 siblings, 1 reply; 76+ messages in thread
From: Zheng Liu @ 2013-05-08  3:48 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:23PM +0200, Jan Kara wrote:
> There are two issues with current writeback path in ext4. For one we
> don't necessarily map complete pages when blocksize < pagesize and thus
> needn't do any writeback in one iteration. We always map some blocks
> though so we will eventually finish mapping the page. Just if writeback
> races with other operations on the file, forward progress is not really
> guaranteed. The second problem is that current code structure makes it
> hard to associate all the bios to some range of pages with one io_end
> structure so that unwritten extents can be converted after all the bios
> are finished. This will be especially difficult later when io_end will
> be associated with reserved transaction handle.
> 
> We restructure the writeback path to a relatively simple loop which
> first prepares extent of pages, then maps one or more extents so that
> no page is partially mapped, and once page is fully mapped it is
> submitted for IO. We keep all the mapping and IO submission information
> in mpage_da_data structure to somewhat reduce stack usage. Resulting
> code is somewhat shorter than the old one and hopefully also easier to
> read.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

One nit below.  Otherwise the patch looks good to be.
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>

> ---
>  fs/ext4/ext4.h              |   15 -
>  fs/ext4/inode.c             |  978 +++++++++++++++++++++----------------------
>  include/trace/events/ext4.h |   64 ++-
>  3 files changed, 508 insertions(+), 549 deletions(-)
...
>  /*
> - * write_cache_pages_da - walk the list of dirty pages of the given
> - * address space and accumulate pages that need writing, and call
> - * mpage_da_map_and_submit to map a single contiguous memory region
> - * and then write them.
> + * mpage_prepare_extent_to_map - find & lock contiguous range of dirty pages
> + * 				 and underlying extent to map
> + *
> + * @mpd - where to look for pages
> + *
> + * Walk dirty pages in the mapping while they are contiguous and lock them.
> + * While pages are fully mapped submit them for IO. When we find a page which
> + * isn't mapped we start accumulating extent of buffers underlying these pages
> + * that needs mapping (formed by either delayed or unwritten buffers). The
> + * extent found is returned in @mpd structure (starting at mpd->lblk with
> + * length mpd->len blocks).
>   */
> -static int write_cache_pages_da(handle_t *handle,
> -				struct address_space *mapping,
> -				struct writeback_control *wbc,
> -				struct mpage_da_data *mpd,
> -				pgoff_t *done_index)
> +static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
>  {
> -	struct buffer_head	*bh, *head;
> -	struct inode		*inode = mapping->host;
> -	struct pagevec		pvec;
> -	unsigned int		nr_pages;
> -	sector_t		logical;
> -	pgoff_t			index, end;
> -	long			nr_to_write = wbc->nr_to_write;
> -	int			i, tag, ret = 0;
> -
> -	memset(mpd, 0, sizeof(struct mpage_da_data));
> -	mpd->wbc = wbc;
> -	mpd->inode = inode;
> -	pagevec_init(&pvec, 0);
> -	index = wbc->range_start >> PAGE_CACHE_SHIFT;
> -	end = wbc->range_end >> PAGE_CACHE_SHIFT;
> +	struct address_space *mapping = mpd->inode->i_mapping;
> +	struct pagevec pvec;
> +	unsigned int nr_pages;
> +	pgoff_t index = mpd->first_page;
> +	pgoff_t end = mpd->last_page;
> +	bool first_page_found = false;
> +	int tag;
> +	int i, err = 0;
> +	int blkbits = mpd->inode->i_blkbits;
> +	ext4_lblk_t lblk;
> +	struct buffer_head *head;
>  
> -	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
> +	if (mpd->wbc->sync_mode == WB_SYNC_ALL || mpd->wbc->tagged_writepages)
>  		tag = PAGECACHE_TAG_TOWRITE;
>  	else
>  		tag = PAGECACHE_TAG_DIRTY;
>  
> -	*done_index = index;
> +	mpd->map.m_len = 0;
> +	mpd->next_page = index;

Forgot to call pagevec_init(&pvec, 0) here.

Regards,
                                                - Zheng

>  	while (index <= end) {
>  		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
>  			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
>  		if (nr_pages == 0)
> -			return 0;
> +			goto out;
>  
>  		for (i = 0; i < nr_pages; i++) {
>  			struct page *page = pvec.pages[i];
> @@ -2145,31 +2138,21 @@ static int write_cache_pages_da(handle_t *handle,
>  			if (page->index > end)
>  				goto out;
>  
> -			*done_index = page->index + 1;
> -
> -			/*
> -			 * If we can't merge this page, and we have
> -			 * accumulated an contiguous region, write it
> -			 */
> -			if ((mpd->next_page != page->index) &&
> -			    (mpd->next_page != mpd->first_page)) {
> -				mpage_da_map_and_submit(mpd);
> -				goto ret_extent_tail;
> -			}
> +			/* If we can't merge this page, we are done. */
> +			if (first_page_found && mpd->next_page != page->index)
> +				goto out;
>  
>  			lock_page(page);
> -
>  			/*
> -			 * If the page is no longer dirty, or its
> -			 * mapping no longer corresponds to inode we
> -			 * are writing (which means it has been
> -			 * truncated or invalidated), or the page is
> -			 * already under writeback and we are not
> -			 * doing a data integrity writeback, skip the page
> +			 * If the page is no longer dirty, or its mapping no
> +			 * longer corresponds to inode we are writing (which
> +			 * means it has been truncated or invalidated), or the
> +			 * page is already under writeback and we are not doing
> +			 * a data integrity writeback, skip the page
>  			 */
>  			if (!PageDirty(page) ||
>  			    (PageWriteback(page) &&
> -			     (wbc->sync_mode == WB_SYNC_NONE)) ||
> +			     (mpd->wbc->sync_mode == WB_SYNC_NONE)) ||
>  			    unlikely(page->mapping != mapping)) {
>  				unlock_page(page);
>  				continue;
> @@ -2178,101 +2161,60 @@ static int write_cache_pages_da(handle_t *handle,
>  			wait_on_page_writeback(page);
>  			BUG_ON(PageWriteback(page));
>  
> -			/*
> -			 * If we have inline data and arrive here, it means that
> -			 * we will soon create the block for the 1st page, so
> -			 * we'd better clear the inline data here.
> -			 */
> -			if (ext4_has_inline_data(inode)) {
> -				BUG_ON(ext4_test_inode_state(inode,
> -						EXT4_STATE_MAY_INLINE_DATA));
> -				ext4_destroy_inline_data(handle, inode);
> -			}
> -
> -			if (mpd->next_page != page->index)
> +			if (!first_page_found) {
>  				mpd->first_page = page->index;
> +				first_page_found = true;
> +			}
>  			mpd->next_page = page->index + 1;
> -			logical = (sector_t) page->index <<
> -				(PAGE_CACHE_SHIFT - inode->i_blkbits);
> +			lblk = ((ext4_lblk_t)page->index) <<
> +				(PAGE_CACHE_SHIFT - blkbits);
>  
>  			/* Add all dirty buffers to mpd */
>  			head = page_buffers(page);
> -			bh = head;
> -			do {
> -				BUG_ON(buffer_locked(bh));
> -				/*
> -				 * We need to try to allocate unmapped blocks
> -				 * in the same page.  Otherwise we won't make
> -				 * progress with the page in ext4_writepage
> -				 */
> -				if (ext4_bh_delay_or_unwritten(NULL, bh)) {
> -					mpage_add_bh_to_extent(mpd, logical,
> -							       bh->b_state);
> -					if (mpd->io_done)
> -						goto ret_extent_tail;
> -				} else if (buffer_dirty(bh) &&
> -					   buffer_mapped(bh)) {
> -					/*
> -					 * mapped dirty buffer. We need to
> -					 * update the b_state because we look
> -					 * at b_state in mpage_da_map_blocks.
> -					 * We don't update b_size because if we
> -					 * find an unmapped buffer_head later
> -					 * we need to use the b_state flag of
> -					 * that buffer_head.
> -					 */
> -					if (mpd->b_size == 0)
> -						mpd->b_state =
> -							bh->b_state & BH_FLAGS;
> -				}
> -				logical++;
> -			} while ((bh = bh->b_this_page) != head);
> -
> -			if (nr_to_write > 0) {
> -				nr_to_write--;
> -				if (nr_to_write == 0 &&
> -				    wbc->sync_mode == WB_SYNC_NONE)
> -					/*
> -					 * We stop writing back only if we are
> -					 * not doing integrity sync. In case of
> -					 * integrity sync we have to keep going
> -					 * because someone may be concurrently
> -					 * dirtying pages, and we might have
> -					 * synced a lot of newly appeared dirty
> -					 * pages, but have not synced all of the
> -					 * old dirty pages.
> -					 */
> +			if (!add_page_bufs_to_extent(mpd, head, head, lblk))
> +				goto out;
> +			/* So far everything mapped? Submit the page for IO. */
> +			if (mpd->map.m_len == 0) {
> +				err = mpage_submit_page(mpd, page);
> +				if (err < 0)
>  					goto out;
>  			}
> +
> +			/*
> +			 * Accumulated enough dirty pages? This doesn't apply
> +			 * to WB_SYNC_ALL mode. For integrity sync we have to
> +			 * keep going because someone may be concurrently
> +			 * dirtying pages, and we might have synced a lot of
> +			 * newly appeared dirty pages, but have not synced all
> +			 * of the old dirty pages.
> +			 */
> +			if (mpd->wbc->sync_mode == WB_SYNC_NONE &&
> +			    mpd->next_page - mpd->first_page >=
> +							mpd->wbc->nr_to_write)
> +				goto out;
>  		}
>  		pagevec_release(&pvec);
>  		cond_resched();
>  	}
>  	return 0;
> -ret_extent_tail:
> -	ret = MPAGE_DA_EXTENT_TAIL;
>  out:
>  	pagevec_release(&pvec);
> -	cond_resched();
> -	return ret;
> +	return err;
>  }
>  
> -
>  static int ext4_da_writepages(struct address_space *mapping,
>  			      struct writeback_control *wbc)
>  {
> -	pgoff_t	index;
> +	pgoff_t	writeback_index = 0;
> +	long nr_to_write = wbc->nr_to_write;
>  	int range_whole = 0;
> +	int cycled = 1;
>  	handle_t *handle = NULL;
>  	struct mpage_da_data mpd;
>  	struct inode *inode = mapping->host;
> -	int pages_written = 0;
> -	int range_cyclic, cycled = 1, io_done = 0;
>  	int needed_blocks, ret = 0;
> -	loff_t range_start = wbc->range_start;
>  	struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
> -	pgoff_t done_index = 0;
> -	pgoff_t end;
> +	bool done;
>  	struct blk_plug plug;
>  
>  	trace_ext4_da_writepages(inode, wbc);
> @@ -2298,40 +2240,65 @@ static int ext4_da_writepages(struct address_space *mapping,
>  	if (unlikely(sbi->s_mount_flags & EXT4_MF_FS_ABORTED))
>  		return -EROFS;
>  
> +	/*
> +	 * If we have inline data and arrive here, it means that
> +	 * we will soon create the block for the 1st page, so
> +	 * we'd better clear the inline data here.
> +	 */
> +	if (ext4_has_inline_data(inode)) {
> +		/* Just inode will be modified... */
> +		handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
> +		if (IS_ERR(handle)) {
> +			ret = PTR_ERR(handle);
> +			goto out_writepages;
> +		}
> +		BUG_ON(ext4_test_inode_state(inode,
> +				EXT4_STATE_MAY_INLINE_DATA));
> +		ext4_destroy_inline_data(handle, inode);
> +		ext4_journal_stop(handle);
> +	}
> +
>  	if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
>  		range_whole = 1;
>  
> -	range_cyclic = wbc->range_cyclic;
>  	if (wbc->range_cyclic) {
> -		index = mapping->writeback_index;
> -		if (index)
> +		writeback_index = mapping->writeback_index;
> +		if (writeback_index)
>  			cycled = 0;
> -		wbc->range_start = index << PAGE_CACHE_SHIFT;
> -		wbc->range_end  = LLONG_MAX;
> -		wbc->range_cyclic = 0;
> -		end = -1;
> +		mpd.first_page = writeback_index;
> +		mpd.last_page = -1;
>  	} else {
> -		index = wbc->range_start >> PAGE_CACHE_SHIFT;
> -		end = wbc->range_end >> PAGE_CACHE_SHIFT;
> +		mpd.first_page = wbc->range_start >> PAGE_CACHE_SHIFT;
> +		mpd.last_page = wbc->range_end >> PAGE_CACHE_SHIFT;
>  	}
>  
> +	mpd.inode = inode;
> +	mpd.wbc = wbc;
> +	ext4_io_submit_init(&mpd.io_submit, wbc);
>  retry:
>  	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
> -		tag_pages_for_writeback(mapping, index, end);
> -
> +		tag_pages_for_writeback(mapping, mpd.first_page, mpd.last_page);
> +	done = false;
>  	blk_start_plug(&plug);
> -	while (!ret && wbc->nr_to_write > 0) {
> +	while (!done && mpd.first_page <= mpd.last_page) {
> +		/* For each extent of pages we use new io_end */
> +		mpd.io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
> +		if (!mpd.io_submit.io_end) {
> +			ret = -ENOMEM;
> +			break;
> +		}
>  
>  		/*
> -		 * we  insert one extent at a time. So we need
> -		 * credit needed for single extent allocation.
> -		 * journalled mode is currently not supported
> -		 * by delalloc
> +		 * We have two constraints: We find one extent to map and we
> +		 * must always write out whole page (makes a difference when
> +		 * blocksize < pagesize) so that we don't block on IO when we
> +		 * try to write out the rest of the page. Journalled mode is
> +		 * not supported by delalloc.
>  		 */
>  		BUG_ON(ext4_should_journal_data(inode));
>  		needed_blocks = ext4_da_writepages_trans_blocks(inode);
>  
> -		/* start a new transaction*/
> +		/* start a new transaction */
>  		handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE,
>  					    needed_blocks);
>  		if (IS_ERR(handle)) {
> @@ -2339,76 +2306,67 @@ retry:
>  			ext4_msg(inode->i_sb, KERN_CRIT, "%s: jbd2_start: "
>  			       "%ld pages, ino %lu; err %d", __func__,
>  				wbc->nr_to_write, inode->i_ino, ret);
> -			blk_finish_plug(&plug);
> -			goto out_writepages;
> +			/* Release allocated io_end */
> +			ext4_put_io_end(mpd.io_submit.io_end);
> +			break;
>  		}
>  
> -		/*
> -		 * Now call write_cache_pages_da() to find the next
> -		 * contiguous region of logical blocks that need
> -		 * blocks to be allocated by ext4 and submit them.
> -		 */
> -		ret = write_cache_pages_da(handle, mapping,
> -					   wbc, &mpd, &done_index);
> -		/*
> -		 * If we have a contiguous extent of pages and we
> -		 * haven't done the I/O yet, map the blocks and submit
> -		 * them for I/O.
> -		 */
> -		if (!mpd.io_done && mpd.next_page != mpd.first_page) {
> -			mpage_da_map_and_submit(&mpd);
> -			ret = MPAGE_DA_EXTENT_TAIL;
> +		trace_ext4_da_write_pages(inode, mpd.first_page, mpd.wbc);
> +		ret = mpage_prepare_extent_to_map(&mpd);
> +		if (!ret) {
> +			if (mpd.map.m_len)
> +				ret = mpage_map_and_submit_extent(handle, &mpd);
> +			else {
> +				/*
> +				 * We scanned the whole range (or exhausted
> +				 * nr_to_write), submitted what was mapped and
> +				 * didn't find anything needing mapping. We are
> +				 * done.
> +				 */
> +				done = true;
> +			}
>  		}
> -		trace_ext4_da_write_pages(inode, &mpd);
> -		wbc->nr_to_write -= mpd.pages_written;
> -
>  		ext4_journal_stop(handle);
> -
> -		if ((mpd.retval == -ENOSPC) && sbi->s_journal) {
> -			/* commit the transaction which would
> +		/* Submit prepared bio */
> +		ext4_io_submit(&mpd.io_submit);
> +		/* Unlock pages we didn't use */
> +		mpage_release_unused_pages(&mpd, false);
> +		/* Drop our io_end reference we got from init */
> +		ext4_put_io_end(mpd.io_submit.io_end);
> +
> +		if (ret == -ENOSPC && sbi->s_journal) {
> +			/*
> +			 * Commit the transaction which would
>  			 * free blocks released in the transaction
>  			 * and try again
>  			 */
>  			jbd2_journal_force_commit_nested(sbi->s_journal);
>  			ret = 0;
> -		} else if (ret == MPAGE_DA_EXTENT_TAIL) {
> -			/*
> -			 * Got one extent now try with rest of the pages.
> -			 * If mpd.retval is set -EIO, journal is aborted.
> -			 * So we don't need to write any more.
> -			 */
> -			pages_written += mpd.pages_written;
> -			ret = mpd.retval;
> -			io_done = 1;
> -		} else if (wbc->nr_to_write)
> -			/*
> -			 * There is no more writeout needed
> -			 * or we requested for a noblocking writeout
> -			 * and we found the device congested
> -			 */
> +			continue;
> +		}
> +		/* Fatal error - ENOMEM, EIO... */
> +		if (ret)
>  			break;
>  	}
>  	blk_finish_plug(&plug);
> -	if (!io_done && !cycled) {
> +	if (!ret && !cycled) {
>  		cycled = 1;
> -		index = 0;
> -		wbc->range_start = index << PAGE_CACHE_SHIFT;
> -		wbc->range_end  = mapping->writeback_index - 1;
> +		mpd.last_page = writeback_index - 1;
> +		mpd.first_page = 0;
>  		goto retry;
>  	}
>  
>  	/* Update index */
> -	wbc->range_cyclic = range_cyclic;
>  	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
>  		/*
> -		 * set the writeback_index so that range_cyclic
> +		 * Set the writeback_index so that range_cyclic
>  		 * mode will write it back later
>  		 */
> -		mapping->writeback_index = done_index;
> +		mapping->writeback_index = mpd.first_page;
>  
>  out_writepages:
> -	wbc->range_start = range_start;
> -	trace_ext4_da_writepages_result(inode, wbc, ret, pages_written);
> +	trace_ext4_da_writepages_result(inode, wbc, ret,
> +					nr_to_write - wbc->nr_to_write);
>  	return ret;
>  }
>  
> diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> index a601bb3..203dcd5 100644
> --- a/include/trace/events/ext4.h
> +++ b/include/trace/events/ext4.h
> @@ -332,43 +332,59 @@ TRACE_EVENT(ext4_da_writepages,
>  );
>  
>  TRACE_EVENT(ext4_da_write_pages,
> -	TP_PROTO(struct inode *inode, struct mpage_da_data *mpd),
> +	TP_PROTO(struct inode *inode, pgoff_t first_page,
> +		 struct writeback_control *wbc),
>  
> -	TP_ARGS(inode, mpd),
> +	TP_ARGS(inode, first_page, wbc),
>  
>  	TP_STRUCT__entry(
>  		__field(	dev_t,	dev			)
>  		__field(	ino_t,	ino			)
> -		__field(	__u64,	b_blocknr		)
> -		__field(	__u32,	b_size			)
> -		__field(	__u32,	b_state			)
> -		__field(	unsigned long,	first_page	)
> -		__field(	int,	io_done			)
> -		__field(	int,	pages_written		)
> -		__field(	int,	sync_mode		)
> +		__field(      pgoff_t,	first_page		)
> +		__field(	 long,	nr_to_write		)
> +		__field(	  int,	sync_mode		)
>  	),
>  
>  	TP_fast_assign(
>  		__entry->dev		= inode->i_sb->s_dev;
>  		__entry->ino		= inode->i_ino;
> -		__entry->b_blocknr	= mpd->b_blocknr;
> -		__entry->b_size		= mpd->b_size;
> -		__entry->b_state	= mpd->b_state;
> -		__entry->first_page	= mpd->first_page;
> -		__entry->io_done	= mpd->io_done;
> -		__entry->pages_written	= mpd->pages_written;
> -		__entry->sync_mode	= mpd->wbc->sync_mode;
> +		__entry->first_page	= first_page;
> +		__entry->nr_to_write	= wbc->nr_to_write;
> +		__entry->sync_mode	= wbc->sync_mode;
>  	),
>  
> -	TP_printk("dev %d,%d ino %lu b_blocknr %llu b_size %u b_state 0x%04x "
> -		  "first_page %lu io_done %d pages_written %d sync_mode %d",
> +	TP_printk("dev %d,%d ino %lu first_page %lu nr_to_write %ld "
> +		  "sync_mode %d",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> -		  (unsigned long) __entry->ino,
> -		  __entry->b_blocknr, __entry->b_size,
> -		  __entry->b_state, __entry->first_page,
> -		  __entry->io_done, __entry->pages_written,
> -		  __entry->sync_mode
> -                  )
> +		  (unsigned long) __entry->ino, __entry->first_page,
> +		  __entry->nr_to_write, __entry->sync_mode)
> +);
> +
> +TRACE_EVENT(ext4_da_write_pages_extent,
> +	TP_PROTO(struct inode *inode, struct ext4_map_blocks *map),
> +
> +	TP_ARGS(inode, map),
> +
> +	TP_STRUCT__entry(
> +		__field(	dev_t,	dev			)
> +		__field(	ino_t,	ino			)
> +		__field(	__u64,	lblk			)
> +		__field(	__u32,	len			)
> +		__field(	__u32,	flags			)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->dev		= inode->i_sb->s_dev;
> +		__entry->ino		= inode->i_ino;
> +		__entry->lblk		= map->m_lblk;
> +		__entry->len		= map->m_len;
> +		__entry->flags		= map->m_flags;
> +	),
> +
> +	TP_printk("dev %d,%d ino %lu lblk %llu len %u flags 0x%04x",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  (unsigned long) __entry->ino, __entry->lblk, __entry->len,
> +		  __entry->flags)
>  );
>  
>  TRACE_EVENT(ext4_da_writepages_result,
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 19/29] ext4: Remove buffer_uninit handling
  2013-04-08 21:32 ` [PATCH 19/29] ext4: Remove buffer_uninit handling Jan Kara
@ 2013-05-08  6:56   ` Zheng Liu
  0 siblings, 0 replies; 76+ messages in thread
From: Zheng Liu @ 2013-05-08  6:56 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:24PM +0200, Jan Kara wrote:
> There isn't any need for setting BH_Uninit on buffers anymore. It was
> only used to signal we need to mark io_end as needing extent conversion
> in add_bh_to_extent() but now we can mark the io_end directly when
> mapping extent.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Regards,
                                                - Zheng

> ---
>  fs/ext4/ext4.h    |   15 ++++++---------
>  fs/ext4/inode.c   |    4 ++--
>  fs/ext4/page-io.c |    2 --
>  3 files changed, 8 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index cb1ba1c..3c3827a 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -2606,20 +2606,17 @@ extern void ext4_mmp_csum_set(struct super_block *sb, struct mmp_struct *mmp);
>  extern int ext4_mmp_csum_verify(struct super_block *sb,
>  				struct mmp_struct *mmp);
>  
> -/* BH_Uninit flag: blocks are allocated but uninitialized on disk */
> +/*
> + * Note that these flags will never ever appear in a buffer_head's state flag.
> + * See EXT4_MAP_... to see where this is used.
> + */
>  enum ext4_state_bits {
>  	BH_Uninit	/* blocks are allocated but uninitialized on disk */
> -	  = BH_JBDPrivateStart,
> +	 = BH_JBDPrivateStart,
>  	BH_AllocFromCluster,	/* allocated blocks were part of already
> -				 * allocated cluster. Note that this flag will
> -				 * never, ever appear in a buffer_head's state
> -				 * flag. See EXT4_MAP_FROM_CLUSTER to see where
> -				 * this is used. */
> +				 * allocated cluster. */
>  };
>  
> -BUFFER_FNS(Uninit, uninit)
> -TAS_BUFFER_FNS(Uninit, uninit)
> -
>  /*
>   * Add new method to test whether block and inode bitmaps are properly
>   * initialized. With uninit_bg reading the block from disk is not enough
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5c191a3..0602a09 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1924,8 +1924,6 @@ static int mpage_map_and_submit_buffers(struct mpage_da_data *mpd)
>  					clear_buffer_delay(bh);
>  					bh->b_blocknr = pblock++;
>  				}
> -				if (mpd->map.m_flags & EXT4_MAP_UNINIT)
> -					set_buffer_uninit(bh);
>  				clear_buffer_unwritten(bh);
>  			} while (lblk++, (bh = bh->b_this_page) != head);
>  
> @@ -1975,6 +1973,8 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>  	err = ext4_map_blocks(handle, inode, map, get_blocks_flags);
>  	if (err < 0)
>  		return err;
> +	if (map->m_flags & EXT4_MAP_UNINIT)
> +		ext4_set_io_unwritten_flag(inode, mpd->io_submit.io_end);
>  
>  	BUG_ON(map->m_len == 0);
>  	if (map->m_flags & EXT4_MAP_NEW) {
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index efdf0a5..cc59cd9 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -377,8 +377,6 @@ submit_and_retry:
>  	if (ret != bh->b_size)
>  		goto submit_and_retry;
>  	io_end = io->io_end;
> -	if (test_clear_buffer_uninit(bh))
> -		ext4_set_io_unwritten_flag(inode, io_end);
>  	io_end->size += bh->b_size;
>  	io->io_next_block++;
>  	return 0;
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 20/29] ext4: Use transaction reservation for extent conversion in ext4_end_io
  2013-04-08 21:32 ` [PATCH 20/29] ext4: Use transaction reservation for extent conversion in ext4_end_io Jan Kara
@ 2013-05-08  6:57   ` Zheng Liu
  0 siblings, 0 replies; 76+ messages in thread
From: Zheng Liu @ 2013-05-08  6:57 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:25PM +0200, Jan Kara wrote:
> Later we would like to clear PageWriteback bit only after extent conversion
> from unwritten to written extents is performed. However it is not possible
> to start a transaction after PageWriteback is set because that violates
> lock ordering (and is easy to deadlock). So we have to reserve a transaction
> before locking pages and sending them for IO and later we use the transaction
> for extent conversion from ext4_end_io().
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Regards,
                                                - Zheng

> ---
>  fs/ext4/ext4.h      |   12 +++++++++---
>  fs/ext4/ext4_jbd2.h |    3 ++-
>  fs/ext4/extents.c   |   39 ++++++++++++++++++++++++++++-----------
>  fs/ext4/inode.c     |   32 ++++++++++++++++++++++++++++++--
>  fs/ext4/page-io.c   |   11 ++++++++---
>  5 files changed, 77 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 3c3827a..65adf0d 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -182,10 +182,13 @@ struct ext4_map_blocks {
>  #define EXT4_IO_END_DIRECT	0x0004
>  
>  /*
> - * For converting uninitialized extents on a work queue.
> + * For converting uninitialized extents on a work queue. 'handle' is used for
> + * buffered writeback.
>   */
>  typedef struct ext4_io_end {
>  	struct list_head	list;		/* per-file finished IO list */
> +	handle_t		*handle;	/* handle reserved for extent
> +						 * conversion */
>  	struct inode		*inode;		/* file being written to */
>  	unsigned int		flag;		/* unwritten or not */
>  	loff_t			offset;		/* offset in the file */
> @@ -1314,6 +1317,9 @@ static inline void ext4_set_io_unwritten_flag(struct inode *inode,
>  					      struct ext4_io_end *io_end)
>  {
>  	if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
> +		/* Writeback has to have coversion transaction reserved */
> +		WARN_ON(!io_end->handle &&
> +			!(io_end->flag & EXT4_IO_END_DIRECT));
>  		io_end->flag |= EXT4_IO_END_UNWRITTEN;
>  		atomic_inc(&EXT4_I(inode)->i_unwritten);
>  	}
> @@ -2550,8 +2556,8 @@ extern void ext4_ext_init(struct super_block *);
>  extern void ext4_ext_release(struct super_block *);
>  extern long ext4_fallocate(struct file *file, int mode, loff_t offset,
>  			  loff_t len);
> -extern int ext4_convert_unwritten_extents(struct inode *inode, loff_t offset,
> -			  ssize_t len);
> +extern int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode,
> +					  loff_t offset, ssize_t len);
>  extern int ext4_map_blocks(handle_t *handle, struct inode *inode,
>  			   struct ext4_map_blocks *map, int flags);
>  extern int ext4_ext_calc_metadata_amount(struct inode *inode,
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index bb17931..88e95d7 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -132,7 +132,8 @@ static inline int ext4_jbd2_credits_xattr(struct inode *inode)
>  #define EXT4_HT_MIGRATE          8
>  #define EXT4_HT_MOVE_EXTENTS     9
>  #define EXT4_HT_XATTR           10
> -#define EXT4_HT_MAX             11
> +#define EXT4_HT_EXT_CONVERT     11
> +#define EXT4_HT_MAX             12
>  
>  /**
>   *   struct ext4_journal_cb_entry - Base structure for callback information.
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 8064b71..ae22735 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4484,10 +4484,9 @@ retry:
>   * function, to convert the fallocated extents after IO is completed.
>   * Returns 0 on success.
>   */
> -int ext4_convert_unwritten_extents(struct inode *inode, loff_t offset,
> -				    ssize_t len)
> +int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode,
> +				   loff_t offset, ssize_t len)
>  {
> -	handle_t *handle;
>  	unsigned int max_blocks;
>  	int ret = 0;
>  	int ret2 = 0;
> @@ -4502,16 +4501,31 @@ int ext4_convert_unwritten_extents(struct inode *inode, loff_t offset,
>  	max_blocks = ((EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) -
>  		      map.m_lblk);
>  	/*
> -	 * credits to insert 1 extent into extent tree
> +	 * This is somewhat ugly but the idea is clear: When transaction is
> +	 * reserved, everything goes into it. Otherwise we rather start several
> +	 * smaller transactions for conversion of each extent separately.
>  	 */
> -	credits = ext4_chunk_trans_blocks(inode, max_blocks);
> +	if (handle) {
> +		handle = ext4_journal_start_reserved(handle);
> +		if (IS_ERR(handle))
> +			return PTR_ERR(handle);
> +		credits = 0;
> +	} else {
> +		/*
> +		 * credits to insert 1 extent into extent tree
> +		 */
> +		credits = ext4_chunk_trans_blocks(inode, max_blocks);
> +	}
>  	while (ret >= 0 && ret < max_blocks) {
>  		map.m_lblk += ret;
>  		map.m_len = (max_blocks -= ret);
> -		handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, credits);
> -		if (IS_ERR(handle)) {
> -			ret = PTR_ERR(handle);
> -			break;
> +		if (credits) {
> +			handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
> +						    credits);
> +			if (IS_ERR(handle)) {
> +				ret = PTR_ERR(handle);
> +				break;
> +			}
>  		}
>  		ret = ext4_map_blocks(handle, inode, &map,
>  				      EXT4_GET_BLOCKS_IO_CONVERT_EXT);
> @@ -4522,10 +4536,13 @@ int ext4_convert_unwritten_extents(struct inode *inode, loff_t offset,
>  				     inode->i_ino, map.m_lblk,
>  				     map.m_len, ret);
>  		ext4_mark_inode_dirty(handle, inode);
> -		ret2 = ext4_journal_stop(handle);
> -		if (ret <= 0 || ret2 )
> +		if (credits)
> +			ret2 = ext4_journal_stop(handle);
> +		if (ret <= 0 || ret2)
>  			break;
>  	}
> +	if (!credits)
> +		ret2 = ext4_journal_stop(handle);
>  	return ret > 0 ? ret2 : ret;
>  }
>  
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 0602a09..f8e78ce 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1327,6 +1327,8 @@ static void ext4_da_page_release_reservation(struct page *page,
>  struct mpage_da_data {
>  	struct inode *inode;
>  	struct writeback_control *wbc;
> +	handle_t *reserved_handle;	/* Handle reserved for conversion */
> +
>  	pgoff_t first_page;	/* The first page to write */
>  	pgoff_t next_page;	/* Current page to examine */
>  	pgoff_t last_page;	/* Last page to examine */
> @@ -1973,8 +1975,13 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>  	err = ext4_map_blocks(handle, inode, map, get_blocks_flags);
>  	if (err < 0)
>  		return err;
> -	if (map->m_flags & EXT4_MAP_UNINIT)
> +	if (map->m_flags & EXT4_MAP_UNINIT) {
> +		if (!mpd->io_submit.io_end->handle) {
> +			mpd->io_submit.io_end->handle = mpd->reserved_handle;
> +			mpd->reserved_handle = NULL;
> +		}
>  		ext4_set_io_unwritten_flag(inode, mpd->io_submit.io_end);
> +	}
>  
>  	BUG_ON(map->m_len == 0);
>  	if (map->m_flags & EXT4_MAP_NEW) {
> @@ -2274,6 +2281,7 @@ static int ext4_da_writepages(struct address_space *mapping,
>  
>  	mpd.inode = inode;
>  	mpd.wbc = wbc;
> +	mpd.reserved_handle = NULL;
>  	ext4_io_submit_init(&mpd.io_submit, wbc);
>  retry:
>  	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
> @@ -2288,6 +2296,23 @@ retry:
>  			break;
>  		}
>  
> +		/* Reserve handle if it may be needed for extent conversion */
> +		if (ext4_should_dioread_nolock(inode) && !mpd.reserved_handle) {
> +			/*
> +			 * We may need to convert upto one extent per block in
> +			 * the page and we may dirty the inode.
> +			 */
> +			mpd.reserved_handle = ext4_journal_reserve(inode,
> +				EXT4_HT_EXT_CONVERT,
> +				1 + (PAGE_CACHE_SIZE >> inode->i_blkbits));
> +			if (IS_ERR(mpd.reserved_handle)) {
> +				ret = PTR_ERR(mpd.reserved_handle);
> +				mpd.reserved_handle = NULL;
> +				ext4_put_io_end(mpd.io_submit.io_end);
> +				break;
> +			}
> +		}
> +
>  		/*
>  		 * We have two constraints: We find one extent to map and we
>  		 * must always write out whole page (makes a difference when
> @@ -2364,6 +2389,9 @@ retry:
>  		 */
>  		mapping->writeback_index = mpd.first_page;
>  
> +	if (mpd.reserved_handle)
> +		ext4_journal_free_reserved(mpd.reserved_handle);
> +
>  out_writepages:
>  	trace_ext4_da_writepages_result(inode, wbc, ret,
>  					nr_to_write - wbc->nr_to_write);
> @@ -2977,7 +3005,7 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
>  		 * for non AIO case, since the IO is already
>  		 * completed, we could do the conversion right here
>  		 */
> -		err = ext4_convert_unwritten_extents(inode,
> +		err = ext4_convert_unwritten_extents(NULL, inode,
>  						     offset, ret);
>  		if (err < 0)
>  			ret = err;
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index cc59cd9..e8ee4da 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -55,6 +55,7 @@ static void ext4_release_io_end(ext4_io_end_t *io_end)
>  {
>  	BUG_ON(!list_empty(&io_end->list));
>  	BUG_ON(io_end->flag & EXT4_IO_END_UNWRITTEN);
> +	WARN_ON(io_end->handle);
>  
>  	if (atomic_dec_and_test(&EXT4_I(io_end->inode)->i_ioend_count))
>  		wake_up_all(ext4_ioend_wq(io_end->inode));
> @@ -81,13 +82,15 @@ static int ext4_end_io(ext4_io_end_t *io)
>  	struct inode *inode = io->inode;
>  	loff_t offset = io->offset;
>  	ssize_t size = io->size;
> +	handle_t *handle = io->handle;
>  	int ret = 0;
>  
>  	ext4_debug("ext4_end_io_nolock: io 0x%p from inode %lu,list->next 0x%p,"
>  		   "list->prev 0x%p\n",
>  		   io, inode->i_ino, io->list.next, io->list.prev);
>  
> -	ret = ext4_convert_unwritten_extents(inode, offset, size);
> +	io->handle = NULL;	/* Following call will use up the handle */
> +	ret = ext4_convert_unwritten_extents(handle, inode, offset, size);
>  	if (ret < 0) {
>  		ext4_msg(inode->i_sb, KERN_EMERG,
>  			 "failed to convert unwritten extents to written "
> @@ -217,8 +220,10 @@ int ext4_put_io_end(ext4_io_end_t *io_end)
>  
>  	if (atomic_dec_and_test(&io_end->count)) {
>  		if (io_end->flag & EXT4_IO_END_UNWRITTEN) {
> -			err = ext4_convert_unwritten_extents(io_end->inode,
> -						io_end->offset, io_end->size);
> +			err = ext4_convert_unwritten_extents(io_end->handle,
> +						io_end->inode, io_end->offset,
> +						io_end->size);
> +			io_end->handle = NULL;
>  			ext4_clear_io_unwritten_flag(io_end);
>  		}
>  		ext4_release_io_end(io_end);
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 21/29] ext4: Split extent conversion lists to reserved & unreserved parts
  2013-04-08 21:32 ` [PATCH 21/29] ext4: Split extent conversion lists to reserved & unreserved parts Jan Kara
@ 2013-05-08  7:03   ` Zheng Liu
  2013-05-08 11:23     ` Jan Kara
  0 siblings, 1 reply; 76+ messages in thread
From: Zheng Liu @ 2013-05-08  7:03 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:26PM +0200, Jan Kara wrote:
> Now that we have extent conversions with reserved transaction, we have
> to prevent extent conversions without reserved transaction (from DIO
> code) to block these (as that would effectively void any transaction
> reservation we did). So split lists, work items, and work queues to
> reserved and unreserved parts.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

I got a build error that looks like this.

 fs/ext4/page-io.c: In function ‘ext4_ioend_shutdown’:
 fs/ext4/page-io.c:60: error: ‘struct ext4_inode_info’ has no member
 named ‘i_unwritten_work’

I guess the reason is that when this patch set is sent out,
ext4_io_end_shutdown() hasn't be added.  So please add the code
like this.  Otherwise the patch looks good to me.
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>

Regards,
                                                - Zheng

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 3fea79e..f9ecc4f 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -57,8 +57,10 @@ void ext4_ioend_shutdown(struct inode *inode)
         * We need to make sure the work structure is finished being
         * used before we let the inode get destroyed.
         */
-       if (work_pending(&EXT4_I(inode)->i_unwritten_work))
-               cancel_work_sync(&EXT4_I(inode)->i_unwritten_work);
+       if (work_pending(&EXT4_I(inode)->i_rsv_conversion_work))
+               cancel_work_sync(&EXT4_I(inode)->i_rsv_conversion_work);
+       if (work_pending(&EXT4_I(inode)->i_unrsv_conversion_work))
+ cancel_work_sync(&EXT4_I(inode)->i_unrsv_conversion_work);
 }
 
 static void ext4_release_io_end(ext4_io_end_t *io_end)

> ---
>  fs/ext4/ext4.h    |   25 +++++++++++++++++-----
>  fs/ext4/page-io.c |   59 ++++++++++++++++++++++++++++++++++------------------
>  fs/ext4/super.c   |   38 ++++++++++++++++++++++++---------
>  3 files changed, 84 insertions(+), 38 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 65adf0d..a594a94 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -889,12 +889,22 @@ struct ext4_inode_info {
>  	qsize_t i_reserved_quota;
>  #endif
>  
> -	/* completed IOs that might need unwritten extents handling */
> -	struct list_head i_completed_io_list;
> +	/* Lock protecting lists below */
>  	spinlock_t i_completed_io_lock;
> +	/*
> +	 * Completed IOs that need unwritten extents handling and have
> +	 * transaction reserved
> +	 */
> +	struct list_head i_rsv_conversion_list;
> +	/*
> +	 * Completed IOs that need unwritten extents handling and don't have
> +	 * transaction reserved
> +	 */
> +	struct list_head i_unrsv_conversion_list;
>  	atomic_t i_ioend_count;	/* Number of outstanding io_end structs */
>  	atomic_t i_unwritten; /* Nr. of inflight conversions pending */
> -	struct work_struct i_unwritten_work;	/* deferred extent conversion */
> +	struct work_struct i_rsv_conversion_work;
> +	struct work_struct i_unrsv_conversion_work;
>  
>  	spinlock_t i_block_reservation_lock;
>  
> @@ -1257,8 +1267,10 @@ struct ext4_sb_info {
>  	struct flex_groups *s_flex_groups;
>  	ext4_group_t s_flex_groups_allocated;
>  
> -	/* workqueue for dio unwritten */
> -	struct workqueue_struct *dio_unwritten_wq;
> +	/* workqueue for unreserved extent convertions (dio) */
> +	struct workqueue_struct *unrsv_conversion_wq;
> +	/* workqueue for reserved extent conversions (buffered io) */
> +	struct workqueue_struct *rsv_conversion_wq;
>  
>  	/* timer for periodic error stats printing */
>  	struct timer_list s_err_report;
> @@ -2599,7 +2611,8 @@ extern int ext4_put_io_end(ext4_io_end_t *io_end);
>  extern void ext4_put_io_end_defer(ext4_io_end_t *io_end);
>  extern void ext4_io_submit_init(struct ext4_io_submit *io,
>  				struct writeback_control *wbc);
> -extern void ext4_end_io_work(struct work_struct *work);
> +extern void ext4_end_io_rsv_work(struct work_struct *work);
> +extern void ext4_end_io_unrsv_work(struct work_struct *work);
>  extern void ext4_io_submit(struct ext4_io_submit *io);
>  extern int ext4_bio_write_page(struct ext4_io_submit *io,
>  			       struct page *page,
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index e8ee4da..8bff3b3 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -103,20 +103,17 @@ static int ext4_end_io(ext4_io_end_t *io)
>  	return ret;
>  }
>  
> -static void dump_completed_IO(struct inode *inode)
> +static void dump_completed_IO(struct inode *inode, struct list_head *head)
>  {
>  #ifdef	EXT4FS_DEBUG
>  	struct list_head *cur, *before, *after;
>  	ext4_io_end_t *io, *io0, *io1;
>  
> -	if (list_empty(&EXT4_I(inode)->i_completed_io_list)) {
> -		ext4_debug("inode %lu completed_io list is empty\n",
> -			   inode->i_ino);
> +	if (list_empty(head))
>  		return;
> -	}
>  
> -	ext4_debug("Dump inode %lu completed_io list\n", inode->i_ino);
> -	list_for_each_entry(io, &EXT4_I(inode)->i_completed_io_list, list) {
> +	ext4_debug("Dump inode %lu completed io list\n", inode->i_ino);
> +	list_for_each_entry(io, head, list) {
>  		cur = &io->list;
>  		before = cur->prev;
>  		io0 = container_of(before, ext4_io_end_t, list);
> @@ -137,16 +134,23 @@ static void ext4_add_complete_io(ext4_io_end_t *io_end)
>  	unsigned long flags;
>  
>  	BUG_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
> -	wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
> -
>  	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> -	if (list_empty(&ei->i_completed_io_list))
> -		queue_work(wq, &ei->i_unwritten_work);
> -	list_add_tail(&io_end->list, &ei->i_completed_io_list);
> +	if (io_end->handle) {
> +		wq = EXT4_SB(io_end->inode->i_sb)->rsv_conversion_wq;
> +		if (list_empty(&ei->i_rsv_conversion_list))
> +			queue_work(wq, &ei->i_rsv_conversion_work);
> +		list_add_tail(&io_end->list, &ei->i_rsv_conversion_list);
> +	} else {
> +		wq = EXT4_SB(io_end->inode->i_sb)->unrsv_conversion_wq;
> +		if (list_empty(&ei->i_unrsv_conversion_list))
> +			queue_work(wq, &ei->i_unrsv_conversion_work);
> +		list_add_tail(&io_end->list, &ei->i_unrsv_conversion_list);
> +	}
>  	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
>  }
>  
> -static int ext4_do_flush_completed_IO(struct inode *inode)
> +static int ext4_do_flush_completed_IO(struct inode *inode,
> +				      struct list_head *head)
>  {
>  	ext4_io_end_t *io;
>  	struct list_head unwritten;
> @@ -155,8 +159,8 @@ static int ext4_do_flush_completed_IO(struct inode *inode)
>  	int err, ret = 0;
>  
>  	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> -	dump_completed_IO(inode);
> -	list_replace_init(&ei->i_completed_io_list, &unwritten);
> +	dump_completed_IO(inode, head);
> +	list_replace_init(head, &unwritten);
>  	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
>  
>  	while (!list_empty(&unwritten)) {
> @@ -172,21 +176,34 @@ static int ext4_do_flush_completed_IO(struct inode *inode)
>  }
>  
>  /*
> - * work on completed aio dio IO, to convert unwritten extents to extents
> + * work on completed IO, to convert unwritten extents to extents
>   */
> -void ext4_end_io_work(struct work_struct *work)
> +void ext4_end_io_rsv_work(struct work_struct *work)
>  {
>  	struct ext4_inode_info *ei = container_of(work, struct ext4_inode_info,
> -						  i_unwritten_work);
> -	ext4_do_flush_completed_IO(&ei->vfs_inode);
> +						  i_rsv_conversion_work);
> +	ext4_do_flush_completed_IO(&ei->vfs_inode, &ei->i_rsv_conversion_list);
> +}
> +
> +void ext4_end_io_unrsv_work(struct work_struct *work)
> +{
> +	struct ext4_inode_info *ei = container_of(work, struct ext4_inode_info,
> +						  i_unrsv_conversion_work);
> +	ext4_do_flush_completed_IO(&ei->vfs_inode, &ei->i_unrsv_conversion_list);
>  }
>  
>  int ext4_flush_unwritten_io(struct inode *inode)
>  {
> -	int ret;
> +	int ret, err;
> +
>  	WARN_ON_ONCE(!mutex_is_locked(&inode->i_mutex) &&
>  		     !(inode->i_state & I_FREEING));
> -	ret = ext4_do_flush_completed_IO(inode);
> +	ret = ext4_do_flush_completed_IO(inode,
> +					 &EXT4_I(inode)->i_rsv_conversion_list);
> +	err = ext4_do_flush_completed_IO(inode,
> +					 &EXT4_I(inode)->i_unrsv_conversion_list);
> +	if (!ret)
> +		ret = err;
>  	ext4_unwritten_wait(inode);
>  	return ret;
>  }
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 09ff724..916c4fb 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -747,8 +747,10 @@ static void ext4_put_super(struct super_block *sb)
>  	ext4_unregister_li_request(sb);
>  	dquot_disable(sb, -1, DQUOT_USAGE_ENABLED | DQUOT_LIMITS_ENABLED);
>  
> -	flush_workqueue(sbi->dio_unwritten_wq);
> -	destroy_workqueue(sbi->dio_unwritten_wq);
> +	flush_workqueue(sbi->unrsv_conversion_wq);
> +	flush_workqueue(sbi->rsv_conversion_wq);
> +	destroy_workqueue(sbi->unrsv_conversion_wq);
> +	destroy_workqueue(sbi->rsv_conversion_wq);
>  
>  	if (sbi->s_journal) {
>  		err = jbd2_journal_destroy(sbi->s_journal);
> @@ -856,13 +858,15 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
>  	ei->i_reserved_quota = 0;
>  #endif
>  	ei->jinode = NULL;
> -	INIT_LIST_HEAD(&ei->i_completed_io_list);
> +	INIT_LIST_HEAD(&ei->i_rsv_conversion_list);
> +	INIT_LIST_HEAD(&ei->i_unrsv_conversion_list);
>  	spin_lock_init(&ei->i_completed_io_lock);
>  	ei->i_sync_tid = 0;
>  	ei->i_datasync_tid = 0;
>  	atomic_set(&ei->i_ioend_count, 0);
>  	atomic_set(&ei->i_unwritten, 0);
> -	INIT_WORK(&ei->i_unwritten_work, ext4_end_io_work);
> +	INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
> +	INIT_WORK(&ei->i_unrsv_conversion_work, ext4_end_io_unrsv_work);
>  
>  	return &ei->vfs_inode;
>  }
> @@ -3867,12 +3871,20 @@ no_journal:
>  	 * The maximum number of concurrent works can be high and
>  	 * concurrency isn't really necessary.  Limit it to 1.
>  	 */
> -	EXT4_SB(sb)->dio_unwritten_wq =
> -		alloc_workqueue("ext4-dio-unwritten", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
> -	if (!EXT4_SB(sb)->dio_unwritten_wq) {
> -		printk(KERN_ERR "EXT4-fs: failed to create DIO workqueue\n");
> +	EXT4_SB(sb)->rsv_conversion_wq =
> +		alloc_workqueue("ext4-rsv-conversion", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
> +	if (!EXT4_SB(sb)->rsv_conversion_wq) {
> +		printk(KERN_ERR "EXT4-fs: failed to create workqueue\n");
>  		ret = -ENOMEM;
> -		goto failed_mount_wq;
> +		goto failed_mount4;
> +	}
> +
> +	EXT4_SB(sb)->unrsv_conversion_wq =
> +		alloc_workqueue("ext4-unrsv-conversion", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
> +	if (!EXT4_SB(sb)->unrsv_conversion_wq) {
> +		printk(KERN_ERR "EXT4-fs: failed to create workqueue\n");
> +		ret = -ENOMEM;
> +		goto failed_mount4;
>  	}
>  
>  	/*
> @@ -4019,7 +4031,10 @@ failed_mount4a:
>  	sb->s_root = NULL;
>  failed_mount4:
>  	ext4_msg(sb, KERN_ERR, "mount failed");
> -	destroy_workqueue(EXT4_SB(sb)->dio_unwritten_wq);
> +	if (EXT4_SB(sb)->rsv_conversion_wq)
> +		destroy_workqueue(EXT4_SB(sb)->rsv_conversion_wq);
> +	if (EXT4_SB(sb)->unrsv_conversion_wq)
> +		destroy_workqueue(EXT4_SB(sb)->unrsv_conversion_wq);
>  failed_mount_wq:
>  	if (sbi->s_journal) {
>  		jbd2_journal_destroy(sbi->s_journal);
> @@ -4464,7 +4479,8 @@ static int ext4_sync_fs(struct super_block *sb, int wait)
>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
>  
>  	trace_ext4_sync_fs(sb, wait);
> -	flush_workqueue(sbi->dio_unwritten_wq);
> +	flush_workqueue(sbi->rsv_conversion_wq);
> +	flush_workqueue(sbi->unrsv_conversion_wq);
>  	/*
>  	 * Writeback quota in non-journalled quota case - journalled quota has
>  	 * no dirty dquots
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 22/29] ext4: Defer clearing of PageWriteback after extent conversion
  2013-04-08 21:32 ` [PATCH 22/29] ext4: Defer clearing of PageWriteback after extent conversion Jan Kara
@ 2013-05-08  7:08   ` Zheng Liu
  0 siblings, 0 replies; 76+ messages in thread
From: Zheng Liu @ 2013-05-08  7:08 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:27PM +0200, Jan Kara wrote:
> Currently PageWriteback bit gets cleared from put_io_page() called from
> ext4_end_bio(). This is somewhat inconvenient as extent tree is not
> fully updated at that time (unwritten extents are not marked as written)
> so we cannot read the data back yet. This design was dictated by lock
> ordering as we cannot start a transaction while PageWriteback bit is set
> (we could easily deadlock with ext4_da_writepages()). But now that we
> use transaction reservation for extent conversion, locking issues are
> solved and we can move PageWriteback bit clearing after extent
> conversion is done. As a result we can remove wait for unwritt en extent
> conversion from ext4_sync_file() because it already implicitely happe ns
> through wait_on_page_writeback().
> 
> We implement deferring of PageWriteback clearing by queueing completed
> bios to appropriate io_end and processing all the pages when io_end is
> going to be freed instead of at the moment ext4_io_end() is called.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Regards,
                                                - Zheng

> ---
>  fs/ext4/ext4.h    |    2 +
>  fs/ext4/fsync.c   |    4 --
>  fs/ext4/page-io.c |  132 +++++++++++++++++++++++++++++------------------------
>  3 files changed, 74 insertions(+), 64 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index a594a94..2b0dd9a 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -190,6 +190,8 @@ typedef struct ext4_io_end {
>  	handle_t		*handle;	/* handle reserved for extent
>  						 * conversion */
>  	struct inode		*inode;		/* file being written to */
> +	struct bio		*bio;		/* Linked list of completed
> +						 * bios covering the extent */
>  	unsigned int		flag;		/* unwritten or not */
>  	loff_t			offset;		/* offset in the file */
>  	ssize_t			size;		/* size of the extent */
> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index 3278e64..e02ba1b 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -132,10 +132,6 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>  	if (inode->i_sb->s_flags & MS_RDONLY)
>  		goto out;
>  
> -	ret = ext4_flush_unwritten_io(inode);
> -	if (ret < 0)
> -		goto out;
> -
>  	if (!journal) {
>  		ret = __sync_inode(inode, datasync);
>  		if (!ret && !hlist_empty(&inode->i_dentry))
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 8bff3b3..2967794 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -51,14 +51,83 @@ void ext4_ioend_wait(struct inode *inode)
>  	wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_ioend_count) == 0));
>  }
>  
> +/*
> + * Print an buffer I/O error compatible with the fs/buffer.c.  This
> + * provides compatibility with dmesg scrapers that look for a specific
> + * buffer I/O error message.  We really need a unified error reporting
> + * structure to userspace ala Digital Unix's uerf system, but it's
> + * probably not going to happen in my lifetime, due to LKML politics...
> + */
> +static void buffer_io_error(struct buffer_head *bh)
> +{
> +	char b[BDEVNAME_SIZE];
> +	printk(KERN_ERR "Buffer I/O error on device %s, logical block %llu\n",
> +			bdevname(bh->b_bdev, b),
> +			(unsigned long long)bh->b_blocknr);
> +}
> +
> +static void ext4_finish_bio(struct bio *bio)
> +{
> +	int i;
> +	int error = !test_bit(BIO_UPTODATE, &bio->bi_flags);
> +
> +	for (i = 0; i < bio->bi_vcnt; i++) {
> +		struct bio_vec *bvec = &bio->bi_io_vec[i];
> +		struct page *page = bvec->bv_page;
> +		struct buffer_head *bh, *head;
> +		unsigned bio_start = bvec->bv_offset;
> +		unsigned bio_end = bio_start + bvec->bv_len;
> +		unsigned under_io = 0;
> +		unsigned long flags;
> +
> +		if (!page)
> +			continue;
> +
> +		if (error) {
> +			SetPageError(page);
> +			set_bit(AS_EIO, &page->mapping->flags);
> +		}
> +		bh = head = page_buffers(page);
> +		/*
> +		 * We check all buffers in the page under BH_Uptodate_Lock
> +		 * to avoid races with other end io clearing async_write flags
> +		 */
> +		local_irq_save(flags);
> +		bit_spin_lock(BH_Uptodate_Lock, &head->b_state);
> +		do {
> +			if (bh_offset(bh) < bio_start ||
> +			    bh_offset(bh) + bh->b_size > bio_end) {
> +				if (buffer_async_write(bh))
> +					under_io++;
> +				continue;
> +			}
> +			clear_buffer_async_write(bh);
> +			if (error)
> +				buffer_io_error(bh);
> +		} while ((bh = bh->b_this_page) != head);
> +		bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
> +		local_irq_restore(flags);
> +		if (!under_io)
> +			end_page_writeback(page);
> +	}
> +}
> +
>  static void ext4_release_io_end(ext4_io_end_t *io_end)
>  {
> +	struct bio *bio, *next_bio;
> +
>  	BUG_ON(!list_empty(&io_end->list));
>  	BUG_ON(io_end->flag & EXT4_IO_END_UNWRITTEN);
>  	WARN_ON(io_end->handle);
>  
>  	if (atomic_dec_and_test(&EXT4_I(io_end->inode)->i_ioend_count))
>  		wake_up_all(ext4_ioend_wq(io_end->inode));
> +
> +	for (bio = io_end->bio; bio; bio = next_bio) {
> +		next_bio = bio->bi_private;
> +		ext4_finish_bio(bio);
> +		bio_put(bio);
> +	}
>  	if (io_end->flag & EXT4_IO_END_DIRECT)
>  		inode_dio_done(io_end->inode);
>  	if (io_end->iocb)
> @@ -254,76 +323,20 @@ ext4_io_end_t *ext4_get_io_end(ext4_io_end_t *io_end)
>  	return io_end;
>  }
>  
> -/*
> - * Print an buffer I/O error compatible with the fs/buffer.c.  This
> - * provides compatibility with dmesg scrapers that look for a specific
> - * buffer I/O error message.  We really need a unified error reporting
> - * structure to userspace ala Digital Unix's uerf system, but it's
> - * probably not going to happen in my lifetime, due to LKML politics...
> - */
> -static void buffer_io_error(struct buffer_head *bh)
> -{
> -	char b[BDEVNAME_SIZE];
> -	printk(KERN_ERR "Buffer I/O error on device %s, logical block %llu\n",
> -			bdevname(bh->b_bdev, b),
> -			(unsigned long long)bh->b_blocknr);
> -}
> -
>  static void ext4_end_bio(struct bio *bio, int error)
>  {
>  	ext4_io_end_t *io_end = bio->bi_private;
>  	struct inode *inode;
> -	int i;
> -	int blocksize;
>  	sector_t bi_sector = bio->bi_sector;
>  
>  	BUG_ON(!io_end);
>  	inode = io_end->inode;
> -	blocksize = 1 << inode->i_blkbits;
> -	bio->bi_private = NULL;
>  	bio->bi_end_io = NULL;
> +	/* Link bio into list hanging from io_end */
> +	bio->bi_private = io_end->bio;
> +	io_end->bio = bio;
>  	if (test_bit(BIO_UPTODATE, &bio->bi_flags))
>  		error = 0;
> -	for (i = 0; i < bio->bi_vcnt; i++) {
> -		struct bio_vec *bvec = &bio->bi_io_vec[i];
> -		struct page *page = bvec->bv_page;
> -		struct buffer_head *bh, *head;
> -		unsigned bio_start = bvec->bv_offset;
> -		unsigned bio_end = bio_start + bvec->bv_len;
> -		unsigned under_io = 0;
> -		unsigned long flags;
> -
> -		if (!page)
> -			continue;
> -
> -		if (error) {
> -			SetPageError(page);
> -			set_bit(AS_EIO, &page->mapping->flags);
> -		}
> -		bh = head = page_buffers(page);
> -		/*
> -		 * We check all buffers in the page under BH_Uptodate_Lock
> -		 * to avoid races with other end io clearing async_write flags
> -		 */
> -		local_irq_save(flags);
> -		bit_spin_lock(BH_Uptodate_Lock, &head->b_state);
> -		do {
> -			if (bh_offset(bh) < bio_start ||
> -			    bh_offset(bh) + blocksize > bio_end) {
> -				if (buffer_async_write(bh))
> -					under_io++;
> -				continue;
> -			}
> -			clear_buffer_async_write(bh);
> -			if (error)
> -				buffer_io_error(bh);
> -		} while ((bh = bh->b_this_page) != head);
> -		bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
> -		local_irq_restore(flags);
> -		if (!under_io)
> -			end_page_writeback(page);
> -	}
> -	bio_put(bio);
>  
>  	if (error) {
>  		io_end->flag |= EXT4_IO_END_ERROR;
> @@ -335,7 +348,6 @@ static void ext4_end_bio(struct bio *bio, int error)
>  			     (unsigned long long)
>  			     bi_sector >> (inode->i_blkbits - 9));
>  	}
> -
>  	ext4_put_io_end_defer(io_end);
>  }
>  
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 23/29] ext4: Protect extent conversion after DIO with i_dio_count
  2013-04-08 21:32 ` [PATCH 23/29] ext4: Protect extent conversion after DIO with i_dio_count Jan Kara
@ 2013-05-08  7:08   ` Zheng Liu
  0 siblings, 0 replies; 76+ messages in thread
From: Zheng Liu @ 2013-05-08  7:08 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:28PM +0200, Jan Kara wrote:
> Make sure extent conversion after DIO happens while i_dio_count is still
> elevated so that inode_dio_wait() waits until extent conversion is done.
> This removes the need for explicit waiting for extent conversion in some
> cases.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Regards,
                                                - Zheng

> ---
>  fs/ext4/inode.c |   12 ++++++++++--
>  1 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index f8e78ce..f493ec2 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2914,11 +2914,18 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
>  
>  	BUG_ON(iocb->private == NULL);
>  
> +	/*
> +	 * Make all waiters for direct IO properly wait also for extent
> +	 * conversion. This also disallows race between truncate() and
> +	 * overwrite DIO as i_dio_count needs to be incremented under i_mutex.
> +	 */
> +	if (rw == WRITE)
> +		atomic_inc(&inode->i_dio_count);
> +
>  	/* If we do a overwrite dio, i_mutex locking can be released */
>  	overwrite = *((int *)iocb->private);
>  
>  	if (overwrite) {
> -		atomic_inc(&inode->i_dio_count);
>  		down_read(&EXT4_I(inode)->i_data_sem);
>  		mutex_unlock(&inode->i_mutex);
>  	}
> @@ -3013,9 +3020,10 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
>  	}
>  
>  retake_lock:
> +	if (rw == WRITE)
> +		inode_dio_done(inode);
>  	/* take i_mutex locking again if we do a ovewrite dio */
>  	if (overwrite) {
> -		inode_dio_done(inode);
>  		up_read(&EXT4_I(inode)->i_data_sem);
>  		mutex_lock(&inode->i_mutex);
>  	}
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 24/29] ext4: Remove wait for unwritten extent conversion from ext4_ext_truncate()
  2013-04-08 21:32 ` [PATCH 24/29] ext4: Remove wait for unwritten extent conversion from ext4_ext_truncate() Jan Kara
@ 2013-05-08  7:35   ` Zheng Liu
  0 siblings, 0 replies; 76+ messages in thread
From: Zheng Liu @ 2013-05-08  7:35 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:29PM +0200, Jan Kara wrote:
> Since PageWriteback bit is now cleared after extents are converted from
> unwritten to written ones, we have full exclusion of writeback path from
> truncate (truncate_inode_pages() waits for PageWriteback bits to get cleared
> on all invalidated pages). Exclusion from DIO path is achieved by
> inode_dio_wait() call in ext4_setattr(). So there's no need to wait for
> extent convertion in ext4_ext_truncate() anymore.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Regards,
                                                - Zheng

> ---
>  fs/ext4/extents.c |    6 ------
>  fs/ext4/page-io.c |    9 ++++++++-
>  2 files changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index ae22735..ca4ff71 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4256,12 +4256,6 @@ void ext4_ext_truncate(struct inode *inode)
>  	int err = 0;
>  
>  	/*
> -	 * finish any pending end_io work so we won't run the risk of
> -	 * converting any truncated blocks to initialized later
> -	 */
> -	ext4_flush_unwritten_io(inode);
> -
> -	/*
>  	 * probably first extent we're gonna free will be last in block
>  	 */
>  	err = ext4_writepage_trans_blocks(inode);
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 2967794..2f0b943 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -145,7 +145,14 @@ static void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
>  		wake_up_all(ext4_ioend_wq(inode));
>  }
>  
> -/* check a range of space and convert unwritten extents to written. */
> +/*
> + * Check a range of space and convert unwritten extents to written. Note that
> + * we are protected from truncate touching same part of extent tree by the
> + * fact that truncate code waits for all DIO to finish (thus exclusion from
> + * direct IO is achieved) and also waits for PageWriteback bits. Thus we
> + * cannot get to ext4_ext_truncate() before all IOs overlapping that range are
> + * completed (happens from ext4_free_ioend()).
> + */
>  static int ext4_end_io(ext4_io_end_t *io)
>  {
>  	struct inode *inode = io->inode;
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 25/29] ext4: Use generic_file_fsync() in ext4_file_fsync() in nojournal mode
  2013-04-08 21:32 ` [PATCH 25/29] ext4: Use generic_file_fsync() in ext4_file_fsync() in nojournal mode Jan Kara
@ 2013-05-08  7:37   ` Zheng Liu
  2013-05-08 11:29     ` Jan Kara
  0 siblings, 1 reply; 76+ messages in thread
From: Zheng Liu @ 2013-05-08  7:37 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:30PM +0200, Jan Kara wrote:
> Just use the generic function instead of duplicating it. We only need
> to reshuffle the read-only check a bit (which is there to prevent
> writing to a filesystem which has been remounted read-only after error
> I assume).
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

A minor nit below.  Otherwise the patch looks good to me.
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>

> ---
>  fs/ext4/fsync.c |   45 ++++++++++-----------------------------------
>  1 files changed, 10 insertions(+), 35 deletions(-)
> 
> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index e02ba1b..1c08780 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -73,32 +73,6 @@ static int ext4_sync_parent(struct inode *inode)
>  	return ret;
>  }
>  
> -/**
> - * __sync_file - generic_file_fsync without the locking and filemap_write
> - * @inode:	inode to sync
> - * @datasync:	only sync essential metadata if true
> - *
> - * This is just generic_file_fsync without the locking.  This is needed for
> - * nojournal mode to make sure this inodes data/metadata makes it to disk
> - * properly.  The i_mutex should be held already.
> - */
> -static int __sync_inode(struct inode *inode, int datasync)
> -{
> -	int err;
> -	int ret;
> -
> -	ret = sync_mapping_buffers(inode->i_mapping);
> -	if (!(inode->i_state & I_DIRTY))
> -		return ret;
> -	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> -		return ret;
> -
> -	err = sync_inode_metadata(inode, 1);
> -	if (ret == 0)
> -		ret = err;
> -	return ret;
> -}
> -
>  /*
>   * akpm: A new design for ext4_sync_file().
>   *
> @@ -116,7 +90,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>  	struct inode *inode = file->f_mapping->host;
>  	struct ext4_inode_info *ei = EXT4_I(inode);
>  	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
> -	int ret, err;
> +	int ret = 0, err;
>  	tid_t commit_tid;
>  	bool needs_barrier = false;
>  
> @@ -124,21 +98,21 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>  
>  	trace_ext4_sync_file_enter(file, datasync);
>  
> -	ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
> -	if (ret)
> -		return ret;
> -	mutex_lock(&inode->i_mutex);
> -
>  	if (inode->i_sb->s_flags & MS_RDONLY)
> -		goto out;
> +		goto out_trace;
>  
>  	if (!journal) {
> -		ret = __sync_inode(inode, datasync);
> +		ret = generic_file_fsync(file, start, end, datasync);
>  		if (!ret && !hlist_empty(&inode->i_dentry))
>  			ret = ext4_sync_parent(inode);
>  		goto out;
                     ^^^^
                goto out_trace;
Otherwise we will unlock i_mutex that we haven't take it.

Regards
                                                - Zheng

>  	}
>  
> +	ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
> +	if (ret)
> +		return ret;
> +	mutex_lock(&inode->i_mutex);
> +
>  	/*
>  	 * data=writeback,ordered:
>  	 *  The caller's filemap_fdatawrite()/wait will sync the data.
> @@ -169,8 +143,9 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>  		if (!ret)
>  			ret = err;
>  	}
> - out:
> +out:
>  	mutex_unlock(&inode->i_mutex);
> +out_trace:
>  	trace_ext4_sync_file_exit(inode, ret);
>  	return ret;
>  }
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 26/29] ext4: Remove i_mutex from ext4_file_sync()
  2013-04-08 21:32 ` [PATCH 26/29] ext4: Remove i_mutex from ext4_file_sync() Jan Kara
@ 2013-05-08  7:41   ` Zheng Liu
  0 siblings, 0 replies; 76+ messages in thread
From: Zheng Liu @ 2013-05-08  7:41 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:31PM +0200, Jan Kara wrote:
> After removal of ext4_flush_unwritten_io() call, ext4_file_sync()
> doesn't need i_mutex anymore. Forcing of transaction commits doesn't
> need i_mutex as there's nothing inode specific in that code apart from
> grabbing transaction ids from the inode. So remove the lock.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Due to the comment in patch 25/29, here we need to change that statement
back again.

                ret = generic_file_fsync(file, start, end, datasync);
                if (!ret && !hlist_empty(&inode->i_dentry))
                        ret = ext4_sync_parent(inode);
-               goto out_trace;
+               goto out;
        }
 
        ret = filemap_write_and_wait_range(inode->i_mapping, start, end);

Otherwise the patch looks good to me.
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>

Regards,
                                                - Zheng

> ---
>  fs/ext4/fsync.c |    6 +-----
>  1 files changed, 1 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index 1c08780..9040faa 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -99,7 +99,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>  	trace_ext4_sync_file_enter(file, datasync);
>  
>  	if (inode->i_sb->s_flags & MS_RDONLY)
> -		goto out_trace;
> +		goto out;
>  
>  	if (!journal) {
>  		ret = generic_file_fsync(file, start, end, datasync);
> @@ -111,8 +111,6 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>  	ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
>  	if (ret)
>  		return ret;
> -	mutex_lock(&inode->i_mutex);
> -
>  	/*
>  	 * data=writeback,ordered:
>  	 *  The caller's filemap_fdatawrite()/wait will sync the data.
> @@ -144,8 +142,6 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>  			ret = err;
>  	}
>  out:
> -	mutex_unlock(&inode->i_mutex);
> -out_trace:
>  	trace_ext4_sync_file_exit(inode, ret);
>  	return ret;
>  }
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 27/29] ext4: Remove wait for unwritten extents in ext4_ind_direct_IO()
  2013-04-08 21:32 ` [PATCH 27/29] ext4: Remove wait for unwritten extents in ext4_ind_direct_IO() Jan Kara
@ 2013-05-08  7:55   ` Zheng Liu
  0 siblings, 0 replies; 76+ messages in thread
From: Zheng Liu @ 2013-05-08  7:55 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:32PM +0200, Jan Kara wrote:
> We don't have to wait for unwritten extent conversion in
> ext4_ind_direct_IO() as all writes that happened before DIO are flushed
> by the generic code and extent conversion has happened before we cleared
> PageWriteback bit.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Regards,
                                                - Zheng

> ---
>  fs/ext4/indirect.c |    5 -----
>  1 files changed, 0 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> index 197b202..d1c6be4 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c
> @@ -809,11 +809,6 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
>  
>  retry:
>  	if (rw == READ && ext4_should_dioread_nolock(inode)) {
> -		if (unlikely(atomic_read(&EXT4_I(inode)->i_unwritten))) {
> -			mutex_lock(&inode->i_mutex);
> -			ext4_flush_unwritten_io(inode);
> -			mutex_unlock(&inode->i_mutex);
> -		}
>  		/*
>  		 * Nolock dioread optimization may be dynamically disabled
>  		 * via ext4_inode_block_unlocked_dio(). Check inode's state
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 28/29] ext4: Don't wait for extent conversion in ext4_ext_punch_hole()
  2013-04-08 21:32 ` [PATCH 28/29] ext4: Don't wait for extent conversion in ext4_ext_punch_hole() Jan Kara
@ 2013-05-08  7:56   ` Zheng Liu
  0 siblings, 0 replies; 76+ messages in thread
From: Zheng Liu @ 2013-05-08  7:56 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:33PM +0200, Jan Kara wrote:
> We don't have to wait for extent conversion in ext4_ext_punch_hole() as
> buffered IO for the punched range has been flushed and waited upon (thus
> all extent conversions for that range have completed). Also we wait for
> all DIO to finish using inode_dio_wait() so there cannot be any extent
> conversions pending due to direct IO.
> 
> Also remove ext4_flush_unwritten_io() since it's unused now.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Need to be rebased.  Otherwise the patch looks good to me.
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>

Regards,
                                                - Zheng

> ---
>  fs/ext4/ext4.h    |    1 -
>  fs/ext4/extents.c |    3 ---
>  fs/ext4/page-io.c |   16 ----------------
>  3 files changed, 0 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 2b0dd9a..859f235 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1986,7 +1986,6 @@ static inline  unsigned char get_dtype(struct super_block *sb, int filetype)
>  
>  /* fsync.c */
>  extern int ext4_sync_file(struct file *, loff_t, loff_t, int);
> -extern int ext4_flush_unwritten_io(struct inode *);
>  
>  /* hash.c */
>  extern int ext4fs_dirhash(const char *name, int len, struct
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index ca4ff71..96c4855 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4697,9 +4697,6 @@ int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length)
>  
>  	/* Wait all existing dio workers, newcomers will block on i_mutex */
>  	ext4_inode_block_unlocked_dio(inode);
> -	err = ext4_flush_unwritten_io(inode);
> -	if (err)
> -		goto out_dio;
>  	inode_dio_wait(inode);
>  
>  	credits = ext4_writepage_trans_blocks(inode);
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 2f0b943..1156b9f 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -268,22 +268,6 @@ void ext4_end_io_unrsv_work(struct work_struct *work)
>  	ext4_do_flush_completed_IO(&ei->vfs_inode, &ei->i_unrsv_conversion_list);
>  }
>  
> -int ext4_flush_unwritten_io(struct inode *inode)
> -{
> -	int ret, err;
> -
> -	WARN_ON_ONCE(!mutex_is_locked(&inode->i_mutex) &&
> -		     !(inode->i_state & I_FREEING));
> -	ret = ext4_do_flush_completed_IO(inode,
> -					 &EXT4_I(inode)->i_rsv_conversion_list);
> -	err = ext4_do_flush_completed_IO(inode,
> -					 &EXT4_I(inode)->i_unrsv_conversion_list);
> -	if (!ret)
> -		ret = err;
> -	ext4_unwritten_wait(inode);
> -	return ret;
> -}
> -
>  ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
>  {
>  	ext4_io_end_t *io = kmem_cache_zalloc(io_end_cachep, flags);
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 29/29] ext4: Remove ext4_ioend_wait()
  2013-04-08 21:32 ` [PATCH 29/29] ext4: Remove ext4_ioend_wait() Jan Kara
@ 2013-05-08  7:57   ` Zheng Liu
  2013-05-08 11:32     ` Jan Kara
  0 siblings, 1 reply; 76+ messages in thread
From: Zheng Liu @ 2013-05-08  7:57 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Apr 08, 2013 at 11:32:34PM +0200, Jan Kara wrote:
> Now that we clear PageWriteback after extent conversion, there's no need
> to wait for io_end processing in ext4_evict_inode(). Running AIO/DIO
> keeps file reference until aio_complete() is called so ext4_evict_inode()
> cannot be called. For io_end structures resulting from buffered IO
> waiting is happening because we wait for PageWriteback in
> truncate_inode_pages().
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Need to be rebased.
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>

Regards,
                                                - Zheng
> ---
>  fs/ext4/ext4.h    |    1 -
>  fs/ext4/inode.c   |    5 +++--
>  fs/ext4/page-io.c |    7 -------
>  3 files changed, 3 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 859f235..b359aef 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -2605,7 +2605,6 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
>  /* page-io.c */
>  extern int __init ext4_init_pageio(void);
>  extern void ext4_exit_pageio(void);
> -extern void ext4_ioend_wait(struct inode *);
>  extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
>  extern ext4_io_end_t *ext4_get_io_end(ext4_io_end_t *io_end);
>  extern int ext4_put_io_end(ext4_io_end_t *io_end);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index f493ec2..1f88941 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -188,8 +188,6 @@ void ext4_evict_inode(struct inode *inode)
>  
>  	trace_ext4_evict_inode(inode);
>  
> -	ext4_ioend_wait(inode);
> -
>  	if (inode->i_nlink) {
>  		/*
>  		 * When journalling data dirty buffers are tracked only in the
> @@ -219,6 +217,8 @@ void ext4_evict_inode(struct inode *inode)
>  			filemap_write_and_wait(&inode->i_data);
>  		}
>  		truncate_inode_pages(&inode->i_data, 0);
> +
> +		WARN_ON(atomic_read(&EXT4_I(inode)->i_ioend_count));
>  		goto no_delete;
>  	}
>  
> @@ -229,6 +229,7 @@ void ext4_evict_inode(struct inode *inode)
>  		ext4_begin_ordered_truncate(inode, 0);
>  	truncate_inode_pages(&inode->i_data, 0);
>  
> +	WARN_ON(atomic_read(&EXT4_I(inode)->i_ioend_count));
>  	if (is_bad_inode(inode))
>  		goto no_delete;
>  
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 1156b9f..e720d4e 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -44,13 +44,6 @@ void ext4_exit_pageio(void)
>  	kmem_cache_destroy(io_end_cachep);
>  }
>  
> -void ext4_ioend_wait(struct inode *inode)
> -{
> -	wait_queue_head_t *wq = ext4_ioend_wq(inode);
> -
> -	wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_ioend_count) == 0));
> -}
> -
>  /*
>   * Print an buffer I/O error compatible with the fs/buffer.c.  This
>   * provides compatibility with dmesg scrapers that look for a specific
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 18/29] ext4: Restructure writeback path
  2013-05-08  3:48   ` Zheng Liu
@ 2013-05-08 11:20     ` Jan Kara
  0 siblings, 0 replies; 76+ messages in thread
From: Jan Kara @ 2013-05-08 11:20 UTC (permalink / raw)
  To: Zheng Liu; +Cc: Jan Kara, Ted Tso, linux-ext4

On Wed 08-05-13 11:48:57, Zheng Liu wrote:
> On Mon, Apr 08, 2013 at 11:32:23PM +0200, Jan Kara wrote:
> > There are two issues with current writeback path in ext4. For one we
> > don't necessarily map complete pages when blocksize < pagesize and thus
> > needn't do any writeback in one iteration. We always map some blocks
> > though so we will eventually finish mapping the page. Just if writeback
> > races with other operations on the file, forward progress is not really
> > guaranteed. The second problem is that current code structure makes it
> > hard to associate all the bios to some range of pages with one io_end
> > structure so that unwritten extents can be converted after all the bios
> > are finished. This will be especially difficult later when io_end will
> > be associated with reserved transaction handle.
> > 
> > We restructure the writeback path to a relatively simple loop which
> > first prepares extent of pages, then maps one or more extents so that
> > no page is partially mapped, and once page is fully mapped it is
> > submitted for IO. We keep all the mapping and IO submission information
> > in mpage_da_data structure to somewhat reduce stack usage. Resulting
> > code is somewhat shorter than the old one and hopefully also easier to
> > read.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> One nit below.  Otherwise the patch looks good to be.
> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
  Thanks for review!

> > ---
> >  fs/ext4/ext4.h              |   15 -
> >  fs/ext4/inode.c             |  978 +++++++++++++++++++++----------------------
> >  include/trace/events/ext4.h |   64 ++-
> >  3 files changed, 508 insertions(+), 549 deletions(-)
> ...
> >  /*
> > - * write_cache_pages_da - walk the list of dirty pages of the given
> > - * address space and accumulate pages that need writing, and call
> > - * mpage_da_map_and_submit to map a single contiguous memory region
> > - * and then write them.
> > + * mpage_prepare_extent_to_map - find & lock contiguous range of dirty pages
> > + * 				 and underlying extent to map
> > + *
> > + * @mpd - where to look for pages
> > + *
> > + * Walk dirty pages in the mapping while they are contiguous and lock them.
> > + * While pages are fully mapped submit them for IO. When we find a page which
> > + * isn't mapped we start accumulating extent of buffers underlying these pages
> > + * that needs mapping (formed by either delayed or unwritten buffers). The
> > + * extent found is returned in @mpd structure (starting at mpd->lblk with
> > + * length mpd->len blocks).
> >   */
> > -static int write_cache_pages_da(handle_t *handle,
> > -				struct address_space *mapping,
> > -				struct writeback_control *wbc,
> > -				struct mpage_da_data *mpd,
> > -				pgoff_t *done_index)
> > +static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
> >  {
> > -	struct buffer_head	*bh, *head;
> > -	struct inode		*inode = mapping->host;
> > -	struct pagevec		pvec;
> > -	unsigned int		nr_pages;
> > -	sector_t		logical;
> > -	pgoff_t			index, end;
> > -	long			nr_to_write = wbc->nr_to_write;
> > -	int			i, tag, ret = 0;
> > -
> > -	memset(mpd, 0, sizeof(struct mpage_da_data));
> > -	mpd->wbc = wbc;
> > -	mpd->inode = inode;
> > -	pagevec_init(&pvec, 0);
> > -	index = wbc->range_start >> PAGE_CACHE_SHIFT;
> > -	end = wbc->range_end >> PAGE_CACHE_SHIFT;
> > +	struct address_space *mapping = mpd->inode->i_mapping;
> > +	struct pagevec pvec;
> > +	unsigned int nr_pages;
> > +	pgoff_t index = mpd->first_page;
> > +	pgoff_t end = mpd->last_page;
> > +	bool first_page_found = false;
> > +	int tag;
> > +	int i, err = 0;
> > +	int blkbits = mpd->inode->i_blkbits;
> > +	ext4_lblk_t lblk;
> > +	struct buffer_head *head;
> >  
> > -	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
> > +	if (mpd->wbc->sync_mode == WB_SYNC_ALL || mpd->wbc->tagged_writepages)
> >  		tag = PAGECACHE_TAG_TOWRITE;
> >  	else
> >  		tag = PAGECACHE_TAG_DIRTY;
> >  
> > -	*done_index = index;
> > +	mpd->map.m_len = 0;
> > +	mpd->next_page = index;
> 
> Forgot to call pagevec_init(&pvec, 0) here.
  I actually don't think it can cause any problems here (pagevec_lookup()
simply overwrites the pvec and we call pagevec_release() only after
pagevec_lookup()) but it's certainly a good practice so I added it.

								Honza

> >  	while (index <= end) {
> >  		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
> >  			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
> >  		if (nr_pages == 0)
> > -			return 0;
> > +			goto out;
> >  
> >  		for (i = 0; i < nr_pages; i++) {
> >  			struct page *page = pvec.pages[i];
> > @@ -2145,31 +2138,21 @@ static int write_cache_pages_da(handle_t *handle,
> >  			if (page->index > end)
> >  				goto out;
> >  
> > -			*done_index = page->index + 1;
> > -
> > -			/*
> > -			 * If we can't merge this page, and we have
> > -			 * accumulated an contiguous region, write it
> > -			 */
> > -			if ((mpd->next_page != page->index) &&
> > -			    (mpd->next_page != mpd->first_page)) {
> > -				mpage_da_map_and_submit(mpd);
> > -				goto ret_extent_tail;
> > -			}
> > +			/* If we can't merge this page, we are done. */
> > +			if (first_page_found && mpd->next_page != page->index)
> > +				goto out;
> >  
> >  			lock_page(page);
> > -
> >  			/*
> > -			 * If the page is no longer dirty, or its
> > -			 * mapping no longer corresponds to inode we
> > -			 * are writing (which means it has been
> > -			 * truncated or invalidated), or the page is
> > -			 * already under writeback and we are not
> > -			 * doing a data integrity writeback, skip the page
> > +			 * If the page is no longer dirty, or its mapping no
> > +			 * longer corresponds to inode we are writing (which
> > +			 * means it has been truncated or invalidated), or the
> > +			 * page is already under writeback and we are not doing
> > +			 * a data integrity writeback, skip the page
> >  			 */
> >  			if (!PageDirty(page) ||
> >  			    (PageWriteback(page) &&
> > -			     (wbc->sync_mode == WB_SYNC_NONE)) ||
> > +			     (mpd->wbc->sync_mode == WB_SYNC_NONE)) ||
> >  			    unlikely(page->mapping != mapping)) {
> >  				unlock_page(page);
> >  				continue;
> > @@ -2178,101 +2161,60 @@ static int write_cache_pages_da(handle_t *handle,
> >  			wait_on_page_writeback(page);
> >  			BUG_ON(PageWriteback(page));
> >  
> > -			/*
> > -			 * If we have inline data and arrive here, it means that
> > -			 * we will soon create the block for the 1st page, so
> > -			 * we'd better clear the inline data here.
> > -			 */
> > -			if (ext4_has_inline_data(inode)) {
> > -				BUG_ON(ext4_test_inode_state(inode,
> > -						EXT4_STATE_MAY_INLINE_DATA));
> > -				ext4_destroy_inline_data(handle, inode);
> > -			}
> > -
> > -			if (mpd->next_page != page->index)
> > +			if (!first_page_found) {
> >  				mpd->first_page = page->index;
> > +				first_page_found = true;
> > +			}
> >  			mpd->next_page = page->index + 1;
> > -			logical = (sector_t) page->index <<
> > -				(PAGE_CACHE_SHIFT - inode->i_blkbits);
> > +			lblk = ((ext4_lblk_t)page->index) <<
> > +				(PAGE_CACHE_SHIFT - blkbits);
> >  
> >  			/* Add all dirty buffers to mpd */
> >  			head = page_buffers(page);
> > -			bh = head;
> > -			do {
> > -				BUG_ON(buffer_locked(bh));
> > -				/*
> > -				 * We need to try to allocate unmapped blocks
> > -				 * in the same page.  Otherwise we won't make
> > -				 * progress with the page in ext4_writepage
> > -				 */
> > -				if (ext4_bh_delay_or_unwritten(NULL, bh)) {
> > -					mpage_add_bh_to_extent(mpd, logical,
> > -							       bh->b_state);
> > -					if (mpd->io_done)
> > -						goto ret_extent_tail;
> > -				} else if (buffer_dirty(bh) &&
> > -					   buffer_mapped(bh)) {
> > -					/*
> > -					 * mapped dirty buffer. We need to
> > -					 * update the b_state because we look
> > -					 * at b_state in mpage_da_map_blocks.
> > -					 * We don't update b_size because if we
> > -					 * find an unmapped buffer_head later
> > -					 * we need to use the b_state flag of
> > -					 * that buffer_head.
> > -					 */
> > -					if (mpd->b_size == 0)
> > -						mpd->b_state =
> > -							bh->b_state & BH_FLAGS;
> > -				}
> > -				logical++;
> > -			} while ((bh = bh->b_this_page) != head);
> > -
> > -			if (nr_to_write > 0) {
> > -				nr_to_write--;
> > -				if (nr_to_write == 0 &&
> > -				    wbc->sync_mode == WB_SYNC_NONE)
> > -					/*
> > -					 * We stop writing back only if we are
> > -					 * not doing integrity sync. In case of
> > -					 * integrity sync we have to keep going
> > -					 * because someone may be concurrently
> > -					 * dirtying pages, and we might have
> > -					 * synced a lot of newly appeared dirty
> > -					 * pages, but have not synced all of the
> > -					 * old dirty pages.
> > -					 */
> > +			if (!add_page_bufs_to_extent(mpd, head, head, lblk))
> > +				goto out;
> > +			/* So far everything mapped? Submit the page for IO. */
> > +			if (mpd->map.m_len == 0) {
> > +				err = mpage_submit_page(mpd, page);
> > +				if (err < 0)
> >  					goto out;
> >  			}
> > +
> > +			/*
> > +			 * Accumulated enough dirty pages? This doesn't apply
> > +			 * to WB_SYNC_ALL mode. For integrity sync we have to
> > +			 * keep going because someone may be concurrently
> > +			 * dirtying pages, and we might have synced a lot of
> > +			 * newly appeared dirty pages, but have not synced all
> > +			 * of the old dirty pages.
> > +			 */
> > +			if (mpd->wbc->sync_mode == WB_SYNC_NONE &&
> > +			    mpd->next_page - mpd->first_page >=
> > +							mpd->wbc->nr_to_write)
> > +				goto out;
> >  		}
> >  		pagevec_release(&pvec);
> >  		cond_resched();
> >  	}
> >  	return 0;
> > -ret_extent_tail:
> > -	ret = MPAGE_DA_EXTENT_TAIL;
> >  out:
> >  	pagevec_release(&pvec);
> > -	cond_resched();
> > -	return ret;
> > +	return err;
> >  }
> >  
> > -
> >  static int ext4_da_writepages(struct address_space *mapping,
> >  			      struct writeback_control *wbc)
> >  {
> > -	pgoff_t	index;
> > +	pgoff_t	writeback_index = 0;
> > +	long nr_to_write = wbc->nr_to_write;
> >  	int range_whole = 0;
> > +	int cycled = 1;
> >  	handle_t *handle = NULL;
> >  	struct mpage_da_data mpd;
> >  	struct inode *inode = mapping->host;
> > -	int pages_written = 0;
> > -	int range_cyclic, cycled = 1, io_done = 0;
> >  	int needed_blocks, ret = 0;
> > -	loff_t range_start = wbc->range_start;
> >  	struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
> > -	pgoff_t done_index = 0;
> > -	pgoff_t end;
> > +	bool done;
> >  	struct blk_plug plug;
> >  
> >  	trace_ext4_da_writepages(inode, wbc);
> > @@ -2298,40 +2240,65 @@ static int ext4_da_writepages(struct address_space *mapping,
> >  	if (unlikely(sbi->s_mount_flags & EXT4_MF_FS_ABORTED))
> >  		return -EROFS;
> >  
> > +	/*
> > +	 * If we have inline data and arrive here, it means that
> > +	 * we will soon create the block for the 1st page, so
> > +	 * we'd better clear the inline data here.
> > +	 */
> > +	if (ext4_has_inline_data(inode)) {
> > +		/* Just inode will be modified... */
> > +		handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
> > +		if (IS_ERR(handle)) {
> > +			ret = PTR_ERR(handle);
> > +			goto out_writepages;
> > +		}
> > +		BUG_ON(ext4_test_inode_state(inode,
> > +				EXT4_STATE_MAY_INLINE_DATA));
> > +		ext4_destroy_inline_data(handle, inode);
> > +		ext4_journal_stop(handle);
> > +	}
> > +
> >  	if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
> >  		range_whole = 1;
> >  
> > -	range_cyclic = wbc->range_cyclic;
> >  	if (wbc->range_cyclic) {
> > -		index = mapping->writeback_index;
> > -		if (index)
> > +		writeback_index = mapping->writeback_index;
> > +		if (writeback_index)
> >  			cycled = 0;
> > -		wbc->range_start = index << PAGE_CACHE_SHIFT;
> > -		wbc->range_end  = LLONG_MAX;
> > -		wbc->range_cyclic = 0;
> > -		end = -1;
> > +		mpd.first_page = writeback_index;
> > +		mpd.last_page = -1;
> >  	} else {
> > -		index = wbc->range_start >> PAGE_CACHE_SHIFT;
> > -		end = wbc->range_end >> PAGE_CACHE_SHIFT;
> > +		mpd.first_page = wbc->range_start >> PAGE_CACHE_SHIFT;
> > +		mpd.last_page = wbc->range_end >> PAGE_CACHE_SHIFT;
> >  	}
> >  
> > +	mpd.inode = inode;
> > +	mpd.wbc = wbc;
> > +	ext4_io_submit_init(&mpd.io_submit, wbc);
> >  retry:
> >  	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
> > -		tag_pages_for_writeback(mapping, index, end);
> > -
> > +		tag_pages_for_writeback(mapping, mpd.first_page, mpd.last_page);
> > +	done = false;
> >  	blk_start_plug(&plug);
> > -	while (!ret && wbc->nr_to_write > 0) {
> > +	while (!done && mpd.first_page <= mpd.last_page) {
> > +		/* For each extent of pages we use new io_end */
> > +		mpd.io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
> > +		if (!mpd.io_submit.io_end) {
> > +			ret = -ENOMEM;
> > +			break;
> > +		}
> >  
> >  		/*
> > -		 * we  insert one extent at a time. So we need
> > -		 * credit needed for single extent allocation.
> > -		 * journalled mode is currently not supported
> > -		 * by delalloc
> > +		 * We have two constraints: We find one extent to map and we
> > +		 * must always write out whole page (makes a difference when
> > +		 * blocksize < pagesize) so that we don't block on IO when we
> > +		 * try to write out the rest of the page. Journalled mode is
> > +		 * not supported by delalloc.
> >  		 */
> >  		BUG_ON(ext4_should_journal_data(inode));
> >  		needed_blocks = ext4_da_writepages_trans_blocks(inode);
> >  
> > -		/* start a new transaction*/
> > +		/* start a new transaction */
> >  		handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE,
> >  					    needed_blocks);
> >  		if (IS_ERR(handle)) {
> > @@ -2339,76 +2306,67 @@ retry:
> >  			ext4_msg(inode->i_sb, KERN_CRIT, "%s: jbd2_start: "
> >  			       "%ld pages, ino %lu; err %d", __func__,
> >  				wbc->nr_to_write, inode->i_ino, ret);
> > -			blk_finish_plug(&plug);
> > -			goto out_writepages;
> > +			/* Release allocated io_end */
> > +			ext4_put_io_end(mpd.io_submit.io_end);
> > +			break;
> >  		}
> >  
> > -		/*
> > -		 * Now call write_cache_pages_da() to find the next
> > -		 * contiguous region of logical blocks that need
> > -		 * blocks to be allocated by ext4 and submit them.
> > -		 */
> > -		ret = write_cache_pages_da(handle, mapping,
> > -					   wbc, &mpd, &done_index);
> > -		/*
> > -		 * If we have a contiguous extent of pages and we
> > -		 * haven't done the I/O yet, map the blocks and submit
> > -		 * them for I/O.
> > -		 */
> > -		if (!mpd.io_done && mpd.next_page != mpd.first_page) {
> > -			mpage_da_map_and_submit(&mpd);
> > -			ret = MPAGE_DA_EXTENT_TAIL;
> > +		trace_ext4_da_write_pages(inode, mpd.first_page, mpd.wbc);
> > +		ret = mpage_prepare_extent_to_map(&mpd);
> > +		if (!ret) {
> > +			if (mpd.map.m_len)
> > +				ret = mpage_map_and_submit_extent(handle, &mpd);
> > +			else {
> > +				/*
> > +				 * We scanned the whole range (or exhausted
> > +				 * nr_to_write), submitted what was mapped and
> > +				 * didn't find anything needing mapping. We are
> > +				 * done.
> > +				 */
> > +				done = true;
> > +			}
> >  		}
> > -		trace_ext4_da_write_pages(inode, &mpd);
> > -		wbc->nr_to_write -= mpd.pages_written;
> > -
> >  		ext4_journal_stop(handle);
> > -
> > -		if ((mpd.retval == -ENOSPC) && sbi->s_journal) {
> > -			/* commit the transaction which would
> > +		/* Submit prepared bio */
> > +		ext4_io_submit(&mpd.io_submit);
> > +		/* Unlock pages we didn't use */
> > +		mpage_release_unused_pages(&mpd, false);
> > +		/* Drop our io_end reference we got from init */
> > +		ext4_put_io_end(mpd.io_submit.io_end);
> > +
> > +		if (ret == -ENOSPC && sbi->s_journal) {
> > +			/*
> > +			 * Commit the transaction which would
> >  			 * free blocks released in the transaction
> >  			 * and try again
> >  			 */
> >  			jbd2_journal_force_commit_nested(sbi->s_journal);
> >  			ret = 0;
> > -		} else if (ret == MPAGE_DA_EXTENT_TAIL) {
> > -			/*
> > -			 * Got one extent now try with rest of the pages.
> > -			 * If mpd.retval is set -EIO, journal is aborted.
> > -			 * So we don't need to write any more.
> > -			 */
> > -			pages_written += mpd.pages_written;
> > -			ret = mpd.retval;
> > -			io_done = 1;
> > -		} else if (wbc->nr_to_write)
> > -			/*
> > -			 * There is no more writeout needed
> > -			 * or we requested for a noblocking writeout
> > -			 * and we found the device congested
> > -			 */
> > +			continue;
> > +		}
> > +		/* Fatal error - ENOMEM, EIO... */
> > +		if (ret)
> >  			break;
> >  	}
> >  	blk_finish_plug(&plug);
> > -	if (!io_done && !cycled) {
> > +	if (!ret && !cycled) {
> >  		cycled = 1;
> > -		index = 0;
> > -		wbc->range_start = index << PAGE_CACHE_SHIFT;
> > -		wbc->range_end  = mapping->writeback_index - 1;
> > +		mpd.last_page = writeback_index - 1;
> > +		mpd.first_page = 0;
> >  		goto retry;
> >  	}
> >  
> >  	/* Update index */
> > -	wbc->range_cyclic = range_cyclic;
> >  	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
> >  		/*
> > -		 * set the writeback_index so that range_cyclic
> > +		 * Set the writeback_index so that range_cyclic
> >  		 * mode will write it back later
> >  		 */
> > -		mapping->writeback_index = done_index;
> > +		mapping->writeback_index = mpd.first_page;
> >  
> >  out_writepages:
> > -	wbc->range_start = range_start;
> > -	trace_ext4_da_writepages_result(inode, wbc, ret, pages_written);
> > +	trace_ext4_da_writepages_result(inode, wbc, ret,
> > +					nr_to_write - wbc->nr_to_write);
> >  	return ret;
> >  }
> >  
> > diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> > index a601bb3..203dcd5 100644
> > --- a/include/trace/events/ext4.h
> > +++ b/include/trace/events/ext4.h
> > @@ -332,43 +332,59 @@ TRACE_EVENT(ext4_da_writepages,
> >  );
> >  
> >  TRACE_EVENT(ext4_da_write_pages,
> > -	TP_PROTO(struct inode *inode, struct mpage_da_data *mpd),
> > +	TP_PROTO(struct inode *inode, pgoff_t first_page,
> > +		 struct writeback_control *wbc),
> >  
> > -	TP_ARGS(inode, mpd),
> > +	TP_ARGS(inode, first_page, wbc),
> >  
> >  	TP_STRUCT__entry(
> >  		__field(	dev_t,	dev			)
> >  		__field(	ino_t,	ino			)
> > -		__field(	__u64,	b_blocknr		)
> > -		__field(	__u32,	b_size			)
> > -		__field(	__u32,	b_state			)
> > -		__field(	unsigned long,	first_page	)
> > -		__field(	int,	io_done			)
> > -		__field(	int,	pages_written		)
> > -		__field(	int,	sync_mode		)
> > +		__field(      pgoff_t,	first_page		)
> > +		__field(	 long,	nr_to_write		)
> > +		__field(	  int,	sync_mode		)
> >  	),
> >  
> >  	TP_fast_assign(
> >  		__entry->dev		= inode->i_sb->s_dev;
> >  		__entry->ino		= inode->i_ino;
> > -		__entry->b_blocknr	= mpd->b_blocknr;
> > -		__entry->b_size		= mpd->b_size;
> > -		__entry->b_state	= mpd->b_state;
> > -		__entry->first_page	= mpd->first_page;
> > -		__entry->io_done	= mpd->io_done;
> > -		__entry->pages_written	= mpd->pages_written;
> > -		__entry->sync_mode	= mpd->wbc->sync_mode;
> > +		__entry->first_page	= first_page;
> > +		__entry->nr_to_write	= wbc->nr_to_write;
> > +		__entry->sync_mode	= wbc->sync_mode;
> >  	),
> >  
> > -	TP_printk("dev %d,%d ino %lu b_blocknr %llu b_size %u b_state 0x%04x "
> > -		  "first_page %lu io_done %d pages_written %d sync_mode %d",
> > +	TP_printk("dev %d,%d ino %lu first_page %lu nr_to_write %ld "
> > +		  "sync_mode %d",
> >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > -		  (unsigned long) __entry->ino,
> > -		  __entry->b_blocknr, __entry->b_size,
> > -		  __entry->b_state, __entry->first_page,
> > -		  __entry->io_done, __entry->pages_written,
> > -		  __entry->sync_mode
> > -                  )
> > +		  (unsigned long) __entry->ino, __entry->first_page,
> > +		  __entry->nr_to_write, __entry->sync_mode)
> > +);
> > +
> > +TRACE_EVENT(ext4_da_write_pages_extent,
> > +	TP_PROTO(struct inode *inode, struct ext4_map_blocks *map),
> > +
> > +	TP_ARGS(inode, map),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(	dev_t,	dev			)
> > +		__field(	ino_t,	ino			)
> > +		__field(	__u64,	lblk			)
> > +		__field(	__u32,	len			)
> > +		__field(	__u32,	flags			)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->dev		= inode->i_sb->s_dev;
> > +		__entry->ino		= inode->i_ino;
> > +		__entry->lblk		= map->m_lblk;
> > +		__entry->len		= map->m_len;
> > +		__entry->flags		= map->m_flags;
> > +	),
> > +
> > +	TP_printk("dev %d,%d ino %lu lblk %llu len %u flags 0x%04x",
> > +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > +		  (unsigned long) __entry->ino, __entry->lblk, __entry->len,
> > +		  __entry->flags)
> >  );
> >  
> >  TRACE_EVENT(ext4_da_writepages_result,
> > -- 
> > 1.7.1
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 21/29] ext4: Split extent conversion lists to reserved & unreserved parts
  2013-05-08  7:03   ` Zheng Liu
@ 2013-05-08 11:23     ` Jan Kara
  2013-05-08 11:49       ` Zheng Liu
  0 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2013-05-08 11:23 UTC (permalink / raw)
  To: Zheng Liu; +Cc: Jan Kara, Ted Tso, linux-ext4

On Wed 08-05-13 15:03:35, Zheng Liu wrote:
> On Mon, Apr 08, 2013 at 11:32:26PM +0200, Jan Kara wrote:
> > Now that we have extent conversions with reserved transaction, we have
> > to prevent extent conversions without reserved transaction (from DIO
> > code) to block these (as that would effectively void any transaction
> > reservation we did). So split lists, work items, and work queues to
> > reserved and unreserved parts.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> I got a build error that looks like this.
> 
>  fs/ext4/page-io.c: In function ‘ext4_ioend_shutdown’:
>  fs/ext4/page-io.c:60: error: ‘struct ext4_inode_info’ has no member
>  named ‘i_unwritten_work’
> 
> I guess the reason is that when this patch set is sent out,
> ext4_io_end_shutdown() hasn't be added.  So please add the code
> like this.  Otherwise the patch looks good to me.
> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
  Yeah, I've already rebased the series on top of current Linus's tree and
I've notice this problem as well. It should be fixed by now. I didn't post
the rebased series yet because I'm looking into some xfstests failures I
hit when testing it...

								Honza
> 
> Regards,
>                                                 - Zheng
> 
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 3fea79e..f9ecc4f 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -57,8 +57,10 @@ void ext4_ioend_shutdown(struct inode *inode)
>          * We need to make sure the work structure is finished being
>          * used before we let the inode get destroyed.
>          */
> -       if (work_pending(&EXT4_I(inode)->i_unwritten_work))
> -               cancel_work_sync(&EXT4_I(inode)->i_unwritten_work);
> +       if (work_pending(&EXT4_I(inode)->i_rsv_conversion_work))
> +               cancel_work_sync(&EXT4_I(inode)->i_rsv_conversion_work);
> +       if (work_pending(&EXT4_I(inode)->i_unrsv_conversion_work))
> + cancel_work_sync(&EXT4_I(inode)->i_unrsv_conversion_work);
>  }
>  
>  static void ext4_release_io_end(ext4_io_end_t *io_end)
> 
> > ---
> >  fs/ext4/ext4.h    |   25 +++++++++++++++++-----
> >  fs/ext4/page-io.c |   59 ++++++++++++++++++++++++++++++++++------------------
> >  fs/ext4/super.c   |   38 ++++++++++++++++++++++++---------
> >  3 files changed, 84 insertions(+), 38 deletions(-)
> > 
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index 65adf0d..a594a94 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -889,12 +889,22 @@ struct ext4_inode_info {
> >  	qsize_t i_reserved_quota;
> >  #endif
> >  
> > -	/* completed IOs that might need unwritten extents handling */
> > -	struct list_head i_completed_io_list;
> > +	/* Lock protecting lists below */
> >  	spinlock_t i_completed_io_lock;
> > +	/*
> > +	 * Completed IOs that need unwritten extents handling and have
> > +	 * transaction reserved
> > +	 */
> > +	struct list_head i_rsv_conversion_list;
> > +	/*
> > +	 * Completed IOs that need unwritten extents handling and don't have
> > +	 * transaction reserved
> > +	 */
> > +	struct list_head i_unrsv_conversion_list;
> >  	atomic_t i_ioend_count;	/* Number of outstanding io_end structs */
> >  	atomic_t i_unwritten; /* Nr. of inflight conversions pending */
> > -	struct work_struct i_unwritten_work;	/* deferred extent conversion */
> > +	struct work_struct i_rsv_conversion_work;
> > +	struct work_struct i_unrsv_conversion_work;
> >  
> >  	spinlock_t i_block_reservation_lock;
> >  
> > @@ -1257,8 +1267,10 @@ struct ext4_sb_info {
> >  	struct flex_groups *s_flex_groups;
> >  	ext4_group_t s_flex_groups_allocated;
> >  
> > -	/* workqueue for dio unwritten */
> > -	struct workqueue_struct *dio_unwritten_wq;
> > +	/* workqueue for unreserved extent convertions (dio) */
> > +	struct workqueue_struct *unrsv_conversion_wq;
> > +	/* workqueue for reserved extent conversions (buffered io) */
> > +	struct workqueue_struct *rsv_conversion_wq;
> >  
> >  	/* timer for periodic error stats printing */
> >  	struct timer_list s_err_report;
> > @@ -2599,7 +2611,8 @@ extern int ext4_put_io_end(ext4_io_end_t *io_end);
> >  extern void ext4_put_io_end_defer(ext4_io_end_t *io_end);
> >  extern void ext4_io_submit_init(struct ext4_io_submit *io,
> >  				struct writeback_control *wbc);
> > -extern void ext4_end_io_work(struct work_struct *work);
> > +extern void ext4_end_io_rsv_work(struct work_struct *work);
> > +extern void ext4_end_io_unrsv_work(struct work_struct *work);
> >  extern void ext4_io_submit(struct ext4_io_submit *io);
> >  extern int ext4_bio_write_page(struct ext4_io_submit *io,
> >  			       struct page *page,
> > diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> > index e8ee4da..8bff3b3 100644
> > --- a/fs/ext4/page-io.c
> > +++ b/fs/ext4/page-io.c
> > @@ -103,20 +103,17 @@ static int ext4_end_io(ext4_io_end_t *io)
> >  	return ret;
> >  }
> >  
> > -static void dump_completed_IO(struct inode *inode)
> > +static void dump_completed_IO(struct inode *inode, struct list_head *head)
> >  {
> >  #ifdef	EXT4FS_DEBUG
> >  	struct list_head *cur, *before, *after;
> >  	ext4_io_end_t *io, *io0, *io1;
> >  
> > -	if (list_empty(&EXT4_I(inode)->i_completed_io_list)) {
> > -		ext4_debug("inode %lu completed_io list is empty\n",
> > -			   inode->i_ino);
> > +	if (list_empty(head))
> >  		return;
> > -	}
> >  
> > -	ext4_debug("Dump inode %lu completed_io list\n", inode->i_ino);
> > -	list_for_each_entry(io, &EXT4_I(inode)->i_completed_io_list, list) {
> > +	ext4_debug("Dump inode %lu completed io list\n", inode->i_ino);
> > +	list_for_each_entry(io, head, list) {
> >  		cur = &io->list;
> >  		before = cur->prev;
> >  		io0 = container_of(before, ext4_io_end_t, list);
> > @@ -137,16 +134,23 @@ static void ext4_add_complete_io(ext4_io_end_t *io_end)
> >  	unsigned long flags;
> >  
> >  	BUG_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
> > -	wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
> > -
> >  	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> > -	if (list_empty(&ei->i_completed_io_list))
> > -		queue_work(wq, &ei->i_unwritten_work);
> > -	list_add_tail(&io_end->list, &ei->i_completed_io_list);
> > +	if (io_end->handle) {
> > +		wq = EXT4_SB(io_end->inode->i_sb)->rsv_conversion_wq;
> > +		if (list_empty(&ei->i_rsv_conversion_list))
> > +			queue_work(wq, &ei->i_rsv_conversion_work);
> > +		list_add_tail(&io_end->list, &ei->i_rsv_conversion_list);
> > +	} else {
> > +		wq = EXT4_SB(io_end->inode->i_sb)->unrsv_conversion_wq;
> > +		if (list_empty(&ei->i_unrsv_conversion_list))
> > +			queue_work(wq, &ei->i_unrsv_conversion_work);
> > +		list_add_tail(&io_end->list, &ei->i_unrsv_conversion_list);
> > +	}
> >  	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> >  }
> >  
> > -static int ext4_do_flush_completed_IO(struct inode *inode)
> > +static int ext4_do_flush_completed_IO(struct inode *inode,
> > +				      struct list_head *head)
> >  {
> >  	ext4_io_end_t *io;
> >  	struct list_head unwritten;
> > @@ -155,8 +159,8 @@ static int ext4_do_flush_completed_IO(struct inode *inode)
> >  	int err, ret = 0;
> >  
> >  	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> > -	dump_completed_IO(inode);
> > -	list_replace_init(&ei->i_completed_io_list, &unwritten);
> > +	dump_completed_IO(inode, head);
> > +	list_replace_init(head, &unwritten);
> >  	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> >  
> >  	while (!list_empty(&unwritten)) {
> > @@ -172,21 +176,34 @@ static int ext4_do_flush_completed_IO(struct inode *inode)
> >  }
> >  
> >  /*
> > - * work on completed aio dio IO, to convert unwritten extents to extents
> > + * work on completed IO, to convert unwritten extents to extents
> >   */
> > -void ext4_end_io_work(struct work_struct *work)
> > +void ext4_end_io_rsv_work(struct work_struct *work)
> >  {
> >  	struct ext4_inode_info *ei = container_of(work, struct ext4_inode_info,
> > -						  i_unwritten_work);
> > -	ext4_do_flush_completed_IO(&ei->vfs_inode);
> > +						  i_rsv_conversion_work);
> > +	ext4_do_flush_completed_IO(&ei->vfs_inode, &ei->i_rsv_conversion_list);
> > +}
> > +
> > +void ext4_end_io_unrsv_work(struct work_struct *work)
> > +{
> > +	struct ext4_inode_info *ei = container_of(work, struct ext4_inode_info,
> > +						  i_unrsv_conversion_work);
> > +	ext4_do_flush_completed_IO(&ei->vfs_inode, &ei->i_unrsv_conversion_list);
> >  }
> >  
> >  int ext4_flush_unwritten_io(struct inode *inode)
> >  {
> > -	int ret;
> > +	int ret, err;
> > +
> >  	WARN_ON_ONCE(!mutex_is_locked(&inode->i_mutex) &&
> >  		     !(inode->i_state & I_FREEING));
> > -	ret = ext4_do_flush_completed_IO(inode);
> > +	ret = ext4_do_flush_completed_IO(inode,
> > +					 &EXT4_I(inode)->i_rsv_conversion_list);
> > +	err = ext4_do_flush_completed_IO(inode,
> > +					 &EXT4_I(inode)->i_unrsv_conversion_list);
> > +	if (!ret)
> > +		ret = err;
> >  	ext4_unwritten_wait(inode);
> >  	return ret;
> >  }
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index 09ff724..916c4fb 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -747,8 +747,10 @@ static void ext4_put_super(struct super_block *sb)
> >  	ext4_unregister_li_request(sb);
> >  	dquot_disable(sb, -1, DQUOT_USAGE_ENABLED | DQUOT_LIMITS_ENABLED);
> >  
> > -	flush_workqueue(sbi->dio_unwritten_wq);
> > -	destroy_workqueue(sbi->dio_unwritten_wq);
> > +	flush_workqueue(sbi->unrsv_conversion_wq);
> > +	flush_workqueue(sbi->rsv_conversion_wq);
> > +	destroy_workqueue(sbi->unrsv_conversion_wq);
> > +	destroy_workqueue(sbi->rsv_conversion_wq);
> >  
> >  	if (sbi->s_journal) {
> >  		err = jbd2_journal_destroy(sbi->s_journal);
> > @@ -856,13 +858,15 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
> >  	ei->i_reserved_quota = 0;
> >  #endif
> >  	ei->jinode = NULL;
> > -	INIT_LIST_HEAD(&ei->i_completed_io_list);
> > +	INIT_LIST_HEAD(&ei->i_rsv_conversion_list);
> > +	INIT_LIST_HEAD(&ei->i_unrsv_conversion_list);
> >  	spin_lock_init(&ei->i_completed_io_lock);
> >  	ei->i_sync_tid = 0;
> >  	ei->i_datasync_tid = 0;
> >  	atomic_set(&ei->i_ioend_count, 0);
> >  	atomic_set(&ei->i_unwritten, 0);
> > -	INIT_WORK(&ei->i_unwritten_work, ext4_end_io_work);
> > +	INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
> > +	INIT_WORK(&ei->i_unrsv_conversion_work, ext4_end_io_unrsv_work);
> >  
> >  	return &ei->vfs_inode;
> >  }
> > @@ -3867,12 +3871,20 @@ no_journal:
> >  	 * The maximum number of concurrent works can be high and
> >  	 * concurrency isn't really necessary.  Limit it to 1.
> >  	 */
> > -	EXT4_SB(sb)->dio_unwritten_wq =
> > -		alloc_workqueue("ext4-dio-unwritten", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
> > -	if (!EXT4_SB(sb)->dio_unwritten_wq) {
> > -		printk(KERN_ERR "EXT4-fs: failed to create DIO workqueue\n");
> > +	EXT4_SB(sb)->rsv_conversion_wq =
> > +		alloc_workqueue("ext4-rsv-conversion", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
> > +	if (!EXT4_SB(sb)->rsv_conversion_wq) {
> > +		printk(KERN_ERR "EXT4-fs: failed to create workqueue\n");
> >  		ret = -ENOMEM;
> > -		goto failed_mount_wq;
> > +		goto failed_mount4;
> > +	}
> > +
> > +	EXT4_SB(sb)->unrsv_conversion_wq =
> > +		alloc_workqueue("ext4-unrsv-conversion", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
> > +	if (!EXT4_SB(sb)->unrsv_conversion_wq) {
> > +		printk(KERN_ERR "EXT4-fs: failed to create workqueue\n");
> > +		ret = -ENOMEM;
> > +		goto failed_mount4;
> >  	}
> >  
> >  	/*
> > @@ -4019,7 +4031,10 @@ failed_mount4a:
> >  	sb->s_root = NULL;
> >  failed_mount4:
> >  	ext4_msg(sb, KERN_ERR, "mount failed");
> > -	destroy_workqueue(EXT4_SB(sb)->dio_unwritten_wq);
> > +	if (EXT4_SB(sb)->rsv_conversion_wq)
> > +		destroy_workqueue(EXT4_SB(sb)->rsv_conversion_wq);
> > +	if (EXT4_SB(sb)->unrsv_conversion_wq)
> > +		destroy_workqueue(EXT4_SB(sb)->unrsv_conversion_wq);
> >  failed_mount_wq:
> >  	if (sbi->s_journal) {
> >  		jbd2_journal_destroy(sbi->s_journal);
> > @@ -4464,7 +4479,8 @@ static int ext4_sync_fs(struct super_block *sb, int wait)
> >  	struct ext4_sb_info *sbi = EXT4_SB(sb);
> >  
> >  	trace_ext4_sync_fs(sb, wait);
> > -	flush_workqueue(sbi->dio_unwritten_wq);
> > +	flush_workqueue(sbi->rsv_conversion_wq);
> > +	flush_workqueue(sbi->unrsv_conversion_wq);
> >  	/*
> >  	 * Writeback quota in non-journalled quota case - journalled quota has
> >  	 * no dirty dquots
> > -- 
> > 1.7.1
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 25/29] ext4: Use generic_file_fsync() in ext4_file_fsync() in nojournal mode
  2013-05-08  7:37   ` Zheng Liu
@ 2013-05-08 11:29     ` Jan Kara
  0 siblings, 0 replies; 76+ messages in thread
From: Jan Kara @ 2013-05-08 11:29 UTC (permalink / raw)
  To: Zheng Liu; +Cc: Jan Kara, Ted Tso, linux-ext4

On Wed 08-05-13 15:37:34, Zheng Liu wrote:
> On Mon, Apr 08, 2013 at 11:32:30PM +0200, Jan Kara wrote:
> > Just use the generic function instead of duplicating it. We only need
> > to reshuffle the read-only check a bit (which is there to prevent
> > writing to a filesystem which has been remounted read-only after error
> > I assume).
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> A minor nit below.  Otherwise the patch looks good to me.
> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
> 
> > ---
> >  fs/ext4/fsync.c |   45 ++++++++++-----------------------------------
> >  1 files changed, 10 insertions(+), 35 deletions(-)
> > 
> > diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> > index e02ba1b..1c08780 100644
> > --- a/fs/ext4/fsync.c
> > +++ b/fs/ext4/fsync.c
> > @@ -73,32 +73,6 @@ static int ext4_sync_parent(struct inode *inode)
> >  	return ret;
> >  }
> >  
> > -/**
> > - * __sync_file - generic_file_fsync without the locking and filemap_write
> > - * @inode:	inode to sync
> > - * @datasync:	only sync essential metadata if true
> > - *
> > - * This is just generic_file_fsync without the locking.  This is needed for
> > - * nojournal mode to make sure this inodes data/metadata makes it to disk
> > - * properly.  The i_mutex should be held already.
> > - */
> > -static int __sync_inode(struct inode *inode, int datasync)
> > -{
> > -	int err;
> > -	int ret;
> > -
> > -	ret = sync_mapping_buffers(inode->i_mapping);
> > -	if (!(inode->i_state & I_DIRTY))
> > -		return ret;
> > -	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> > -		return ret;
> > -
> > -	err = sync_inode_metadata(inode, 1);
> > -	if (ret == 0)
> > -		ret = err;
> > -	return ret;
> > -}
> > -
> >  /*
> >   * akpm: A new design for ext4_sync_file().
> >   *
> > @@ -116,7 +90,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> >  	struct inode *inode = file->f_mapping->host;
> >  	struct ext4_inode_info *ei = EXT4_I(inode);
> >  	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
> > -	int ret, err;
> > +	int ret = 0, err;
> >  	tid_t commit_tid;
> >  	bool needs_barrier = false;
> >  
> > @@ -124,21 +98,21 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> >  
> >  	trace_ext4_sync_file_enter(file, datasync);
> >  
> > -	ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
> > -	if (ret)
> > -		return ret;
> > -	mutex_lock(&inode->i_mutex);
> > -
> >  	if (inode->i_sb->s_flags & MS_RDONLY)
> > -		goto out;
> > +		goto out_trace;
> >  
> >  	if (!journal) {
> > -		ret = __sync_inode(inode, datasync);
> > +		ret = generic_file_fsync(file, start, end, datasync);
> >  		if (!ret && !hlist_empty(&inode->i_dentry))
> >  			ret = ext4_sync_parent(inode);
> >  		goto out;
>                      ^^^^
>                 goto out_trace;
> Otherwise we will unlock i_mutex that we haven't take it.
  Ha, good spotting! The following patch will fix this but nevertheless
it's good to fix the breakage here. Thanks for review.

								Honza

> >  	}
> >  
> > +	ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
> > +	if (ret)
> > +		return ret;
> > +	mutex_lock(&inode->i_mutex);
> > +
> >  	/*
> >  	 * data=writeback,ordered:
> >  	 *  The caller's filemap_fdatawrite()/wait will sync the data.
> > @@ -169,8 +143,9 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> >  		if (!ret)
> >  			ret = err;
> >  	}
> > - out:
> > +out:
> >  	mutex_unlock(&inode->i_mutex);
> > +out_trace:
> >  	trace_ext4_sync_file_exit(inode, ret);
> >  	return ret;
> >  }
> > -- 
> > 1.7.1
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 29/29] ext4: Remove ext4_ioend_wait()
  2013-05-08  7:57   ` Zheng Liu
@ 2013-05-08 11:32     ` Jan Kara
  0 siblings, 0 replies; 76+ messages in thread
From: Jan Kara @ 2013-05-08 11:32 UTC (permalink / raw)
  To: Zheng Liu; +Cc: Jan Kara, Ted Tso, linux-ext4

On Wed 08-05-13 15:57:12, Zheng Liu wrote:
> On Mon, Apr 08, 2013 at 11:32:34PM +0200, Jan Kara wrote:
> > Now that we clear PageWriteback after extent conversion, there's no need
> > to wait for io_end processing in ext4_evict_inode(). Running AIO/DIO
> > keeps file reference until aio_complete() is called so ext4_evict_inode()
> > cannot be called. For io_end structures resulting from buffered IO
> > waiting is happening because we wait for PageWriteback in
> > truncate_inode_pages().
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> Need to be rebased.
> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
  Thanks for you effort with reviewing all the patches in this big series!
I really appreciate that.

								Honza
> > ---
> >  fs/ext4/ext4.h    |    1 -
> >  fs/ext4/inode.c   |    5 +++--
> >  fs/ext4/page-io.c |    7 -------
> >  3 files changed, 3 insertions(+), 10 deletions(-)
> > 
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index 859f235..b359aef 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -2605,7 +2605,6 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
> >  /* page-io.c */
> >  extern int __init ext4_init_pageio(void);
> >  extern void ext4_exit_pageio(void);
> > -extern void ext4_ioend_wait(struct inode *);
> >  extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
> >  extern ext4_io_end_t *ext4_get_io_end(ext4_io_end_t *io_end);
> >  extern int ext4_put_io_end(ext4_io_end_t *io_end);
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index f493ec2..1f88941 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -188,8 +188,6 @@ void ext4_evict_inode(struct inode *inode)
> >  
> >  	trace_ext4_evict_inode(inode);
> >  
> > -	ext4_ioend_wait(inode);
> > -
> >  	if (inode->i_nlink) {
> >  		/*
> >  		 * When journalling data dirty buffers are tracked only in the
> > @@ -219,6 +217,8 @@ void ext4_evict_inode(struct inode *inode)
> >  			filemap_write_and_wait(&inode->i_data);
> >  		}
> >  		truncate_inode_pages(&inode->i_data, 0);
> > +
> > +		WARN_ON(atomic_read(&EXT4_I(inode)->i_ioend_count));
> >  		goto no_delete;
> >  	}
> >  
> > @@ -229,6 +229,7 @@ void ext4_evict_inode(struct inode *inode)
> >  		ext4_begin_ordered_truncate(inode, 0);
> >  	truncate_inode_pages(&inode->i_data, 0);
> >  
> > +	WARN_ON(atomic_read(&EXT4_I(inode)->i_ioend_count));
> >  	if (is_bad_inode(inode))
> >  		goto no_delete;
> >  
> > diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> > index 1156b9f..e720d4e 100644
> > --- a/fs/ext4/page-io.c
> > +++ b/fs/ext4/page-io.c
> > @@ -44,13 +44,6 @@ void ext4_exit_pageio(void)
> >  	kmem_cache_destroy(io_end_cachep);
> >  }
> >  
> > -void ext4_ioend_wait(struct inode *inode)
> > -{
> > -	wait_queue_head_t *wq = ext4_ioend_wq(inode);
> > -
> > -	wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_ioend_count) == 0));
> > -}
> > -
> >  /*
> >   * Print an buffer I/O error compatible with the fs/buffer.c.  This
> >   * provides compatibility with dmesg scrapers that look for a specific
> > -- 
> > 1.7.1
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 21/29] ext4: Split extent conversion lists to reserved & unreserved parts
  2013-05-08 11:23     ` Jan Kara
@ 2013-05-08 11:49       ` Zheng Liu
  0 siblings, 0 replies; 76+ messages in thread
From: Zheng Liu @ 2013-05-08 11:49 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Wed, May 08, 2013 at 01:23:55PM +0200, Jan Kara wrote:
> On Wed 08-05-13 15:03:35, Zheng Liu wrote:
> > On Mon, Apr 08, 2013 at 11:32:26PM +0200, Jan Kara wrote:
> > > Now that we have extent conversions with reserved transaction, we have
> > > to prevent extent conversions without reserved transaction (from DIO
> > > code) to block these (as that would effectively void any transaction
> > > reservation we did). So split lists, work items, and work queues to
> > > reserved and unreserved parts.
> > > 
> > > Signed-off-by: Jan Kara <jack@suse.cz>
> > 
> > I got a build error that looks like this.
> > 
> >  fs/ext4/page-io.c: In function ‘ext4_ioend_shutdown’:
> >  fs/ext4/page-io.c:60: error: ‘struct ext4_inode_info’ has no member
> >  named ‘i_unwritten_work’
> > 
> > I guess the reason is that when this patch set is sent out,
> > ext4_io_end_shutdown() hasn't be added.  So please add the code
> > like this.  Otherwise the patch looks good to me.
> > Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
>   Yeah, I've already rebased the series on top of current Linus's tree and
> I've notice this problem as well. It should be fixed by now. I didn't post
> the rebased series yet because I'm looking into some xfstests failures I
> hit when testing it...

Thanks for your excellent work.  Yes, I am running xfstests against your
patch set and I got a failure that is #091 test case when dioread_nolock
enables.  That is pretty easy to be triggered.  Just let you know.

Regards,
                                                - Zheng

> > diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> > index 3fea79e..f9ecc4f 100644
> > --- a/fs/ext4/page-io.c
> > +++ b/fs/ext4/page-io.c
> > @@ -57,8 +57,10 @@ void ext4_ioend_shutdown(struct inode *inode)
> >          * We need to make sure the work structure is finished being
> >          * used before we let the inode get destroyed.
> >          */
> > -       if (work_pending(&EXT4_I(inode)->i_unwritten_work))
> > -               cancel_work_sync(&EXT4_I(inode)->i_unwritten_work);
> > +       if (work_pending(&EXT4_I(inode)->i_rsv_conversion_work))
> > +               cancel_work_sync(&EXT4_I(inode)->i_rsv_conversion_work);
> > +       if (work_pending(&EXT4_I(inode)->i_unrsv_conversion_work))
> > + cancel_work_sync(&EXT4_I(inode)->i_unrsv_conversion_work);
> >  }
> >  
> >  static void ext4_release_io_end(ext4_io_end_t *io_end)
> > 
> > > ---
> > >  fs/ext4/ext4.h    |   25 +++++++++++++++++-----
> > >  fs/ext4/page-io.c |   59 ++++++++++++++++++++++++++++++++++------------------
> > >  fs/ext4/super.c   |   38 ++++++++++++++++++++++++---------
> > >  3 files changed, 84 insertions(+), 38 deletions(-)
> > > 
> > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > > index 65adf0d..a594a94 100644
> > > --- a/fs/ext4/ext4.h
> > > +++ b/fs/ext4/ext4.h
> > > @@ -889,12 +889,22 @@ struct ext4_inode_info {
> > >  	qsize_t i_reserved_quota;
> > >  #endif
> > >  
> > > -	/* completed IOs that might need unwritten extents handling */
> > > -	struct list_head i_completed_io_list;
> > > +	/* Lock protecting lists below */
> > >  	spinlock_t i_completed_io_lock;
> > > +	/*
> > > +	 * Completed IOs that need unwritten extents handling and have
> > > +	 * transaction reserved
> > > +	 */
> > > +	struct list_head i_rsv_conversion_list;
> > > +	/*
> > > +	 * Completed IOs that need unwritten extents handling and don't have
> > > +	 * transaction reserved
> > > +	 */
> > > +	struct list_head i_unrsv_conversion_list;
> > >  	atomic_t i_ioend_count;	/* Number of outstanding io_end structs */
> > >  	atomic_t i_unwritten; /* Nr. of inflight conversions pending */
> > > -	struct work_struct i_unwritten_work;	/* deferred extent conversion */
> > > +	struct work_struct i_rsv_conversion_work;
> > > +	struct work_struct i_unrsv_conversion_work;
> > >  
> > >  	spinlock_t i_block_reservation_lock;
> > >  
> > > @@ -1257,8 +1267,10 @@ struct ext4_sb_info {
> > >  	struct flex_groups *s_flex_groups;
> > >  	ext4_group_t s_flex_groups_allocated;
> > >  
> > > -	/* workqueue for dio unwritten */
> > > -	struct workqueue_struct *dio_unwritten_wq;
> > > +	/* workqueue for unreserved extent convertions (dio) */
> > > +	struct workqueue_struct *unrsv_conversion_wq;
> > > +	/* workqueue for reserved extent conversions (buffered io) */
> > > +	struct workqueue_struct *rsv_conversion_wq;
> > >  
> > >  	/* timer for periodic error stats printing */
> > >  	struct timer_list s_err_report;
> > > @@ -2599,7 +2611,8 @@ extern int ext4_put_io_end(ext4_io_end_t *io_end);
> > >  extern void ext4_put_io_end_defer(ext4_io_end_t *io_end);
> > >  extern void ext4_io_submit_init(struct ext4_io_submit *io,
> > >  				struct writeback_control *wbc);
> > > -extern void ext4_end_io_work(struct work_struct *work);
> > > +extern void ext4_end_io_rsv_work(struct work_struct *work);
> > > +extern void ext4_end_io_unrsv_work(struct work_struct *work);
> > >  extern void ext4_io_submit(struct ext4_io_submit *io);
> > >  extern int ext4_bio_write_page(struct ext4_io_submit *io,
> > >  			       struct page *page,
> > > diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> > > index e8ee4da..8bff3b3 100644
> > > --- a/fs/ext4/page-io.c
> > > +++ b/fs/ext4/page-io.c
> > > @@ -103,20 +103,17 @@ static int ext4_end_io(ext4_io_end_t *io)
> > >  	return ret;
> > >  }
> > >  
> > > -static void dump_completed_IO(struct inode *inode)
> > > +static void dump_completed_IO(struct inode *inode, struct list_head *head)
> > >  {
> > >  #ifdef	EXT4FS_DEBUG
> > >  	struct list_head *cur, *before, *after;
> > >  	ext4_io_end_t *io, *io0, *io1;
> > >  
> > > -	if (list_empty(&EXT4_I(inode)->i_completed_io_list)) {
> > > -		ext4_debug("inode %lu completed_io list is empty\n",
> > > -			   inode->i_ino);
> > > +	if (list_empty(head))
> > >  		return;
> > > -	}
> > >  
> > > -	ext4_debug("Dump inode %lu completed_io list\n", inode->i_ino);
> > > -	list_for_each_entry(io, &EXT4_I(inode)->i_completed_io_list, list) {
> > > +	ext4_debug("Dump inode %lu completed io list\n", inode->i_ino);
> > > +	list_for_each_entry(io, head, list) {
> > >  		cur = &io->list;
> > >  		before = cur->prev;
> > >  		io0 = container_of(before, ext4_io_end_t, list);
> > > @@ -137,16 +134,23 @@ static void ext4_add_complete_io(ext4_io_end_t *io_end)
> > >  	unsigned long flags;
> > >  
> > >  	BUG_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
> > > -	wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
> > > -
> > >  	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> > > -	if (list_empty(&ei->i_completed_io_list))
> > > -		queue_work(wq, &ei->i_unwritten_work);
> > > -	list_add_tail(&io_end->list, &ei->i_completed_io_list);
> > > +	if (io_end->handle) {
> > > +		wq = EXT4_SB(io_end->inode->i_sb)->rsv_conversion_wq;
> > > +		if (list_empty(&ei->i_rsv_conversion_list))
> > > +			queue_work(wq, &ei->i_rsv_conversion_work);
> > > +		list_add_tail(&io_end->list, &ei->i_rsv_conversion_list);
> > > +	} else {
> > > +		wq = EXT4_SB(io_end->inode->i_sb)->unrsv_conversion_wq;
> > > +		if (list_empty(&ei->i_unrsv_conversion_list))
> > > +			queue_work(wq, &ei->i_unrsv_conversion_work);
> > > +		list_add_tail(&io_end->list, &ei->i_unrsv_conversion_list);
> > > +	}
> > >  	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> > >  }
> > >  
> > > -static int ext4_do_flush_completed_IO(struct inode *inode)
> > > +static int ext4_do_flush_completed_IO(struct inode *inode,
> > > +				      struct list_head *head)
> > >  {
> > >  	ext4_io_end_t *io;
> > >  	struct list_head unwritten;
> > > @@ -155,8 +159,8 @@ static int ext4_do_flush_completed_IO(struct inode *inode)
> > >  	int err, ret = 0;
> > >  
> > >  	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> > > -	dump_completed_IO(inode);
> > > -	list_replace_init(&ei->i_completed_io_list, &unwritten);
> > > +	dump_completed_IO(inode, head);
> > > +	list_replace_init(head, &unwritten);
> > >  	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> > >  
> > >  	while (!list_empty(&unwritten)) {
> > > @@ -172,21 +176,34 @@ static int ext4_do_flush_completed_IO(struct inode *inode)
> > >  }
> > >  
> > >  /*
> > > - * work on completed aio dio IO, to convert unwritten extents to extents
> > > + * work on completed IO, to convert unwritten extents to extents
> > >   */
> > > -void ext4_end_io_work(struct work_struct *work)
> > > +void ext4_end_io_rsv_work(struct work_struct *work)
> > >  {
> > >  	struct ext4_inode_info *ei = container_of(work, struct ext4_inode_info,
> > > -						  i_unwritten_work);
> > > -	ext4_do_flush_completed_IO(&ei->vfs_inode);
> > > +						  i_rsv_conversion_work);
> > > +	ext4_do_flush_completed_IO(&ei->vfs_inode, &ei->i_rsv_conversion_list);
> > > +}
> > > +
> > > +void ext4_end_io_unrsv_work(struct work_struct *work)
> > > +{
> > > +	struct ext4_inode_info *ei = container_of(work, struct ext4_inode_info,
> > > +						  i_unrsv_conversion_work);
> > > +	ext4_do_flush_completed_IO(&ei->vfs_inode, &ei->i_unrsv_conversion_list);
> > >  }
> > >  
> > >  int ext4_flush_unwritten_io(struct inode *inode)
> > >  {
> > > -	int ret;
> > > +	int ret, err;
> > > +
> > >  	WARN_ON_ONCE(!mutex_is_locked(&inode->i_mutex) &&
> > >  		     !(inode->i_state & I_FREEING));
> > > -	ret = ext4_do_flush_completed_IO(inode);
> > > +	ret = ext4_do_flush_completed_IO(inode,
> > > +					 &EXT4_I(inode)->i_rsv_conversion_list);
> > > +	err = ext4_do_flush_completed_IO(inode,
> > > +					 &EXT4_I(inode)->i_unrsv_conversion_list);
> > > +	if (!ret)
> > > +		ret = err;
> > >  	ext4_unwritten_wait(inode);
> > >  	return ret;
> > >  }
> > > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > > index 09ff724..916c4fb 100644
> > > --- a/fs/ext4/super.c
> > > +++ b/fs/ext4/super.c
> > > @@ -747,8 +747,10 @@ static void ext4_put_super(struct super_block *sb)
> > >  	ext4_unregister_li_request(sb);
> > >  	dquot_disable(sb, -1, DQUOT_USAGE_ENABLED | DQUOT_LIMITS_ENABLED);
> > >  
> > > -	flush_workqueue(sbi->dio_unwritten_wq);
> > > -	destroy_workqueue(sbi->dio_unwritten_wq);
> > > +	flush_workqueue(sbi->unrsv_conversion_wq);
> > > +	flush_workqueue(sbi->rsv_conversion_wq);
> > > +	destroy_workqueue(sbi->unrsv_conversion_wq);
> > > +	destroy_workqueue(sbi->rsv_conversion_wq);
> > >  
> > >  	if (sbi->s_journal) {
> > >  		err = jbd2_journal_destroy(sbi->s_journal);
> > > @@ -856,13 +858,15 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
> > >  	ei->i_reserved_quota = 0;
> > >  #endif
> > >  	ei->jinode = NULL;
> > > -	INIT_LIST_HEAD(&ei->i_completed_io_list);
> > > +	INIT_LIST_HEAD(&ei->i_rsv_conversion_list);
> > > +	INIT_LIST_HEAD(&ei->i_unrsv_conversion_list);
> > >  	spin_lock_init(&ei->i_completed_io_lock);
> > >  	ei->i_sync_tid = 0;
> > >  	ei->i_datasync_tid = 0;
> > >  	atomic_set(&ei->i_ioend_count, 0);
> > >  	atomic_set(&ei->i_unwritten, 0);
> > > -	INIT_WORK(&ei->i_unwritten_work, ext4_end_io_work);
> > > +	INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
> > > +	INIT_WORK(&ei->i_unrsv_conversion_work, ext4_end_io_unrsv_work);
> > >  
> > >  	return &ei->vfs_inode;
> > >  }
> > > @@ -3867,12 +3871,20 @@ no_journal:
> > >  	 * The maximum number of concurrent works can be high and
> > >  	 * concurrency isn't really necessary.  Limit it to 1.
> > >  	 */
> > > -	EXT4_SB(sb)->dio_unwritten_wq =
> > > -		alloc_workqueue("ext4-dio-unwritten", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
> > > -	if (!EXT4_SB(sb)->dio_unwritten_wq) {
> > > -		printk(KERN_ERR "EXT4-fs: failed to create DIO workqueue\n");
> > > +	EXT4_SB(sb)->rsv_conversion_wq =
> > > +		alloc_workqueue("ext4-rsv-conversion", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
> > > +	if (!EXT4_SB(sb)->rsv_conversion_wq) {
> > > +		printk(KERN_ERR "EXT4-fs: failed to create workqueue\n");
> > >  		ret = -ENOMEM;
> > > -		goto failed_mount_wq;
> > > +		goto failed_mount4;
> > > +	}
> > > +
> > > +	EXT4_SB(sb)->unrsv_conversion_wq =
> > > +		alloc_workqueue("ext4-unrsv-conversion", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
> > > +	if (!EXT4_SB(sb)->unrsv_conversion_wq) {
> > > +		printk(KERN_ERR "EXT4-fs: failed to create workqueue\n");
> > > +		ret = -ENOMEM;
> > > +		goto failed_mount4;
> > >  	}
> > >  
> > >  	/*
> > > @@ -4019,7 +4031,10 @@ failed_mount4a:
> > >  	sb->s_root = NULL;
> > >  failed_mount4:
> > >  	ext4_msg(sb, KERN_ERR, "mount failed");
> > > -	destroy_workqueue(EXT4_SB(sb)->dio_unwritten_wq);
> > > +	if (EXT4_SB(sb)->rsv_conversion_wq)
> > > +		destroy_workqueue(EXT4_SB(sb)->rsv_conversion_wq);
> > > +	if (EXT4_SB(sb)->unrsv_conversion_wq)
> > > +		destroy_workqueue(EXT4_SB(sb)->unrsv_conversion_wq);
> > >  failed_mount_wq:
> > >  	if (sbi->s_journal) {
> > >  		jbd2_journal_destroy(sbi->s_journal);
> > > @@ -4464,7 +4479,8 @@ static int ext4_sync_fs(struct super_block *sb, int wait)
> > >  	struct ext4_sb_info *sbi = EXT4_SB(sb);
> > >  
> > >  	trace_ext4_sync_fs(sb, wait);
> > > -	flush_workqueue(sbi->dio_unwritten_wq);
> > > +	flush_workqueue(sbi->rsv_conversion_wq);
> > > +	flush_workqueue(sbi->unrsv_conversion_wq);
> > >  	/*
> > >  	 * Writeback quota in non-journalled quota case - journalled quota has
> > >  	 * no dirty dquots
> > > -- 
> > > 1.7.1
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2013-05-08 11:32 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-08 21:32 [PATCH 00/22 v1] Fixes and improvements in ext4 writeback path Jan Kara
2013-04-08 21:32 ` [PATCH 01/29] ext4: Make ext4_bio_write_page() use BH_Async_Write flags instead page pointers from ext4_io_end Jan Kara
2013-04-10 18:05   ` Dmitry Monakhov
2013-04-11 13:38   ` Zheng Liu
2013-04-12  3:50   ` Theodore Ts'o
2013-04-08 21:32 ` [PATCH 02/29] ext4: Use io_end for multiple bios Jan Kara
2013-04-11  5:10   ` Dmitry Monakhov
2013-04-11 14:04   ` Zheng Liu
2013-04-12  3:55   ` Theodore Ts'o
2013-04-08 21:32 ` [PATCH 03/29] ext4: Clear buffer_uninit flag when submitting IO Jan Kara
2013-04-11 14:08   ` Zheng Liu
2013-04-08 21:32 ` [PATCH 04/29] jbd2: Reduce journal_head size Jan Kara
2013-04-11 14:10   ` Zheng Liu
2013-04-12  4:04   ` Theodore Ts'o
2013-04-08 21:32 ` [PATCH 05/29] jbd2: Don't create journal_head for temporary journal buffers Jan Kara
2013-04-12  8:01   ` Zheng Liu
2013-04-08 21:32 ` [PATCH 06/29] jbd2: Remove journal_head from descriptor buffers Jan Kara
2013-04-12  8:10   ` Zheng Liu
2013-04-08 21:32 ` [PATCH 07/29] jbd2: Refine waiting for shadow buffers Jan Kara
2013-05-03 14:16   ` Zheng Liu
2013-05-03 20:44     ` Jan Kara
2013-04-08 21:32 ` [PATCH 08/29] jbd2: Remove outdated comment Jan Kara
2013-05-03 14:20   ` Zheng Liu
2013-04-08 21:32 ` [PATCH 09/29] jbd2: Cleanup needed free block estimates when starting a transaction Jan Kara
2013-05-05  8:17   ` Zheng Liu
2013-04-08 21:32 ` [PATCH 10/29] jbd2: Fix race in t_outstanding_credits update in jbd2_journal_extend() Jan Kara
2013-05-05  8:37   ` Zheng Liu
2013-04-08 21:32 ` [PATCH 11/29] jbd2: Remove unused waitqueues Jan Kara
2013-05-05  8:41   ` Zheng Liu
2013-04-08 21:32 ` [PATCH 12/29] jbd2: Transaction reservation support Jan Kara
2013-05-05  9:39   ` Zheng Liu
2013-05-06 12:49     ` Jan Kara
2013-05-07  5:22       ` Zheng Liu
2013-04-08 21:32 ` [PATCH 13/29] ext4: Provide wrappers for transaction reservation calls Jan Kara
2013-05-05 11:51   ` Zheng Liu
2013-05-05 11:58   ` Zheng Liu
2013-05-06 12:51     ` Jan Kara
2013-04-08 21:32 ` [PATCH 14/29] ext4: Stop messing with nr_to_write in ext4_da_writepages() Jan Kara
2013-05-05 12:40   ` Zheng Liu
2013-04-08 21:32 ` [PATCH 15/29] ext4: Deprecate max_writeback_mb_bump sysfs attribute Jan Kara
2013-05-05 12:47   ` Zheng Liu
2013-05-06 12:55     ` Jan Kara
2013-04-08 21:32 ` [PATCH 16/29] ext4: Improve writepage credit estimate for files with indirect blocks Jan Kara
2013-05-07  5:39   ` Zheng Liu
2013-04-08 21:32 ` [PATCH 17/29] ext4: Better estimate credits needed for ext4_da_writepages() Jan Kara
2013-05-07  6:33   ` Zheng Liu
2013-05-07 14:17     ` Jan Kara
2013-04-08 21:32 ` [PATCH 18/29] ext4: Restructure writeback path Jan Kara
2013-05-08  3:48   ` Zheng Liu
2013-05-08 11:20     ` Jan Kara
2013-04-08 21:32 ` [PATCH 19/29] ext4: Remove buffer_uninit handling Jan Kara
2013-05-08  6:56   ` Zheng Liu
2013-04-08 21:32 ` [PATCH 20/29] ext4: Use transaction reservation for extent conversion in ext4_end_io Jan Kara
2013-05-08  6:57   ` Zheng Liu
2013-04-08 21:32 ` [PATCH 21/29] ext4: Split extent conversion lists to reserved & unreserved parts Jan Kara
2013-05-08  7:03   ` Zheng Liu
2013-05-08 11:23     ` Jan Kara
2013-05-08 11:49       ` Zheng Liu
2013-04-08 21:32 ` [PATCH 22/29] ext4: Defer clearing of PageWriteback after extent conversion Jan Kara
2013-05-08  7:08   ` Zheng Liu
2013-04-08 21:32 ` [PATCH 23/29] ext4: Protect extent conversion after DIO with i_dio_count Jan Kara
2013-05-08  7:08   ` Zheng Liu
2013-04-08 21:32 ` [PATCH 24/29] ext4: Remove wait for unwritten extent conversion from ext4_ext_truncate() Jan Kara
2013-05-08  7:35   ` Zheng Liu
2013-04-08 21:32 ` [PATCH 25/29] ext4: Use generic_file_fsync() in ext4_file_fsync() in nojournal mode Jan Kara
2013-05-08  7:37   ` Zheng Liu
2013-05-08 11:29     ` Jan Kara
2013-04-08 21:32 ` [PATCH 26/29] ext4: Remove i_mutex from ext4_file_sync() Jan Kara
2013-05-08  7:41   ` Zheng Liu
2013-04-08 21:32 ` [PATCH 27/29] ext4: Remove wait for unwritten extents in ext4_ind_direct_IO() Jan Kara
2013-05-08  7:55   ` Zheng Liu
2013-04-08 21:32 ` [PATCH 28/29] ext4: Don't wait for extent conversion in ext4_ext_punch_hole() Jan Kara
2013-05-08  7:56   ` Zheng Liu
2013-04-08 21:32 ` [PATCH 29/29] ext4: Remove ext4_ioend_wait() Jan Kara
2013-05-08  7:57   ` Zheng Liu
2013-05-08 11:32     ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.