linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] ext4: optimize ext4 direct I/O locking for reading
       [not found] <20160921052744.5223-1-tytso@mit.edu>
@ 2016-09-21 13:26 ` Christoph Hellwig
  2016-09-21 14:37   ` Theodore Ts'o
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2016-09-21 13:26 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Ext4 Developers List, Jan Kara, linux-fsdevel

On Wed, Sep 21, 2016 at 01:27:44AM -0400, Theodore Ts'o wrote:
> Even if we are not using dioread_nolock, if the blocks to be read are
> allocated and initialized, and we know we have not done any block
> allocation "recently", it safe to issue the direct I/O read without
> doing any locking.
> 
> For now we use a very simple definition of "recently".  If we have
> done any block allocations at all, we set the HAS_ALLOCATED state
> flag.  This flag is only cleared after an fsync() call.
> 
> We could also clear the HAS_ALLOCATED flag after all of the dirty
> pages for the inode have been written to disk, but tracking that is a
> bit trickier, so we don't do that for now.  In pretty much all of the
> use cases where we would care about DIO scalability, the applications
> tend to be issuing fsync(2) calls very frequently in any case.

Is there any chance you could look into simplifying the locking instead
of making it even more complicated?  Since Al replaced i_mutex with
i_rwsem you can now easily take that in shared mode.  E.g. if you'd
move to a direct I/O model closer to XFS, ocfs2 and NFS where you always
take i_rwsem in shared mode you'll get the scalibility of the
dioread_nolock case while no having to do these crazy dances, and you
can also massively simplify the codebase.  Similarly you can demote it
from exclusive to shared after allocating blocks in the write path,
and you'll end up with something way easier to understand.

Sorry for the rant, but I just had to dig into this code when looking
at converting ext4 to the new DAX path, and my eyes still bleed..

> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 282a51b..3bb3734 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1566,6 +1566,10 @@ enum {
>  					   nolocking */
>  	EXT4_STATE_MAY_INLINE_DATA,	/* may have in-inode data */
>  	EXT4_STATE_EXT_PRECACHED,	/* extents have been precached */
> +	EXT4_STATE_HAS_ALLOCATED,	/* there may be some unwritten,
> +					   allocated blocks */
> +	EXT4_STATE_IS_SYNCING,		/* potentially allocated blocks
> +					   are being synced  */
>  };
>  
>  #define EXT4_INODE_BIT_FNS(name, field, offset)				\
> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index 88effb1..cb8565f 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -96,6 +96,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>  	struct inode *inode = file->f_mapping->host;
>  	struct ext4_inode_info *ei = EXT4_I(inode);
>  	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
> +	unsigned long old_state;
>  	int ret = 0, err;
>  	tid_t commit_tid;
>  	bool needs_barrier = false;
> @@ -112,6 +113,8 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>  		goto out;
>  	}
>  
> +	if (ext4_test_inode_state(inode, EXT4_STATE_HAS_ALLOCATED))
> +		ext4_set_inode_state(inode, EXT4_STATE_IS_SYNCING);
>  	if (!journal) {
>  		ret = __generic_file_fsync(file, start, end, datasync);
>  		if (!ret)
> @@ -155,6 +158,26 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>  			ret = err;
>  	}
>  out:
> +#if (BITS_PER_LONG < 64)
> +	old_state = READ_ONCE(ei->i_state_flags);
> +	if (old_state & (1 << EXT4_STATE_IS_SYNCING)) {
> +		unsigned long new_state;
> +
> +		new_state = old_state & ~((1 << EXT4_STATE_IS_SYNCING) |
> +					  (1 << EXT4_STATE_HAS_ALLOCATED));
> +		cmpxchg(&ei->i_state_flags, old_state, new_state);
> +	}
> +#else
> +	old_state = READ_ONCE(ei->i_flags);
> +	if (old_state & (1UL << (32 + EXT4_STATE_IS_SYNCING))) {
> +		unsigned long new_state;
> +
> +		new_state = old_state &
> +			~((1UL << (32 + EXT4_STATE_IS_SYNCING)) |
> +			  (1UL << (32 + EXT4_STATE_HAS_ALLOCATED)));
> +		cmpxchg(&ei->i_flags, old_state, new_state);
> +	}
> +#endif
>  	trace_ext4_sync_file_exit(inode, ret);
>  	return ret;
>  }
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 9b464e5..4ce0a81 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -659,6 +659,11 @@ found:
>  				goto out_sem;
>  			}
>  		}
> +		if ((map->m_flags & EXT4_MAP_NEW) &&
> +		    !(map->m_flags & EXT4_MAP_UNWRITTEN)) {
> +			ext4_clear_inode_state(inode, EXT4_STATE_IS_SYNCING);
> +			ext4_set_inode_state(inode, EXT4_STATE_HAS_ALLOCATED);
> +		}
>  
>  		/*
>  		 * If the extent has been zeroed out, we don't need to update
> @@ -3532,20 +3537,47 @@ static ssize_t ext4_direct_IO_read(struct kiocb *iocb, struct iov_iter *iter)
>  	struct inode *inode = iocb->ki_filp->f_mapping->host;
>  	ssize_t ret;
>  
> -	if (ext4_should_dioread_nolock(inode)) {
> +	inode_dio_begin(inode);
> +	smp_mb();
> +	if (ext4_should_dioread_nolock(inode))
> +		unlocked = 1;
> +	else if (!ext4_has_inline_data(inode)) {
> +		struct ext4_map_blocks map;
> +		loff_t offset = iocb->ki_pos;
> +		loff_t end = offset + iov_iter_count(iter) - 1;
> +		ext4_lblk_t iblock = offset >> inode->i_blkbits;
> +		int wanted = ((offset + end) >> inode->i_blkbits) - iblock + 1;
> +		int ret;
> +
>  		/*
> -		 * Nolock dioread optimization may be dynamically disabled
> -		 * via ext4_inode_block_unlocked_dio(). Check inode's state
> -		 * while holding extra i_dio_count ref.
> +		 * If the blocks we are going to read are all
> +		 * allocated and initialized, and we haven't allocated
> +		 * any blocks to this inode recently, it is safe to do
> +		 * an unlocked DIO read.  (We do this check with
> +		 * i_dio_count elevated, so we don't have to worry
> +		 * about any racing truncate or punch hole
> +		 * operations.)
>  		 */
> -		inode_dio_begin(inode);
> -		smp_mb();
> -		if (unlikely(ext4_test_inode_state(inode,
> -						    EXT4_STATE_DIOREAD_LOCK)))
> -			inode_dio_end(inode);
> -		else
> +		while (wanted) {
> +			map.m_lblk = iblock;
> +			map.m_len = wanted;
> +
> +			ret = ext4_map_blocks(NULL, inode, &map, 0);
> +			if ((ret <= 0) ||
> +			    (map.m_flags & EXT4_MAP_UNWRITTEN))
> +				break;
> +			iblock += ret;
> +			wanted -= ret;
> +		}
> +		if ((wanted == 0) &&
> +		    !ext4_test_inode_state(inode, EXT4_STATE_HAS_ALLOCATED))
>  			unlocked = 1;
>  	}
> +	if (unlocked &&
> +	    unlikely(ext4_test_inode_state(inode,
> +					   EXT4_STATE_DIOREAD_LOCK)))
> +		unlocked = 0;
> +
>  	if (IS_DAX(inode)) {
>  		ret = dax_do_io(iocb, inode, iter, ext4_dio_get_block,
>  				NULL, unlocked ? 0 : DIO_LOCKING);
> @@ -3555,8 +3587,7 @@ static ssize_t ext4_direct_IO_read(struct kiocb *iocb, struct iov_iter *iter)
>  					   NULL, NULL,
>  					   unlocked ? 0 : DIO_LOCKING);
>  	}
> -	if (unlocked)
> -		inode_dio_end(inode);
> +	inode_dio_end(inode);
>  	return ret;
>  }
>  
> -- 
> 2.9.0.243.g5c589a7.dirty
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
---end quoted text---

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ext4: optimize ext4 direct I/O locking for reading
  2016-09-21 13:26 ` [PATCH] ext4: optimize ext4 direct I/O locking for reading Christoph Hellwig
@ 2016-09-21 14:37   ` Theodore Ts'o
  2016-09-22 12:31     ` Jan Kara
  0 siblings, 1 reply; 7+ messages in thread
From: Theodore Ts'o @ 2016-09-21 14:37 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ext4 Developers List, Jan Kara, linux-fsdevel

On Wed, Sep 21, 2016 at 06:26:09AM -0700, Christoph Hellwig wrote:
> Is there any chance you could look into simplifying the locking instead
> of making it even more complicated?  Since Al replaced i_mutex with
> i_rwsem you can now easily take that in shared mode.  E.g. if you'd
> move to a direct I/O model closer to XFS, ocfs2 and NFS where you always
> take i_rwsem in shared mode you'll get the scalibility of the
> dioread_nolock case while no having to do these crazy dances, and you
> can also massively simplify the codebase.  Similarly you can demote it
> from exclusive to shared after allocating blocks in the write path,
> and you'll end up with something way easier to understand.

Unfortunately, in order to do this we need to extend the
dioread_nolock handling for sub-page block sizes.  (This is where we
insert the allocated blocks into the extent maps marked uninitialized,
and only converting the extent from uninitialized to initialized ---
which today only works when the page size == block size.)

This is on my todo list, but half of the problem is the mess caused by
needing to iterate over the circularly linked buffer heads when there
are multiple buffer heads covering the page.  I was originally
assuming that it would be easier to fix this after doing the bh ->
iomap conversion, but it's in a while before I looked into this
particular change.  I can try to take a closer look again....

The main reason why I looked into this hack --- and I will be the
first to agree it was a hack, is because I had a request to support
the dioread_nolock scalability on a Little Endian PowerPC system which
has 64k page sizes.

> Sorry for the rant, but I just had to dig into this code when looking
> at converting ext4 to the new DAX path, and my eyes still bleed..

Yeah, I know, and I'm sorry.  There's quite a bit of technical debt
there, which I do want to clean up.

							- Ted

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ext4: optimize ext4 direct I/O locking for reading
  2016-09-21 14:37   ` Theodore Ts'o
@ 2016-09-22 12:31     ` Jan Kara
  2016-09-22 13:18       ` Christoph Hellwig
  2016-09-22 13:30       ` Theodore Ts'o
  0 siblings, 2 replies; 7+ messages in thread
From: Jan Kara @ 2016-09-22 12:31 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Christoph Hellwig, Ext4 Developers List, Jan Kara, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 2478 bytes --]

On Wed 21-09-16 10:37:48, Ted Tso wrote:
> On Wed, Sep 21, 2016 at 06:26:09AM -0700, Christoph Hellwig wrote:
> > Is there any chance you could look into simplifying the locking instead
> > of making it even more complicated?  Since Al replaced i_mutex with
> > i_rwsem you can now easily take that in shared mode.  E.g. if you'd
> > move to a direct I/O model closer to XFS, ocfs2 and NFS where you always
> > take i_rwsem in shared mode you'll get the scalibility of the
> > dioread_nolock case while no having to do these crazy dances, and you
> > can also massively simplify the codebase.  Similarly you can demote it
> > from exclusive to shared after allocating blocks in the write path,
> > and you'll end up with something way easier to understand.
> 
> Unfortunately, in order to do this we need to extend the
> dioread_nolock handling for sub-page block sizes.  (This is where we
> insert the allocated blocks into the extent maps marked uninitialized,
> and only converting the extent from uninitialized to initialized ---
> which today only works when the page size == block size.)
> 
> This is on my todo list, but half of the problem is the mess caused by
> needing to iterate over the circularly linked buffer heads when there
> are multiple buffer heads covering the page.  I was originally
> assuming that it would be easier to fix this after doing the bh ->
> iomap conversion, but it's in a while before I looked into this
> particular change.  I can try to take a closer look again....
> 
> The main reason why I looked into this hack --- and I will be the
> first to agree it was a hack, is because I had a request to support
> the dioread_nolock scalability on a Little Endian PowerPC system which
> has 64k page sizes.
> 
> > Sorry for the rant, but I just had to dig into this code when looking
> > at converting ext4 to the new DAX path, and my eyes still bleed..
> 
> Yeah, I know, and I'm sorry.  There's quite a bit of technical debt
> there, which I do want to clean up.

So I think what Christoph meant in this case is something like attached
patch. That achieves more than your dirty hack in a much cleaner way.
Beware, the patch is only compile-tested.

Then there is the case of unlocked direct IO overwrites which we allow to
run without inode_lock in dioread_nolock mode as well and that is more
difficult to resolve (there lay the problems with blocksize < pagesize you
speak about).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-ext4-Allow-parallel-DIO-reads.patch --]
[-- Type: text/x-patch, Size: 2655 bytes --]

>From 7de4ca30e0c897bbdd49eae0fdec5132744c105a Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Thu, 22 Sep 2016 14:20:15 +0200
Subject: [PATCH] ext4: Allow parallel DIO reads

We can easily support parallel direct IO reads. We only have to make
sure we cannot expose uninitialized data by reading allocated block to
which data was not written yet, or which was already truncated. That is
easily achieved by holding inode_lock in shared mode - that excludes all
writes, truncates, hole punches. We also have to guard against page
writeback allocating blocks for delay-allocated pages - that race is
handled by the fact that we writeback all the pages in the affected
range and the lock protects us from new pages being created there.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 40 ++++++++++++++++++----------------------
 1 file changed, 18 insertions(+), 22 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c6ea25a190f8..0af52f012bfb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3526,35 +3526,31 @@ out:
 
 static ssize_t ext4_direct_IO_read(struct kiocb *iocb, struct iov_iter *iter)
 {
-	int unlocked = 0;
-	struct inode *inode = iocb->ki_filp->f_mapping->host;
+	struct address_space *mapping = iocb->ki_filp->f_mapping;
+	struct inode *inode = mapping->host;
 	ssize_t ret;
 
-	if (ext4_should_dioread_nolock(inode)) {
-		/*
-		 * Nolock dioread optimization may be dynamically disabled
-		 * via ext4_inode_block_unlocked_dio(). Check inode's state
-		 * while holding extra i_dio_count ref.
-		 */
-		inode_dio_begin(inode);
-		smp_mb();
-		if (unlikely(ext4_test_inode_state(inode,
-						    EXT4_STATE_DIOREAD_LOCK)))
-			inode_dio_end(inode);
-		else
-			unlocked = 1;
-	}
+	/*
+	 * Shared inode_lock is enough for us - it protects against concurrent
+	 * writes & truncates and since we take care of writing back page cache,
+	 * we are protected against page writeback as well.
+	 */
+	inode_lock_shared(inode);
 	if (IS_DAX(inode)) {
-		ret = dax_do_io(iocb, inode, iter, ext4_dio_get_block,
-				NULL, unlocked ? 0 : DIO_LOCKING);
+		ret = dax_do_io(iocb, inode, iter, ext4_dio_get_block, NULL, 0);
 	} else {
+		size_t count = iov_iter_count(iter);
+
+		ret = filemap_write_and_wait_range(mapping, iocb->ki_pos,
+						   iocb->ki_pos + count);
+		if (ret)
+			goto out_unlock;
 		ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev,
 					   iter, ext4_dio_get_block,
-					   NULL, NULL,
-					   unlocked ? 0 : DIO_LOCKING);
+					   NULL, NULL, 0);
 	}
-	if (unlocked)
-		inode_dio_end(inode);
+out_unlock:
+	inode_unlock_shared(inode);
 	return ret;
 }
 
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] ext4: optimize ext4 direct I/O locking for reading
  2016-09-22 12:31     ` Jan Kara
@ 2016-09-22 13:18       ` Christoph Hellwig
  2016-09-22 13:30       ` Theodore Ts'o
  1 sibling, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2016-09-22 13:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Ts'o, Christoph Hellwig, Ext4 Developers List,
	linux-fsdevel

On Thu, Sep 22, 2016 at 02:31:43PM +0200, Jan Kara wrote:
> So I think what Christoph meant in this case is something like attached
> patch. That achieves more than your dirty hack in a much cleaner way.
> Beware, the patch is only compile-tested.

Yes, that's exactly what I had in mind.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ext4: optimize ext4 direct I/O locking for reading
  2016-09-22 12:31     ` Jan Kara
  2016-09-22 13:18       ` Christoph Hellwig
@ 2016-09-22 13:30       ` Theodore Ts'o
  2016-09-30  5:22         ` Theodore Ts'o
  1 sibling, 1 reply; 7+ messages in thread
From: Theodore Ts'o @ 2016-09-22 13:30 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, Ext4 Developers List, linux-fsdevel

On Thu, Sep 22, 2016 at 02:31:43PM +0200, Jan Kara wrote:
> 
> So I think what Christoph meant in this case is something like attached
> patch. That achieves more than your dirty hack in a much cleaner way.
> Beware, the patch is only compile-tested.

Your patch also disables dioread_nolock (which is what I think
Christoph was asking about because it's the rest of the dioread nolock
support code which causes the eye-bleeding complexity on the write
path).

> Then there is the case of unlocked direct IO overwrites which we allow to
> run without inode_lock in dioread_nolock mode as well and that is more
> difficult to resolve (there lay the problems with blocksize < pagesize you
> speak about).

Right, by disabling dioread_nolock, it means we lose the feature that
dioread_nolock doesn't require blocking versus _any_ direct I/O writes
(because of the post-write uninit->init conversion) --- not just DIO
overwrites.

But we should be able to support dioread_nolock as well as by only
taking inode_lock_shared() in the non-dioread_nolock case, I think.

Thanks for the prototype patch; I agree it's a cleaner way to go.

	       	      	    		      - Ted

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ext4: optimize ext4 direct I/O locking for reading
  2016-09-22 13:30       ` Theodore Ts'o
@ 2016-09-30  5:22         ` Theodore Ts'o
  2016-10-03  7:41           ` Jan Kara
  0 siblings, 1 reply; 7+ messages in thread
From: Theodore Ts'o @ 2016-09-30  5:22 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, Ext4 Developers List, linux-fsdevel

I've been looking at your patch and testing it, and it looks like
works fine as-is.

The other thing I like about this patch is it allows us to drop
ext4_inode_{block,resume}_unlocked_dio() and EXT4_STATE_DIOREAD_LOCK.

Thanks!!

					- Ted

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ext4: optimize ext4 direct I/O locking for reading
  2016-09-30  5:22         ` Theodore Ts'o
@ 2016-10-03  7:41           ` Jan Kara
  0 siblings, 0 replies; 7+ messages in thread
From: Jan Kara @ 2016-10-03  7:41 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, Christoph Hellwig, Ext4 Developers List, linux-fsdevel

On Fri 30-09-16 01:22:33, Ted Tso wrote:
> I've been looking at your patch and testing it, and it looks like
> works fine as-is.

Glad to hear that!

> The other thing I like about this patch is it allows us to drop
> ext4_inode_{block,resume}_unlocked_dio() and EXT4_STATE_DIOREAD_LOCK.

Well, not completely - we still need to deal with unlocked writes that
happen in ext4_direct_IO_write() when overwrite == 1. But that should be
doable in a similar way. We could just demote inode_lock to a shared mode
instead of dropping it in ext4_direct_IO_write() and then we could drop the
bits you mention above. I can look into that when I'll be looking into
converting ext4 into the new iomap infrastructure but that's definitely
material for the next merge window.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-10-03  7:41 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20160921052744.5223-1-tytso@mit.edu>
2016-09-21 13:26 ` [PATCH] ext4: optimize ext4 direct I/O locking for reading Christoph Hellwig
2016-09-21 14:37   ` Theodore Ts'o
2016-09-22 12:31     ` Jan Kara
2016-09-22 13:18       ` Christoph Hellwig
2016-09-22 13:30       ` Theodore Ts'o
2016-09-30  5:22         ` Theodore Ts'o
2016-10-03  7:41           ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).