linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] pNFS: Avoid read-modify-write for page-aligned full page write
@ 2019-02-07  8:12 Kazuo Ito
  2019-02-07 13:37 ` Benjamin Coddington
  0 siblings, 1 reply; 5+ messages in thread
From: Kazuo Ito @ 2019-02-07  8:12 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker
  Cc: linux-nfs, Ryusuke Konishi, watanabe.hiroyuki

As the block and SCSI layouts can only read/write fixed-length
blocks, we must perform read-modify-write when data to be written is
not aligned to a block boundary or smaller than the block size.
(612aa983a0410 pnfs: add flag to force read-modify-write in ->write_begin)

The current code tries to see if we have to do read-modify-write
on block-oriented pNFS layouts by just checking !PageUptodate(page),
but the same condition also applies for overwriting of any uncached
potions of existing files, making such operations excessively slow
even if it is block-aligned.

The change does not affect the optimization for modify-write-read
cases (38c73044f5f4d NFS: read-modify-write page updating),
because partial update of !PageUptodate() pages can only happen
in layouts that can do arbitrary length read/write and never
in block-based ones.

Testing results:

We ran fio on one of the pNFS clients running 4.20 kernel
(vanilla and patched) in this configuration to read/write/overwrite
files on the storage array, exported as pnfs share by the server.

  pNFS clients ---1G Ethernet--- pNFS server
  (HP DL360 G8)                  (HP DL360 G8)
        |                              |
        |                              |
        +------8G Fiber Channel--------+
                      |
                Storage Array
                  (HP P6350)

Throughput of overwrite (both buffered and O_SYNC) is noticeably
improved.

Ops.     |block size|   Throughput   |
          |  (KiB)   |    (MiB/s)     |
          |          |  4.20 | patched|
---------+----------+----------------+
buffered |         4|  21.3 |  233   |
overwrite|        32|  22.2 |  254   |
          |       512|  22.4 |  258   |
---------+----------+----------------+
O_SYNC   |         4|   3.84|    4.68|
overwrite|        32|  12.2 |   32.9 |
          |       512|  18.5 |  151   |
---------+----------+----------------+

Read and write (buffered and O_SYNC) by the same client remain unchanged
by the patch either negatively or positively, as they should do.

Ops.     |block size|   Throughput   |
          |  (KiB)   |    (MiB/s)     |
          |          |  4.20 | patched|
---------+----------+----------------+
read     |         4| 548   |  551   |
          |        32| 547   |  552   |
          |       512| 548   |  549   |
---------+----------+----------------+
buffered |         4| 237   |  243   |
write    |        32| 261   |  267   |
          |       512| 265   |  271   |
---------+----------+----------------+
O_SYNC   |         4|   0.46|    0.46|
write    |        32|   3.60|    3.61|
          |       512| 105   |  108   |
---------+----------+----------------+

Signed-off-by: Kazuo Ito <ito_kazuo_g3@lab.ntt.co.jp>
Tested-by: Hiroyuki Watabane <watanabe.hiroyuki@lab.ntt.co.jp>

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 29553fdba8af..e80954c96ec1 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -276,6 +276,12 @@ EXPORT_SYMBOL_GPL(nfs_file_fsync);
   * then a modify/write/read cycle when writing to a page in the
   * page cache.
   *
+ * Some pNFS layout drivers can only read/write at a certain block
+ * granularity like all block devices and therefore we must perform
+ * read/modify/write whenever a page hasn't read yet and the data
+ * to be written there is not aligned to a block boundary and/or
+ * smaller than the block size.
+ *
   * The modify/write/read cycle may occur if a page is read before
   * being completely filled by the writer.  In this situation, the
   * page must be completely written to stable storage on the server
@@ -299,8 +305,10 @@ static int nfs_want_read_modify_write(struct file 
*file, struct page *page,
  	unsigned int end = offset + len;

  	if (pnfs_ld_read_whole_page(file->f_mapping->host)) {
-		if (!PageUptodate(page))
-			return 1;
+		if (!PageUptodate(page)) {
+			if (pglen && (end < pglen || offset))
+				return 1;
+		}
  		return 0;
  	}

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] pNFS: Avoid read-modify-write for page-aligned full page write
  2019-02-07  8:12 [PATCH] pNFS: Avoid read-modify-write for page-aligned full page write Kazuo Ito
@ 2019-02-07 13:37 ` Benjamin Coddington
  2019-02-08  7:54   ` 伊藤和夫
  0 siblings, 1 reply; 5+ messages in thread
From: Benjamin Coddington @ 2019-02-07 13:37 UTC (permalink / raw)
  To: Kazuo Ito
  Cc: Trond Myklebust, Anna Schumaker, linux-nfs, Ryusuke Konishi,
	watanabe.hiroyuki

On 7 Feb 2019, at 3:12, Kazuo Ito wrote:

> As the block and SCSI layouts can only read/write fixed-length
> blocks, we must perform read-modify-write when data to be written is
> not aligned to a block boundary or smaller than the block size.
> (612aa983a0410 pnfs: add flag to force read-modify-write in 
> ->write_begin)
>
> The current code tries to see if we have to do read-modify-write
> on block-oriented pNFS layouts by just checking !PageUptodate(page),
> but the same condition also applies for overwriting of any uncached
> potions of existing files, making such operations excessively slow
> even if it is block-aligned.
>
> The change does not affect the optimization for modify-write-read
> cases (38c73044f5f4d NFS: read-modify-write page updating),
> because partial update of !PageUptodate() pages can only happen
> in layouts that can do arbitrary length read/write and never
> in block-based ones.
>
> Testing results:
>
> We ran fio on one of the pNFS clients running 4.20 kernel
> (vanilla and patched) in this configuration to read/write/overwrite
> files on the storage array, exported as pnfs share by the server.
>
>  pNFS clients ---1G Ethernet--- pNFS server
>  (HP DL360 G8)                  (HP DL360 G8)
>        |                              |
>        |                              |
>        +------8G Fiber Channel--------+
>                      |
>                Storage Array
>                  (HP P6350)
>
> Throughput of overwrite (both buffered and O_SYNC) is noticeably
> improved.
>
> Ops.     |block size|   Throughput   |
>          |  (KiB)   |    (MiB/s)     |
>          |          |  4.20 | patched|
> ---------+----------+----------------+
> buffered |         4|  21.3 |  233   |
> overwrite|        32|  22.2 |  254   |
>          |       512|  22.4 |  258   |
> ---------+----------+----------------+
> O_SYNC   |         4|   3.84|    4.68|
> overwrite|        32|  12.2 |   32.9 |
>          |       512|  18.5 |  151   |
> ---------+----------+----------------+
>
> Read and write (buffered and O_SYNC) by the same client remain 
> unchanged
> by the patch either negatively or positively, as they should do.
>
> Ops.     |block size|   Throughput   |
>          |  (KiB)   |    (MiB/s)     |
>          |          |  4.20 | patched|
> ---------+----------+----------------+
> read     |         4| 548   |  551   |
>          |        32| 547   |  552   |
>          |       512| 548   |  549   |
> ---------+----------+----------------+
> buffered |         4| 237   |  243   |
> write    |        32| 261   |  267   |
>          |       512| 265   |  271   |
> ---------+----------+----------------+
> O_SYNC   |         4|   0.46|    0.46|
> write    |        32|   3.60|    3.61|
>          |       512| 105   |  108   |
> ---------+----------+----------------+
>
> Signed-off-by: Kazuo Ito <ito_kazuo_g3@lab.ntt.co.jp>
> Tested-by: Hiroyuki Watabane <watanabe.hiroyuki@lab.ntt.co.jp>
>
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index 29553fdba8af..e80954c96ec1 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -276,6 +276,12 @@ EXPORT_SYMBOL_GPL(nfs_file_fsync);
>   * then a modify/write/read cycle when writing to a page in the
>   * page cache.
>   *
> + * Some pNFS layout drivers can only read/write at a certain block
> + * granularity like all block devices and therefore we must perform
> + * read/modify/write whenever a page hasn't read yet and the data
> + * to be written there is not aligned to a block boundary and/or
> + * smaller than the block size.
> + *
>   * The modify/write/read cycle may occur if a page is read before
>   * being completely filled by the writer.  In this situation, the
>   * page must be completely written to stable storage on the server
> @@ -299,8 +305,10 @@ static int nfs_want_read_modify_write(struct file 
> *file, struct page *page,
>  	unsigned int end = offset + len;
>
>  	if (pnfs_ld_read_whole_page(file->f_mapping->host)) {
> -		if (!PageUptodate(page))
> -			return 1;
> +		if (!PageUptodate(page)) {
> +			if (pglen && (end < pglen || offset))
> +				return 1;
> +		}
>  		return 0;
>  	}

This looks right.  I think that a static inline bool 
nfs_write_covers_page,
or full_page_write or similar might make sense here, as we do the same 
test
just below, and would make the code easier to quickly understand.

Reviewed-by: Benjamin Coddington <bcodding@redhat.com>

Ben

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] pNFS: Avoid read-modify-write for page-aligned full page write
  2019-02-07 13:37 ` Benjamin Coddington
@ 2019-02-08  7:54   ` 伊藤和夫
  2019-02-08 14:58     ` Trond Myklebust
  0 siblings, 1 reply; 5+ messages in thread
From: 伊藤和夫 @ 2019-02-08  7:54 UTC (permalink / raw)
  To: Benjamin Coddington
  Cc: Trond Myklebust, Anna Schumaker, linux-nfs, Ryusuke Konishi,
	watanabe.hiroyuki


On 2019/02/07 22:37, Benjamin Coddington wrote:
> On 7 Feb 2019, at 3:12, Kazuo Ito wrote:
> [snipped]
>> @@ -299,8 +305,10 @@ static int nfs_want_read_modify_write(struct file 
>> *file, struct page *page,
>>      unsigned int end = offset + len;
>>
>>      if (pnfs_ld_read_whole_page(file->f_mapping->host)) {
>> -        if (!PageUptodate(page))
>> -            return 1;
>> +        if (!PageUptodate(page)) {
>> +            if (pglen && (end < pglen || offset))
>> +                return 1;
>> +        }
>>          return 0;
>>      }
> 
> This looks right.  I think that a static inline bool nfs_write_covers_page,
> or full_page_write or similar might make sense here, as we do the same test
> just below, and would make the code easier to quickly understand.
> 
> Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
 >
 > Ben

As per Ben's comment, I made the check for full page write a static
inline function and both the block-oriented and the non-block-
oriented paths call it.

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 29553fdba8af..458c77ccf274 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -276,6 +276,12 @@ EXPORT_SYMBOL_GPL(nfs_file_fsync);
   * then a modify/write/read cycle when writing to a page in the
   * page cache.
   *
+ * Some pNFS layout drivers can only read/write at a certain block
+ * granularity like all block devices and therefore we must perform
+ * read/modify/write whenever a page hasn't read yet and the data
+ * to be written there is not aligned to a block boundary and/or
+ * smaller than the block size.
+ *
   * The modify/write/read cycle may occur if a page is read before
   * being completely filled by the writer.  In this situation, the
   * page must be completely written to stable storage on the server
@@ -291,15 +297,23 @@ EXPORT_SYMBOL_GPL(nfs_file_fsync);
   * and that the new data won't completely replace the old data in
   * that range of the file.
   */
-static int nfs_want_read_modify_write(struct file *file, struct page *page,
-			loff_t pos, unsigned len)
+static bool nfs_full_page_write(struct page *page, loff_t pos, unsigned 
len)
  {
  	unsigned int pglen = nfs_page_length(page);
  	unsigned int offset = pos & (PAGE_SIZE - 1);
  	unsigned int end = offset + len;

+	if (pglen && ((end < pglen) || offset))
+	    return 0;
+	return 1;
+}
+
+static int nfs_want_read_modify_write(struct file *file, struct page *page,
+			loff_t pos, unsigned len)
+{
  	if (pnfs_ld_read_whole_page(file->f_mapping->host)) {
-		if (!PageUptodate(page))
+		if (!PageUptodate(page) &&
+		    !nfs_full_page_write(page, pos, len))
  			return 1;
  		return 0;
  	}
@@ -307,8 +321,7 @@ static int nfs_want_read_modify_write(struct file 
*file, struct page *page,
  	if ((file->f_mode & FMODE_READ) &&	/* open for read? */
  	    !PageUptodate(page) &&		/* Uptodate? */
  	    !PagePrivate(page) &&		/* i/o request already? */
-	    pglen &&				/* valid bytes of file? */
-	    (end < pglen || offset))		/* replace all valid bytes? */
+	    !nfs_full_page_write(page, pos, len))
  		return 1;
  	return 0;
  }

Signed-off-by: Kazuo Ito <ito_kazuo_g3@lab.ntt.co.jp>

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] pNFS: Avoid read-modify-write for page-aligned full page write
  2019-02-08  7:54   ` 伊藤和夫
@ 2019-02-08 14:58     ` Trond Myklebust
  2019-02-12  4:34       ` Kazuo Ito
  0 siblings, 1 reply; 5+ messages in thread
From: Trond Myklebust @ 2019-02-08 14:58 UTC (permalink / raw)
  To: 伊藤和夫, Benjamin Coddington
  Cc: Anna Schumaker, linux-nfs, Ryusuke Konishi, watanabe.hiroyuki

On Fri, 2019-02-08 at 16:54 +0900, 伊藤和夫 wrote:
> On 2019/02/07 22:37, Benjamin Coddington wrote:
> > On 7 Feb 2019, at 3:12, Kazuo Ito wrote:
> > [snipped]
> > > @@ -299,8 +305,10 @@ static int nfs_want_read_modify_write(struct
> > > file 
> > > *file, struct page *page,
> > >      unsigned int end = offset + len;
> > > 
> > >      if (pnfs_ld_read_whole_page(file->f_mapping->host)) {
> > > -        if (!PageUptodate(page))
> > > -            return 1;
> > > +        if (!PageUptodate(page)) {
> > > +            if (pglen && (end < pglen || offset))
> > > +                return 1;
> > > +        }
> > >          return 0;
> > >      }
> > 
> > This looks right.  I think that a static inline bool
> > nfs_write_covers_page,
> > or full_page_write or similar might make sense here, as we do the
> > same test
> > just below, and would make the code easier to quickly understand.
> > 
> > Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
>  >
>  > Ben
> 
> As per Ben's comment, I made the check for full page write a static
> inline function and both the block-oriented and the non-block-
> oriented paths call it.
> 
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index 29553fdba8af..458c77ccf274 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -276,6 +276,12 @@ EXPORT_SYMBOL_GPL(nfs_file_fsync);
>    * then a modify/write/read cycle when writing to a page in the
>    * page cache.
>    *
> + * Some pNFS layout drivers can only read/write at a certain block
> + * granularity like all block devices and therefore we must perform
> + * read/modify/write whenever a page hasn't read yet and the data
> + * to be written there is not aligned to a block boundary and/or
> + * smaller than the block size.
> + *
>    * The modify/write/read cycle may occur if a page is read before
>    * being completely filled by the writer.  In this situation, the
>    * page must be completely written to stable storage on the server
> @@ -291,15 +297,23 @@ EXPORT_SYMBOL_GPL(nfs_file_fsync);
>    * and that the new data won't completely replace the old data in
>    * that range of the file.
>    */
> -static int nfs_want_read_modify_write(struct file *file, struct page
> *page,
> -			loff_t pos, unsigned len)
> +static bool nfs_full_page_write(struct page *page, loff_t pos,
> unsigned 
> len)
>   {
>   	unsigned int pglen = nfs_page_length(page);
>   	unsigned int offset = pos & (PAGE_SIZE - 1);
>   	unsigned int end = offset + len;
> 
> +	if (pglen && ((end < pglen) || offset))
> +	    return 0;
> +	return 1;
> +}
> +
> +static int nfs_want_read_modify_write(struct file *file, struct page
> *page,
> +			loff_t pos, unsigned len)
> +{
>   	if (pnfs_ld_read_whole_page(file->f_mapping->host)) {
> -		if (!PageUptodate(page))
> +		if (!PageUptodate(page) &&
> +		    !nfs_full_page_write(page, pos, len))
>   			return 1;
>   		return 0;
>   	}
> @@ -307,8 +321,7 @@ static int nfs_want_read_modify_write(struct
> file 
> *file, struct page *page,
>   	if ((file->f_mode & FMODE_READ) &&	/* open for read? */
>   	    !PageUptodate(page) &&		/* Uptodate? */
>   	    !PagePrivate(page) &&		/* i/o request already? */
> -	    pglen &&				/* valid bytes of
> file? */
> -	    (end < pglen || offset))		/* replace all valid
> bytes? */
> +	    !nfs_full_page_write(page, pos, len))
>   		return 1;
>   	return 0;
>   }

How about adding a separate

if (PageUptodate(page) || nfs_full_page_write())
    return 0;

before the check for pNFS?

That means we won't have to duplicate those for the pNFS block and
ordinary case, and it improves code clarity.

BTW: Why doesn't the pNFS case check for PagePrivate(page)? That looks
like a bug which would cause the existing write to get corrupted.
If so, we should move that check too into the common code.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] pNFS: Avoid read-modify-write for page-aligned full page write
  2019-02-08 14:58     ` Trond Myklebust
@ 2019-02-12  4:34       ` Kazuo Ito
  0 siblings, 0 replies; 5+ messages in thread
From: Kazuo Ito @ 2019-02-12  4:34 UTC (permalink / raw)
  To: Trond Myklebust, Benjamin Coddington
  Cc: Anna Schumaker, linux-nfs, Ryusuke Konishi, watanabe.hiroyuki

On 2019/02/08 23:58, Trond Myklebust wrote:
> On Fri, 2019-02-08 at 16:54 +0900, 伊藤和夫 wrote:
>> On 2019/02/07 22:37, Benjamin Coddington wrote:
>>> On 7 Feb 2019, at 3:12, Kazuo Ito wrote:
>>> [snipped]
>>>> @@ -299,8 +305,10 @@ static int nfs_want_read_modify_write(struct
>>>> file
>>>> *file, struct page *page,
>>>>       unsigned int end = offset + len;
>>>>
>>>>       if (pnfs_ld_read_whole_page(file->f_mapping->host)) {
>>>> -        if (!PageUptodate(page))
>>>> -            return 1;
>>>> +        if (!PageUptodate(page)) {
>>>> +            if (pglen && (end < pglen || offset))
>>>> +                return 1;
>>>> +        }
>>>>           return 0;
>>>>       }
>>>
>>> This looks right.  I think that a static inline bool
>>> nfs_write_covers_page,
>>> or full_page_write or similar might make sense here, as we do the
>>> same test
>>> just below, and would make the code easier to quickly understand.
>>>
>>> Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
>>   >
>>   > Ben
>>
>> As per Ben's comment, I made the check for full page write a static
>> inline function and both the block-oriented and the non-block-
>> oriented paths call it.
>>
>> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
>> index 29553fdba8af..458c77ccf274 100644
>> --- a/fs/nfs/file.c
>> +++ b/fs/nfs/file.c
>> @@ -276,6 +276,12 @@ EXPORT_SYMBOL_GPL(nfs_file_fsync);
>>     * then a modify/write/read cycle when writing to a page in the
>>     * page cache.
>>     *
>> + * Some pNFS layout drivers can only read/write at a certain block
>> + * granularity like all block devices and therefore we must perform
>> + * read/modify/write whenever a page hasn't read yet and the data
>> + * to be written there is not aligned to a block boundary and/or
>> + * smaller than the block size.
>> + *
>>     * The modify/write/read cycle may occur if a page is read before
>>     * being completely filled by the writer.  In this situation, the
>>     * page must be completely written to stable storage on the server
>> @@ -291,15 +297,23 @@ EXPORT_SYMBOL_GPL(nfs_file_fsync);
>>     * and that the new data won't completely replace the old data in
>>     * that range of the file.
>>     */
>> -static int nfs_want_read_modify_write(struct file *file, struct page
>> *page,
>> -			loff_t pos, unsigned len)
>> +static bool nfs_full_page_write(struct page *page, loff_t pos,
>> unsigned
>> len)
>>    {
>>    	unsigned int pglen = nfs_page_length(page);
>>    	unsigned int offset = pos & (PAGE_SIZE - 1);
>>    	unsigned int end = offset + len;
>>
>> +	if (pglen && ((end < pglen) || offset))
>> +	    return 0;
>> +	return 1;
>> +}
>> +
>> +static int nfs_want_read_modify_write(struct file *file, struct page
>> *page,
>> +			loff_t pos, unsigned len)
>> +{
>>    	if (pnfs_ld_read_whole_page(file->f_mapping->host)) {
>> -		if (!PageUptodate(page))
>> +		if (!PageUptodate(page) &&
>> +		    !nfs_full_page_write(page, pos, len))
>>    			return 1;
>>    		return 0;
>>    	}
>> @@ -307,8 +321,7 @@ static int nfs_want_read_modify_write(struct
>> file
>> *file, struct page *page,
>>    	if ((file->f_mode & FMODE_READ) &&	/* open for read? */
>>    	    !PageUptodate(page) &&		/* Uptodate? */
>>    	    !PagePrivate(page) &&		/* i/o request already? */
>> -	    pglen &&				/* valid bytes of
>> file? */
>> -	    (end < pglen || offset))		/* replace all valid
>> bytes? */
>> +	    !nfs_full_page_write(page, pos, len))
>>    		return 1;
>>    	return 0;
>>    }
> 
> How about adding a separate
> 
> if (PageUptodate(page) || nfs_full_page_write())
>      return 0;
> 
> before the check for pNFS?
 >
 > That means we won't have to duplicate those for the pNFS block and
 > ordinary case, and it improves code clarity.

Yes, it is much better, and

> BTW: Why doesn't the pNFS case check for PagePrivate(page)? That looks
> like a bug which would cause the existing write to get corrupted.
> If so, we should move that check too into the common code.

It's been that way since the check for
pnfs_ld_read_whole_page(file->f_mapping->host) was added there.
As you pointed out, it shouldn't try to initiate read when there's
an outstanding write.
So, I'll update the patch with these changes, including check for
ongoing I/O, and come up with newer test results in a couple of days.
--
kazuo ito (ito_kazuo_g3@iecl.ntt.co.jp)
NTT OSS Center

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-02-12  4:34 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-07  8:12 [PATCH] pNFS: Avoid read-modify-write for page-aligned full page write Kazuo Ito
2019-02-07 13:37 ` Benjamin Coddington
2019-02-08  7:54   ` 伊藤和夫
2019-02-08 14:58     ` Trond Myklebust
2019-02-12  4:34       ` Kazuo Ito

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).