All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Benjamin Coddington" <bcodding@redhat.com>
To: "Trond Myklebust" <trondmy@primarydata.com>
Cc: "hch@infradead.org" <hch@infradead.org>,
	"List Linux NFS Mailing" <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH v4 24/28] Getattr doesn't require data sync semantics
Date: Tue, 26 Jul 2016 13:57:40 -0400	[thread overview]
Message-ID: <4CA4361B-2501-48AA-8B2C-715E5A3FD35C@redhat.com> (raw)
In-Reply-To: <3448E968-3564-43C1-ADBB-A16C9687B0F3@primarydata.com>

On 26 Jul 2016, at 12:35, Trond Myklebust wrote:

>> On Jul 26, 2016, at 12:32, Benjamin Coddington <bcodding@redhat.com> 
>> wrote:
>>
>> On 25 Jul 2016, at 14:41, Benjamin Coddington wrote:
>>
>>> On 25 Jul 2016, at 14:34, Trond Myklebust wrote:
>>>
>>>>> On Jul 25, 2016, at 14:26, Benjamin Coddington 
>>>>> <bcodding@redhat.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 25 Jul 2016, at 12:39, Trond Myklebust wrote:
>>>>>
>>>>>>> On Jul 25, 2016, at 12:26, Benjamin Coddington 
>>>>>>> <bcodding@redhat.com> wrote:
>>>>>>>
>>>>>>> On 21 Jul 2016, at 9:20, Trond Myklebust wrote:
>>>>>>>
>>>>>>>>> On Jul 21, 2016, at 09:05, Benjamin Coddington 
>>>>>>>>> <bcodding@redhat.com> wrote:
>>>>>>>>>
>>>>>>>>> So back to Christoph's point earlier:
>>>>>>>>>
>>>>>>>>> On 17 Jul 2016, at 23:48, Christoph Hellwig wrote:
>>>>>>>>>> This one breaks xfstests generic/207 on block/scsi layout for 
>>>>>>>>>> me.  The
>>>>>>>>>> reason for that is that we need a layoutcommit after writing 
>>>>>>>>>> out all
>>>>>>>>>> data for the file for the file size to be updated on the 
>>>>>>>>>> server.
>>>>>>>>>
>>>>>>>>> You responded:
>>>>>>>>>
>>>>>>>>> On 18 Jul 2016, at 0:32, Trond Myklebust wrote:
>>>>>>>>>> I’m not understanding this argument. Why do we care if the 
>>>>>>>>>> file size is up
>>>>>>>>>> to date on the server if we’re not sending an actual 
>>>>>>>>>> GETATTR on the wire
>>>>>>>>>> to retrieve the file size?
>>>>>>>>>
>>>>>>>>> I guess the answer might be because we can get it back from 
>>>>>>>>> the last
>>>>>>>>> LAYOUTCOMMIT.
>>>>>>>>>
>>>>>>>>
>>>>>>>> The patch that I followed up with should now ensure that we do 
>>>>>>>> not mark the attribute cache as up to date if there is a 
>>>>>>>> LAYOUTCOMMIT pending.
>>>>>>>> IOW: when the pNFS write is done, it is expected to do 2 
>>>>>>>> things:
>>>>>>>>
>>>>>>>> 1) mark the inode for LAYOUTCOMMIT
>>>>>>>> 2) mark the attribute cache as invalid (because we know the 
>>>>>>>> change attribute, mtime, ctime need to be updates)
>>>>>>>>
>>>>>>>> In the case of blocks pNFS write:
>>>>>>>> The call to pnfs_set_layoutcommit() in pnfs_ld_write_done() 
>>>>>>>> should take care of (1)
>>>>>>>> The call to nfs_writeback_update_inode() in 
>>>>>>>> nfs4_write_done_cb() should take care of (2).
>>>>>>>>
>>>>>>>> Provided that these 2 calls are performed in the above order, 
>>>>>>>> then any call to nfs_getattr() which has not been preceded by a 
>>>>>>>> call to nfs4_proc_layoutcommit() should trigger the call to 
>>>>>>>> __nfs_revalidate_inode().
>>>>>>>
>>>>>>> I think the problem is that a following nfs_getattr() will fail 
>>>>>>> to notice
>>>>>>> the size change in the case of a write_completion and 
>>>>>>> layoutcommit occuring
>>>>>>> after nfs_getattr() has done pnfs_sync_inode() but before it has 
>>>>>>> done
>>>>>>> nfs_update_inode().
>>>>>>>
>>>>>>> In the failing case there are two threads one is doing writes, 
>>>>>>> the other
>>>>>>> doing lstat on aio_complete via io_getevents(2).
>>>>>>>
>>>>>>> For each write completion the lstat thread tries to verify the 
>>>>>>> file size.
>>>>>>>
>>>>>>> GETATTR Thread                  LAYOUTCOMMIT Thread
>>>>>>> --------------                  --------------------
>>>>>>>                              write_completion sets LAYOUTCOMMIT 
>>>>>>> (4096@0)
>>>>>>> --> nfs_getattr
>>>>>>
>>>>>> filemap_write_and_wait()
>>>>>>
>>>>>>> __nfs_revalidate_inode
>>>>>>> pnfs_sync_inode
>>>>>>
>>>>>> NFS_PROTO(inode)->getattr()
>>>>>>
>>>>>>> getattr sees 4096
>>>>>>>                              write_completion sets LAYOUTCOMMIT 
>>>>>>> (4096@4096)
>>>>>>>                              sets LAYOUTCOMMITING
>>>>>>>                              clears LAYOUTCOMMIT
>>>>>>>                              clears LAYOUTCOMMITTING
>>>>>>> nfs_refresh_inode
>>>>>>> nfs_update_inode size is 4096
>>>>>>> <-- nfs_getattr
>>>>>>>
>>>>>>> At this point the cached attributes are seen as up to date, but
>>>>>>> aio-dio-extend-stat program expects that second write_completion 
>>>>>>> to reflect
>>>>>>> in the file size.
>>>>>>>
>>>>>>
>>>>>> Why isn’t the filemap_write_and_wait() above resolving the 
>>>>>> race? I’d
>>>>>> expect that would move your “write completion sets 
>>>>>> LAYOUTCOMMIT” up to
>>>>>> before the pnfs_sync_inode().  In fact, in the patch that 
>>>>>> Christoph sent,
>>>>>> all he was doing was moving the pnfs_sync_inode() to immediately 
>>>>>> after
>>>>>> that filemap_write_and_wait() instead of relying on it in
>>>>>> _nfs_revalidate_inode.
>>>>>
>>>>> This is O_DIRECT, I've failed to mention yet.  The second write 
>>>>> hasn't made
>>>>> it out of __nfs_pageio_add_request() at the time 
>>>>> filemap_write_and_wait() is
>>>>> called.  It is sleeping in pnfs_update_layout() waiting on a 
>>>>> LAYOUTGET and it
>>>>> doesn't resumes until after filemap_write_and_wait().
>>>>
>>>> Wait, so you have 1 thread doing an O_DIRECT write() and another 
>>>> doing a
>>>> stat() in parallel? Why would there be an expectation that the 
>>>> filesystem
>>>> should serialise those system calls?
>>>
>>> Not exactly parallel, but synchronized on aio_complete.  A comment 
>>> in
>>> generic/207's src/aio-dio-regress/aio-dio-extend-stat.c:
>>>
>>> 36 /*
>>> 37  * This was originally submitted to
>>> 38  * http://bugzilla.kernel.org/show_bug.cgi?id=6831 by
>>> 39  * Rafal Wijata <wijata@nec-labs.com>.  It caught a race in dio 
>>> aio completion
>>> 40  * that would call aio_complete() before the dio callers would 
>>> update i_size.
>>> 41  * A stat after io_getevents() would not see the new file size.
>>> 42  *
>>> 43  * The bug was fixed in the fs/direct-io.c completion reworking 
>>> that appeared
>>> 44  * in 2.6.20.  This test should fail on 2.6.19.
>>> 45  */
>>>
>>> As far as I can see, this check is the whole point of generic/207..
>>
>> This would fix it up:
>>
>> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
>> index f108d58101f8..823700f827b6 100644
>> --- a/fs/nfs/inode.c
>> +++ b/fs/nfs/inode.c
>> @@ -661,6 +661,7 @@ int nfs_getattr(struct vfsmount *mnt, struct 
>> dentry
>> *dentry, struct kstat *stat)
>>
>>        trace_nfs_getattr_enter(inode);
>>        /* Flush out writes to the server in order to update c/mtime.  
>> */
>> +       nfs_start_io_read(inode);
>>        if (S_ISREG(inode->i_mode)) {
>>                err = filemap_write_and_wait(inode->i_mapping);
>>                if (err)
>> @@ -694,6 +695,7 @@ int nfs_getattr(struct vfsmount *mnt, struct 
>> dentry
>> *dentry, struct kstat *stat)
>>                        stat->blksize = NFS_SERVER(inode)->dtsize;
>>        }
>> out:
>> +       nfs_end_io_read(inode);
>>        trace_nfs_getattr_exit(inode, err);
>>        return err;
>> }
>>
>> Trond, what do you think?  I'll take any additional silence as a sign 
>> to go
>> elsewhere.  :P
>
> No. The above locking excludes all writes as well as O_DIRECT reads… 
> That’s worse than we had before.
>
> I’d like rather to understand _why_ the aio_complete() is failing to 
> work correctly here. According to the analysis of the test case that 
> you quoted yesterday, the O_DIRECT writes should have completed before 
> we even call stat().
>
> Cheers
>   Trond

The O_DIRECT writes do complete, and every completion signals the other
thread to do stat(), but that completion does not update the size on the
server.  As we know, we need a LAYOUTCOMMIT.  After this patch, we're 
only
going to do a LAYOUTCOMMIT if nfs_need_revalidate_inode(inode).

So what happens is that the first write completes, and the first
nfs_getattr() triggers the first LAYOUTCOMMIT, and then a GETATTR.
Simultaneously, the second write is waiting on a LAYOUTGET.  The GETATTR
completes and sets the size to 4k.

Now the attributes are marked up to date with a size of 4k, and when the
second write completes, nfs_getattr() is called, 
nfs_need_revalidate_inode()
is false, and we don't bother to send another LAYOUTCOMMIT or GETATTR to
correct the cached file size.

Here's a function graph of that: 
http://people.redhat.com/bcodding/borken

We could invalidate the attribute cache every time a write completes.. 
maybe
nfs_writeback_update_inode() isn't doing the job for block layouts (are 
we
not setting res.count?  I'll look at that..)

I think we could also use the LAYOUTCOMMIT results to invalidate the 
cache,
RFC-5661 18.42.3:
    If the metadata server updates the file's size as the
    result of the LAYOUTCOMMIT operation, it must return the new size
    (locr_newsize.ns_size) as part of the results."

I'm not sure you want to bother to try to reproduce -- but if so, you 
don't
need special hardware for SCSI layout.  I wrote up a quick how-to for 
SCSI
layouts on a VM in qemu:
http://people.redhat.com/bcodding/pnfs/nfs/scsi/2016/07/13/pnfs_scsi_setup_for_VMs/

Ben

  reply	other threads:[~2016-07-26 17:56 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-06 22:29 [PATCH v4 00/28] NFS writeback performance patches for v4.8 Trond Myklebust
2016-07-06 22:29 ` [PATCH v4 01/28] NFS: Don't flush caches for a getattr that races with writeback Trond Myklebust
2016-07-06 22:29   ` [PATCH v4 02/28] NFS: Cache access checks more aggressively Trond Myklebust
2016-07-06 22:29     ` [PATCH v4 03/28] NFS: Cache aggressively when file is open for writing Trond Myklebust
2016-07-06 22:29       ` [PATCH v4 04/28] NFS: Kill NFS_INO_NFS_INO_FLUSHING: it is a performance killer Trond Myklebust
2016-07-06 22:29         ` [PATCH v4 05/28] NFS: writepage of a single page should not be synchronous Trond Myklebust
2016-07-06 22:29           ` [PATCH v4 06/28] NFS: Don't hold the inode lock across fsync() Trond Myklebust
2016-07-06 22:29             ` [PATCH v4 07/28] NFS: Don't call COMMIT in ->releasepage() Trond Myklebust
2016-07-06 22:29               ` [PATCH v4 08/28] pNFS/files: Fix layoutcommit after a commit to DS Trond Myklebust
2016-07-06 22:29                 ` [PATCH v4 09/28] pNFS/flexfiles: " Trond Myklebust
2016-07-06 22:29                   ` [PATCH v4 10/28] pNFS/flexfiles: Clean up calls to pnfs_set_layoutcommit() Trond Myklebust
2016-07-06 22:29                     ` [PATCH v4 11/28] pNFS: Files and flexfiles always need to commit before layoutcommit Trond Myklebust
2016-07-06 22:29                       ` [PATCH v4 12/28] pNFS: Ensure we layoutcommit before revalidating attributes Trond Myklebust
2016-07-06 22:29                         ` [PATCH v4 13/28] pNFS: pnfs_layoutcommit_outstanding() is no longer used when !CONFIG_NFS_V4_1 Trond Myklebust
2016-07-06 22:29                           ` [PATCH v4 14/28] NFS: Fix O_DIRECT verifier problems Trond Myklebust
2016-07-06 22:29                             ` [PATCH v4 15/28] NFS: Ensure we reset the write verifier 'committed' value on resend Trond Myklebust
2016-07-06 22:29                               ` [PATCH v4 16/28] NFS: Remove racy size manipulations in O_DIRECT Trond Myklebust
2016-07-06 22:29                                 ` [PATCH v4 17/28] NFS Cleanup: move call to generic_write_checks() into fs/nfs/direct.c Trond Myklebust
2016-07-06 22:29                                   ` [PATCH v4 18/28] NFS: Move buffered I/O locking into nfs_file_write() Trond Myklebust
2016-07-06 22:29                                     ` [PATCH v4 19/28] NFS: Do not serialise O_DIRECT reads and writes Trond Myklebust
2016-07-06 22:29                                       ` [PATCH v4 20/28] NFS: Cleanup nfs_direct_complete() Trond Myklebust
2016-07-06 22:29                                         ` [PATCH v4 21/28] NFS: Remove redundant waits for O_DIRECT in fsync() and write_begin() Trond Myklebust
2016-07-06 22:29                                           ` [PATCH v4 22/28] NFS: Remove unused function nfs_revalidate_mapping_protected() Trond Myklebust
2016-07-06 22:30                                             ` [PATCH v4 23/28] NFS: Do not aggressively cache file attributes in the case of O_DIRECT Trond Myklebust
2016-07-06 22:30                                               ` [PATCH v4 24/28] NFS: Getattr doesn't require data sync semantics Trond Myklebust
2016-07-06 22:30                                                 ` [PATCH v4 25/28] NFSv4.2: Fix a race in nfs42_proc_deallocate() Trond Myklebust
2016-07-06 22:30                                                   ` [PATCH v4 26/28] NFSv4.2: Fix writeback races in nfs4_copy_file_range Trond Myklebust
2016-07-06 22:30                                                     ` [PATCH v4 27/28] NFSv4.2: llseek(SEEK_HOLE) and llseek(SEEK_DATA) don't require data sync Trond Myklebust
2016-07-06 22:30                                                       ` [PATCH v4 28/28] NFS nfs_vm_page_mkwrite: Don't freeze me, Bro Trond Myklebust
2016-07-18  3:48                                                 ` [PATCH v4 24/28] NFS: Getattr doesn't require data sync semantics Christoph Hellwig
2016-07-18  4:32                                                   ` Trond Myklebust
2016-07-18  4:59                                                     ` Trond Myklebust
2016-07-19  3:58                                                       ` hch
2016-07-19 20:00                                                         ` [PATCH v4 24/28] " Benjamin Coddington
2016-07-19 20:06                                                           ` Trond Myklebust
2016-07-20 15:03                                                             ` Benjamin Coddington
2016-07-21  8:22                                                               ` hch
2016-07-21  8:32                                                                 ` Benjamin Coddington
2016-07-21  9:10                                                                   ` Benjamin Coddington
2016-07-21  9:52                                                                     ` Benjamin Coddington
2016-07-21 12:46                                                                       ` Trond Myklebust
2016-07-21 13:05                                                                         ` Benjamin Coddington
2016-07-21 13:20                                                                           ` Trond Myklebust
2016-07-21 14:00                                                                             ` Trond Myklebust
2016-07-21 14:02                                                                             ` Benjamin Coddington
2016-07-25 16:26                                                                             ` Benjamin Coddington
2016-07-25 16:39                                                                               ` Trond Myklebust
2016-07-25 18:26                                                                                 ` Benjamin Coddington
2016-07-25 18:34                                                                                   ` Trond Myklebust
2016-07-25 18:41                                                                                     ` Benjamin Coddington
2016-07-26 16:32                                                                                       ` Benjamin Coddington
2016-07-26 16:35                                                                                         ` Trond Myklebust
2016-07-26 17:57                                                                                           ` Benjamin Coddington [this message]
2016-07-26 18:07                                                                                             ` Trond Myklebust
2016-07-27 11:55                                                                                               ` Benjamin Coddington
2016-07-27 12:15                                                                                                 ` Trond Myklebust
2016-07-27 12:31                                                                                                   ` Trond Myklebust
2016-07-27 16:14                                                                                                     ` Benjamin Coddington
2016-07-27 18:05                                                                                                       ` Trond Myklebust
2016-07-28  9:47                                                                                                         ` Benjamin Coddington
2016-07-28 12:31                                                                                                           ` Trond Myklebust
2016-07-28 14:04                                                                                                             ` Trond Myklebust
2016-07-28 15:38                                                                                                               ` Benjamin Coddington
2016-07-28 15:39                                                                                                                 ` Trond Myklebust
2016-07-28 15:33                                                                                                             ` Benjamin Coddington
2016-07-28 15:36                                                                                                               ` Trond Myklebust
2016-07-28 16:40                                                                                                                 ` Benjamin Coddington
2016-07-28 16:41                                                                                                                   ` Trond Myklebust
2016-07-19 20:09                                                           ` Benjamin Coddington

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4CA4361B-2501-48AA-8B2C-715E5A3FD35C@redhat.com \
    --to=bcodding@redhat.com \
    --cc=hch@infradead.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trondmy@primarydata.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.