linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20
@ 2008-09-05 19:19 Aaron Straus
  2008-09-05 19:56 ` [NFS] " Chuck Lever
  0 siblings, 1 reply; 25+ messages in thread
From: Aaron Straus @ 2008-09-05 19:19 UTC (permalink / raw)
  To: neilb, nfs, trond.myklebust, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1779 bytes --]

Hi all,

  We're hitting some bad behavior in NFS v3.  The situation is this:

   machine A - NFS server

   machine B - NFS client (writer)
   machine C - NFS client (reader)

   (all machines x86 SMP)

  machine A exports a directory on ext3 filesystem:

	/srv/home       192.168.0.0/24(rw,sync,no_subtree_check)

  machines B and C mount that directory normally

        mount A:/srv/home /mntpnt

  machine B opens a file and writes to it (think a log file)

  machine C stats that file, opens it and reads it (think tailing the
                                                    log file)


  The issue is that machine C will often see large blocks of NULLs
(zeros) in the file.  If you do the same read again just after you see
the block of NULLs you will see proper the data.

  Attached are two simple python programs that demonstrate the problem.

  To use them (they will write to a file called test-nfs in CWD):

 (on machine B in one window)

   python writer.py

 (on machine C in another window)

   python reader.py

  
  reader.py will die when it sees NULLs in the file.  Usually for us
this happens after about 60s (two timeouts I think).   The first NULL is
usually either at index 4000 or 8000 depending on the kernel.


  Now the version of the kernel the server is running doesn't seem to
matter.  The reader also doesn't seem to matter (though I didn't test
this completely).  The writer seems to be the issue:

  Writer_Version     Outcome:
  <= 2.6.19          OK
  >= 2.6.20	     BAD

  I've tested both vanilla kernel.org kernels and Ubuntu 8.04 kernels.

  I can try to bisect between 2.6.19 <-> 2.6.20. 

  Anyone else hitting this problem?  Any better ideas?


  					Thanks,
					=a=


   
-- 
===================
Aaron Straus
aaron@merfinllc.com

[-- Attachment #2: reader.py --]
[-- Type: text/x-python, Size: 844 bytes --]

#!/usr/bin/env python
import time
import os

def run(file_name):
    data = ''
    last_data_len = len(data)
    last_size = os.stat(file_name).st_size

    while True:
        st_size = os.stat(file_name).st_size
        if st_size != last_size:
            print 'size changed @ %s' % time.asctime()

        data = open(file_name).read()

        if len(data) != last_data_len:
            print 'new data arrived @ %s' % time.asctime()
            print repr(data[-50:])


        if '\0' in data:
            first_index = data.index('\0')
            data_fragment = data[first_index-50:first_index+50]
            print 'Detected NULL @ %d %s' % (first_index, repr(data_fragment))
            break

        time.sleep(0.250)

        last_data_len = len(data)
        last_size = st_size

if __name__ == '__main__':
    run('test-nfs')

[-- Attachment #3: writer.py --]
[-- Type: text/x-python, Size: 257 bytes --]

#!/usr/bin/env python
import time

def run(file_name):
    fp = open(file_name, 'w')

    count = 1

    while count:
        fp.write('meow\n' * 800)
        fp.flush()
        time.sleep(32)

    fp.close()

if __name__ == '__main__':
    run('test-nfs')

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20
  2008-09-05 19:19 blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 Aaron Straus
@ 2008-09-05 19:56 ` Chuck Lever
  2008-09-05 20:04   ` Aaron Straus
                     ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Chuck Lever @ 2008-09-05 19:56 UTC (permalink / raw)
  To: Aaron Straus
  Cc: Neil Brown, Linux NFS Mailing List, Trond Myklebust, LKML Kernel

[ replacing cc: nfs@sf.net with linux-nfs@vger.kernel.org, and neil's  
old address with his current one ]

On Sep 5, 2008, at Sep 5, 2008, 3:19 PM, Aaron Straus wrote:
> Hi all,
>
>  We're hitting some bad behavior in NFS v3.  The situation is this:
>
>   machine A - NFS server
>
>   machine B - NFS client (writer)
>   machine C - NFS client (reader)
>
>   (all machines x86 SMP)
>
>  machine A exports a directory on ext3 filesystem:
>
> 	/srv/home       192.168.0.0/24(rw,sync,no_subtree_check)
>
>  machines B and C mount that directory normally
>
>        mount A:/srv/home /mntpnt
>
>  machine B opens a file and writes to it (think a log file)
>
>  machine C stats that file, opens it and reads it (think tailing the
>                                                    log file)
>
>
>  The issue is that machine C will often see large blocks of NULLs
> (zeros) in the file.  If you do the same read again just after you see
> the block of NULLs you will see proper the data.
>
>  Attached are two simple python programs that demonstrate the problem.
>
>  To use them (they will write to a file called test-nfs in CWD):
>
> (on machine B in one window)
>
>   python writer.py
>
> (on machine C in another window)
>
>   python reader.py
>
>
>  reader.py will die when it sees NULLs in the file.  Usually for us
> this happens after about 60s (two timeouts I think).   The first  
> NULL is
> usually either at index 4000 or 8000 depending on the kernel.
>
>
>  Now the version of the kernel the server is running doesn't seem to
> matter.  The reader also doesn't seem to matter (though I didn't test
> this completely).  The writer seems to be the issue:
>
>  Writer_Version     Outcome:
>  <= 2.6.19          OK
>  >= 2.6.20	    BAD

Up to which kernel?  Recent ones may address this issue already.

>  I've tested both vanilla kernel.org kernels and Ubuntu 8.04 kernels.
>
>  I can try to bisect between 2.6.19 <-> 2.6.20.

That's a good start.

Comparing a wire trace with strace output, starting with the writing  
client, might also be illuminating.  We prefer wireshark as it uses  
good default trace settings, parses the wire bytes and displays them  
coherently, and allows you to sort the frames in various useful ways.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20
  2008-09-05 19:56 ` [NFS] " Chuck Lever
@ 2008-09-05 20:04   ` Aaron Straus
  2008-09-05 20:36     ` Bernd Eckenfels
  2008-09-05 20:36     ` Chuck Lever
  2008-09-06  0:03   ` Aaron Straus
  2008-09-08 19:02   ` Aaron Straus
  2 siblings, 2 replies; 25+ messages in thread
From: Aaron Straus @ 2008-09-05 20:04 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Neil Brown, Linux NFS Mailing List, Trond Myklebust, LKML Kernel

Hi,

On Sep 05 03:56 PM, Chuck Lever wrote:
> [ replacing cc: nfs@sf.net with linux-nfs@vger.kernel.org, and neil's  
> old address with his current one ]

Sorry I probably grabbed an old MAINTAINERS file.

> On Sep 5, 2008, at Sep 5, 2008, 3:19 PM, Aaron Straus wrote:
> > Writer_Version     Outcome:
> > <= 2.6.19          OK
> > >= 2.6.20	    BAD
> 
> Up to which kernel?  Recent ones may address this issue already.

BAD up to 2.6.27-rc? 

I have to see exactly which is the last rc version I tested.

> > I can try to bisect between 2.6.19 <-> 2.6.20.
> 
> That's a good start.

OK will try to bisect.

> Comparing a wire trace with strace output, starting with the writing  
> client, might also be illuminating.  We prefer wireshark as it uses  
> good default trace settings, parses the wire bytes and displays them  
> coherently, and allows you to sort the frames in various useful ways.

OK.  Could you also try to reproduce on your side using those python
programs?  I want to make sure it's not something specific with our
mounts, etc.

Thanks!
					=a=


-- 
===================
Aaron Straus
aaron@merfinllc.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20
  2008-09-05 20:04   ` Aaron Straus
@ 2008-09-05 20:36     ` Bernd Eckenfels
  2008-09-05 20:36     ` Chuck Lever
  1 sibling, 0 replies; 25+ messages in thread
From: Bernd Eckenfels @ 2008-09-05 20:36 UTC (permalink / raw)
  To: linux-kernel

In article <20080905200455.GH22796@merfinllc.com> you wrote:
> OK.  Could you also try to reproduce on your side using those python
> programs?  I want to make sure it's not something specific with our
> mounts, etc.

sounds like the inode sync comes before the data sync. I suspect thats a
client side "problem". Maybe a fsync on the dir or something while the file
is open?

Bernd

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20
  2008-09-05 20:04   ` Aaron Straus
  2008-09-05 20:36     ` Bernd Eckenfels
@ 2008-09-05 20:36     ` Chuck Lever
  2008-09-05 22:14       ` Aaron Straus
  1 sibling, 1 reply; 25+ messages in thread
From: Chuck Lever @ 2008-09-05 20:36 UTC (permalink / raw)
  To: Aaron Straus
  Cc: Neil Brown, Linux NFS Mailing List, Trond Myklebust, LKML Kernel

On Sep 5, 2008, at Sep 5, 2008, 4:04 PM, Aaron Straus wrote:
> Hi,
>
> On Sep 05 03:56 PM, Chuck Lever wrote:
>> [ replacing cc: nfs@sf.net with linux-nfs@vger.kernel.org, and neil's
>> old address with his current one ]
>
> Sorry I probably grabbed an old MAINTAINERS file.
>
>> On Sep 5, 2008, at Sep 5, 2008, 3:19 PM, Aaron Straus wrote:
>>> Writer_Version     Outcome:
>>> <= 2.6.19          OK
>>>> = 2.6.20	    BAD
>>
>> Up to which kernel?  Recent ones may address this issue already.
>
> BAD up to 2.6.27-rc?
>
> I have to see exactly which is the last rc version I tested.
>
>>> I can try to bisect between 2.6.19 <-> 2.6.20.
>>
>> That's a good start.
>
> OK will try to bisect.
>
>> Comparing a wire trace with strace output, starting with the writing
>> client, might also be illuminating.  We prefer wireshark as it uses
>> good default trace settings, parses the wire bytes and displays them
>> coherently, and allows you to sort the frames in various useful ways.
>
> OK.  Could you also try to reproduce on your side using those python
> programs?  I want to make sure it's not something specific with our
> mounts, etc.

I have the latest Fedora 9 kernels on two clients, mounting via NFSv3  
using "actimeo=600" (for other reasons).  The server is OpenSolaris  
2008.5.

reader.py reported zeroes in the test file after about 5 minutes.

Looking at the file a little later, I don't see any problems with it.

Since your scripts are not using any kind of serialization (ie file  
locking) between the clients, I wonder if non-determinant behavior is  
to be expected.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20
  2008-09-05 20:36     ` Chuck Lever
@ 2008-09-05 22:14       ` Aaron Straus
  0 siblings, 0 replies; 25+ messages in thread
From: Aaron Straus @ 2008-09-05 22:14 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Neil Brown, Linux NFS Mailing List, Trond Myklebust, LKML Kernel

[-- Attachment #1: Type: text/plain, Size: 1360 bytes --]

Hi,

On Sep 05 04:36 PM, Chuck Lever wrote:
> I have the latest Fedora 9 kernels on two clients, mounting via NFSv3  
> using "actimeo=600" (for other reasons).  The server is OpenSolaris  
> 2008.5.
> 
> reader.py reported zeroes in the test file after about 5 minutes.

Awesome.  Thanks for testing!  Our actime is much shorter which is
probably why it happens sooner for us.

> Looking at the file a little later, I don't see any problems with it.
> 
> Since your scripts are not using any kind of serialization (ie file  
> locking) between the clients, I wonder if non-determinant behavior is  
> to be expected.

Hmm... yep.  I don't know what guarantees we want to make.   The
behavior doesn't seem to be consistent with older kernels though... so
I'm thinking it might be a bug.

We hit this particular issue because we have scripts which essentially
'tail -f' log files looking for errors.   They miss log messages (and
see corrupted ones) b/c of the NULLs.  That's also why there is no
serialization.... we don't need it when grep'ing through log messages.

I'm bisecting now.  I see a block of intricate-looking NFS patches, I'll
try to narrow it down to a particular commit.  

I'll also get the wireshark data at that point.

					Thanks,
					=a=


-- 
===================
Aaron Straus
aaron@merfinllc.com

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 191 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20
  2008-09-05 19:56 ` [NFS] " Chuck Lever
  2008-09-05 20:04   ` Aaron Straus
@ 2008-09-06  0:03   ` Aaron Straus
  2008-09-08 19:02   ` Aaron Straus
  2 siblings, 0 replies; 25+ messages in thread
From: Aaron Straus @ 2008-09-06  0:03 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Neil Brown, Linux NFS Mailing List, Trond Myklebust, LKML Kernel

[-- Attachment #1: Type: text/plain, Size: 800 bytes --]

On Sep 05 03:56 PM, Chuck Lever wrote:
> > I can try to bisect between 2.6.19 <-> 2.6.20.
> 
> That's a good start.

Hi,

OK.    Bisected.

This is the commit where we start to see blocks of NULLs in NFS files.

e261f51f25b98c213e0b3d7f2109b117d714f69d is first bad commit
commit e261f51f25b98c213e0b3d7f2109b117d714f69d
Author: Trond Myklebust <Trond.Myklebust@netapp.com>
Date:   Tue Dec 5 00:35:41 2006 -0500

    NFS: Make nfs_updatepage() mark the page as dirty.
    
    This will ensure that we can call set_page_writeback() from within
    nfs_writepage(), which is always called with the page lock set.
    
    Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>



					Thanks,
					=a=



-- 
===================
Aaron Straus
aaron@merfinllc.com

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 191 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20
  2008-09-05 19:56 ` [NFS] " Chuck Lever
  2008-09-05 20:04   ` Aaron Straus
  2008-09-06  0:03   ` Aaron Straus
@ 2008-09-08 19:02   ` Aaron Straus
  2008-09-08 21:15     ` Chuck Lever
  2 siblings, 1 reply; 25+ messages in thread
From: Aaron Straus @ 2008-09-08 19:02 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Neil Brown, Linux NFS Mailing List, Trond Myklebust, LKML Kernel

[-- Attachment #1: Type: text/plain, Size: 1003 bytes --]

Hi,

On Sep 05 03:56 PM, Chuck Lever wrote:
> Comparing a wire trace with strace output, starting with the writing  
> client, might also be illuminating.  We prefer wireshark as it uses  
> good default trace settings, parses the wire bytes and displays them  
> coherently, and allows you to sort the frames in various useful ways.

OK in addition to the bisection I've collected trace data for the good
(commit 4d770ccf4257b23a7ca2a85de1b1c22657b581d8) and bad (commit
e261f51f25b98c213e0b3d7f2109b117d714f69d) cases.

Attached is a file called trace.tar.bz2 inside you'll find 4 files, for
the two sessions:

  bad-wireshark
  bad-strace

  good-wireshark
  good-strace

>From a quick glance the difference seems to be the bad case does an
UNSTABLE NFS WRITE call.  I don't really know what that means or what
its semantics are... but that bad commit does seem to introduce this
regression.

Anything else I can provide?

Thanks!

					=a=


-- 
===================
Aaron Straus
aaron@merfinllc.com

[-- Attachment #2: trace.tar.bz2 --]
[-- Type: application/octet-stream, Size: 13979 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20
  2008-09-08 19:02   ` Aaron Straus
@ 2008-09-08 21:15     ` Chuck Lever
  2008-09-08 22:02       ` Aaron Straus
  2008-09-09 19:46       ` Aaron Straus
  0 siblings, 2 replies; 25+ messages in thread
From: Chuck Lever @ 2008-09-08 21:15 UTC (permalink / raw)
  To: Aaron Straus
  Cc: Neil Brown, Linux NFS Mailing List, Trond Myklebust, LKML Kernel

On Mon, Sep 8, 2008 at 3:02 PM, Aaron Straus <aaron@merfinllc.com> wrote:
> On Sep 05 03:56 PM, Chuck Lever wrote:
>> Comparing a wire trace with strace output, starting with the writing
>> client, might also be illuminating.  We prefer wireshark as it uses
>> good default trace settings, parses the wire bytes and displays them
>> coherently, and allows you to sort the frames in various useful ways.
>
> OK in addition to the bisection I've collected trace data for the good
> (commit 4d770ccf4257b23a7ca2a85de1b1c22657b581d8) and bad (commit
> e261f51f25b98c213e0b3d7f2109b117d714f69d) cases.
>
> Attached is a file called trace.tar.bz2 inside you'll find 4 files, for
> the two sessions:
>
>  bad-wireshark
>  bad-strace
>
>  good-wireshark
>  good-strace
>
> From a quick glance the difference seems to be the bad case does an
> UNSTABLE NFS WRITE call.  I don't really know what that means or what
> its semantics are... but that bad commit does seem to introduce this
> regression.

A little analysis reveals the bad behavior on the wire.  I filtered
the network traces for just WRITE requests.  There were six in the
good trace, and five in the bad.  This seems to correspond with the
strace logs, which show that you stopped the bad run before it
generated more writes.

Good:

time:           offset:         len:    sync:           byte range:
 35.402734      0               8000    FILE_SYNC       [0,     7999]
100.405170      4096            11904   FILE_SYNC       [4096,  15999]
160.407822      12288           7712    FILE_SYNC       [12288, 19999]
195.411035      16384           11616   FILE_SYNC       [16384, 27999]
260.413574      24576           11424   FILE_SYNC       [24576, 35999]
295.704044      32768           7232    FILE_SYNC       [32768, 39999]

Bad:

time:           offset:         len:    sync:           byte range:
 36.989907      0               8000    FILE_SYNC       [0,     7999]
101.991164      4096            11904   FILE_SYNC       [4096,  15999]
161.992447      12288           7712    FILE_SYNC       [12288, 19999]
166.992203      20480           3520    UNSTABLE        [20480, 23999]
169.181642      16384           4096    FILE_SYNC       [16384, 20479]

The first thing I notice is that the client is re-writing areas of the
file.  It seems to want to generate page aligned WRITEs, but this
means it writes over bytes that were already sent to the server.
Probably not the issue here, but that's certainly not optimal
behavior.  The strace logs show that the application generates 4000
byte write(2) calls, without any operation interleaved that might
change the seek offset.

I might go so far to say that this is incorrect behavior as multiple
clients writing to a file not on page boundaries would clobber each
other's WRITEs.

The second thing is that UNSTABLE WRITE.  The fact that it is UNSTABLE
is not the problem here.  The byte range is what is most interesting:
it leaves a hole in the file, which is fixed by the next write (3
seconds later) which fills in that hole.

Normally this isn't a big deal, as NFS likes applications to serialize
access to shared files.  However, in this case we expect the .flush()
to force all dirty bytes out at once, and this is definitely not
occurring....

The third thing is that I don't see any flush or fsync system calls in
the strace logs, even though the writer.py script invokes fp.flush().
I'm not terribly familiar with Python's flush().  How do you expect
the flush() method to work?

I wonder why the client is using FILE_SYNC writes and not
UNSTABLE+COMMIT. Are the clients using the "sync" mount option?  I
don't see O_SYNC on the open.

  open("test-nfs", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 3

I think I saw some recent work in Trond's development branch that
makes some changes in this area.  I will wait for him to respond to
this thread.

--
Chuck Lever

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20
  2008-09-08 21:15     ` Chuck Lever
@ 2008-09-08 22:02       ` Aaron Straus
  2008-09-09 19:46       ` Aaron Straus
  1 sibling, 0 replies; 25+ messages in thread
From: Aaron Straus @ 2008-09-08 22:02 UTC (permalink / raw)
  To: chucklever
  Cc: Neil Brown, Linux NFS Mailing List, Trond Myklebust, LKML Kernel

[-- Attachment #1: Type: text/plain, Size: 1535 bytes --]

Hi Chuck,

On Sep 08 05:15 PM, Chuck Lever wrote:
> This seems to correspond with the strace logs, which show that you
> stopped the bad run before it generated more writes.

Yes.  Sorry.   I stopped the writer once the reader saw the NULL bytes.

> Normally this isn't a big deal, as NFS likes applications to serialize
> access to shared files.  However, in this case we expect the .flush()
> to force all dirty bytes out at once, and this is definitely not
> occurring....
> 
> The third thing is that I don't see any flush or fsync system calls in
> the strace logs, even though the writer.py script invokes fp.flush().
> I'm not terribly familiar with Python's flush().  How do you expect
> the flush() method to work?

I think fp.flush() in Python is equivalent to fflush() i.e. it only
flushes the stdio buffer.  It doesn't actually call fsync or fdatasync
or anything like that.

We can pretty easily change the Python code to use the syscall
write()/etc methods.

> I wonder why the client is using FILE_SYNC writes and not
> UNSTABLE+COMMIT. Are the clients using the "sync" mount option?  I
> don't see O_SYNC on the open.

I've exported them on the server as sync e.g. in exports:

  /export		192.168.0.0/24(rw,sync)

However I don't give any special options when mounting e.e. on the
client:

  mount machine:/export  mntpnt

Anyway thanks for looking at this and let me know if I can do anything
else!


					=a=


-- 
===================
Aaron Straus
aaron@merfinllc.com

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 191 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20
  2008-09-08 21:15     ` Chuck Lever
  2008-09-08 22:02       ` Aaron Straus
@ 2008-09-09 19:46       ` Aaron Straus
  2008-09-11 16:55         ` Chuck Lever
  1 sibling, 1 reply; 25+ messages in thread
From: Aaron Straus @ 2008-09-09 19:46 UTC (permalink / raw)
  To: chucklever
  Cc: Neil Brown, Linux NFS Mailing List, Trond Myklebust, LKML Kernel

[-- Attachment #1: Type: text/plain, Size: 1252 bytes --]

Hi,

On Sep 08 05:15 PM, Chuck Lever wrote:
> I think I saw some recent work in Trond's development branch that
> makes some changes in this area.  I will wait for him to respond to
> this thread.

One other piece of information.

Of the bisected offending commit:

commit e261f51f25b98c213e0b3d7f2109b117d714f69d
Author: Trond Myklebust <Trond.Myklebust@netapp.com>
Date:   Tue Dec 5 00:35:41 2006 -0500

    NFS: Make nfs_updatepage() mark the page as dirty.
    
    This will ensure that we can call set_page_writeback() from within
    nfs_writepage(), which is always called with the page lock set.
    
    Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>


It seems to be this hunk which introduces the problem:


@@ -628,7 +667,6 @@ static struct nfs_page * nfs_update_request(struct nfs_open_context* ctx,
 				return ERR_PTR(error);
 			}
 			spin_unlock(&nfsi->req_lock);
-			nfs_mark_request_dirty(new);
 			return new;
 		}
 		spin_unlock(&nfsi->req_lock);



If I add that function call back in... the problem disappears.  I don't
know if this just papers over the real problem though?  


					Thanks,
					=a=





-- 
===================
Aaron Straus
aaron@merfinllc.com

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 191 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20
  2008-09-09 19:46       ` Aaron Straus
@ 2008-09-11 16:55         ` Chuck Lever
  2008-09-11 17:19           ` Aaron Straus
  0 siblings, 1 reply; 25+ messages in thread
From: Chuck Lever @ 2008-09-11 16:55 UTC (permalink / raw)
  To: Aaron Straus
  Cc: Neil Brown, Linux NFS Mailing List, Trond Myklebust, LKML Kernel

On Sep 9, 2008, at Sep 9, 2008, 3:46 PM, Aaron Straus wrote:
> Hi,
>
> On Sep 08 05:15 PM, Chuck Lever wrote:
>> I think I saw some recent work in Trond's development branch that
>> makes some changes in this area.  I will wait for him to respond to
>> this thread.
>
> One other piece of information.
>
> Of the bisected offending commit:
>
> commit e261f51f25b98c213e0b3d7f2109b117d714f69d
> Author: Trond Myklebust <Trond.Myklebust@netapp.com>
> Date:   Tue Dec 5 00:35:41 2006 -0500
>
>    NFS: Make nfs_updatepage() mark the page as dirty.
>
>    This will ensure that we can call set_page_writeback() from within
>    nfs_writepage(), which is always called with the page lock set.
>
>    Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
>
>
> It seems to be this hunk which introduces the problem:
>
>
> @@ -628,7 +667,6 @@ static struct nfs_page *  
> nfs_update_request(struct nfs_open_context* ctx,
> 				return ERR_PTR(error);
> 			}
> 			spin_unlock(&nfsi->req_lock);
> -			nfs_mark_request_dirty(new);
> 			return new;
> 		}
> 		spin_unlock(&nfsi->req_lock);
>
>
>
> If I add that function call back in... the problem disappears.  I  
> don't
> know if this just papers over the real problem though?

It's possible that the NFS write path as it exists after this commit  
is not marking certain pages dirty at the same time as others.  An  
asynchronous flush call from the VM would catch some of the dirty  
pages, then later, others, resulting in the pages flushing to the  
server out of order.

However, I poked around a little bit in late model kernels yesterday  
and found that the changes from Trond I referred to earlier are  
already in 2.6.27.  These rework this area significantly, so reverting  
this hunk alone will be more difficult in recent kernels.

In addition I notice that one purpose of this commit is to change the  
page locking scheme in the NFS client's write path.  I wouldn't advise  
simply adding this nfs_mark_request_dirty() call back in.  It could  
result in deadlocks for certain workloads (perhaps not your specific  
workload, though).

A more thorough review of the NFS write and flush logic that exists in  
2.6.27 is needed if we choose to recognize this issue as a real problem.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20
  2008-09-11 16:55         ` Chuck Lever
@ 2008-09-11 17:19           ` Aaron Straus
  2008-09-11 17:48             ` Chuck Lever
  0 siblings, 1 reply; 25+ messages in thread
From: Aaron Straus @ 2008-09-11 17:19 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Neil Brown, Linux NFS Mailing List, Trond Myklebust, LKML Kernel

Hi,

On Sep 11 12:55 PM, Chuck Lever wrote:
> A more thorough review of the NFS write and flush logic that exists in
> 2.6.27 is needed if we choose to recognize this issue as a real
> problem.

Yep.  

Sorry.  I didn't mean we should revert the hunk.  I was just trying to
help identify the cause of the new behavior.

I think this is a real problem albeit not a "serious" one.  Network
file-systems usually try to avoid readers seeing blocks of zeros in
files, especially in this simple writer/reader case.

It wouldn't be bad if the file is written out of order occasionally, but
we see this constantly now.

We cannot write/read log files to NFS mounts reliably any more.  That
seems like a valid use case which no longer works?

Anyway, thanks for your time!

				=a=



-- 
===================
Aaron Straus
aaron@merfinllc.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20
  2008-09-11 17:19           ` Aaron Straus
@ 2008-09-11 17:48             ` Chuck Lever
  2008-09-11 18:49               ` Aaron Straus
  0 siblings, 1 reply; 25+ messages in thread
From: Chuck Lever @ 2008-09-11 17:48 UTC (permalink / raw)
  To: Aaron Straus
  Cc: Neil Brown, Linux NFS Mailing List, Trond Myklebust, LKML Kernel

On Sep 11, 2008, at Sep 11, 2008, 1:19 PM, Aaron Straus wrote:
> Hi,
>
> On Sep 11 12:55 PM, Chuck Lever wrote:
>> A more thorough review of the NFS write and flush logic that exists  
>> in
>> 2.6.27 is needed if we choose to recognize this issue as a real
>> problem.
>
> Yep.
>
> Sorry.  I didn't mean we should revert the hunk.  I was just trying to
> help identify the cause of the new behavior.
>
> I think this is a real problem albeit not a "serious" one.  Network
> file-systems usually try to avoid readers seeing blocks of zeros in
> files, especially in this simple writer/reader case.
>
> It wouldn't be bad if the file is written out of order occasionally,  
> but
> we see this constantly now.
>
> We cannot write/read log files to NFS mounts reliably any more.  That
> seems like a valid use case which no longer works?

Were you able to modify your writer to do real fsync system calls?  If  
so, did it help?  That would be a useful data point.

NFS uses close-to-open cache coherency.  Thus we expect the file to be  
consistent and up to date after it is opened by a client, but not  
necessarily if some other client writes to it after it was opened.  We  
usually recommend file locking and direct I/O to minimize these  
problems.

Practically speaking this is often not enough for typical  
applications, so NFS client implementations go to further (non- 
standard) efforts to behave like a local file system.  This is simply  
a question of whether we can address this while not creating  
performance or correctness issues for other common use cases.

Anyway, I'm not the NFS client maintainer, so this decision is not up  
to me.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20
  2008-09-11 17:48             ` Chuck Lever
@ 2008-09-11 18:49               ` Aaron Straus
  2008-09-22 16:05                 ` Hans-Peter Jansen
  0 siblings, 1 reply; 25+ messages in thread
From: Aaron Straus @ 2008-09-11 18:49 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Neil Brown, Linux NFS Mailing List, Trond Myklebust, LKML Kernel

[-- Attachment #1: Type: text/plain, Size: 1183 bytes --]

Hi,

On Sep 11 01:48 PM, Chuck Lever wrote:
> Were you able to modify your writer to do real fsync system calls?  If  
> so, did it help?  That would be a useful data point.

Yes/Yes.  Please see the attached tarball sync-test.tar.bz2.

Inside you'll find the modified writer:  writer_sync.py

That will call fsync on the file descriptor after each write.

Also I added the strace and wireshark data for the sync and nosync
cases.

Note these are all tested with latest linus git:  

  d1c6d2e547148c5aa0c0a4ff6aac82f7c6da1d8b

> Practically speaking this is often not enough for typical  
> applications, so NFS client implementations go to further (non- 
> standard) efforts to behave like a local file system.  This is simply  
> a question of whether we can address this while not creating  
> performance or correctness issues for other common use cases.

Yep, I agree.  I'm not saying what we do now is "wrong" per the RFC
(writing the file out of order).  It's just different from what we've
done in the past (and somewhat unexpected).

I'm still hoping there is a simple fix... but maybe not... :(

Thanks!

					=a=

-- 
===================
Aaron Straus
aaron@merfinllc.com

[-- Attachment #2: sync-test.tar.bz2 --]
[-- Type: application/octet-stream, Size: 15200 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20
  2008-09-11 18:49               ` Aaron Straus
@ 2008-09-22 16:05                 ` Hans-Peter Jansen
  2008-09-22 16:35                   ` Trond Myklebust
  0 siblings, 1 reply; 25+ messages in thread
From: Hans-Peter Jansen @ 2008-09-22 16:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: Aaron Straus, Chuck Lever, Neil Brown, Linux NFS Mailing List,
	Trond Myklebust

Hi Aaron, hi NFS hackers,

Am Donnerstag, 11. September 2008 schrieb Aaron Straus:
> Hi,
>
> On Sep 11 01:48 PM, Chuck Lever wrote:
> > Were you able to modify your writer to do real fsync system calls?  If
> > so, did it help?  That would be a useful data point.
>
> Yes/Yes.  Please see the attached tarball sync-test.tar.bz2.
>
> Inside you'll find the modified writer:  writer_sync.py
>
> That will call fsync on the file descriptor after each write.
>
> Also I added the strace and wireshark data for the sync and nosync
> cases.
>
> Note these are all tested with latest linus git:
>
>   d1c6d2e547148c5aa0c0a4ff6aac82f7c6da1d8b
>
> > Practically speaking this is often not enough for typical
> > applications, so NFS client implementations go to further (non-
> > standard) efforts to behave like a local file system.  This is simply
> > a question of whether we can address this while not creating
> > performance or correctness issues for other common use cases.
>
> Yep, I agree.  I'm not saying what we do now is "wrong" per the RFC
> (writing the file out of order).  It's just different from what we've
> done in the past (and somewhat unexpected).
>
> I'm still hoping there is a simple fix... but maybe not... :(

For what is worth, this behavior is visible in bog standard writing/reading 
files, (log files in my case, via the python logging package). It obviously 
deviates from local filesystem behavior, and former state of the linux 
nfs-client. Should we add patches to less, tail, and all other instruments 
for watching/analysing log files (just to pick the tip of the ice rock) in 
order to throw away runs of zeros, when reading from nfs mounted files? Or 
should we ask their maintainers to add locking code for the nfs "read 
files, which are written at the same time" case, just to work around 
__some__ of the consequences of this bug? Imagine, how ugly this is going 
to look! 

The whole issue is what I call a major regression, thus I strongly ask for a 
reply from Trond on this matter.

I even vote for sending a revert request for this hunk to the stable team, 
where it is applicable, after Trond sorted it out (for 2.6.27?).

Thanks, Aaron and Chuck for the detailed analysis - it demystified a wired 
behavior, I observed here. When you're in a process to get real work done 
in a fixed timeline, such things could make you mad..

Cheers,
Pete

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20
  2008-09-22 16:05                 ` Hans-Peter Jansen
@ 2008-09-22 16:35                   ` Trond Myklebust
  2008-09-22 17:04                     ` Aaron Straus
  2008-09-22 18:45                     ` Hans-Peter Jansen
  0 siblings, 2 replies; 25+ messages in thread
From: Trond Myklebust @ 2008-09-22 16:35 UTC (permalink / raw)
  To: Hans-Peter Jansen
  Cc: linux-kernel, Aaron Straus, Chuck Lever, Neil Brown,
	Linux NFS Mailing List

On Mon, 2008-09-22 at 18:05 +0200, Hans-Peter Jansen wrote:
> For what is worth, this behavior is visible in bog standard writing/reading 
> files, (log files in my case, via the python logging package). It obviously 
> deviates from local filesystem behavior, and former state of the linux 
> nfs-client. Should we add patches to less, tail, and all other instruments 
> for watching/analysing log files (just to pick the tip of the ice rock) in 
> order to throw away runs of zeros, when reading from nfs mounted files? Or 
> should we ask their maintainers to add locking code for the nfs "read 
> files, which are written at the same time" case, just to work around 
> __some__ of the consequences of this bug? Imagine, how ugly this is going 
> to look! 
> 
> The whole issue is what I call a major regression, thus I strongly ask for a 
> reply from Trond on this matter.
> 
> I even vote for sending a revert request for this hunk to the stable team, 
> where it is applicable, after Trond sorted it out (for 2.6.27?).
> 
> Thanks, Aaron and Chuck for the detailed analysis - it demystified a wired 
> behavior, I observed here. When you're in a process to get real work done 
> in a fixed timeline, such things could make you mad..

Revert _what_ exactly?

Please assume that I've been travelling for the past 5 weeks, and have
only a sketchy idea of what has been going on.
My understanding was that this is a consequence of unordered writes
causing the file to be extended while some other task is reading.
AFAICS, this sort of behaviour has _always_ been possible. I can't see
how reverting anything will fix it.

  Trond


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20
  2008-09-22 16:35                   ` Trond Myklebust
@ 2008-09-22 17:04                     ` Aaron Straus
  2008-09-22 17:26                       ` Chuck Lever
  2008-09-22 17:29                       ` Trond Myklebust
  2008-09-22 18:45                     ` Hans-Peter Jansen
  1 sibling, 2 replies; 25+ messages in thread
From: Aaron Straus @ 2008-09-22 17:04 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Hans-Peter Jansen, linux-kernel, Chuck Lever, Neil Brown,
	Linux NFS Mailing List

Hi,

On Sep 22 12:35 PM, Trond Myklebust wrote:
> Revert _what_ exactly?

Yep.  I narrowed the problem down to an offending hunk in a particular
patch.  Removing that hunk did eliminate the problem.   However,
reverting that hunk is likely wrong and the code has changed _a lot_
since that commit.

> My understanding was that this is a consequence of unordered writes
> causing the file to be extended while some other task is reading.

Yes.  I added some debugging statements to look at the writeout path.  

I think the following happens:

  - page 0 gets partially written to by app
  - VM writes out partial dirty page 0
  - page 0 gets fully written by app
  - page 1 gets partially written by app
  - VM writes out dirty page 1

  At this point there is a hole in the file.  The tail end of page 0 is
still not written to server.

  - VM writes out dirty page 0
  ...

> AFAICS, this sort of behaviour has _always_ been possible. I can't see
> how reverting anything will fix it.

Here is the crux.  It was possible previously but unlikely e.g. our app
never saw this behavior.  The new writeout semantics causes visible
holes in files often.

Anyway, I agree the new writeout semantics are allowed and possibly
saner than the previous writeout path.  The problem is that it is
__annoying__ for this use case (log files).

I'm not sure if there is an easy solution.  We want the VM to writeout
the address space in order.   Maybe we can start the scan for dirty
pages at the last page we wrote out i.e. page 0 in the example above?

					=a=



-- 
===================
Aaron Straus
aaron@merfinllc.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20
  2008-09-22 17:04                     ` Aaron Straus
@ 2008-09-22 17:26                       ` Chuck Lever
  2008-09-22 17:37                         ` Aaron Straus
  2008-09-22 17:29                       ` Trond Myklebust
  1 sibling, 1 reply; 25+ messages in thread
From: Chuck Lever @ 2008-09-22 17:26 UTC (permalink / raw)
  To: Aaron Straus
  Cc: Trond Myklebust, Hans-Peter Jansen, linux-kernel, Neil Brown,
	Linux NFS Mailing List

On Mon, Sep 22, 2008 at 1:04 PM, Aaron Straus <aaron@merfinllc.com> wrote:
> Hi,
>
> On Sep 22 12:35 PM, Trond Myklebust wrote:
>> Revert _what_ exactly?
>
> Yep.  I narrowed the problem down to an offending hunk in a particular
> patch.  Removing that hunk did eliminate the problem.   However,
> reverting that hunk is likely wrong and the code has changed _a lot_
> since that commit.
>
>> My understanding was that this is a consequence of unordered writes
>> causing the file to be extended while some other task is reading.
>
> Yes.  I added some debugging statements to look at the writeout path.
>
> I think the following happens:
>
>  - page 0 gets partially written to by app
>  - VM writes out partial dirty page 0
>  - page 0 gets fully written by app
>  - page 1 gets partially written by app
>  - VM writes out dirty page 1
>
>  At this point there is a hole in the file.  The tail end of page 0 is
> still not written to server.
>
>  - VM writes out dirty page 0
>  ...
>
>> AFAICS, this sort of behaviour has _always_ been possible. I can't see
>> how reverting anything will fix it.
>
> Here is the crux.  It was possible previously but unlikely e.g. our app
> never saw this behavior.  The new writeout semantics causes visible
> holes in files often.
>
> Anyway, I agree the new writeout semantics are allowed and possibly
> saner than the previous writeout path.  The problem is that it is
> __annoying__ for this use case (log files).
>
> I'm not sure if there is an easy solution.  We want the VM to writeout
> the address space in order.   Maybe we can start the scan for dirty
> pages at the last page we wrote out i.e. page 0 in the example above?

Why can't you use O_SYNC | O_APPEND ?

--
Chuck Lever

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20
  2008-09-22 17:04                     ` Aaron Straus
  2008-09-22 17:26                       ` Chuck Lever
@ 2008-09-22 17:29                       ` Trond Myklebust
  2008-09-22 17:45                         ` Aaron Straus
  1 sibling, 1 reply; 25+ messages in thread
From: Trond Myklebust @ 2008-09-22 17:29 UTC (permalink / raw)
  To: Aaron Straus
  Cc: Hans-Peter Jansen, linux-kernel, Chuck Lever, Neil Brown,
	Linux NFS Mailing List

[-- Attachment #1: Type: text/plain, Size: 1122 bytes --]

On Mon, 2008-09-22 at 10:04 -0700, Aaron Straus wrote:
> Here is the crux.  It was possible previously but unlikely e.g. our app
> never saw this behavior.  The new writeout semantics causes visible
> holes in files often.
> 
> Anyway, I agree the new writeout semantics are allowed and possibly
> saner than the previous writeout path.  The problem is that it is
> __annoying__ for this use case (log files).

There is always the option of using syslog.

> I'm not sure if there is an easy solution.  We want the VM to writeout
> the address space in order.   Maybe we can start the scan for dirty
> pages at the last page we wrote out i.e. page 0 in the example above?

You can never guarantee that in a multi-threaded environment.

Two threads may, for instance, force 2 competing fsync() calls: that
again may cause out-of-order writes.
...and even if the client doesn't reorder the writes, the _server_ may
do it, since multiple nfsd threads may race when processing writes to
the same file.

Anyway, the patch to force a single threaded nfs client to write out the
data in order is trivial. See attachment...

Trond

[-- Attachment #2: linux-2.6.27-ordered_writes.dif --]
[-- Type: message/rfc822, Size: 749 bytes --]

From: Trond Myklebust <Trond.Myklebust@netapp.com>
Subject: No Subject
Date: 
Message-ID: <1222104509.7615.22.camel@localhost>

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
---

 fs/nfs/write.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 3229e21..eb6b211 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1428,7 +1428,8 @@ static int nfs_write_mapping(struct address_space *mapping, int how)
 		.sync_mode = WB_SYNC_NONE,
 		.nr_to_write = LONG_MAX,
 		.for_writepages = 1,
-		.range_cyclic = 1,
+		.range_start = 0,
+		.range_end = LLONG_MAX,
 	};
 	int ret;
 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20
  2008-09-22 17:26                       ` Chuck Lever
@ 2008-09-22 17:37                         ` Aaron Straus
  0 siblings, 0 replies; 25+ messages in thread
From: Aaron Straus @ 2008-09-22 17:37 UTC (permalink / raw)
  To: chucklever
  Cc: Trond Myklebust, Hans-Peter Jansen, linux-kernel, Neil Brown,
	Linux NFS Mailing List

Hi,

On Sep 22 01:26 PM, Chuck Lever wrote:
> Why can't you use O_SYNC | O_APPEND ?

In our case, some of the writers are not in our control.  Another case
we see the issue is you spawn a job:

   job 1> out 2> err &
   tail -f out err

--

Also, we actually do like caching the writes (b/c log files do not need
to be checked immediately after being written).   We just wish the cache
could be written out in-order.

If there's no way to make that happen... we can reorganize our
file-system, exports, mounts, etc so that the log file directory is
mounted with the sync option.   

However, the problem *might* be annoying and wide-spread enough to
warrant a change.

Anyway, thanks!

				Regards,
				=a=





-- 
===================
Aaron Straus
aaron@merfinllc.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20
  2008-09-22 17:29                       ` Trond Myklebust
@ 2008-09-22 17:45                         ` Aaron Straus
  2008-09-22 18:43                           ` Aaron Straus
  2008-09-22 18:45                           ` Hans-Peter Jansen
  0 siblings, 2 replies; 25+ messages in thread
From: Aaron Straus @ 2008-09-22 17:45 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Hans-Peter Jansen, linux-kernel, Chuck Lever, Neil Brown,
	Linux NFS Mailing List

Hi,

On Sep 22 01:29 PM, Trond Myklebust wrote:
> > Anyway, I agree the new writeout semantics are allowed and possibly
> > saner than the previous writeout path.  The problem is that it is
> > __annoying__ for this use case (log files).
> 
> There is always the option of using syslog.

Definitely.  Everything in our control we can work around.... there are
a few applications we cannot easily change... see the follow-up in
another e-mail.

> > I'm not sure if there is an easy solution.  We want the VM to writeout
> > the address space in order.   Maybe we can start the scan for dirty
> > pages at the last page we wrote out i.e. page 0 in the example above?
> 
> You can never guarantee that in a multi-threaded environment.

Definitely.  This is a single writer, single reader case though.

> Two threads may, for instance, force 2 competing fsync() calls: that
> again may cause out-of-order writes.

Yup.

> ...and even if the client doesn't reorder the writes, the _server_ may
> do it, since multiple nfsd threads may race when processing writes to
> the same file.

Yup.  We're definitely not asking for anything like that.

> Anyway, the patch to force a single threaded nfs client to write out the
> data in order is trivial. See attachment...
>
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index 3229e21..eb6b211 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -1428,7 +1428,8 @@ static int nfs_write_mapping(struct address_space *mapping, int how)
>  		.sync_mode = WB_SYNC_NONE,
>  		.nr_to_write = LONG_MAX,
>  		.for_writepages = 1,
> -		.range_cyclic = 1,
> +		.range_start = 0,
> +		.range_end = LLONG_MAX,
>  	};
>  	int ret;
>  

Yeah I was looking at that while debugging.  Would that change have
chance to make it into mainline?  I assume it makes the normal writeout
path more expensive, by forcing a scan of the entire address space.

Also, I should test this, but I thought the VM was calling
nfs_writepages directly i.e. not going through nfs_write_mapping.  Let
me test with this patch.

					Thanks,
					=a=



-- 
===================
Aaron Straus
aaron@merfinllc.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20
  2008-09-22 17:45                         ` Aaron Straus
@ 2008-09-22 18:43                           ` Aaron Straus
  2008-09-22 18:45                           ` Hans-Peter Jansen
  1 sibling, 0 replies; 25+ messages in thread
From: Aaron Straus @ 2008-09-22 18:43 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Hans-Peter Jansen, linux-kernel, Chuck Lever, Neil Brown,
	Linux NFS Mailing List

Hi,

On Sep 22 10:45 AM, Aaron Straus wrote:
> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > index 3229e21..eb6b211 100644
> > --- a/fs/nfs/write.c
> > +++ b/fs/nfs/write.c
> > @@ -1428,7 +1428,8 @@ static int nfs_write_mapping(struct address_space *mapping, int how)
> >  		.sync_mode = WB_SYNC_NONE,
> >  		.nr_to_write = LONG_MAX,
> >  		.for_writepages = 1,
> > -		.range_cyclic = 1,
> > +		.range_start = 0,
> > +		.range_end = LLONG_MAX,
> >  	};
> >  	int ret;
> >  
> 
> Also, I should test this, but I thought the VM was calling
> nfs_writepages directly i.e. not going through nfs_write_mapping.  Let
> me test with this patch.

Yes.  This patch doesn't seem to help.  The VM is calling nfs_writepages
directly.   I have debug statements in the nfs_write_mapping() call and it
hasn't been called when the hole appears.

It's the VM dirty page writeout that is happening out of order, *not*
nfs_wb_* stuff.   That's why we can see it so easily.  The dirty pages
are only scanned periodically.  In-between scans is when we see the
hole.

I think :).  I'm obviously not an expert here.

					Thanks,
					=a=


-- 
===================
Aaron Straus
aaron@merfinllc.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20
  2008-09-22 16:35                   ` Trond Myklebust
  2008-09-22 17:04                     ` Aaron Straus
@ 2008-09-22 18:45                     ` Hans-Peter Jansen
  1 sibling, 0 replies; 25+ messages in thread
From: Hans-Peter Jansen @ 2008-09-22 18:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Trond Myklebust, Aaron Straus, Chuck Lever, Neil Brown,
	Linux NFS Mailing List

Am Montag, 22. September 2008 schrieb Trond Myklebust:
> On Mon, 2008-09-22 at 18:05 +0200, Hans-Peter Jansen wrote:
> > For what is worth, this behavior is visible in bog standard
> > writing/reading files, (log files in my case, via the python logging
> > package). It obviously deviates from local filesystem behavior, and
> > former state of the linux nfs-client. Should we add patches to less,
> > tail, and all other instruments for watching/analysing log files (just
> > to pick the tip of the ice rock) in order to throw away runs of zeros,
> > when reading from nfs mounted files? Or should we ask their maintainers
> > to add locking code for the nfs "read files, which are written at the
> > same time" case, just to work around __some__ of the consequences of
> > this bug? Imagine, how ugly this is going to look!
> >
> > The whole issue is what I call a major regression, thus I strongly ask
> > for a reply from Trond on this matter.
> >
> > I even vote for sending a revert request for this hunk to the stable
> > team, where it is applicable, after Trond sorted it out (for 2.6.27?).
> >
> > Thanks, Aaron and Chuck for the detailed analysis - it demystified a
> > wired behavior, I observed here. When you're in a process to get real
> > work done in a fixed timeline, such things could make you mad..
>
> Revert _what_ exactly?
For your convenience, important parts inlined here:

>From Aarons message: Tue, 9 Sep 2008 12:46:44 -0700 in this thread. << EOM

Of the bisected offending commit:

commit e261f51f25b98c213e0b3d7f2109b117d714f69d
Author: Trond Myklebust <Trond.Myklebust@netapp.com>
Date:   Tue Dec 5 00:35:41 2006 -0500

    NFS: Make nfs_updatepage() mark the page as dirty.
    
    This will ensure that we can call set_page_writeback() from within
    nfs_writepage(), which is always called with the page lock set.
    
    Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>


It seems to be this hunk which introduces the problem:


@@ -628,7 +667,6 @@ static struct nfs_page * nfs_update_request(struct 
nfs_open_context* ctx,
                                return ERR_PTR(error);
                        }
                        spin_unlock(&nfsi->req_lock);
-                       nfs_mark_request_dirty(new);
                        return new;
                }
                spin_unlock(&nfsi->req_lock);


If I add that function call back in... the problem disappears.  I don't
know if this just papers over the real problem though?  

EOM

This commit happened between 2.6.19 and 2.6.20, btw.

> Please assume that I've been travelling for the past 5 weeks, and have
> only a sketchy idea of what has been going on.

Ahh, I see, that explains, why you didn't responded earlier.

> My understanding was that this is a consequence of unordered writes
> causing the file to be extended while some other task is reading.
> AFAICS, this sort of behaviour has _always_ been possible. I can't see
> how reverting anything will fix it.

Hopefully, this helps you to remember the purpose of that change.

Cheers,
Pete


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20
  2008-09-22 17:45                         ` Aaron Straus
  2008-09-22 18:43                           ` Aaron Straus
@ 2008-09-22 18:45                           ` Hans-Peter Jansen
  1 sibling, 0 replies; 25+ messages in thread
From: Hans-Peter Jansen @ 2008-09-22 18:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Aaron Straus, Trond Myklebust, Chuck Lever, Neil Brown,
	Linux NFS Mailing List

Am Montag, 22. September 2008 schrieb Aaron Straus:
> Hi,
>
> On Sep 22 01:29 PM, Trond Myklebust wrote:
> > > Anyway, I agree the new writeout semantics are allowed and possibly
> > > saner than the previous writeout path.  The problem is that it is
> > > __annoying__ for this use case (log files).
> >
> > There is always the option of using syslog.
>
> Definitely.  Everything in our control we can work around.... there are
> a few applications we cannot easily change... see the follow-up in
> another e-mail.
>
> > > I'm not sure if there is an easy solution.  We want the VM to
> > > writeout the address space in order.   Maybe we can start the scan
> > > for dirty pages at the last page we wrote out i.e. page 0 in the
> > > example above?
> >
> > You can never guarantee that in a multi-threaded environment.
>
> Definitely.  This is a single writer, single reader case though.

...where it happens, that the reader gets chunks of zeros from reading a 
file, that is written from another (single threaded) process.

Note, that going through syslog isn't an option in many cases unless we want 
to rewrite the "world" to work around this phenomenon, thus it's not simply 
annoying, as Aaron points out, the "in order" approach is inevitable.

> > Two threads may, for instance, force 2 competing fsync() calls: that
> > again may cause out-of-order writes.
>
> Yup.
>
> > ...and even if the client doesn't reorder the writes, the _server_ may
> > do it, since multiple nfsd threads may race when processing writes to
> > the same file.
>
> Yup.  We're definitely not asking for anything like that.
>
> > Anyway, the patch to force a single threaded nfs client to write out
> > the data in order is trivial. See attachment...
> >
> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > index 3229e21..eb6b211 100644
> > --- a/fs/nfs/write.c
> > +++ b/fs/nfs/write.c
> > @@ -1428,7 +1428,8 @@ static int nfs_write_mapping(struct address_space
> > *mapping, int how) .sync_mode = WB_SYNC_NONE,
> >  		.nr_to_write = LONG_MAX,
> >  		.for_writepages = 1,
> > -		.range_cyclic = 1,
> > +		.range_start = 0,
> > +		.range_end = LLONG_MAX,
> >  	};
> >  	int ret;
>
> Yeah I was looking at that while debugging.  Would that change have
> chance to make it into mainline?  I assume it makes the normal writeout
> path more expensive, by forcing a scan of the entire address space.

If this patch solves this issue, it is necessary to get applied as soon as 
possible as outlined above.. 

> Also, I should test this, but I thought the VM was calling
> nfs_writepages directly i.e. not going through nfs_write_mapping.  Let
> me test with this patch.

Let us know about the outcome. 

Thanks,
Pete

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2008-09-22 18:46 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-09-05 19:19 blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 Aaron Straus
2008-09-05 19:56 ` [NFS] " Chuck Lever
2008-09-05 20:04   ` Aaron Straus
2008-09-05 20:36     ` Bernd Eckenfels
2008-09-05 20:36     ` Chuck Lever
2008-09-05 22:14       ` Aaron Straus
2008-09-06  0:03   ` Aaron Straus
2008-09-08 19:02   ` Aaron Straus
2008-09-08 21:15     ` Chuck Lever
2008-09-08 22:02       ` Aaron Straus
2008-09-09 19:46       ` Aaron Straus
2008-09-11 16:55         ` Chuck Lever
2008-09-11 17:19           ` Aaron Straus
2008-09-11 17:48             ` Chuck Lever
2008-09-11 18:49               ` Aaron Straus
2008-09-22 16:05                 ` Hans-Peter Jansen
2008-09-22 16:35                   ` Trond Myklebust
2008-09-22 17:04                     ` Aaron Straus
2008-09-22 17:26                       ` Chuck Lever
2008-09-22 17:37                         ` Aaron Straus
2008-09-22 17:29                       ` Trond Myklebust
2008-09-22 17:45                         ` Aaron Straus
2008-09-22 18:43                           ` Aaron Straus
2008-09-22 18:45                           ` Hans-Peter Jansen
2008-09-22 18:45                     ` Hans-Peter Jansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).