* blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 @ 2008-09-05 19:19 Aaron Straus 2008-09-05 19:56 ` [NFS] " Chuck Lever 0 siblings, 1 reply; 25+ messages in thread From: Aaron Straus @ 2008-09-05 19:19 UTC (permalink / raw) To: neilb, nfs, trond.myklebust, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1779 bytes --] Hi all, We're hitting some bad behavior in NFS v3. The situation is this: machine A - NFS server machine B - NFS client (writer) machine C - NFS client (reader) (all machines x86 SMP) machine A exports a directory on ext3 filesystem: /srv/home 192.168.0.0/24(rw,sync,no_subtree_check) machines B and C mount that directory normally mount A:/srv/home /mntpnt machine B opens a file and writes to it (think a log file) machine C stats that file, opens it and reads it (think tailing the log file) The issue is that machine C will often see large blocks of NULLs (zeros) in the file. If you do the same read again just after you see the block of NULLs you will see proper the data. Attached are two simple python programs that demonstrate the problem. To use them (they will write to a file called test-nfs in CWD): (on machine B in one window) python writer.py (on machine C in another window) python reader.py reader.py will die when it sees NULLs in the file. Usually for us this happens after about 60s (two timeouts I think). The first NULL is usually either at index 4000 or 8000 depending on the kernel. Now the version of the kernel the server is running doesn't seem to matter. The reader also doesn't seem to matter (though I didn't test this completely). The writer seems to be the issue: Writer_Version Outcome: <= 2.6.19 OK >= 2.6.20 BAD I've tested both vanilla kernel.org kernels and Ubuntu 8.04 kernels. I can try to bisect between 2.6.19 <-> 2.6.20. Anyone else hitting this problem? Any better ideas? Thanks, =a= -- =================== Aaron Straus aaron@merfinllc.com [-- Attachment #2: reader.py --] [-- Type: text/x-python, Size: 844 bytes --] #!/usr/bin/env python import time import os def run(file_name): data = '' last_data_len = len(data) last_size = os.stat(file_name).st_size while True: st_size = os.stat(file_name).st_size if st_size != last_size: print 'size changed @ %s' % time.asctime() data = open(file_name).read() if len(data) != last_data_len: print 'new data arrived @ %s' % time.asctime() print repr(data[-50:]) if '\0' in data: first_index = data.index('\0') data_fragment = data[first_index-50:first_index+50] print 'Detected NULL @ %d %s' % (first_index, repr(data_fragment)) break time.sleep(0.250) last_data_len = len(data) last_size = st_size if __name__ == '__main__': run('test-nfs') [-- Attachment #3: writer.py --] [-- Type: text/x-python, Size: 257 bytes --] #!/usr/bin/env python import time def run(file_name): fp = open(file_name, 'w') count = 1 while count: fp.write('meow\n' * 800) fp.flush() time.sleep(32) fp.close() if __name__ == '__main__': run('test-nfs') ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 2008-09-05 19:19 blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 Aaron Straus @ 2008-09-05 19:56 ` Chuck Lever 2008-09-05 20:04 ` Aaron Straus ` (2 more replies) 0 siblings, 3 replies; 25+ messages in thread From: Chuck Lever @ 2008-09-05 19:56 UTC (permalink / raw) To: Aaron Straus Cc: Neil Brown, Linux NFS Mailing List, Trond Myklebust, LKML Kernel [ replacing cc: nfs@sf.net with linux-nfs@vger.kernel.org, and neil's old address with his current one ] On Sep 5, 2008, at Sep 5, 2008, 3:19 PM, Aaron Straus wrote: > Hi all, > > We're hitting some bad behavior in NFS v3. The situation is this: > > machine A - NFS server > > machine B - NFS client (writer) > machine C - NFS client (reader) > > (all machines x86 SMP) > > machine A exports a directory on ext3 filesystem: > > /srv/home 192.168.0.0/24(rw,sync,no_subtree_check) > > machines B and C mount that directory normally > > mount A:/srv/home /mntpnt > > machine B opens a file and writes to it (think a log file) > > machine C stats that file, opens it and reads it (think tailing the > log file) > > > The issue is that machine C will often see large blocks of NULLs > (zeros) in the file. If you do the same read again just after you see > the block of NULLs you will see proper the data. > > Attached are two simple python programs that demonstrate the problem. > > To use them (they will write to a file called test-nfs in CWD): > > (on machine B in one window) > > python writer.py > > (on machine C in another window) > > python reader.py > > > reader.py will die when it sees NULLs in the file. Usually for us > this happens after about 60s (two timeouts I think). The first > NULL is > usually either at index 4000 or 8000 depending on the kernel. > > > Now the version of the kernel the server is running doesn't seem to > matter. The reader also doesn't seem to matter (though I didn't test > this completely). The writer seems to be the issue: > > Writer_Version Outcome: > <= 2.6.19 OK > >= 2.6.20 BAD Up to which kernel? Recent ones may address this issue already. > I've tested both vanilla kernel.org kernels and Ubuntu 8.04 kernels. > > I can try to bisect between 2.6.19 <-> 2.6.20. That's a good start. Comparing a wire trace with strace output, starting with the writing client, might also be illuminating. We prefer wireshark as it uses good default trace settings, parses the wire bytes and displays them coherently, and allows you to sort the frames in various useful ways. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 2008-09-05 19:56 ` [NFS] " Chuck Lever @ 2008-09-05 20:04 ` Aaron Straus 2008-09-05 20:36 ` Bernd Eckenfels 2008-09-05 20:36 ` Chuck Lever 2008-09-06 0:03 ` Aaron Straus 2008-09-08 19:02 ` Aaron Straus 2 siblings, 2 replies; 25+ messages in thread From: Aaron Straus @ 2008-09-05 20:04 UTC (permalink / raw) To: Chuck Lever Cc: Neil Brown, Linux NFS Mailing List, Trond Myklebust, LKML Kernel Hi, On Sep 05 03:56 PM, Chuck Lever wrote: > [ replacing cc: nfs@sf.net with linux-nfs@vger.kernel.org, and neil's > old address with his current one ] Sorry I probably grabbed an old MAINTAINERS file. > On Sep 5, 2008, at Sep 5, 2008, 3:19 PM, Aaron Straus wrote: > > Writer_Version Outcome: > > <= 2.6.19 OK > > >= 2.6.20 BAD > > Up to which kernel? Recent ones may address this issue already. BAD up to 2.6.27-rc? I have to see exactly which is the last rc version I tested. > > I can try to bisect between 2.6.19 <-> 2.6.20. > > That's a good start. OK will try to bisect. > Comparing a wire trace with strace output, starting with the writing > client, might also be illuminating. We prefer wireshark as it uses > good default trace settings, parses the wire bytes and displays them > coherently, and allows you to sort the frames in various useful ways. OK. Could you also try to reproduce on your side using those python programs? I want to make sure it's not something specific with our mounts, etc. Thanks! =a= -- =================== Aaron Straus aaron@merfinllc.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 2008-09-05 20:04 ` Aaron Straus @ 2008-09-05 20:36 ` Bernd Eckenfels 2008-09-05 20:36 ` Chuck Lever 1 sibling, 0 replies; 25+ messages in thread From: Bernd Eckenfels @ 2008-09-05 20:36 UTC (permalink / raw) To: linux-kernel In article <20080905200455.GH22796@merfinllc.com> you wrote: > OK. Could you also try to reproduce on your side using those python > programs? I want to make sure it's not something specific with our > mounts, etc. sounds like the inode sync comes before the data sync. I suspect thats a client side "problem". Maybe a fsync on the dir or something while the file is open? Bernd ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 2008-09-05 20:04 ` Aaron Straus 2008-09-05 20:36 ` Bernd Eckenfels @ 2008-09-05 20:36 ` Chuck Lever 2008-09-05 22:14 ` Aaron Straus 1 sibling, 1 reply; 25+ messages in thread From: Chuck Lever @ 2008-09-05 20:36 UTC (permalink / raw) To: Aaron Straus Cc: Neil Brown, Linux NFS Mailing List, Trond Myklebust, LKML Kernel On Sep 5, 2008, at Sep 5, 2008, 4:04 PM, Aaron Straus wrote: > Hi, > > On Sep 05 03:56 PM, Chuck Lever wrote: >> [ replacing cc: nfs@sf.net with linux-nfs@vger.kernel.org, and neil's >> old address with his current one ] > > Sorry I probably grabbed an old MAINTAINERS file. > >> On Sep 5, 2008, at Sep 5, 2008, 3:19 PM, Aaron Straus wrote: >>> Writer_Version Outcome: >>> <= 2.6.19 OK >>>> = 2.6.20 BAD >> >> Up to which kernel? Recent ones may address this issue already. > > BAD up to 2.6.27-rc? > > I have to see exactly which is the last rc version I tested. > >>> I can try to bisect between 2.6.19 <-> 2.6.20. >> >> That's a good start. > > OK will try to bisect. > >> Comparing a wire trace with strace output, starting with the writing >> client, might also be illuminating. We prefer wireshark as it uses >> good default trace settings, parses the wire bytes and displays them >> coherently, and allows you to sort the frames in various useful ways. > > OK. Could you also try to reproduce on your side using those python > programs? I want to make sure it's not something specific with our > mounts, etc. I have the latest Fedora 9 kernels on two clients, mounting via NFSv3 using "actimeo=600" (for other reasons). The server is OpenSolaris 2008.5. reader.py reported zeroes in the test file after about 5 minutes. Looking at the file a little later, I don't see any problems with it. Since your scripts are not using any kind of serialization (ie file locking) between the clients, I wonder if non-determinant behavior is to be expected. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 2008-09-05 20:36 ` Chuck Lever @ 2008-09-05 22:14 ` Aaron Straus 0 siblings, 0 replies; 25+ messages in thread From: Aaron Straus @ 2008-09-05 22:14 UTC (permalink / raw) To: Chuck Lever Cc: Neil Brown, Linux NFS Mailing List, Trond Myklebust, LKML Kernel [-- Attachment #1: Type: text/plain, Size: 1360 bytes --] Hi, On Sep 05 04:36 PM, Chuck Lever wrote: > I have the latest Fedora 9 kernels on two clients, mounting via NFSv3 > using "actimeo=600" (for other reasons). The server is OpenSolaris > 2008.5. > > reader.py reported zeroes in the test file after about 5 minutes. Awesome. Thanks for testing! Our actime is much shorter which is probably why it happens sooner for us. > Looking at the file a little later, I don't see any problems with it. > > Since your scripts are not using any kind of serialization (ie file > locking) between the clients, I wonder if non-determinant behavior is > to be expected. Hmm... yep. I don't know what guarantees we want to make. The behavior doesn't seem to be consistent with older kernels though... so I'm thinking it might be a bug. We hit this particular issue because we have scripts which essentially 'tail -f' log files looking for errors. They miss log messages (and see corrupted ones) b/c of the NULLs. That's also why there is no serialization.... we don't need it when grep'ing through log messages. I'm bisecting now. I see a block of intricate-looking NFS patches, I'll try to narrow it down to a particular commit. I'll also get the wireshark data at that point. Thanks, =a= -- =================== Aaron Straus aaron@merfinllc.com [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 191 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 2008-09-05 19:56 ` [NFS] " Chuck Lever 2008-09-05 20:04 ` Aaron Straus @ 2008-09-06 0:03 ` Aaron Straus 2008-09-08 19:02 ` Aaron Straus 2 siblings, 0 replies; 25+ messages in thread From: Aaron Straus @ 2008-09-06 0:03 UTC (permalink / raw) To: Chuck Lever Cc: Neil Brown, Linux NFS Mailing List, Trond Myklebust, LKML Kernel [-- Attachment #1: Type: text/plain, Size: 800 bytes --] On Sep 05 03:56 PM, Chuck Lever wrote: > > I can try to bisect between 2.6.19 <-> 2.6.20. > > That's a good start. Hi, OK. Bisected. This is the commit where we start to see blocks of NULLs in NFS files. e261f51f25b98c213e0b3d7f2109b117d714f69d is first bad commit commit e261f51f25b98c213e0b3d7f2109b117d714f69d Author: Trond Myklebust <Trond.Myklebust@netapp.com> Date: Tue Dec 5 00:35:41 2006 -0500 NFS: Make nfs_updatepage() mark the page as dirty. This will ensure that we can call set_page_writeback() from within nfs_writepage(), which is always called with the page lock set. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Thanks, =a= -- =================== Aaron Straus aaron@merfinllc.com [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 191 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 2008-09-05 19:56 ` [NFS] " Chuck Lever 2008-09-05 20:04 ` Aaron Straus 2008-09-06 0:03 ` Aaron Straus @ 2008-09-08 19:02 ` Aaron Straus 2008-09-08 21:15 ` Chuck Lever 2 siblings, 1 reply; 25+ messages in thread From: Aaron Straus @ 2008-09-08 19:02 UTC (permalink / raw) To: Chuck Lever Cc: Neil Brown, Linux NFS Mailing List, Trond Myklebust, LKML Kernel [-- Attachment #1: Type: text/plain, Size: 1003 bytes --] Hi, On Sep 05 03:56 PM, Chuck Lever wrote: > Comparing a wire trace with strace output, starting with the writing > client, might also be illuminating. We prefer wireshark as it uses > good default trace settings, parses the wire bytes and displays them > coherently, and allows you to sort the frames in various useful ways. OK in addition to the bisection I've collected trace data for the good (commit 4d770ccf4257b23a7ca2a85de1b1c22657b581d8) and bad (commit e261f51f25b98c213e0b3d7f2109b117d714f69d) cases. Attached is a file called trace.tar.bz2 inside you'll find 4 files, for the two sessions: bad-wireshark bad-strace good-wireshark good-strace >From a quick glance the difference seems to be the bad case does an UNSTABLE NFS WRITE call. I don't really know what that means or what its semantics are... but that bad commit does seem to introduce this regression. Anything else I can provide? Thanks! =a= -- =================== Aaron Straus aaron@merfinllc.com [-- Attachment #2: trace.tar.bz2 --] [-- Type: application/octet-stream, Size: 13979 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 2008-09-08 19:02 ` Aaron Straus @ 2008-09-08 21:15 ` Chuck Lever 2008-09-08 22:02 ` Aaron Straus 2008-09-09 19:46 ` Aaron Straus 0 siblings, 2 replies; 25+ messages in thread From: Chuck Lever @ 2008-09-08 21:15 UTC (permalink / raw) To: Aaron Straus Cc: Neil Brown, Linux NFS Mailing List, Trond Myklebust, LKML Kernel On Mon, Sep 8, 2008 at 3:02 PM, Aaron Straus <aaron@merfinllc.com> wrote: > On Sep 05 03:56 PM, Chuck Lever wrote: >> Comparing a wire trace with strace output, starting with the writing >> client, might also be illuminating. We prefer wireshark as it uses >> good default trace settings, parses the wire bytes and displays them >> coherently, and allows you to sort the frames in various useful ways. > > OK in addition to the bisection I've collected trace data for the good > (commit 4d770ccf4257b23a7ca2a85de1b1c22657b581d8) and bad (commit > e261f51f25b98c213e0b3d7f2109b117d714f69d) cases. > > Attached is a file called trace.tar.bz2 inside you'll find 4 files, for > the two sessions: > > bad-wireshark > bad-strace > > good-wireshark > good-strace > > From a quick glance the difference seems to be the bad case does an > UNSTABLE NFS WRITE call. I don't really know what that means or what > its semantics are... but that bad commit does seem to introduce this > regression. A little analysis reveals the bad behavior on the wire. I filtered the network traces for just WRITE requests. There were six in the good trace, and five in the bad. This seems to correspond with the strace logs, which show that you stopped the bad run before it generated more writes. Good: time: offset: len: sync: byte range: 35.402734 0 8000 FILE_SYNC [0, 7999] 100.405170 4096 11904 FILE_SYNC [4096, 15999] 160.407822 12288 7712 FILE_SYNC [12288, 19999] 195.411035 16384 11616 FILE_SYNC [16384, 27999] 260.413574 24576 11424 FILE_SYNC [24576, 35999] 295.704044 32768 7232 FILE_SYNC [32768, 39999] Bad: time: offset: len: sync: byte range: 36.989907 0 8000 FILE_SYNC [0, 7999] 101.991164 4096 11904 FILE_SYNC [4096, 15999] 161.992447 12288 7712 FILE_SYNC [12288, 19999] 166.992203 20480 3520 UNSTABLE [20480, 23999] 169.181642 16384 4096 FILE_SYNC [16384, 20479] The first thing I notice is that the client is re-writing areas of the file. It seems to want to generate page aligned WRITEs, but this means it writes over bytes that were already sent to the server. Probably not the issue here, but that's certainly not optimal behavior. The strace logs show that the application generates 4000 byte write(2) calls, without any operation interleaved that might change the seek offset. I might go so far to say that this is incorrect behavior as multiple clients writing to a file not on page boundaries would clobber each other's WRITEs. The second thing is that UNSTABLE WRITE. The fact that it is UNSTABLE is not the problem here. The byte range is what is most interesting: it leaves a hole in the file, which is fixed by the next write (3 seconds later) which fills in that hole. Normally this isn't a big deal, as NFS likes applications to serialize access to shared files. However, in this case we expect the .flush() to force all dirty bytes out at once, and this is definitely not occurring.... The third thing is that I don't see any flush or fsync system calls in the strace logs, even though the writer.py script invokes fp.flush(). I'm not terribly familiar with Python's flush(). How do you expect the flush() method to work? I wonder why the client is using FILE_SYNC writes and not UNSTABLE+COMMIT. Are the clients using the "sync" mount option? I don't see O_SYNC on the open. open("test-nfs", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 3 I think I saw some recent work in Trond's development branch that makes some changes in this area. I will wait for him to respond to this thread. -- Chuck Lever ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 2008-09-08 21:15 ` Chuck Lever @ 2008-09-08 22:02 ` Aaron Straus 2008-09-09 19:46 ` Aaron Straus 1 sibling, 0 replies; 25+ messages in thread From: Aaron Straus @ 2008-09-08 22:02 UTC (permalink / raw) To: chucklever Cc: Neil Brown, Linux NFS Mailing List, Trond Myklebust, LKML Kernel [-- Attachment #1: Type: text/plain, Size: 1535 bytes --] Hi Chuck, On Sep 08 05:15 PM, Chuck Lever wrote: > This seems to correspond with the strace logs, which show that you > stopped the bad run before it generated more writes. Yes. Sorry. I stopped the writer once the reader saw the NULL bytes. > Normally this isn't a big deal, as NFS likes applications to serialize > access to shared files. However, in this case we expect the .flush() > to force all dirty bytes out at once, and this is definitely not > occurring.... > > The third thing is that I don't see any flush or fsync system calls in > the strace logs, even though the writer.py script invokes fp.flush(). > I'm not terribly familiar with Python's flush(). How do you expect > the flush() method to work? I think fp.flush() in Python is equivalent to fflush() i.e. it only flushes the stdio buffer. It doesn't actually call fsync or fdatasync or anything like that. We can pretty easily change the Python code to use the syscall write()/etc methods. > I wonder why the client is using FILE_SYNC writes and not > UNSTABLE+COMMIT. Are the clients using the "sync" mount option? I > don't see O_SYNC on the open. I've exported them on the server as sync e.g. in exports: /export 192.168.0.0/24(rw,sync) However I don't give any special options when mounting e.e. on the client: mount machine:/export mntpnt Anyway thanks for looking at this and let me know if I can do anything else! =a= -- =================== Aaron Straus aaron@merfinllc.com [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 191 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 2008-09-08 21:15 ` Chuck Lever 2008-09-08 22:02 ` Aaron Straus @ 2008-09-09 19:46 ` Aaron Straus 2008-09-11 16:55 ` Chuck Lever 1 sibling, 1 reply; 25+ messages in thread From: Aaron Straus @ 2008-09-09 19:46 UTC (permalink / raw) To: chucklever Cc: Neil Brown, Linux NFS Mailing List, Trond Myklebust, LKML Kernel [-- Attachment #1: Type: text/plain, Size: 1252 bytes --] Hi, On Sep 08 05:15 PM, Chuck Lever wrote: > I think I saw some recent work in Trond's development branch that > makes some changes in this area. I will wait for him to respond to > this thread. One other piece of information. Of the bisected offending commit: commit e261f51f25b98c213e0b3d7f2109b117d714f69d Author: Trond Myklebust <Trond.Myklebust@netapp.com> Date: Tue Dec 5 00:35:41 2006 -0500 NFS: Make nfs_updatepage() mark the page as dirty. This will ensure that we can call set_page_writeback() from within nfs_writepage(), which is always called with the page lock set. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> It seems to be this hunk which introduces the problem: @@ -628,7 +667,6 @@ static struct nfs_page * nfs_update_request(struct nfs_open_context* ctx, return ERR_PTR(error); } spin_unlock(&nfsi->req_lock); - nfs_mark_request_dirty(new); return new; } spin_unlock(&nfsi->req_lock); If I add that function call back in... the problem disappears. I don't know if this just papers over the real problem though? Thanks, =a= -- =================== Aaron Straus aaron@merfinllc.com [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 191 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 2008-09-09 19:46 ` Aaron Straus @ 2008-09-11 16:55 ` Chuck Lever 2008-09-11 17:19 ` Aaron Straus 0 siblings, 1 reply; 25+ messages in thread From: Chuck Lever @ 2008-09-11 16:55 UTC (permalink / raw) To: Aaron Straus Cc: Neil Brown, Linux NFS Mailing List, Trond Myklebust, LKML Kernel On Sep 9, 2008, at Sep 9, 2008, 3:46 PM, Aaron Straus wrote: > Hi, > > On Sep 08 05:15 PM, Chuck Lever wrote: >> I think I saw some recent work in Trond's development branch that >> makes some changes in this area. I will wait for him to respond to >> this thread. > > One other piece of information. > > Of the bisected offending commit: > > commit e261f51f25b98c213e0b3d7f2109b117d714f69d > Author: Trond Myklebust <Trond.Myklebust@netapp.com> > Date: Tue Dec 5 00:35:41 2006 -0500 > > NFS: Make nfs_updatepage() mark the page as dirty. > > This will ensure that we can call set_page_writeback() from within > nfs_writepage(), which is always called with the page lock set. > > Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> > > > It seems to be this hunk which introduces the problem: > > > @@ -628,7 +667,6 @@ static struct nfs_page * > nfs_update_request(struct nfs_open_context* ctx, > return ERR_PTR(error); > } > spin_unlock(&nfsi->req_lock); > - nfs_mark_request_dirty(new); > return new; > } > spin_unlock(&nfsi->req_lock); > > > > If I add that function call back in... the problem disappears. I > don't > know if this just papers over the real problem though? It's possible that the NFS write path as it exists after this commit is not marking certain pages dirty at the same time as others. An asynchronous flush call from the VM would catch some of the dirty pages, then later, others, resulting in the pages flushing to the server out of order. However, I poked around a little bit in late model kernels yesterday and found that the changes from Trond I referred to earlier are already in 2.6.27. These rework this area significantly, so reverting this hunk alone will be more difficult in recent kernels. In addition I notice that one purpose of this commit is to change the page locking scheme in the NFS client's write path. I wouldn't advise simply adding this nfs_mark_request_dirty() call back in. It could result in deadlocks for certain workloads (perhaps not your specific workload, though). A more thorough review of the NFS write and flush logic that exists in 2.6.27 is needed if we choose to recognize this issue as a real problem. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 2008-09-11 16:55 ` Chuck Lever @ 2008-09-11 17:19 ` Aaron Straus 2008-09-11 17:48 ` Chuck Lever 0 siblings, 1 reply; 25+ messages in thread From: Aaron Straus @ 2008-09-11 17:19 UTC (permalink / raw) To: Chuck Lever Cc: Neil Brown, Linux NFS Mailing List, Trond Myklebust, LKML Kernel Hi, On Sep 11 12:55 PM, Chuck Lever wrote: > A more thorough review of the NFS write and flush logic that exists in > 2.6.27 is needed if we choose to recognize this issue as a real > problem. Yep. Sorry. I didn't mean we should revert the hunk. I was just trying to help identify the cause of the new behavior. I think this is a real problem albeit not a "serious" one. Network file-systems usually try to avoid readers seeing blocks of zeros in files, especially in this simple writer/reader case. It wouldn't be bad if the file is written out of order occasionally, but we see this constantly now. We cannot write/read log files to NFS mounts reliably any more. That seems like a valid use case which no longer works? Anyway, thanks for your time! =a= -- =================== Aaron Straus aaron@merfinllc.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 2008-09-11 17:19 ` Aaron Straus @ 2008-09-11 17:48 ` Chuck Lever 2008-09-11 18:49 ` Aaron Straus 0 siblings, 1 reply; 25+ messages in thread From: Chuck Lever @ 2008-09-11 17:48 UTC (permalink / raw) To: Aaron Straus Cc: Neil Brown, Linux NFS Mailing List, Trond Myklebust, LKML Kernel On Sep 11, 2008, at Sep 11, 2008, 1:19 PM, Aaron Straus wrote: > Hi, > > On Sep 11 12:55 PM, Chuck Lever wrote: >> A more thorough review of the NFS write and flush logic that exists >> in >> 2.6.27 is needed if we choose to recognize this issue as a real >> problem. > > Yep. > > Sorry. I didn't mean we should revert the hunk. I was just trying to > help identify the cause of the new behavior. > > I think this is a real problem albeit not a "serious" one. Network > file-systems usually try to avoid readers seeing blocks of zeros in > files, especially in this simple writer/reader case. > > It wouldn't be bad if the file is written out of order occasionally, > but > we see this constantly now. > > We cannot write/read log files to NFS mounts reliably any more. That > seems like a valid use case which no longer works? Were you able to modify your writer to do real fsync system calls? If so, did it help? That would be a useful data point. NFS uses close-to-open cache coherency. Thus we expect the file to be consistent and up to date after it is opened by a client, but not necessarily if some other client writes to it after it was opened. We usually recommend file locking and direct I/O to minimize these problems. Practically speaking this is often not enough for typical applications, so NFS client implementations go to further (non- standard) efforts to behave like a local file system. This is simply a question of whether we can address this while not creating performance or correctness issues for other common use cases. Anyway, I'm not the NFS client maintainer, so this decision is not up to me. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 2008-09-11 17:48 ` Chuck Lever @ 2008-09-11 18:49 ` Aaron Straus 2008-09-22 16:05 ` Hans-Peter Jansen 0 siblings, 1 reply; 25+ messages in thread From: Aaron Straus @ 2008-09-11 18:49 UTC (permalink / raw) To: Chuck Lever Cc: Neil Brown, Linux NFS Mailing List, Trond Myklebust, LKML Kernel [-- Attachment #1: Type: text/plain, Size: 1183 bytes --] Hi, On Sep 11 01:48 PM, Chuck Lever wrote: > Were you able to modify your writer to do real fsync system calls? If > so, did it help? That would be a useful data point. Yes/Yes. Please see the attached tarball sync-test.tar.bz2. Inside you'll find the modified writer: writer_sync.py That will call fsync on the file descriptor after each write. Also I added the strace and wireshark data for the sync and nosync cases. Note these are all tested with latest linus git: d1c6d2e547148c5aa0c0a4ff6aac82f7c6da1d8b > Practically speaking this is often not enough for typical > applications, so NFS client implementations go to further (non- > standard) efforts to behave like a local file system. This is simply > a question of whether we can address this while not creating > performance or correctness issues for other common use cases. Yep, I agree. I'm not saying what we do now is "wrong" per the RFC (writing the file out of order). It's just different from what we've done in the past (and somewhat unexpected). I'm still hoping there is a simple fix... but maybe not... :( Thanks! =a= -- =================== Aaron Straus aaron@merfinllc.com [-- Attachment #2: sync-test.tar.bz2 --] [-- Type: application/octet-stream, Size: 15200 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 2008-09-11 18:49 ` Aaron Straus @ 2008-09-22 16:05 ` Hans-Peter Jansen 2008-09-22 16:35 ` Trond Myklebust 0 siblings, 1 reply; 25+ messages in thread From: Hans-Peter Jansen @ 2008-09-22 16:05 UTC (permalink / raw) To: linux-kernel Cc: Aaron Straus, Chuck Lever, Neil Brown, Linux NFS Mailing List, Trond Myklebust Hi Aaron, hi NFS hackers, Am Donnerstag, 11. September 2008 schrieb Aaron Straus: > Hi, > > On Sep 11 01:48 PM, Chuck Lever wrote: > > Were you able to modify your writer to do real fsync system calls? If > > so, did it help? That would be a useful data point. > > Yes/Yes. Please see the attached tarball sync-test.tar.bz2. > > Inside you'll find the modified writer: writer_sync.py > > That will call fsync on the file descriptor after each write. > > Also I added the strace and wireshark data for the sync and nosync > cases. > > Note these are all tested with latest linus git: > > d1c6d2e547148c5aa0c0a4ff6aac82f7c6da1d8b > > > Practically speaking this is often not enough for typical > > applications, so NFS client implementations go to further (non- > > standard) efforts to behave like a local file system. This is simply > > a question of whether we can address this while not creating > > performance or correctness issues for other common use cases. > > Yep, I agree. I'm not saying what we do now is "wrong" per the RFC > (writing the file out of order). It's just different from what we've > done in the past (and somewhat unexpected). > > I'm still hoping there is a simple fix... but maybe not... :( For what is worth, this behavior is visible in bog standard writing/reading files, (log files in my case, via the python logging package). It obviously deviates from local filesystem behavior, and former state of the linux nfs-client. Should we add patches to less, tail, and all other instruments for watching/analysing log files (just to pick the tip of the ice rock) in order to throw away runs of zeros, when reading from nfs mounted files? Or should we ask their maintainers to add locking code for the nfs "read files, which are written at the same time" case, just to work around __some__ of the consequences of this bug? Imagine, how ugly this is going to look! The whole issue is what I call a major regression, thus I strongly ask for a reply from Trond on this matter. I even vote for sending a revert request for this hunk to the stable team, where it is applicable, after Trond sorted it out (for 2.6.27?). Thanks, Aaron and Chuck for the detailed analysis - it demystified a wired behavior, I observed here. When you're in a process to get real work done in a fixed timeline, such things could make you mad.. Cheers, Pete ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 2008-09-22 16:05 ` Hans-Peter Jansen @ 2008-09-22 16:35 ` Trond Myklebust 2008-09-22 17:04 ` Aaron Straus 2008-09-22 18:45 ` Hans-Peter Jansen 0 siblings, 2 replies; 25+ messages in thread From: Trond Myklebust @ 2008-09-22 16:35 UTC (permalink / raw) To: Hans-Peter Jansen Cc: linux-kernel, Aaron Straus, Chuck Lever, Neil Brown, Linux NFS Mailing List On Mon, 2008-09-22 at 18:05 +0200, Hans-Peter Jansen wrote: > For what is worth, this behavior is visible in bog standard writing/reading > files, (log files in my case, via the python logging package). It obviously > deviates from local filesystem behavior, and former state of the linux > nfs-client. Should we add patches to less, tail, and all other instruments > for watching/analysing log files (just to pick the tip of the ice rock) in > order to throw away runs of zeros, when reading from nfs mounted files? Or > should we ask their maintainers to add locking code for the nfs "read > files, which are written at the same time" case, just to work around > __some__ of the consequences of this bug? Imagine, how ugly this is going > to look! > > The whole issue is what I call a major regression, thus I strongly ask for a > reply from Trond on this matter. > > I even vote for sending a revert request for this hunk to the stable team, > where it is applicable, after Trond sorted it out (for 2.6.27?). > > Thanks, Aaron and Chuck for the detailed analysis - it demystified a wired > behavior, I observed here. When you're in a process to get real work done > in a fixed timeline, such things could make you mad.. Revert _what_ exactly? Please assume that I've been travelling for the past 5 weeks, and have only a sketchy idea of what has been going on. My understanding was that this is a consequence of unordered writes causing the file to be extended while some other task is reading. AFAICS, this sort of behaviour has _always_ been possible. I can't see how reverting anything will fix it. Trond ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 2008-09-22 16:35 ` Trond Myklebust @ 2008-09-22 17:04 ` Aaron Straus 2008-09-22 17:26 ` Chuck Lever 2008-09-22 17:29 ` Trond Myklebust 2008-09-22 18:45 ` Hans-Peter Jansen 1 sibling, 2 replies; 25+ messages in thread From: Aaron Straus @ 2008-09-22 17:04 UTC (permalink / raw) To: Trond Myklebust Cc: Hans-Peter Jansen, linux-kernel, Chuck Lever, Neil Brown, Linux NFS Mailing List Hi, On Sep 22 12:35 PM, Trond Myklebust wrote: > Revert _what_ exactly? Yep. I narrowed the problem down to an offending hunk in a particular patch. Removing that hunk did eliminate the problem. However, reverting that hunk is likely wrong and the code has changed _a lot_ since that commit. > My understanding was that this is a consequence of unordered writes > causing the file to be extended while some other task is reading. Yes. I added some debugging statements to look at the writeout path. I think the following happens: - page 0 gets partially written to by app - VM writes out partial dirty page 0 - page 0 gets fully written by app - page 1 gets partially written by app - VM writes out dirty page 1 At this point there is a hole in the file. The tail end of page 0 is still not written to server. - VM writes out dirty page 0 ... > AFAICS, this sort of behaviour has _always_ been possible. I can't see > how reverting anything will fix it. Here is the crux. It was possible previously but unlikely e.g. our app never saw this behavior. The new writeout semantics causes visible holes in files often. Anyway, I agree the new writeout semantics are allowed and possibly saner than the previous writeout path. The problem is that it is __annoying__ for this use case (log files). I'm not sure if there is an easy solution. We want the VM to writeout the address space in order. Maybe we can start the scan for dirty pages at the last page we wrote out i.e. page 0 in the example above? =a= -- =================== Aaron Straus aaron@merfinllc.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 2008-09-22 17:04 ` Aaron Straus @ 2008-09-22 17:26 ` Chuck Lever 2008-09-22 17:37 ` Aaron Straus 2008-09-22 17:29 ` Trond Myklebust 1 sibling, 1 reply; 25+ messages in thread From: Chuck Lever @ 2008-09-22 17:26 UTC (permalink / raw) To: Aaron Straus Cc: Trond Myklebust, Hans-Peter Jansen, linux-kernel, Neil Brown, Linux NFS Mailing List On Mon, Sep 22, 2008 at 1:04 PM, Aaron Straus <aaron@merfinllc.com> wrote: > Hi, > > On Sep 22 12:35 PM, Trond Myklebust wrote: >> Revert _what_ exactly? > > Yep. I narrowed the problem down to an offending hunk in a particular > patch. Removing that hunk did eliminate the problem. However, > reverting that hunk is likely wrong and the code has changed _a lot_ > since that commit. > >> My understanding was that this is a consequence of unordered writes >> causing the file to be extended while some other task is reading. > > Yes. I added some debugging statements to look at the writeout path. > > I think the following happens: > > - page 0 gets partially written to by app > - VM writes out partial dirty page 0 > - page 0 gets fully written by app > - page 1 gets partially written by app > - VM writes out dirty page 1 > > At this point there is a hole in the file. The tail end of page 0 is > still not written to server. > > - VM writes out dirty page 0 > ... > >> AFAICS, this sort of behaviour has _always_ been possible. I can't see >> how reverting anything will fix it. > > Here is the crux. It was possible previously but unlikely e.g. our app > never saw this behavior. The new writeout semantics causes visible > holes in files often. > > Anyway, I agree the new writeout semantics are allowed and possibly > saner than the previous writeout path. The problem is that it is > __annoying__ for this use case (log files). > > I'm not sure if there is an easy solution. We want the VM to writeout > the address space in order. Maybe we can start the scan for dirty > pages at the last page we wrote out i.e. page 0 in the example above? Why can't you use O_SYNC | O_APPEND ? -- Chuck Lever ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 2008-09-22 17:26 ` Chuck Lever @ 2008-09-22 17:37 ` Aaron Straus 0 siblings, 0 replies; 25+ messages in thread From: Aaron Straus @ 2008-09-22 17:37 UTC (permalink / raw) To: chucklever Cc: Trond Myklebust, Hans-Peter Jansen, linux-kernel, Neil Brown, Linux NFS Mailing List Hi, On Sep 22 01:26 PM, Chuck Lever wrote: > Why can't you use O_SYNC | O_APPEND ? In our case, some of the writers are not in our control. Another case we see the issue is you spawn a job: job 1> out 2> err & tail -f out err -- Also, we actually do like caching the writes (b/c log files do not need to be checked immediately after being written). We just wish the cache could be written out in-order. If there's no way to make that happen... we can reorganize our file-system, exports, mounts, etc so that the log file directory is mounted with the sync option. However, the problem *might* be annoying and wide-spread enough to warrant a change. Anyway, thanks! Regards, =a= -- =================== Aaron Straus aaron@merfinllc.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 2008-09-22 17:04 ` Aaron Straus 2008-09-22 17:26 ` Chuck Lever @ 2008-09-22 17:29 ` Trond Myklebust 2008-09-22 17:45 ` Aaron Straus 1 sibling, 1 reply; 25+ messages in thread From: Trond Myklebust @ 2008-09-22 17:29 UTC (permalink / raw) To: Aaron Straus Cc: Hans-Peter Jansen, linux-kernel, Chuck Lever, Neil Brown, Linux NFS Mailing List [-- Attachment #1: Type: text/plain, Size: 1122 bytes --] On Mon, 2008-09-22 at 10:04 -0700, Aaron Straus wrote: > Here is the crux. It was possible previously but unlikely e.g. our app > never saw this behavior. The new writeout semantics causes visible > holes in files often. > > Anyway, I agree the new writeout semantics are allowed and possibly > saner than the previous writeout path. The problem is that it is > __annoying__ for this use case (log files). There is always the option of using syslog. > I'm not sure if there is an easy solution. We want the VM to writeout > the address space in order. Maybe we can start the scan for dirty > pages at the last page we wrote out i.e. page 0 in the example above? You can never guarantee that in a multi-threaded environment. Two threads may, for instance, force 2 competing fsync() calls: that again may cause out-of-order writes. ...and even if the client doesn't reorder the writes, the _server_ may do it, since multiple nfsd threads may race when processing writes to the same file. Anyway, the patch to force a single threaded nfs client to write out the data in order is trivial. See attachment... Trond [-- Attachment #2: linux-2.6.27-ordered_writes.dif --] [-- Type: message/rfc822, Size: 749 bytes --] From: Trond Myklebust <Trond.Myklebust@netapp.com> Subject: No Subject Date: Message-ID: <1222104509.7615.22.camel@localhost> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- fs/nfs/write.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 3229e21..eb6b211 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -1428,7 +1428,8 @@ static int nfs_write_mapping(struct address_space *mapping, int how) .sync_mode = WB_SYNC_NONE, .nr_to_write = LONG_MAX, .for_writepages = 1, - .range_cyclic = 1, + .range_start = 0, + .range_end = LLONG_MAX, }; int ret; ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 2008-09-22 17:29 ` Trond Myklebust @ 2008-09-22 17:45 ` Aaron Straus 2008-09-22 18:43 ` Aaron Straus 2008-09-22 18:45 ` Hans-Peter Jansen 0 siblings, 2 replies; 25+ messages in thread From: Aaron Straus @ 2008-09-22 17:45 UTC (permalink / raw) To: Trond Myklebust Cc: Hans-Peter Jansen, linux-kernel, Chuck Lever, Neil Brown, Linux NFS Mailing List Hi, On Sep 22 01:29 PM, Trond Myklebust wrote: > > Anyway, I agree the new writeout semantics are allowed and possibly > > saner than the previous writeout path. The problem is that it is > > __annoying__ for this use case (log files). > > There is always the option of using syslog. Definitely. Everything in our control we can work around.... there are a few applications we cannot easily change... see the follow-up in another e-mail. > > I'm not sure if there is an easy solution. We want the VM to writeout > > the address space in order. Maybe we can start the scan for dirty > > pages at the last page we wrote out i.e. page 0 in the example above? > > You can never guarantee that in a multi-threaded environment. Definitely. This is a single writer, single reader case though. > Two threads may, for instance, force 2 competing fsync() calls: that > again may cause out-of-order writes. Yup. > ...and even if the client doesn't reorder the writes, the _server_ may > do it, since multiple nfsd threads may race when processing writes to > the same file. Yup. We're definitely not asking for anything like that. > Anyway, the patch to force a single threaded nfs client to write out the > data in order is trivial. See attachment... > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > index 3229e21..eb6b211 100644 > --- a/fs/nfs/write.c > +++ b/fs/nfs/write.c > @@ -1428,7 +1428,8 @@ static int nfs_write_mapping(struct address_space *mapping, int how) > .sync_mode = WB_SYNC_NONE, > .nr_to_write = LONG_MAX, > .for_writepages = 1, > - .range_cyclic = 1, > + .range_start = 0, > + .range_end = LLONG_MAX, > }; > int ret; > Yeah I was looking at that while debugging. Would that change have chance to make it into mainline? I assume it makes the normal writeout path more expensive, by forcing a scan of the entire address space. Also, I should test this, but I thought the VM was calling nfs_writepages directly i.e. not going through nfs_write_mapping. Let me test with this patch. Thanks, =a= -- =================== Aaron Straus aaron@merfinllc.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 2008-09-22 17:45 ` Aaron Straus @ 2008-09-22 18:43 ` Aaron Straus 2008-09-22 18:45 ` Hans-Peter Jansen 1 sibling, 0 replies; 25+ messages in thread From: Aaron Straus @ 2008-09-22 18:43 UTC (permalink / raw) To: Trond Myklebust Cc: Hans-Peter Jansen, linux-kernel, Chuck Lever, Neil Brown, Linux NFS Mailing List Hi, On Sep 22 10:45 AM, Aaron Straus wrote: > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > > index 3229e21..eb6b211 100644 > > --- a/fs/nfs/write.c > > +++ b/fs/nfs/write.c > > @@ -1428,7 +1428,8 @@ static int nfs_write_mapping(struct address_space *mapping, int how) > > .sync_mode = WB_SYNC_NONE, > > .nr_to_write = LONG_MAX, > > .for_writepages = 1, > > - .range_cyclic = 1, > > + .range_start = 0, > > + .range_end = LLONG_MAX, > > }; > > int ret; > > > > Also, I should test this, but I thought the VM was calling > nfs_writepages directly i.e. not going through nfs_write_mapping. Let > me test with this patch. Yes. This patch doesn't seem to help. The VM is calling nfs_writepages directly. I have debug statements in the nfs_write_mapping() call and it hasn't been called when the hole appears. It's the VM dirty page writeout that is happening out of order, *not* nfs_wb_* stuff. That's why we can see it so easily. The dirty pages are only scanned periodically. In-between scans is when we see the hole. I think :). I'm obviously not an expert here. Thanks, =a= -- =================== Aaron Straus aaron@merfinllc.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 2008-09-22 17:45 ` Aaron Straus 2008-09-22 18:43 ` Aaron Straus @ 2008-09-22 18:45 ` Hans-Peter Jansen 1 sibling, 0 replies; 25+ messages in thread From: Hans-Peter Jansen @ 2008-09-22 18:45 UTC (permalink / raw) To: linux-kernel Cc: Aaron Straus, Trond Myklebust, Chuck Lever, Neil Brown, Linux NFS Mailing List Am Montag, 22. September 2008 schrieb Aaron Straus: > Hi, > > On Sep 22 01:29 PM, Trond Myklebust wrote: > > > Anyway, I agree the new writeout semantics are allowed and possibly > > > saner than the previous writeout path. The problem is that it is > > > __annoying__ for this use case (log files). > > > > There is always the option of using syslog. > > Definitely. Everything in our control we can work around.... there are > a few applications we cannot easily change... see the follow-up in > another e-mail. > > > > I'm not sure if there is an easy solution. We want the VM to > > > writeout the address space in order. Maybe we can start the scan > > > for dirty pages at the last page we wrote out i.e. page 0 in the > > > example above? > > > > You can never guarantee that in a multi-threaded environment. > > Definitely. This is a single writer, single reader case though. ...where it happens, that the reader gets chunks of zeros from reading a file, that is written from another (single threaded) process. Note, that going through syslog isn't an option in many cases unless we want to rewrite the "world" to work around this phenomenon, thus it's not simply annoying, as Aaron points out, the "in order" approach is inevitable. > > Two threads may, for instance, force 2 competing fsync() calls: that > > again may cause out-of-order writes. > > Yup. > > > ...and even if the client doesn't reorder the writes, the _server_ may > > do it, since multiple nfsd threads may race when processing writes to > > the same file. > > Yup. We're definitely not asking for anything like that. > > > Anyway, the patch to force a single threaded nfs client to write out > > the data in order is trivial. See attachment... > > > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > > index 3229e21..eb6b211 100644 > > --- a/fs/nfs/write.c > > +++ b/fs/nfs/write.c > > @@ -1428,7 +1428,8 @@ static int nfs_write_mapping(struct address_space > > *mapping, int how) .sync_mode = WB_SYNC_NONE, > > .nr_to_write = LONG_MAX, > > .for_writepages = 1, > > - .range_cyclic = 1, > > + .range_start = 0, > > + .range_end = LLONG_MAX, > > }; > > int ret; > > Yeah I was looking at that while debugging. Would that change have > chance to make it into mainline? I assume it makes the normal writeout > path more expensive, by forcing a scan of the entire address space. If this patch solves this issue, it is necessary to get applied as soon as possible as outlined above.. > Also, I should test this, but I thought the VM was calling > nfs_writepages directly i.e. not going through nfs_write_mapping. Let > me test with this patch. Let us know about the outcome. Thanks, Pete ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 2008-09-22 16:35 ` Trond Myklebust 2008-09-22 17:04 ` Aaron Straus @ 2008-09-22 18:45 ` Hans-Peter Jansen 1 sibling, 0 replies; 25+ messages in thread From: Hans-Peter Jansen @ 2008-09-22 18:45 UTC (permalink / raw) To: linux-kernel Cc: Trond Myklebust, Aaron Straus, Chuck Lever, Neil Brown, Linux NFS Mailing List Am Montag, 22. September 2008 schrieb Trond Myklebust: > On Mon, 2008-09-22 at 18:05 +0200, Hans-Peter Jansen wrote: > > For what is worth, this behavior is visible in bog standard > > writing/reading files, (log files in my case, via the python logging > > package). It obviously deviates from local filesystem behavior, and > > former state of the linux nfs-client. Should we add patches to less, > > tail, and all other instruments for watching/analysing log files (just > > to pick the tip of the ice rock) in order to throw away runs of zeros, > > when reading from nfs mounted files? Or should we ask their maintainers > > to add locking code for the nfs "read files, which are written at the > > same time" case, just to work around __some__ of the consequences of > > this bug? Imagine, how ugly this is going to look! > > > > The whole issue is what I call a major regression, thus I strongly ask > > for a reply from Trond on this matter. > > > > I even vote for sending a revert request for this hunk to the stable > > team, where it is applicable, after Trond sorted it out (for 2.6.27?). > > > > Thanks, Aaron and Chuck for the detailed analysis - it demystified a > > wired behavior, I observed here. When you're in a process to get real > > work done in a fixed timeline, such things could make you mad.. > > Revert _what_ exactly? For your convenience, important parts inlined here: >From Aarons message: Tue, 9 Sep 2008 12:46:44 -0700 in this thread. << EOM Of the bisected offending commit: commit e261f51f25b98c213e0b3d7f2109b117d714f69d Author: Trond Myklebust <Trond.Myklebust@netapp.com> Date: Tue Dec 5 00:35:41 2006 -0500 NFS: Make nfs_updatepage() mark the page as dirty. This will ensure that we can call set_page_writeback() from within nfs_writepage(), which is always called with the page lock set. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> It seems to be this hunk which introduces the problem: @@ -628,7 +667,6 @@ static struct nfs_page * nfs_update_request(struct nfs_open_context* ctx, return ERR_PTR(error); } spin_unlock(&nfsi->req_lock); - nfs_mark_request_dirty(new); return new; } spin_unlock(&nfsi->req_lock); If I add that function call back in... the problem disappears. I don't know if this just papers over the real problem though? EOM This commit happened between 2.6.19 and 2.6.20, btw. > Please assume that I've been travelling for the past 5 weeks, and have > only a sketchy idea of what has been going on. Ahh, I see, that explains, why you didn't responded earlier. > My understanding was that this is a consequence of unordered writes > causing the file to be extended while some other task is reading. > AFAICS, this sort of behaviour has _always_ been possible. I can't see > how reverting anything will fix it. Hopefully, this helps you to remember the purpose of that change. Cheers, Pete ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2008-09-22 18:46 UTC | newest] Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2008-09-05 19:19 blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20 Aaron Straus 2008-09-05 19:56 ` [NFS] " Chuck Lever 2008-09-05 20:04 ` Aaron Straus 2008-09-05 20:36 ` Bernd Eckenfels 2008-09-05 20:36 ` Chuck Lever 2008-09-05 22:14 ` Aaron Straus 2008-09-06 0:03 ` Aaron Straus 2008-09-08 19:02 ` Aaron Straus 2008-09-08 21:15 ` Chuck Lever 2008-09-08 22:02 ` Aaron Straus 2008-09-09 19:46 ` Aaron Straus 2008-09-11 16:55 ` Chuck Lever 2008-09-11 17:19 ` Aaron Straus 2008-09-11 17:48 ` Chuck Lever 2008-09-11 18:49 ` Aaron Straus 2008-09-22 16:05 ` Hans-Peter Jansen 2008-09-22 16:35 ` Trond Myklebust 2008-09-22 17:04 ` Aaron Straus 2008-09-22 17:26 ` Chuck Lever 2008-09-22 17:37 ` Aaron Straus 2008-09-22 17:29 ` Trond Myklebust 2008-09-22 17:45 ` Aaron Straus 2008-09-22 18:43 ` Aaron Straus 2008-09-22 18:45 ` Hans-Peter Jansen 2008-09-22 18:45 ` Hans-Peter Jansen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).