From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lb0-f177.google.com ([209.85.217.177]:35232 "EHLO mail-lb0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759987AbcCDUzi (ORCPT ); Fri, 4 Mar 2016 15:55:38 -0500 Received: by mail-lb0-f177.google.com with SMTP id bc4so74622154lbc.2 for ; Fri, 04 Mar 2016 12:55:37 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <20160122200442.GF17997@ZenIV.linux.org.uk> <20160123001202.GJ17997@ZenIV.linux.org.uk> <20160123012808.GK17997@ZenIV.linux.org.uk> <20160123191055.GN17997@ZenIV.linux.org.uk> <20160123214006.GO17997@ZenIV.linux.org.uk> <20160123224632.GQ17997@ZenIV.linux.org.uk> Date: Fri, 4 Mar 2016 15:55:36 -0500 Message-ID: Subject: Re: write() semantics (Re: Orangefs ABI documentation) From: Mike Marshall To: Linus Torvalds Cc: Al Viro , linux-fsdevel , Mike Marshall Content-Type: text/plain; charset=UTF-8 Sender: linux-fsdevel-owner@vger.kernel.org List-ID: I added a signal handler to my test program today, and now I see the giant difference between wait_for_completion_interruptible_timeout, which is what you get when you use the -intr mount option, and wait_for_completion_killable_timeout, which is what you get when you don't, so I retract what I said about the -intr mount option not being relevant ... It also seems like it would be easy (and correct) to modify the patch a little so that -EINTR would still be returned on the off-chance that the interrupt was caught before any data was written... -Mike On Thu, Mar 3, 2016 at 5:25 PM, Mike Marshall wrote: > Here is what I have come up with to try and make our return to > interrupted writes more acceptable... in my tests it seems to > work. My test involved running the client-core with a tiny IO > buffer size (4k) and a C program with a large write buffer(32M). > That way there was plenty of time for me to fire off the C program > and hit Ctrl-C while the write was chugging along. > > The return value from do_readv_writev always matches the size > of the file written by the aborted C program in my tests. > > I changed the C program around so that sometimes I ran it with > (O_CREAT | O_RDWR) on open and sometimes with (O_CREAT | O_RDWR | O_APPEND) > and it seemed to do the right thing. I didn't try to set up a > signal handler so that the signal wasn't fatal to the process, > I guess that would be a test to actually see and verify the correct > short return code to write... > > Do you all think this looks like it should work in principle? > > BTW: in the distant past someone else attempted to solve this problem > the "nfs intr" way - we have an intr mount option, and that's why there's > all that sweating over whether or not stuff is "interruptible" in > waitqueue.c... I'm not sure if our intr mount option is relevant anymore > given the way the op handling code has evolved... > > diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c > index 6f2e0f7..4349c9d 100644 > --- a/fs/orangefs/file.c > +++ b/fs/orangefs/file.c > @@ -180,21 +180,54 @@ populate_shared_memory: > } > > if (ret < 0) { > - /* > - * don't write an error to syslog on signaled operation > - * termination unless we've got debugging turned on, as > - * this can happen regularly (i.e. ctrl-c) > - */ > - if (ret == -EINTR) > + if (ret == -EINTR) { > + /* > + * We can't return EINTR if any data was written, > + * it's not POSIX. It is minimally acceptable > + * to give a partial write, the way NFS does. > + * > + * It would be optimal to return all or nothing, > + * but if a userspace write is bigger than > + * an IO buffer, and the interrupt occurs > + * between buffer writes, that would not be > + * possible. > + */ > + switch (new_op->op_state - OP_VFS_STATE_GIVEN_UP) { > + /* > + * If the op was waiting when the interrupt > + * occurred, then the client-core did not > + * trigger the write. > + */ > + case OP_VFS_STATE_WAITING: > + ret = 0; > + break; > + /* > + * If the op was in progress when the interrupt > + * occurred, then the client-core was able to > + * trigger the write. > + */ > + case OP_VFS_STATE_INPROGR: > + ret = total_size; > + break; > + default: > + gossip_err("%s: unexpected op state :%d:.\n", > + __func__, > + new_op->op_state); > + ret = 0; > + break; > + } > gossip_debug(GOSSIP_FILE_DEBUG, > - "%s: returning error %ld\n", __func__, > - (long)ret); > - else > + "%s: got EINTR, state:%d: %p\n", > + __func__, > + new_op->op_state, > + new_op); > + } else { > gossip_err("%s: error in %s handle %pU, returning %zd\n", > __func__, > type == ORANGEFS_IO_READ ? > "read from" : "write to", > handle, ret); > + } > if (orangefs_cancel_op_in_progress(new_op)) > return ret; > > > On Sat, Jan 23, 2016 at 6:35 PM, Linus Torvalds > wrote: >> On Sat, Jan 23, 2016 at 2:46 PM, Al Viro wrote: >>> >>> What should we get? -EINTR, despite having written some data? >> >> No, that's not acceptable. >> >> Either all or nothing (which is POSIX) or the NFS 'intr' mount >> behavior (partial write return, -EINTR only when nothing was written >> at all). And, like NFS, a mount option might be a good thing. >> >> And of course, for the usual reasons, fatal signals are special in >> that for them we generally say "screw posix, nobody sees the return >> value anyway", but even there the filesystem might as well still >> return the partial return value (just to not introduce yet another >> special case). >> >> In fact, I think that with our "fatal signals interrupt" behavior, >> nobody should likely use the "intr" mount option on NFS. Even if the >> semantics may be "better", there are likely simply just too many >> programs that don't check the return value of "write()" at all, much >> less handle partial writes correctly. >> >> (And yes, our "screw posix" behavior wrt fatal signals is strictly >> wrong even _despite_ the fact that nobody sees the return value - >> other processes can still obviously see that the whole write wasn't >> done. But blocking on a fatal signal is _so_ annoying that it's one of >> those things where we just say "posix was wrong on this one, and if we >> squint a bit we look _almost_ like we're compliant"). >> >> Linus