From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-f43.google.com ([209.85.215.43]:35780 "EHLO mail-lf0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752940AbcAVLJD (ORCPT ); Fri, 22 Jan 2016 06:09:03 -0500 Received: by mail-lf0-f43.google.com with SMTP id c192so44690394lfe.2 for ; Fri, 22 Jan 2016 03:09:02 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20160122071147.GY17997@ZenIV.linux.org.uk> References: <20160122071147.GY17997@ZenIV.linux.org.uk> Date: Fri, 22 Jan 2016 06:09:01 -0500 Message-ID: Subject: Re: Orangefs ABI documentation From: Mike Marshall To: Al Viro Cc: Linus Torvalds , linux-fsdevel Content-Type: text/plain; charset=UTF-8 Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Hi Al... Thanks for the review... I'll be trying to make some progress on your concerns about wait_for_cancellation_downcall today, but in case it looks like I dropped off the face of the earth for a few days - maybe I did - a several days long ice storm is dropping on us right now, everyone is working from home, power will likely go out for some... I'm cloning your repo from kernel.org across my slow dsl link right now... -Mike "I wrote all the debugfs code " On Fri, Jan 22, 2016 at 2:11 AM, Al Viro wrote: > On Fri, Jan 15, 2016 at 04:46:09PM -0500, Mike Marshall wrote: >> Al... >> >> We have been working on Orangefs ABI (protocol between >> userspace and kernel module) documentation, and improving >> the code along the way. >> >> We wish you would look at the documentation progress, and >> look at the way that .write_iter is implemented now... plus >> the other improvements. We know that every problem you >> pointed out hasn't been worked on yet, but your feedback >> would let us know if we're headed in the right direction. >> >> git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux.git >> for-next >> >> The improved (but not yet officially released) userspace >> function that does the writev to the Orangefs pseudo device is in: >> >> http://www.orangefs.org/svn/orangefs/branches/trunk.kernel.update/ >> src/io/dev/pint-dev.c >> >> You may decide we don't deserve an ACK yet, but since we are in >> the merge window I hope you can look at it... > > write_iter got much better, but there are serious problems with locking. > For example, just what will happen if orangefs_devreq_read() picks an op > for copying to daemon, moves it to hash, unlocks everything and starts > copy_to_user() at the time when submitter of that op (sleeping > in wait_for_matching_downcall()) gets a signal and buggers off? It calls > orangefs_clean_up_interrupted_operation(), removes the sucker from hash > and proceed to free the damn thing. Right under ongoing copy_to_user(). > Better yet, should that copy_to_user() fail, we proceed to move an already > freed op back to request list. And no, get_op()/put_op() pair won't be > enough - submitters just go ahead an call op_release(), freeing the op, > refcount be damned. They'd need to switch to put_op() for that to work. > Note that the same applies for existing use of get_op/put_op mechanism - > if that signal comes just as the orangefs_devreq_write_iter() has picked > the op from hash, we'll end up freeing the sucker right under copy_from_iter() > despite get_op(). > > What's more, a failure of that copy_to_user() leaves us in a nasty situation - > how do we tell whether to put the request back to the list or just quietly > drop it? Check the refcount? But what if the submitter (believing, and > for a good reason, that it had removed the sucker from hash) hadn't gotten > around to dropping its reference? > > Another piece of fun assumes malicious daemon; sure, it's already a FUBAR > situation, but userland shouldn't be able to corrupt the kernel memory. > Look: daemon spawns a couple of threads. One of those issues read() on > your character device, with just 16 bytes of destination mmapped and filled > with zeroes. Another spins until it sees ->tag becoming non-zero and > immediately does writev() now that it has the tag value. What we get is > submitter seeing op serviced and proceeding to free it while ->read() hits > the copy_to_user() failure and reinserts the already freed op into request > list. > > I'm not sure we have any business inserting the op to hash until we are > done with copy_to_user() and know we won't fail. Would simplify the > failure recovery as well... > > I think we ought to flag the op when submitter decides to give up on it > and have the return-to-request-list logics check that (under op->lock), > returning it to the list only in case it's not flagged *and* decrementing > refcount before dropping op->lock in that case - we are guaranteed that > this is not the last reference. In case it's flagged, we drop ->op_lock, > do put_op() and _not_ refile the sucker to request list. > > I'm not sure if this > /* NOTE: for I/O operations we handle releasing the op > * object except in the case of timeout. the reason we > * can't free the op in timeout cases is that the op > * service logic in the vfs retries operations using > * the same op ptr, thus it can't be freed. > */ > if (!timed_out) > op_release(op); > is safe, while we are at it... I suspect we'd be better off if we did > put_op() there. Unconditionally. After the entire > if (op->downcall.type == ORANGEFS_VFS_OP_FILE_IO) > thing, so that the total change of refcount is zero in all cases. At least > that makes for regular lifetime rules (with s/op_release/put_op/ in submitters) > > The thing I really don't understand is the situation with > wait_for_cancellation_downcall(). The comments in there flat-out > contradict the code - if called with signal already pending (as the > comment would seem to indicate), we might easily leave and free > the damn op without it ever being seen by daemon. What's intended there? > And what happens if we reach schedule_timeout() in there (i.e. if we get > there without a signal pending) and operation completes successfully? > AFAICS, you'll get -ETIMEDOUT, not that the caller even looked at that > return value... What's going on there? > > Another unpleasantness is that your locking hierarchy leads to > to races in orangefs_clean_up_interrupted_operation(); you take op->lock, > look at the state, decide what needs to be locked to remove the sucker from, > drop op->lock, lock the relevant data structure and proceed to search op in > it, removing it in case of success. Fun, but what happens if it gets refiled > as soon as you drop op->lock? > > I suspect that we ought to make refiling conditional on the same > "submitter has given up on that one" flag *and* move hash lock outside of > op->lock. IOW, once that flag had been set, only submitter can do anything > with op->list. Then orangefs_clean_up_interrupted_operation() could simply > make sure that flag had been set, drop op->lock, grab the relevant spinlock > and do list_del(). And to hell with "search and remove if it's there" > complications. > > One more problem is with your open_access_count; it should be > tristate, not boolean. As it is, another open() of your character > device is possible as soon as ->release() has decremented open_access_count. > IOW, purge_inprogress_ops() and purge_waiting_ops() can happen when we > already have reopened the sucker. It should have three states - "not opened", > "opened" and "shutting down", the last one both preventing open() and > being treated as "not in service". > > Input validation got a lot saner (and documentation is quite welcome). > One place where I see a problem is listxattr - > if (total + new_op->downcall.resp.listxattr.lengths[i] > size) > goto done; > is *not* enough; you need to check that it's not going backwards (i.e. isn't > less than total). total needs to be size_t, BTW - ssize_t is wrong here. > > Another issue is that you are doing register_chrdev() too early - > it should be just before register_filesystem(); definitely without any > failure exits in between. > > Debugfs stuff is FUBAR - it's sufficiently isolated, so I hadn't > spent much time on it, but for example debugfs_create_file() in > orangefs_debugfs_init() fails, we'll call orangefs_debugfs_cleanup() > right there. On module exit it will be called again, with > debugfs_remove_recursive(debug_dir) called *again*. At the very least it > should clear debug_dir in orangefs_debugfs_cleanup(). Worse, > orangefs_kernel_debug_init() will try to create an object in freed > directory... > > I've dumped some of the massage into vfs.git#orangefs-untested, but > the locking and lifetime changes are not there yet; I'll add more of that > tomorrow. I would really appreciate details on wait_for_cancellation_downcall; > that's the worst gap I have in understanding that code.