From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lb0-f170.google.com ([209.85.217.170]:32914 "EHLO mail-lb0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754482AbcAVQ7m (ORCPT ); Fri, 22 Jan 2016 11:59:42 -0500 Received: by mail-lb0-f170.google.com with SMTP id x4so44988528lbm.0 for ; Fri, 22 Jan 2016 08:59:41 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <20160122071147.GY17997@ZenIV.linux.org.uk> Date: Fri, 22 Jan 2016 11:59:40 -0500 Message-ID: Subject: Re: Orangefs ABI documentation From: Mike Marshall To: Al Viro Cc: Linus Torvalds , linux-fsdevel Content-Type: text/plain; charset=UTF-8 Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Hi Al... I moved a tiny bit of your work around so it would compile, but I booted a kernel from it, and ran some tests, it seems to work OK... doing this work from home makes me remember writing cobol programs on a silent 700... my co-workers are helping me look at wait_for_cancellation_downcall... we recently made some improvements there based on some problems we were having in production with our out-of-tree Frankenstein module... I'm glad you are also looking there. # git diff diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h index b926fe77..88e606a 100644 --- a/fs/orangefs/orangefs-kernel.h +++ b/fs/orangefs/orangefs-kernel.h @@ -103,24 +103,6 @@ enum orangefs_vfs_op_states { OP_VFS_STATE_PURGED = 8, }; -#define set_op_state_waiting(op) ((op)->op_state = OP_VFS_STATE_WAITING) -#define set_op_state_inprogress(op) ((op)->op_state = OP_VFS_STATE_INPROGR) -static inline void set_op_state_serviced(struct orangefs_kernel_op_s *op) -{ - op->op_state = OP_VFS_STATE_SERVICED; - wake_up_interruptible(&op->waitq); -} -static inline void set_op_state_purged(struct orangefs_kernel_op_s *op) -{ - op->op_state |= OP_VFS_STATE_PURGED; - wake_up_interruptible(&op->waitq); -} - -#define op_state_waiting(op) ((op)->op_state & OP_VFS_STATE_WAITING) -#define op_state_in_progress(op) ((op)->op_state & OP_VFS_STATE_INPROGR) -#define op_state_serviced(op) ((op)->op_state & OP_VFS_STATE_SERVICED) -#define op_state_purged(op) ((op)->op_state & OP_VFS_STATE_PURGED) - #define get_op(op) \ do { \ atomic_inc(&(op)->ref_count); \ @@ -259,6 +241,25 @@ struct orangefs_kernel_op_s { struct list_head list; }; +#define set_op_state_waiting(op) ((op)->op_state = OP_VFS_STATE_WAITING) +#define set_op_state_inprogress(op) ((op)->op_state = OP_VFS_STATE_INPROGR) +static inline void set_op_state_serviced(struct orangefs_kernel_op_s *op) +{ + op->op_state = OP_VFS_STATE_SERVICED; + wake_up_interruptible(&op->waitq); +} +static inline void set_op_state_purged(struct orangefs_kernel_op_s *op) +{ + op->op_state |= OP_VFS_STATE_PURGED; + wake_up_interruptible(&op->waitq); +} + +#define op_state_waiting(op) ((op)->op_state & OP_VFS_STATE_WAITING) +#define op_state_in_progress(op) ((op)->op_state & OP_VFS_STATE_INPROGR) +#define op_state_serviced(op) ((op)->op_state & OP_VFS_STATE_SERVICED) +#define op_state_purged(op) ((op)->op_state & OP_VFS_STATE_PURGED) + + /* per inode private orangefs info */ struct orangefs_inode_s { struct orangefs_object_kref refn; diff --git a/fs/orangefs/waitqueue.c b/fs/orangefs/waitqueue.c index 641de05..a257891 100644 --- a/fs/orangefs/waitqueue.c +++ b/fs/orangefs/waitqueue.c @@ -16,6 +16,9 @@ #include "orangefs-kernel.h" #include "orangefs-bufmap.h" +static int wait_for_cancellation_downcall(struct orangefs_kernel_op_s *); +static int wait_for_matching_downcall(struct orangefs_kernel_op_s *); + /* * What we do in this function is to walk the list of operations that are * present in the request queue and mark them as purged. -Mike On Fri, Jan 22, 2016 at 6:09 AM, Mike Marshall wrote: > Hi Al... > > Thanks for the review... > > I'll be trying to make some progress on your concerns about > wait_for_cancellation_downcall today, but in case it looks > like I dropped off the face of the earth for a few days - maybe > I did - a several days long ice storm is dropping on us right > now, everyone is working from home, power will likely go > out for some... > > I'm cloning your repo from kernel.org across my slow > dsl link right now... > > -Mike "I wrote all the debugfs code " > > On Fri, Jan 22, 2016 at 2:11 AM, Al Viro wrote: >> On Fri, Jan 15, 2016 at 04:46:09PM -0500, Mike Marshall wrote: >>> Al... >>> >>> We have been working on Orangefs ABI (protocol between >>> userspace and kernel module) documentation, and improving >>> the code along the way. >>> >>> We wish you would look at the documentation progress, and >>> look at the way that .write_iter is implemented now... plus >>> the other improvements. We know that every problem you >>> pointed out hasn't been worked on yet, but your feedback >>> would let us know if we're headed in the right direction. >>> >>> git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux.git >>> for-next >>> >>> The improved (but not yet officially released) userspace >>> function that does the writev to the Orangefs pseudo device is in: >>> >>> http://www.orangefs.org/svn/orangefs/branches/trunk.kernel.update/ >>> src/io/dev/pint-dev.c >>> >>> You may decide we don't deserve an ACK yet, but since we are in >>> the merge window I hope you can look at it... >> >> write_iter got much better, but there are serious problems with locking. >> For example, just what will happen if orangefs_devreq_read() picks an op >> for copying to daemon, moves it to hash, unlocks everything and starts >> copy_to_user() at the time when submitter of that op (sleeping >> in wait_for_matching_downcall()) gets a signal and buggers off? It calls >> orangefs_clean_up_interrupted_operation(), removes the sucker from hash >> and proceed to free the damn thing. Right under ongoing copy_to_user(). >> Better yet, should that copy_to_user() fail, we proceed to move an already >> freed op back to request list. And no, get_op()/put_op() pair won't be >> enough - submitters just go ahead an call op_release(), freeing the op, >> refcount be damned. They'd need to switch to put_op() for that to work. >> Note that the same applies for existing use of get_op/put_op mechanism - >> if that signal comes just as the orangefs_devreq_write_iter() has picked >> the op from hash, we'll end up freeing the sucker right under copy_from_iter() >> despite get_op(). >> >> What's more, a failure of that copy_to_user() leaves us in a nasty situation - >> how do we tell whether to put the request back to the list or just quietly >> drop it? Check the refcount? But what if the submitter (believing, and >> for a good reason, that it had removed the sucker from hash) hadn't gotten >> around to dropping its reference? >> >> Another piece of fun assumes malicious daemon; sure, it's already a FUBAR >> situation, but userland shouldn't be able to corrupt the kernel memory. >> Look: daemon spawns a couple of threads. One of those issues read() on >> your character device, with just 16 bytes of destination mmapped and filled >> with zeroes. Another spins until it sees ->tag becoming non-zero and >> immediately does writev() now that it has the tag value. What we get is >> submitter seeing op serviced and proceeding to free it while ->read() hits >> the copy_to_user() failure and reinserts the already freed op into request >> list. >> >> I'm not sure we have any business inserting the op to hash until we are >> done with copy_to_user() and know we won't fail. Would simplify the >> failure recovery as well... >> >> I think we ought to flag the op when submitter decides to give up on it >> and have the return-to-request-list logics check that (under op->lock), >> returning it to the list only in case it's not flagged *and* decrementing >> refcount before dropping op->lock in that case - we are guaranteed that >> this is not the last reference. In case it's flagged, we drop ->op_lock, >> do put_op() and _not_ refile the sucker to request list. >> >> I'm not sure if this >> /* NOTE: for I/O operations we handle releasing the op >> * object except in the case of timeout. the reason we >> * can't free the op in timeout cases is that the op >> * service logic in the vfs retries operations using >> * the same op ptr, thus it can't be freed. >> */ >> if (!timed_out) >> op_release(op); >> is safe, while we are at it... I suspect we'd be better off if we did >> put_op() there. Unconditionally. After the entire >> if (op->downcall.type == ORANGEFS_VFS_OP_FILE_IO) >> thing, so that the total change of refcount is zero in all cases. At least >> that makes for regular lifetime rules (with s/op_release/put_op/ in submitters) >> >> The thing I really don't understand is the situation with >> wait_for_cancellation_downcall(). The comments in there flat-out >> contradict the code - if called with signal already pending (as the >> comment would seem to indicate), we might easily leave and free >> the damn op without it ever being seen by daemon. What's intended there? >> And what happens if we reach schedule_timeout() in there (i.e. if we get >> there without a signal pending) and operation completes successfully? >> AFAICS, you'll get -ETIMEDOUT, not that the caller even looked at that >> return value... What's going on there? >> >> Another unpleasantness is that your locking hierarchy leads to >> to races in orangefs_clean_up_interrupted_operation(); you take op->lock, >> look at the state, decide what needs to be locked to remove the sucker from, >> drop op->lock, lock the relevant data structure and proceed to search op in >> it, removing it in case of success. Fun, but what happens if it gets refiled >> as soon as you drop op->lock? >> >> I suspect that we ought to make refiling conditional on the same >> "submitter has given up on that one" flag *and* move hash lock outside of >> op->lock. IOW, once that flag had been set, only submitter can do anything >> with op->list. Then orangefs_clean_up_interrupted_operation() could simply >> make sure that flag had been set, drop op->lock, grab the relevant spinlock >> and do list_del(). And to hell with "search and remove if it's there" >> complications. >> >> One more problem is with your open_access_count; it should be >> tristate, not boolean. As it is, another open() of your character >> device is possible as soon as ->release() has decremented open_access_count. >> IOW, purge_inprogress_ops() and purge_waiting_ops() can happen when we >> already have reopened the sucker. It should have three states - "not opened", >> "opened" and "shutting down", the last one both preventing open() and >> being treated as "not in service". >> >> Input validation got a lot saner (and documentation is quite welcome). >> One place where I see a problem is listxattr - >> if (total + new_op->downcall.resp.listxattr.lengths[i] > size) >> goto done; >> is *not* enough; you need to check that it's not going backwards (i.e. isn't >> less than total). total needs to be size_t, BTW - ssize_t is wrong here. >> >> Another issue is that you are doing register_chrdev() too early - >> it should be just before register_filesystem(); definitely without any >> failure exits in between. >> >> Debugfs stuff is FUBAR - it's sufficiently isolated, so I hadn't >> spent much time on it, but for example debugfs_create_file() in >> orangefs_debugfs_init() fails, we'll call orangefs_debugfs_cleanup() >> right there. On module exit it will be called again, with >> debugfs_remove_recursive(debug_dir) called *again*. At the very least it >> should clear debug_dir in orangefs_debugfs_cleanup(). Worse, >> orangefs_kernel_debug_init() will try to create an object in freed >> directory... >> >> I've dumped some of the massage into vfs.git#orangefs-untested, but >> the locking and lifetime changes are not there yet; I'll add more of that >> tomorrow. I would really appreciate details on wait_for_cancellation_downcall; >> that's the worst gap I have in understanding that code.