All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Marshall <hubcap@omnibond.com>
To: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: Orangefs ABI documentation
Date: Fri, 22 Jan 2016 11:59:40 -0500	[thread overview]
Message-ID: <CAOg9mSQf4m-MM_kKd=UcGsCx_LZg=+nFESd8jhzDEbcM9DOGjg@mail.gmail.com> (raw)
In-Reply-To: <CAOg9mST6tDsY-cwvJsfYbcLt41Uufyy8Y-nXiCFsXwRJ0xanjQ@mail.gmail.com>

Hi Al...

I moved a tiny bit of your work around so it would compile, but I booted a
kernel from it, and ran some tests, it seems to work OK... doing this
work from home makes me remember writing cobol programs on a
silent 700... my co-workers are helping me look at
wait_for_cancellation_downcall... we recently made some
improvements there based on some problems we were
having in production with our out-of-tree Frankenstein
module... I'm glad you are also looking there.


# git diff
diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h
index b926fe77..88e606a 100644
--- a/fs/orangefs/orangefs-kernel.h
+++ b/fs/orangefs/orangefs-kernel.h
@@ -103,24 +103,6 @@ enum orangefs_vfs_op_states {
        OP_VFS_STATE_PURGED = 8,
 };

-#define set_op_state_waiting(op)     ((op)->op_state = OP_VFS_STATE_WAITING)
-#define set_op_state_inprogress(op)  ((op)->op_state = OP_VFS_STATE_INPROGR)
-static inline void set_op_state_serviced(struct orangefs_kernel_op_s *op)
-{
-       op->op_state = OP_VFS_STATE_SERVICED;
-       wake_up_interruptible(&op->waitq);
-}
-static inline void set_op_state_purged(struct orangefs_kernel_op_s *op)
-{
-       op->op_state |= OP_VFS_STATE_PURGED;
-       wake_up_interruptible(&op->waitq);
-}
-
-#define op_state_waiting(op)     ((op)->op_state & OP_VFS_STATE_WAITING)
-#define op_state_in_progress(op) ((op)->op_state & OP_VFS_STATE_INPROGR)
-#define op_state_serviced(op)    ((op)->op_state & OP_VFS_STATE_SERVICED)
-#define op_state_purged(op)      ((op)->op_state & OP_VFS_STATE_PURGED)
-
 #define get_op(op)                                     \
        do {                                            \
                atomic_inc(&(op)->ref_count);   \
@@ -259,6 +241,25 @@ struct orangefs_kernel_op_s {
        struct list_head list;
 };

+#define set_op_state_waiting(op)     ((op)->op_state = OP_VFS_STATE_WAITING)
+#define set_op_state_inprogress(op)  ((op)->op_state = OP_VFS_STATE_INPROGR)
+static inline void set_op_state_serviced(struct orangefs_kernel_op_s *op)
+{
+       op->op_state = OP_VFS_STATE_SERVICED;
+       wake_up_interruptible(&op->waitq);
+}
+static inline void set_op_state_purged(struct orangefs_kernel_op_s *op)
+{
+       op->op_state |= OP_VFS_STATE_PURGED;
+       wake_up_interruptible(&op->waitq);
+}
+
+#define op_state_waiting(op)     ((op)->op_state & OP_VFS_STATE_WAITING)
+#define op_state_in_progress(op) ((op)->op_state & OP_VFS_STATE_INPROGR)
+#define op_state_serviced(op)    ((op)->op_state & OP_VFS_STATE_SERVICED)
+#define op_state_purged(op)      ((op)->op_state & OP_VFS_STATE_PURGED)
+
+
 /* per inode private orangefs info */
 struct orangefs_inode_s {
        struct orangefs_object_kref refn;
diff --git a/fs/orangefs/waitqueue.c b/fs/orangefs/waitqueue.c
index 641de05..a257891 100644
--- a/fs/orangefs/waitqueue.c
+++ b/fs/orangefs/waitqueue.c
@@ -16,6 +16,9 @@
 #include "orangefs-kernel.h"
 #include "orangefs-bufmap.h"

+static int wait_for_cancellation_downcall(struct orangefs_kernel_op_s *);
+static int wait_for_matching_downcall(struct orangefs_kernel_op_s *);
+
 /*
  * What we do in this function is to walk the list of operations that are
  * present in the request queue and mark them as purged.

-Mike

On Fri, Jan 22, 2016 at 6:09 AM, Mike Marshall <hubcap@omnibond.com> wrote:
> Hi Al...
>
> Thanks for the review...
>
> I'll be trying to make some progress on your concerns about
> wait_for_cancellation_downcall today, but in case it looks
> like I dropped off the face of the earth for a few days - maybe
> I did - a several days long ice storm is dropping on us right
> now, everyone is working from home, power will likely go
> out for some...
>
> I'm cloning your repo from kernel.org across my slow
> dsl link right now...
>
> -Mike "I wrote all the debugfs code <g> "
>
> On Fri, Jan 22, 2016 at 2:11 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Fri, Jan 15, 2016 at 04:46:09PM -0500, Mike Marshall wrote:
>>> Al...
>>>
>>> We have been working on Orangefs ABI (protocol between
>>> userspace and kernel module) documentation, and improving
>>> the code along the way.
>>>
>>> We wish you would look at the documentation progress, and
>>> look at the way that .write_iter is implemented now... plus
>>> the other improvements. We know that every problem you
>>> pointed out hasn't been worked on yet, but your feedback
>>> would let us know if we're headed in the right direction.
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux.git
>>> for-next
>>>
>>> The improved (but not yet officially released) userspace
>>> function that does the writev to the Orangefs pseudo device is in:
>>>
>>> http://www.orangefs.org/svn/orangefs/branches/trunk.kernel.update/
>>> src/io/dev/pint-dev.c
>>>
>>> You may decide we don't deserve an ACK yet, but since we are in
>>> the merge window I hope you can look at it...
>>
>> write_iter got much better, but there are serious problems with locking.
>> For example, just what will happen if orangefs_devreq_read() picks an op
>> for copying to daemon, moves it to hash, unlocks everything and starts
>> copy_to_user() at the time when submitter of that op (sleeping
>> in wait_for_matching_downcall()) gets a signal and buggers off?  It calls
>> orangefs_clean_up_interrupted_operation(), removes the sucker from hash
>> and proceed to free the damn thing.  Right under ongoing copy_to_user().
>> Better yet, should that copy_to_user() fail, we proceed to move an already
>> freed op back to request list.  And no, get_op()/put_op() pair won't be
>> enough - submitters just go ahead an call op_release(), freeing the op,
>> refcount be damned.  They'd need to switch to put_op() for that to work.
>> Note that the same applies for existing use of get_op/put_op mechanism -
>> if that signal comes just as the orangefs_devreq_write_iter() has picked
>> the op from hash, we'll end up freeing the sucker right under copy_from_iter()
>> despite get_op().
>>
>> What's more, a failure of that copy_to_user() leaves us in a nasty situation -
>> how do we tell whether to put the request back to the list or just quietly
>> drop it?  Check the refcount?  But what if the submitter (believing, and
>> for a good reason, that it had removed the sucker from hash) hadn't gotten
>> around to dropping its reference?
>>
>> Another piece of fun assumes malicious daemon; sure, it's already a FUBAR
>> situation, but userland shouldn't be able to corrupt the kernel memory.
>> Look: daemon spawns a couple of threads.  One of those issues read() on
>> your character device, with just 16 bytes of destination mmapped and filled
>> with zeroes.  Another spins until it sees ->tag becoming non-zero and
>> immediately does writev() now that it has the tag value.  What we get is
>> submitter seeing op serviced and proceeding to free it while ->read() hits
>> the copy_to_user() failure and reinserts the already freed op into request
>> list.
>>
>> I'm not sure we have any business inserting the op to hash until we are
>> done with copy_to_user() and know we won't fail.  Would simplify the
>> failure recovery as well...
>>
>> I think we ought to flag the op when submitter decides to give up on it
>> and have the return-to-request-list logics check that (under op->lock),
>> returning it to the list only in case it's not flagged *and* decrementing
>> refcount before dropping op->lock in that case - we are guaranteed that
>> this is not the last reference.  In case it's flagged, we drop ->op_lock,
>> do put_op() and _not_ refile the sucker to request list.
>>
>> I'm not sure if this
>>                 /* NOTE: for I/O operations we handle releasing the op
>>                  * object except in the case of timeout.  the reason we
>>                  * can't free the op in timeout cases is that the op
>>                  * service logic in the vfs retries operations using
>>                  * the same op ptr, thus it can't be freed.
>>                  */
>>                 if (!timed_out)
>>                         op_release(op);
>> is safe, while we are at it...  I suspect we'd be better off if we did
>> put_op() there.  Unconditionally.  After the entire
>>         if (op->downcall.type == ORANGEFS_VFS_OP_FILE_IO)
>> thing, so that the total change of refcount is zero in all cases.  At least
>> that makes for regular lifetime rules (with s/op_release/put_op/ in submitters)
>>
>>         The thing I really don't understand is the situation with
>> wait_for_cancellation_downcall().  The comments in there flat-out
>> contradict the code - if called with signal already pending (as the
>> comment would seem to indicate), we might easily leave and free
>> the damn op without it ever being seen by daemon.  What's intended there?
>> And what happens if we reach schedule_timeout() in there (i.e. if we get
>> there without a signal pending) and operation completes successfully?
>> AFAICS, you'll get -ETIMEDOUT, not that the caller even looked at that
>> return value...  What's going on there?
>>
>>         Another unpleasantness is that your locking hierarchy leads to
>> to races in orangefs_clean_up_interrupted_operation(); you take op->lock,
>> look at the state, decide what needs to be locked to remove the sucker from,
>> drop op->lock, lock the relevant data structure and proceed to search op in
>> it, removing it in case of success.  Fun, but what happens if it gets refiled
>> as soon as you drop op->lock?
>>
>>         I suspect that we ought to make refiling conditional on the same
>> "submitter has given up on that one" flag *and* move hash lock outside of
>> op->lock.  IOW, once that flag had been set, only submitter can do anything
>> with op->list.  Then orangefs_clean_up_interrupted_operation() could simply
>> make sure that flag had been set, drop op->lock, grab the relevant spinlock
>> and do list_del().  And to hell with "search and remove if it's there"
>> complications.
>>
>>         One more problem is with your open_access_count; it should be
>> tristate, not boolean.  As it is, another open() of your character
>> device is possible as soon as ->release() has decremented open_access_count.
>> IOW, purge_inprogress_ops() and purge_waiting_ops() can happen when we
>> already have reopened the sucker.  It should have three states - "not opened",
>> "opened" and "shutting down", the last one both preventing open() and
>> being treated as "not in service".
>>
>>         Input validation got a lot saner (and documentation is quite welcome).
>> One place where I see a problem is listxattr -
>>                 if (total + new_op->downcall.resp.listxattr.lengths[i] > size)
>>                         goto done;
>> is *not* enough; you need to check that it's not going backwards (i.e. isn't
>> less than total).  total needs to be size_t, BTW - ssize_t is wrong here.
>>
>>         Another issue is that you are doing register_chrdev() too early -
>> it should be just before register_filesystem(); definitely without any
>> failure exits in between.
>>
>>         Debugfs stuff is FUBAR - it's sufficiently isolated, so I hadn't
>> spent much time on it, but for example debugfs_create_file() in
>> orangefs_debugfs_init() fails, we'll call orangefs_debugfs_cleanup()
>> right there.  On module exit it will be called again, with
>> debugfs_remove_recursive(debug_dir) called *again*.  At the very least it
>> should clear debug_dir in orangefs_debugfs_cleanup().  Worse,
>> orangefs_kernel_debug_init() will try to create an object in freed
>> directory...
>>
>>         I've dumped some of the massage into vfs.git#orangefs-untested, but
>> the locking and lifetime changes are not there yet; I'll add more of that
>> tomorrow.  I would really appreciate details on wait_for_cancellation_downcall;
>> that's the worst gap I have in understanding that code.

  reply	other threads:[~2016-01-22 16:59 UTC|newest]

Thread overview: 111+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-15 21:46 Orangefs ABI documentation Mike Marshall
2016-01-22  7:11 ` Al Viro
2016-01-22 11:09   ` Mike Marshall
2016-01-22 16:59     ` Mike Marshall [this message]
2016-01-22 17:08       ` Al Viro
2016-01-22 17:40         ` Mike Marshall
2016-01-22 17:43         ` Al Viro
2016-01-22 18:17           ` Mike Marshall
2016-01-22 18:37             ` Al Viro
2016-01-22 19:07               ` Mike Marshall
2016-01-22 19:21                 ` Mike Marshall
2016-01-22 20:04                   ` Al Viro
2016-01-22 20:30                     ` Mike Marshall
2016-01-23  0:12                       ` Al Viro
2016-01-23  1:28                         ` Al Viro
2016-01-23  2:54                           ` Mike Marshall
2016-01-23 19:10                             ` Al Viro
2016-01-23 19:24                               ` Mike Marshall
2016-01-23 21:35                                 ` Mike Marshall
2016-01-23 22:05                                   ` Al Viro
2016-01-23 21:40                                 ` Al Viro
2016-01-23 22:36                                   ` Mike Marshall
2016-01-24  0:16                                     ` Al Viro
2016-01-24  4:05                                       ` Al Viro
2016-01-24 22:12                                         ` Mike Marshall
2016-01-30 17:22                                           ` Al Viro
2016-01-26 19:52                                         ` Martin Brandenburg
2016-01-30 17:34                                           ` Al Viro
2016-01-30 18:27                                             ` Al Viro
2016-02-04 23:30                                               ` Mike Marshall
2016-02-06 19:42                                                 ` Al Viro
2016-02-07  1:38                                                   ` Al Viro
2016-02-07  3:53                                                     ` Al Viro
2016-02-07 20:01                                                       ` [RFC] bufmap-related wait logics (Re: Orangefs ABI documentation) Al Viro
2016-02-08 22:26                                                       ` Orangefs ABI documentation Mike Marshall
2016-02-08 23:35                                                         ` Al Viro
2016-02-09  3:32                                                           ` Al Viro
2016-02-09 14:34                                                             ` Mike Marshall
2016-02-09 17:40                                                               ` Al Viro
2016-02-09 21:06                                                                 ` Al Viro
2016-02-09 22:25                                                                   ` Mike Marshall
2016-02-11 23:36                                                                   ` Mike Marshall
2016-02-09 22:02                                                                 ` Mike Marshall
2016-02-09 22:16                                                                   ` Al Viro
2016-02-09 22:40                                                                     ` Al Viro
2016-02-09 23:13                                                                       ` Al Viro
2016-02-10 16:44                                                                         ` Al Viro
2016-02-10 21:26                                                                           ` Al Viro
2016-02-11 23:54                                                                           ` Mike Marshall
2016-02-12  0:55                                                                             ` Al Viro
2016-02-12 12:13                                                                               ` Mike Marshall
2016-02-11  0:44                                                                         ` Al Viro
2016-02-11  3:22                                                                           ` Mike Marshall
2016-02-12  4:27                                                                             ` Al Viro
2016-02-12 12:26                                                                               ` Mike Marshall
2016-02-12 18:00                                                                                 ` Martin Brandenburg
2016-02-13 17:18                                                                                   ` Mike Marshall
2016-02-13 17:47                                                                                     ` Al Viro
2016-02-14  2:56                                                                                       ` Al Viro
2016-02-14  3:46                                                                                         ` [RFC] slot allocator - waitqueue use review needed (Re: Orangefs ABI documentation) Al Viro
2016-02-14  4:06                                                                                           ` Al Viro
2016-02-16  2:12                                                                                           ` Al Viro
2016-02-16 19:28                                                                                             ` Al Viro
2016-02-14 22:31                                                                                         ` Orangefs ABI documentation Mike Marshall
2016-02-14 23:43                                                                                           ` Al Viro
2016-02-15 17:46                                                                                             ` Mike Marshall
2016-02-15 18:45                                                                                               ` Al Viro
2016-02-15 22:32                                                                                                 ` Martin Brandenburg
2016-02-15 23:04                                                                                                   ` Al Viro
2016-02-16 23:15                                                                                                     ` Mike Marshall
2016-02-16 23:36                                                                                                       ` Al Viro
2016-02-16 23:54                                                                                                         ` Al Viro
2016-02-17 19:24                                                                                                           ` Mike Marshall
2016-02-17 20:11                                                                                                             ` Al Viro
2016-02-17 21:17                                                                                                               ` Al Viro
2016-02-17 22:24                                                                                                                 ` Mike Marshall
2016-02-17 22:40                                                                                                             ` Martin Brandenburg
2016-02-17 23:09                                                                                                               ` Al Viro
2016-02-17 23:15                                                                                                                 ` Al Viro
2016-02-18  0:04                                                                                                                   ` Al Viro
2016-02-18 11:11                                                                                                                     ` Al Viro
2016-02-18 18:58                                                                                                                       ` Mike Marshall
2016-02-18 19:20                                                                                                                         ` Al Viro
2016-02-18 19:49                                                                                                                         ` Martin Brandenburg
2016-02-18 20:08                                                                                                                           ` Mike Marshall
2016-02-18 20:22                                                                                                                             ` Mike Marshall
2016-02-18 20:38                                                                                                                               ` Mike Marshall
2016-02-18 20:52                                                                                                                                 ` Al Viro
2016-02-18 21:50                                                                                                                                   ` Mike Marshall
2016-02-19  0:25                                                                                                                                     ` Al Viro
2016-02-19 22:11                                                                                                                                       ` Mike Marshall
2016-02-19 22:22                                                                                                                                         ` Al Viro
2016-02-20 12:14                                                                                                                                           ` Mike Marshall
2016-02-20 13:36                                                                                                                                             ` Al Viro
2016-02-22 16:20                                                                                                                                               ` Mike Marshall
2016-02-22 21:22                                                                                                                                                 ` Mike Marshall
2016-02-23 21:58                                                                                                                                                   ` Mike Marshall
2016-02-26 20:21                                                                                                                                                     ` Mike Marshall
2016-02-19 22:32                                                                                                                                         ` Al Viro
2016-02-19 22:45                                                                                                                                           ` Martin Brandenburg
2016-02-19 22:50                                                                                                                                           ` Martin Brandenburg
2016-02-18 20:49                                                                                                                               ` Al Viro
2016-02-15 22:47                                                                                                 ` Mike Marshall
2016-01-23 22:46                                   ` write() semantics (Re: Orangefs ABI documentation) Al Viro
2016-01-23 23:35                                     ` Linus Torvalds
2016-03-03 22:25                                       ` Mike Marshall
2016-03-04 20:55                                         ` Mike Marshall
2016-01-22 20:51                     ` Orangefs ABI documentation Mike Marshall
2016-01-22 23:53                       ` Mike Marshall
2016-01-22 19:54                 ` Al Viro
2016-01-22 19:50             ` Al Viro

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAOg9mSQf4m-MM_kKd=UcGsCx_LZg=+nFESd8jhzDEbcM9DOGjg@mail.gmail.com' \
    --to=hubcap@omnibond.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.