All of lore.kernel.org
 help / color / mirror / Atom feed
* Orangefs ABI documentation
@ 2016-01-15 21:46 Mike Marshall
  2016-01-22  7:11 ` Al Viro
  0 siblings, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-01-15 21:46 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds, linux-fsdevel, Mike Marshall

Al...

We have been working on Orangefs ABI (protocol between
userspace and kernel module) documentation, and improving
the code along the way.

We wish you would look at the documentation progress, and
look at the way that .write_iter is implemented now... plus
the other improvements. We know that every problem you
pointed out hasn't been worked on yet, but your feedback
would let us know if we're headed in the right direction.

git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux.git
for-next

The improved (but not yet officially released) userspace
function that does the writev to the Orangefs pseudo device is in:

http://www.orangefs.org/svn/orangefs/branches/trunk.kernel.update/
src/io/dev/pint-dev.c

You may decide we don't deserve an ACK yet, but since we are in
the merge window I hope you can look at it...

-Mike

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-15 21:46 Orangefs ABI documentation Mike Marshall
@ 2016-01-22  7:11 ` Al Viro
  2016-01-22 11:09   ` Mike Marshall
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-01-22  7:11 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel

On Fri, Jan 15, 2016 at 04:46:09PM -0500, Mike Marshall wrote:
> Al...
> 
> We have been working on Orangefs ABI (protocol between
> userspace and kernel module) documentation, and improving
> the code along the way.
> 
> We wish you would look at the documentation progress, and
> look at the way that .write_iter is implemented now... plus
> the other improvements. We know that every problem you
> pointed out hasn't been worked on yet, but your feedback
> would let us know if we're headed in the right direction.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux.git
> for-next
> 
> The improved (but not yet officially released) userspace
> function that does the writev to the Orangefs pseudo device is in:
> 
> http://www.orangefs.org/svn/orangefs/branches/trunk.kernel.update/
> src/io/dev/pint-dev.c
> 
> You may decide we don't deserve an ACK yet, but since we are in
> the merge window I hope you can look at it...

write_iter got much better, but there are serious problems with locking.
For example, just what will happen if orangefs_devreq_read() picks an op
for copying to daemon, moves it to hash, unlocks everything and starts
copy_to_user() at the time when submitter of that op (sleeping
in wait_for_matching_downcall()) gets a signal and buggers off?  It calls
orangefs_clean_up_interrupted_operation(), removes the sucker from hash
and proceed to free the damn thing.  Right under ongoing copy_to_user().
Better yet, should that copy_to_user() fail, we proceed to move an already
freed op back to request list.  And no, get_op()/put_op() pair won't be
enough - submitters just go ahead an call op_release(), freeing the op,
refcount be damned.  They'd need to switch to put_op() for that to work.
Note that the same applies for existing use of get_op/put_op mechanism -
if that signal comes just as the orangefs_devreq_write_iter() has picked
the op from hash, we'll end up freeing the sucker right under copy_from_iter()
despite get_op().

What's more, a failure of that copy_to_user() leaves us in a nasty situation -
how do we tell whether to put the request back to the list or just quietly
drop it?  Check the refcount?  But what if the submitter (believing, and
for a good reason, that it had removed the sucker from hash) hadn't gotten
around to dropping its reference?

Another piece of fun assumes malicious daemon; sure, it's already a FUBAR
situation, but userland shouldn't be able to corrupt the kernel memory.
Look: daemon spawns a couple of threads.  One of those issues read() on
your character device, with just 16 bytes of destination mmapped and filled
with zeroes.  Another spins until it sees ->tag becoming non-zero and
immediately does writev() now that it has the tag value.  What we get is
submitter seeing op serviced and proceeding to free it while ->read() hits
the copy_to_user() failure and reinserts the already freed op into request
list.

I'm not sure we have any business inserting the op to hash until we are
done with copy_to_user() and know we won't fail.  Would simplify the
failure recovery as well...

I think we ought to flag the op when submitter decides to give up on it
and have the return-to-request-list logics check that (under op->lock),
returning it to the list only in case it's not flagged *and* decrementing
refcount before dropping op->lock in that case - we are guaranteed that
this is not the last reference.  In case it's flagged, we drop ->op_lock,
do put_op() and _not_ refile the sucker to request list.

I'm not sure if this
                /* NOTE: for I/O operations we handle releasing the op
                 * object except in the case of timeout.  the reason we
                 * can't free the op in timeout cases is that the op
                 * service logic in the vfs retries operations using
                 * the same op ptr, thus it can't be freed.
                 */
                if (!timed_out)
                        op_release(op);
is safe, while we are at it...  I suspect we'd be better off if we did
put_op() there.  Unconditionally.  After the entire
	if (op->downcall.type == ORANGEFS_VFS_OP_FILE_IO)
thing, so that the total change of refcount is zero in all cases.  At least
that makes for regular lifetime rules (with s/op_release/put_op/ in submitters)

	The thing I really don't understand is the situation with
wait_for_cancellation_downcall().  The comments in there flat-out
contradict the code - if called with signal already pending (as the
comment would seem to indicate), we might easily leave and free
the damn op without it ever being seen by daemon.  What's intended there?
And what happens if we reach schedule_timeout() in there (i.e. if we get
there without a signal pending) and operation completes successfully?
AFAICS, you'll get -ETIMEDOUT, not that the caller even looked at that
return value...  What's going on there?

	Another unpleasantness is that your locking hierarchy leads to
to races in orangefs_clean_up_interrupted_operation(); you take op->lock,
look at the state, decide what needs to be locked to remove the sucker from,
drop op->lock, lock the relevant data structure and proceed to search op in
it, removing it in case of success.  Fun, but what happens if it gets refiled
as soon as you drop op->lock?

	I suspect that we ought to make refiling conditional on the same
"submitter has given up on that one" flag *and* move hash lock outside of
op->lock.  IOW, once that flag had been set, only submitter can do anything
with op->list.  Then orangefs_clean_up_interrupted_operation() could simply
make sure that flag had been set, drop op->lock, grab the relevant spinlock
and do list_del().  And to hell with "search and remove if it's there"
complications. 

	One more problem is with your open_access_count; it should be
tristate, not boolean.  As it is, another open() of your character
device is possible as soon as ->release() has decremented open_access_count.
IOW, purge_inprogress_ops() and purge_waiting_ops() can happen when we
already have reopened the sucker.  It should have three states - "not opened",
"opened" and "shutting down", the last one both preventing open() and
being treated as "not in service".

	Input validation got a lot saner (and documentation is quite welcome).
One place where I see a problem is listxattr -
                if (total + new_op->downcall.resp.listxattr.lengths[i] > size)
                        goto done;
is *not* enough; you need to check that it's not going backwards (i.e. isn't
less than total).  total needs to be size_t, BTW - ssize_t is wrong here.

	Another issue is that you are doing register_chrdev() too early -
it should be just before register_filesystem(); definitely without any
failure exits in between.

	Debugfs stuff is FUBAR - it's sufficiently isolated, so I hadn't
spent much time on it, but for example debugfs_create_file() in
orangefs_debugfs_init() fails, we'll call orangefs_debugfs_cleanup()
right there.  On module exit it will be called again, with
debugfs_remove_recursive(debug_dir) called *again*.  At the very least it
should clear debug_dir in orangefs_debugfs_cleanup().  Worse,
orangefs_kernel_debug_init() will try to create an object in freed
directory...

	I've dumped some of the massage into vfs.git#orangefs-untested, but
the locking and lifetime changes are not there yet; I'll add more of that
tomorrow.  I would really appreciate details on wait_for_cancellation_downcall;
that's the worst gap I have in understanding that code.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-22  7:11 ` Al Viro
@ 2016-01-22 11:09   ` Mike Marshall
  2016-01-22 16:59     ` Mike Marshall
  0 siblings, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-01-22 11:09 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel

Hi Al...

Thanks for the review...

I'll be trying to make some progress on your concerns about
wait_for_cancellation_downcall today, but in case it looks
like I dropped off the face of the earth for a few days - maybe
I did - a several days long ice storm is dropping on us right
now, everyone is working from home, power will likely go
out for some...

I'm cloning your repo from kernel.org across my slow
dsl link right now...

-Mike "I wrote all the debugfs code <g> "

On Fri, Jan 22, 2016 at 2:11 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Jan 15, 2016 at 04:46:09PM -0500, Mike Marshall wrote:
>> Al...
>>
>> We have been working on Orangefs ABI (protocol between
>> userspace and kernel module) documentation, and improving
>> the code along the way.
>>
>> We wish you would look at the documentation progress, and
>> look at the way that .write_iter is implemented now... plus
>> the other improvements. We know that every problem you
>> pointed out hasn't been worked on yet, but your feedback
>> would let us know if we're headed in the right direction.
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux.git
>> for-next
>>
>> The improved (but not yet officially released) userspace
>> function that does the writev to the Orangefs pseudo device is in:
>>
>> http://www.orangefs.org/svn/orangefs/branches/trunk.kernel.update/
>> src/io/dev/pint-dev.c
>>
>> You may decide we don't deserve an ACK yet, but since we are in
>> the merge window I hope you can look at it...
>
> write_iter got much better, but there are serious problems with locking.
> For example, just what will happen if orangefs_devreq_read() picks an op
> for copying to daemon, moves it to hash, unlocks everything and starts
> copy_to_user() at the time when submitter of that op (sleeping
> in wait_for_matching_downcall()) gets a signal and buggers off?  It calls
> orangefs_clean_up_interrupted_operation(), removes the sucker from hash
> and proceed to free the damn thing.  Right under ongoing copy_to_user().
> Better yet, should that copy_to_user() fail, we proceed to move an already
> freed op back to request list.  And no, get_op()/put_op() pair won't be
> enough - submitters just go ahead an call op_release(), freeing the op,
> refcount be damned.  They'd need to switch to put_op() for that to work.
> Note that the same applies for existing use of get_op/put_op mechanism -
> if that signal comes just as the orangefs_devreq_write_iter() has picked
> the op from hash, we'll end up freeing the sucker right under copy_from_iter()
> despite get_op().
>
> What's more, a failure of that copy_to_user() leaves us in a nasty situation -
> how do we tell whether to put the request back to the list or just quietly
> drop it?  Check the refcount?  But what if the submitter (believing, and
> for a good reason, that it had removed the sucker from hash) hadn't gotten
> around to dropping its reference?
>
> Another piece of fun assumes malicious daemon; sure, it's already a FUBAR
> situation, but userland shouldn't be able to corrupt the kernel memory.
> Look: daemon spawns a couple of threads.  One of those issues read() on
> your character device, with just 16 bytes of destination mmapped and filled
> with zeroes.  Another spins until it sees ->tag becoming non-zero and
> immediately does writev() now that it has the tag value.  What we get is
> submitter seeing op serviced and proceeding to free it while ->read() hits
> the copy_to_user() failure and reinserts the already freed op into request
> list.
>
> I'm not sure we have any business inserting the op to hash until we are
> done with copy_to_user() and know we won't fail.  Would simplify the
> failure recovery as well...
>
> I think we ought to flag the op when submitter decides to give up on it
> and have the return-to-request-list logics check that (under op->lock),
> returning it to the list only in case it's not flagged *and* decrementing
> refcount before dropping op->lock in that case - we are guaranteed that
> this is not the last reference.  In case it's flagged, we drop ->op_lock,
> do put_op() and _not_ refile the sucker to request list.
>
> I'm not sure if this
>                 /* NOTE: for I/O operations we handle releasing the op
>                  * object except in the case of timeout.  the reason we
>                  * can't free the op in timeout cases is that the op
>                  * service logic in the vfs retries operations using
>                  * the same op ptr, thus it can't be freed.
>                  */
>                 if (!timed_out)
>                         op_release(op);
> is safe, while we are at it...  I suspect we'd be better off if we did
> put_op() there.  Unconditionally.  After the entire
>         if (op->downcall.type == ORANGEFS_VFS_OP_FILE_IO)
> thing, so that the total change of refcount is zero in all cases.  At least
> that makes for regular lifetime rules (with s/op_release/put_op/ in submitters)
>
>         The thing I really don't understand is the situation with
> wait_for_cancellation_downcall().  The comments in there flat-out
> contradict the code - if called with signal already pending (as the
> comment would seem to indicate), we might easily leave and free
> the damn op without it ever being seen by daemon.  What's intended there?
> And what happens if we reach schedule_timeout() in there (i.e. if we get
> there without a signal pending) and operation completes successfully?
> AFAICS, you'll get -ETIMEDOUT, not that the caller even looked at that
> return value...  What's going on there?
>
>         Another unpleasantness is that your locking hierarchy leads to
> to races in orangefs_clean_up_interrupted_operation(); you take op->lock,
> look at the state, decide what needs to be locked to remove the sucker from,
> drop op->lock, lock the relevant data structure and proceed to search op in
> it, removing it in case of success.  Fun, but what happens if it gets refiled
> as soon as you drop op->lock?
>
>         I suspect that we ought to make refiling conditional on the same
> "submitter has given up on that one" flag *and* move hash lock outside of
> op->lock.  IOW, once that flag had been set, only submitter can do anything
> with op->list.  Then orangefs_clean_up_interrupted_operation() could simply
> make sure that flag had been set, drop op->lock, grab the relevant spinlock
> and do list_del().  And to hell with "search and remove if it's there"
> complications.
>
>         One more problem is with your open_access_count; it should be
> tristate, not boolean.  As it is, another open() of your character
> device is possible as soon as ->release() has decremented open_access_count.
> IOW, purge_inprogress_ops() and purge_waiting_ops() can happen when we
> already have reopened the sucker.  It should have three states - "not opened",
> "opened" and "shutting down", the last one both preventing open() and
> being treated as "not in service".
>
>         Input validation got a lot saner (and documentation is quite welcome).
> One place where I see a problem is listxattr -
>                 if (total + new_op->downcall.resp.listxattr.lengths[i] > size)
>                         goto done;
> is *not* enough; you need to check that it's not going backwards (i.e. isn't
> less than total).  total needs to be size_t, BTW - ssize_t is wrong here.
>
>         Another issue is that you are doing register_chrdev() too early -
> it should be just before register_filesystem(); definitely without any
> failure exits in between.
>
>         Debugfs stuff is FUBAR - it's sufficiently isolated, so I hadn't
> spent much time on it, but for example debugfs_create_file() in
> orangefs_debugfs_init() fails, we'll call orangefs_debugfs_cleanup()
> right there.  On module exit it will be called again, with
> debugfs_remove_recursive(debug_dir) called *again*.  At the very least it
> should clear debug_dir in orangefs_debugfs_cleanup().  Worse,
> orangefs_kernel_debug_init() will try to create an object in freed
> directory...
>
>         I've dumped some of the massage into vfs.git#orangefs-untested, but
> the locking and lifetime changes are not there yet; I'll add more of that
> tomorrow.  I would really appreciate details on wait_for_cancellation_downcall;
> that's the worst gap I have in understanding that code.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-22 11:09   ` Mike Marshall
@ 2016-01-22 16:59     ` Mike Marshall
  2016-01-22 17:08       ` Al Viro
  0 siblings, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-01-22 16:59 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel

Hi Al...

I moved a tiny bit of your work around so it would compile, but I booted a
kernel from it, and ran some tests, it seems to work OK... doing this
work from home makes me remember writing cobol programs on a
silent 700... my co-workers are helping me look at
wait_for_cancellation_downcall... we recently made some
improvements there based on some problems we were
having in production with our out-of-tree Frankenstein
module... I'm glad you are also looking there.


# git diff
diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h
index b926fe77..88e606a 100644
--- a/fs/orangefs/orangefs-kernel.h
+++ b/fs/orangefs/orangefs-kernel.h
@@ -103,24 +103,6 @@ enum orangefs_vfs_op_states {
        OP_VFS_STATE_PURGED = 8,
 };

-#define set_op_state_waiting(op)     ((op)->op_state = OP_VFS_STATE_WAITING)
-#define set_op_state_inprogress(op)  ((op)->op_state = OP_VFS_STATE_INPROGR)
-static inline void set_op_state_serviced(struct orangefs_kernel_op_s *op)
-{
-       op->op_state = OP_VFS_STATE_SERVICED;
-       wake_up_interruptible(&op->waitq);
-}
-static inline void set_op_state_purged(struct orangefs_kernel_op_s *op)
-{
-       op->op_state |= OP_VFS_STATE_PURGED;
-       wake_up_interruptible(&op->waitq);
-}
-
-#define op_state_waiting(op)     ((op)->op_state & OP_VFS_STATE_WAITING)
-#define op_state_in_progress(op) ((op)->op_state & OP_VFS_STATE_INPROGR)
-#define op_state_serviced(op)    ((op)->op_state & OP_VFS_STATE_SERVICED)
-#define op_state_purged(op)      ((op)->op_state & OP_VFS_STATE_PURGED)
-
 #define get_op(op)                                     \
        do {                                            \
                atomic_inc(&(op)->ref_count);   \
@@ -259,6 +241,25 @@ struct orangefs_kernel_op_s {
        struct list_head list;
 };

+#define set_op_state_waiting(op)     ((op)->op_state = OP_VFS_STATE_WAITING)
+#define set_op_state_inprogress(op)  ((op)->op_state = OP_VFS_STATE_INPROGR)
+static inline void set_op_state_serviced(struct orangefs_kernel_op_s *op)
+{
+       op->op_state = OP_VFS_STATE_SERVICED;
+       wake_up_interruptible(&op->waitq);
+}
+static inline void set_op_state_purged(struct orangefs_kernel_op_s *op)
+{
+       op->op_state |= OP_VFS_STATE_PURGED;
+       wake_up_interruptible(&op->waitq);
+}
+
+#define op_state_waiting(op)     ((op)->op_state & OP_VFS_STATE_WAITING)
+#define op_state_in_progress(op) ((op)->op_state & OP_VFS_STATE_INPROGR)
+#define op_state_serviced(op)    ((op)->op_state & OP_VFS_STATE_SERVICED)
+#define op_state_purged(op)      ((op)->op_state & OP_VFS_STATE_PURGED)
+
+
 /* per inode private orangefs info */
 struct orangefs_inode_s {
        struct orangefs_object_kref refn;
diff --git a/fs/orangefs/waitqueue.c b/fs/orangefs/waitqueue.c
index 641de05..a257891 100644
--- a/fs/orangefs/waitqueue.c
+++ b/fs/orangefs/waitqueue.c
@@ -16,6 +16,9 @@
 #include "orangefs-kernel.h"
 #include "orangefs-bufmap.h"

+static int wait_for_cancellation_downcall(struct orangefs_kernel_op_s *);
+static int wait_for_matching_downcall(struct orangefs_kernel_op_s *);
+
 /*
  * What we do in this function is to walk the list of operations that are
  * present in the request queue and mark them as purged.

-Mike

On Fri, Jan 22, 2016 at 6:09 AM, Mike Marshall <hubcap@omnibond.com> wrote:
> Hi Al...
>
> Thanks for the review...
>
> I'll be trying to make some progress on your concerns about
> wait_for_cancellation_downcall today, but in case it looks
> like I dropped off the face of the earth for a few days - maybe
> I did - a several days long ice storm is dropping on us right
> now, everyone is working from home, power will likely go
> out for some...
>
> I'm cloning your repo from kernel.org across my slow
> dsl link right now...
>
> -Mike "I wrote all the debugfs code <g> "
>
> On Fri, Jan 22, 2016 at 2:11 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Fri, Jan 15, 2016 at 04:46:09PM -0500, Mike Marshall wrote:
>>> Al...
>>>
>>> We have been working on Orangefs ABI (protocol between
>>> userspace and kernel module) documentation, and improving
>>> the code along the way.
>>>
>>> We wish you would look at the documentation progress, and
>>> look at the way that .write_iter is implemented now... plus
>>> the other improvements. We know that every problem you
>>> pointed out hasn't been worked on yet, but your feedback
>>> would let us know if we're headed in the right direction.
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux.git
>>> for-next
>>>
>>> The improved (but not yet officially released) userspace
>>> function that does the writev to the Orangefs pseudo device is in:
>>>
>>> http://www.orangefs.org/svn/orangefs/branches/trunk.kernel.update/
>>> src/io/dev/pint-dev.c
>>>
>>> You may decide we don't deserve an ACK yet, but since we are in
>>> the merge window I hope you can look at it...
>>
>> write_iter got much better, but there are serious problems with locking.
>> For example, just what will happen if orangefs_devreq_read() picks an op
>> for copying to daemon, moves it to hash, unlocks everything and starts
>> copy_to_user() at the time when submitter of that op (sleeping
>> in wait_for_matching_downcall()) gets a signal and buggers off?  It calls
>> orangefs_clean_up_interrupted_operation(), removes the sucker from hash
>> and proceed to free the damn thing.  Right under ongoing copy_to_user().
>> Better yet, should that copy_to_user() fail, we proceed to move an already
>> freed op back to request list.  And no, get_op()/put_op() pair won't be
>> enough - submitters just go ahead an call op_release(), freeing the op,
>> refcount be damned.  They'd need to switch to put_op() for that to work.
>> Note that the same applies for existing use of get_op/put_op mechanism -
>> if that signal comes just as the orangefs_devreq_write_iter() has picked
>> the op from hash, we'll end up freeing the sucker right under copy_from_iter()
>> despite get_op().
>>
>> What's more, a failure of that copy_to_user() leaves us in a nasty situation -
>> how do we tell whether to put the request back to the list or just quietly
>> drop it?  Check the refcount?  But what if the submitter (believing, and
>> for a good reason, that it had removed the sucker from hash) hadn't gotten
>> around to dropping its reference?
>>
>> Another piece of fun assumes malicious daemon; sure, it's already a FUBAR
>> situation, but userland shouldn't be able to corrupt the kernel memory.
>> Look: daemon spawns a couple of threads.  One of those issues read() on
>> your character device, with just 16 bytes of destination mmapped and filled
>> with zeroes.  Another spins until it sees ->tag becoming non-zero and
>> immediately does writev() now that it has the tag value.  What we get is
>> submitter seeing op serviced and proceeding to free it while ->read() hits
>> the copy_to_user() failure and reinserts the already freed op into request
>> list.
>>
>> I'm not sure we have any business inserting the op to hash until we are
>> done with copy_to_user() and know we won't fail.  Would simplify the
>> failure recovery as well...
>>
>> I think we ought to flag the op when submitter decides to give up on it
>> and have the return-to-request-list logics check that (under op->lock),
>> returning it to the list only in case it's not flagged *and* decrementing
>> refcount before dropping op->lock in that case - we are guaranteed that
>> this is not the last reference.  In case it's flagged, we drop ->op_lock,
>> do put_op() and _not_ refile the sucker to request list.
>>
>> I'm not sure if this
>>                 /* NOTE: for I/O operations we handle releasing the op
>>                  * object except in the case of timeout.  the reason we
>>                  * can't free the op in timeout cases is that the op
>>                  * service logic in the vfs retries operations using
>>                  * the same op ptr, thus it can't be freed.
>>                  */
>>                 if (!timed_out)
>>                         op_release(op);
>> is safe, while we are at it...  I suspect we'd be better off if we did
>> put_op() there.  Unconditionally.  After the entire
>>         if (op->downcall.type == ORANGEFS_VFS_OP_FILE_IO)
>> thing, so that the total change of refcount is zero in all cases.  At least
>> that makes for regular lifetime rules (with s/op_release/put_op/ in submitters)
>>
>>         The thing I really don't understand is the situation with
>> wait_for_cancellation_downcall().  The comments in there flat-out
>> contradict the code - if called with signal already pending (as the
>> comment would seem to indicate), we might easily leave and free
>> the damn op without it ever being seen by daemon.  What's intended there?
>> And what happens if we reach schedule_timeout() in there (i.e. if we get
>> there without a signal pending) and operation completes successfully?
>> AFAICS, you'll get -ETIMEDOUT, not that the caller even looked at that
>> return value...  What's going on there?
>>
>>         Another unpleasantness is that your locking hierarchy leads to
>> to races in orangefs_clean_up_interrupted_operation(); you take op->lock,
>> look at the state, decide what needs to be locked to remove the sucker from,
>> drop op->lock, lock the relevant data structure and proceed to search op in
>> it, removing it in case of success.  Fun, but what happens if it gets refiled
>> as soon as you drop op->lock?
>>
>>         I suspect that we ought to make refiling conditional on the same
>> "submitter has given up on that one" flag *and* move hash lock outside of
>> op->lock.  IOW, once that flag had been set, only submitter can do anything
>> with op->list.  Then orangefs_clean_up_interrupted_operation() could simply
>> make sure that flag had been set, drop op->lock, grab the relevant spinlock
>> and do list_del().  And to hell with "search and remove if it's there"
>> complications.
>>
>>         One more problem is with your open_access_count; it should be
>> tristate, not boolean.  As it is, another open() of your character
>> device is possible as soon as ->release() has decremented open_access_count.
>> IOW, purge_inprogress_ops() and purge_waiting_ops() can happen when we
>> already have reopened the sucker.  It should have three states - "not opened",
>> "opened" and "shutting down", the last one both preventing open() and
>> being treated as "not in service".
>>
>>         Input validation got a lot saner (and documentation is quite welcome).
>> One place where I see a problem is listxattr -
>>                 if (total + new_op->downcall.resp.listxattr.lengths[i] > size)
>>                         goto done;
>> is *not* enough; you need to check that it's not going backwards (i.e. isn't
>> less than total).  total needs to be size_t, BTW - ssize_t is wrong here.
>>
>>         Another issue is that you are doing register_chrdev() too early -
>> it should be just before register_filesystem(); definitely without any
>> failure exits in between.
>>
>>         Debugfs stuff is FUBAR - it's sufficiently isolated, so I hadn't
>> spent much time on it, but for example debugfs_create_file() in
>> orangefs_debugfs_init() fails, we'll call orangefs_debugfs_cleanup()
>> right there.  On module exit it will be called again, with
>> debugfs_remove_recursive(debug_dir) called *again*.  At the very least it
>> should clear debug_dir in orangefs_debugfs_cleanup().  Worse,
>> orangefs_kernel_debug_init() will try to create an object in freed
>> directory...
>>
>>         I've dumped some of the massage into vfs.git#orangefs-untested, but
>> the locking and lifetime changes are not there yet; I'll add more of that
>> tomorrow.  I would really appreciate details on wait_for_cancellation_downcall;
>> that's the worst gap I have in understanding that code.

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-22 16:59     ` Mike Marshall
@ 2016-01-22 17:08       ` Al Viro
  2016-01-22 17:40         ` Mike Marshall
  2016-01-22 17:43         ` Al Viro
  0 siblings, 2 replies; 111+ messages in thread
From: Al Viro @ 2016-01-22 17:08 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel

On Fri, Jan 22, 2016 at 11:59:40AM -0500, Mike Marshall wrote:
> Hi Al...
> 
> I moved a tiny bit of your work around so it would compile, but I booted a
> kernel from it, and ran some tests, it seems to work OK... doing this
> work from home makes me remember writing cobol programs on a
> silent 700... my co-workers are helping me look at
> wait_for_cancellation_downcall... we recently made some
> improvements there based on some problems we were
> having in production with our out-of-tree Frankenstein
> module... I'm glad you are also looking there.

BTW, what should happen to requests that got a buggered response in
orangefs_devreq_write_iter()?  As it is, you just free them and to hell
with whatever might've been waiting on them; that's easy to fix, but
what about the submitter?  Should we let it time out/simulate an error
properly returned by daemon/something else?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-22 17:08       ` Al Viro
@ 2016-01-22 17:40         ` Mike Marshall
  2016-01-22 17:43         ` Al Viro
  1 sibling, 0 replies; 111+ messages in thread
From: Mike Marshall @ 2016-01-22 17:40 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel

I think I see what you are saying... it seems like we have always
just freed them and to heck with whatever was waiting on them.

Maybe in all the places I detect something bad, set the return code
and goto out, perhaps  I should also jamb the bad return code
into op->downcall.status and goto wakeup intead of out?

Our infant fuzzer isn't really up to making test errors show up
there, I'd probably have to put some test weirdo stuff in
the userspace part that does the writev to make some test
failures...

-Mike

On Fri, Jan 22, 2016 at 12:08 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Jan 22, 2016 at 11:59:40AM -0500, Mike Marshall wrote:
>> Hi Al...
>>
>> I moved a tiny bit of your work around so it would compile, but I booted a
>> kernel from it, and ran some tests, it seems to work OK... doing this
>> work from home makes me remember writing cobol programs on a
>> silent 700... my co-workers are helping me look at
>> wait_for_cancellation_downcall... we recently made some
>> improvements there based on some problems we were
>> having in production with our out-of-tree Frankenstein
>> module... I'm glad you are also looking there.
>
> BTW, what should happen to requests that got a buggered response in
> orangefs_devreq_write_iter()?  As it is, you just free them and to hell
> with whatever might've been waiting on them; that's easy to fix, but
> what about the submitter?  Should we let it time out/simulate an error
> properly returned by daemon/something else?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-22 17:08       ` Al Viro
  2016-01-22 17:40         ` Mike Marshall
@ 2016-01-22 17:43         ` Al Viro
  2016-01-22 18:17           ` Mike Marshall
  1 sibling, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-01-22 17:43 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel

On Fri, Jan 22, 2016 at 05:08:38PM +0000, Al Viro wrote:

> BTW, what should happen to requests that got a buggered response in
> orangefs_devreq_write_iter()?  As it is, you just free them and to hell
> with whatever might've been waiting on them; that's easy to fix, but
> what about the submitter?  Should we let it time out/simulate an error
> properly returned by daemon/something else?

FWIW, here's what I have in mind for lifetime rules:
	1) op_alloc() creates them with refcount 1
	2) when we decide to pass the sucker to daemon, we bump the refcount
before taking the thing out of request list.  After successful copying
to userland, we check if the submitter has given up on it.  If it hasn't,
we move the sucker to hash and decrement the refcount (it can't be the
last reference there).  If it has, we do put_op().  After _failed_ copying
to userland, we also check if it's been given up.  If it hasn't, we move
it back to request list and decrement the refcount.  If it has, we do
put_op().
	3) when we get a response from daemon, we bump the refcount before
taking the thing out of hash.  Once we are done, we call put_op() - in *all*
cases.  Malformed response is treated like a well-formed error.
	4) if submitter decides to give up upon a request, it sets a flag in
->op_state.  From that point on nobody else is going to touch op->list.
	5) once submitter is done, it does put_op() - *NOT* op_release().
Normally it ends up with request freed, but if we'd raced with interaction
with daemon we'll get it freed by put_op() done in (2) or (3).  No
        /*
         * tell the device file owner waiting on I/O that this read has
         * completed and it can return now.  in this exact case, on
         * wakeup the daemon will free the op, so we *cannot* touch it
         * after this.
         */
        wake_up_daemon_for_return(new_op);
        new_op = NULL;
anymore - just wake it up and do put_op() regardless.  In that case daemon
has already done get_op() and will free the sucker once it does its put_op().

Objections, comments?

BTW, it really looks like we might be better off without atomic_t here -
op->lock is held for all non-trivial transitions and we might as well have
put_op() require op->lock held and do something like
	if (--op->ref_count) {
		spin_unlock(op->lock);
		return;
	}
	spin_unlock(op->lock);
	op_release(op);

Speaking of op->lock:
                        /*
                         * let process sleep for a few seconds so shared
                         * memory system can be initialized.
                         */
                        spin_lock_irqsave(&op->lock, irqflags);
                        prepare_to_wait(&orangefs_bufmap_init_waitq,
                                        &wait_entry,
                                        TASK_INTERRUPTIBLE);
                        spin_unlock_irqrestore(&op->lock, irqflags);
What do you need to block interrupts for?

Oh, and why do you call orangefs_op_initialize(orangefs_op) from op_release(),
when you immediately follow it with freeing orangefs_op and do it on
allocation anyway?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-22 17:43         ` Al Viro
@ 2016-01-22 18:17           ` Mike Marshall
  2016-01-22 18:37             ` Al Viro
  2016-01-22 19:50             ` Al Viro
  0 siblings, 2 replies; 111+ messages in thread
From: Mike Marshall @ 2016-01-22 18:17 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel

>> Objections, comments?

I have no objections so far to your suggestions, I'm trying to keep
my co-workers looking in on this discussion... most other days
we're all lined up in adjacent cubes...

Your attention and fixes will almost certainly all be improvements...
I found a few fixes you made to the old
version of the writev code that were off base, but that
code made it look like impossible situations needed
to be handled...

I'll compile your repo again and run the quick tests
when you think your additions are stable...

-Mike

On Fri, Jan 22, 2016 at 12:43 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Jan 22, 2016 at 05:08:38PM +0000, Al Viro wrote:
>
>> BTW, what should happen to requests that got a buggered response in
>> orangefs_devreq_write_iter()?  As it is, you just free them and to hell
>> with whatever might've been waiting on them; that's easy to fix, but
>> what about the submitter?  Should we let it time out/simulate an error
>> properly returned by daemon/something else?
>
> FWIW, here's what I have in mind for lifetime rules:
>         1) op_alloc() creates them with refcount 1
>         2) when we decide to pass the sucker to daemon, we bump the refcount
> before taking the thing out of request list.  After successful copying
> to userland, we check if the submitter has given up on it.  If it hasn't,
> we move the sucker to hash and decrement the refcount (it can't be the
> last reference there).  If it has, we do put_op().  After _failed_ copying
> to userland, we also check if it's been given up.  If it hasn't, we move
> it back to request list and decrement the refcount.  If it has, we do
> put_op().
>         3) when we get a response from daemon, we bump the refcount before
> taking the thing out of hash.  Once we are done, we call put_op() - in *all*
> cases.  Malformed response is treated like a well-formed error.
>         4) if submitter decides to give up upon a request, it sets a flag in
> ->op_state.  From that point on nobody else is going to touch op->list.
>         5) once submitter is done, it does put_op() - *NOT* op_release().
> Normally it ends up with request freed, but if we'd raced with interaction
> with daemon we'll get it freed by put_op() done in (2) or (3).  No
>         /*
>          * tell the device file owner waiting on I/O that this read has
>          * completed and it can return now.  in this exact case, on
>          * wakeup the daemon will free the op, so we *cannot* touch it
>          * after this.
>          */
>         wake_up_daemon_for_return(new_op);
>         new_op = NULL;
> anymore - just wake it up and do put_op() regardless.  In that case daemon
> has already done get_op() and will free the sucker once it does its put_op().
>
> Objections, comments?
>
> BTW, it really looks like we might be better off without atomic_t here -
> op->lock is held for all non-trivial transitions and we might as well have
> put_op() require op->lock held and do something like
>         if (--op->ref_count) {
>                 spin_unlock(op->lock);
>                 return;
>         }
>         spin_unlock(op->lock);
>         op_release(op);
>
> Speaking of op->lock:
>                         /*
>                          * let process sleep for a few seconds so shared
>                          * memory system can be initialized.
>                          */
>                         spin_lock_irqsave(&op->lock, irqflags);
>                         prepare_to_wait(&orangefs_bufmap_init_waitq,
>                                         &wait_entry,
>                                         TASK_INTERRUPTIBLE);
>                         spin_unlock_irqrestore(&op->lock, irqflags);
> What do you need to block interrupts for?
>
> Oh, and why do you call orangefs_op_initialize(orangefs_op) from op_release(),
> when you immediately follow it with freeing orangefs_op and do it on
> allocation anyway?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-22 18:17           ` Mike Marshall
@ 2016-01-22 18:37             ` Al Viro
  2016-01-22 19:07               ` Mike Marshall
  2016-01-22 19:50             ` Al Viro
  1 sibling, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-01-22 18:37 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel

On Fri, Jan 22, 2016 at 01:17:40PM -0500, Mike Marshall wrote:
> >> Objections, comments?
> 
> I have no objections so far to your suggestions, I'm trying to keep
> my co-workers looking in on this discussion... most other days
> we're all lined up in adjacent cubes...

email and IRC exist for purpose...  BTW, another thing: if you've managed
to get to orangefs_exit() with anything still in request list or hash,
you are seriously fscked.  Whatever had allocated them is presumably still
around and it's running in that module's .text...

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-22 18:37             ` Al Viro
@ 2016-01-22 19:07               ` Mike Marshall
  2016-01-22 19:21                 ` Mike Marshall
  2016-01-22 19:54                 ` Al Viro
  0 siblings, 2 replies; 111+ messages in thread
From: Mike Marshall @ 2016-01-22 19:07 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel

Martin's the only other one subscribed to fs-devel, the
rest are looking in here today:

http://marc.info/?l=linux-fsdevel&w=2&r=1&s=orangefs&q=b

Walt thinks you deserve an Orangefs Contributor's Jacket...

-Mike

On Fri, Jan 22, 2016 at 1:37 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Jan 22, 2016 at 01:17:40PM -0500, Mike Marshall wrote:
>> >> Objections, comments?
>>
>> I have no objections so far to your suggestions, I'm trying to keep
>> my co-workers looking in on this discussion... most other days
>> we're all lined up in adjacent cubes...
>
> email and IRC exist for purpose...  BTW, another thing: if you've managed
> to get to orangefs_exit() with anything still in request list or hash,
> you are seriously fscked.  Whatever had allocated them is presumably still
> around and it's running in that module's .text...

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-22 19:07               ` Mike Marshall
@ 2016-01-22 19:21                 ` Mike Marshall
  2016-01-22 20:04                   ` Al Viro
  2016-01-22 19:54                 ` Al Viro
  1 sibling, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-01-22 19:21 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel

Al...

I think that my commit f987f4c28
"don't trigger copy_attributes_to_inode from d_revalidate."
doesn't really do the job... I think I at least need to
write code that checks some of the attributes and
fails the revalidate if they got changed via some
external process... does that sound right?

-Mike

On Fri, Jan 22, 2016 at 2:07 PM, Mike Marshall <hubcap@omnibond.com> wrote:
> Martin's the only other one subscribed to fs-devel, the
> rest are looking in here today:
>
> http://marc.info/?l=linux-fsdevel&w=2&r=1&s=orangefs&q=b
>
> Walt thinks you deserve an Orangefs Contributor's Jacket...
>
> -Mike
>
> On Fri, Jan 22, 2016 at 1:37 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Fri, Jan 22, 2016 at 01:17:40PM -0500, Mike Marshall wrote:
>>> >> Objections, comments?
>>>
>>> I have no objections so far to your suggestions, I'm trying to keep
>>> my co-workers looking in on this discussion... most other days
>>> we're all lined up in adjacent cubes...
>>
>> email and IRC exist for purpose...  BTW, another thing: if you've managed
>> to get to orangefs_exit() with anything still in request list or hash,
>> you are seriously fscked.  Whatever had allocated them is presumably still
>> around and it's running in that module's .text...

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-22 18:17           ` Mike Marshall
  2016-01-22 18:37             ` Al Viro
@ 2016-01-22 19:50             ` Al Viro
  1 sibling, 0 replies; 111+ messages in thread
From: Al Viro @ 2016-01-22 19:50 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel

On Fri, Jan 22, 2016 at 01:17:40PM -0500, Mike Marshall wrote:
> >> Objections, comments?
> 
> I have no objections so far to your suggestions, I'm trying to keep
> my co-workers looking in on this discussion... most other days
> we're all lined up in adjacent cubes...

FWIW, I do see a problem there.  This "submitter has given up" flag would
need to be cleared on retries ;-/

When can we end up retrying?  We'd need
	* wait_for_cancellation_downcall/wait_for_matching_downcall finished
	* op_state_serviced(op) false
	* op->downcall.status set to -EAGAIN
wait_for_matching_downcall() either returns with op_state_serviced() or
returns one -ETIMEDOUT, -EINTR, -EAGAIN or -EIO, with its return value
ending up in op->downcall.status.  So if it was wait_for_matching_downcall(),
we'd have to have returned -EAGAIN.  Which can happen only with
op_state_purged(op), i.e. with character device having been closed
after the sucker had been placed on the lists...

wait_for_cancellation_downcall() similar, but it never returns -EAGAIN.
IOW, for it retries are irrelevant.

Damnit, what's the point of picking a purged request from the request
list in orangefs_devreq_read()?  Ditto for orangefs_devreq_write_iter()
and request hash...

Note that purge *can't* happen during either of those - ->release() can't
be called while something is in ->read() or ->write_iter().  So if we would
ignore purged requests in those, the problem with retries clearing the
"I gave up" flag wouldn't exist...  It's a race anyway - submitter had been
woken up when we'd been marking the request as purged, and it's going to
remove the sucker from whatever list it's on very soon after getting woken
up.

Do you see any problems with skipping such request list/hash elements in
orangefs_devreq_read()/orangefs_devreq_write_iter() resp.?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-22 19:07               ` Mike Marshall
  2016-01-22 19:21                 ` Mike Marshall
@ 2016-01-22 19:54                 ` Al Viro
  1 sibling, 0 replies; 111+ messages in thread
From: Al Viro @ 2016-01-22 19:54 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel

On Fri, Jan 22, 2016 at 02:07:23PM -0500, Mike Marshall wrote:
> Martin's the only other one subscribed to fs-devel, the
> rest are looking in here today:
> 
> http://marc.info/?l=linux-fsdevel&w=2&r=1&s=orangefs&q=b

Umm...  Why not Cc to your maillist (pvfs2-developers@beowulf-underground.org?)
if it's unmoderated?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-22 19:21                 ` Mike Marshall
@ 2016-01-22 20:04                   ` Al Viro
  2016-01-22 20:30                     ` Mike Marshall
  2016-01-22 20:51                     ` Orangefs ABI documentation Mike Marshall
  0 siblings, 2 replies; 111+ messages in thread
From: Al Viro @ 2016-01-22 20:04 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel

On Fri, Jan 22, 2016 at 02:21:44PM -0500, Mike Marshall wrote:
> Al...
> 
> I think that my commit f987f4c28
> "don't trigger copy_attributes_to_inode from d_revalidate."
> doesn't really do the job... I think I at least need to
> write code that checks some of the attributes and
> fails the revalidate if they got changed via some
> external process... does that sound right?

Won't your orangefs_revalidate_lookup() spot the changed handle?  Anyway,
some sanity checks in there might be a good idea - at least "has the
object type somehow changed without handle going stale?"...

Said that, what should be picking e.g. chmod/chown by external source?
You don't have ->permission() instances in there, so inode->i_mode is
used directly by generic_inode_permission()...

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-22 20:04                   ` Al Viro
@ 2016-01-22 20:30                     ` Mike Marshall
  2016-01-23  0:12                       ` Al Viro
  2016-01-22 20:51                     ` Orangefs ABI documentation Mike Marshall
  1 sibling, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-01-22 20:30 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel

The userspace daemon (client-core) that reads/writes to the device
restarts automatically if it stops for some reason... I believe active
ops are marked "purged" when this happens, and when client-core
restarts "purged" ops are retried (once)... see the comment
in waitqueue.c "if the operation was purged in the meantime..."

I've tried to rattle Walt and Becky's chains to see if they
can describe it better...

-Mike

On Fri, Jan 22, 2016 at 3:04 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Jan 22, 2016 at 02:21:44PM -0500, Mike Marshall wrote:
>> Al...
>>
>> I think that my commit f987f4c28
>> "don't trigger copy_attributes_to_inode from d_revalidate."
>> doesn't really do the job... I think I at least need to
>> write code that checks some of the attributes and
>> fails the revalidate if they got changed via some
>> external process... does that sound right?
>
> Won't your orangefs_revalidate_lookup() spot the changed handle?  Anyway,
> some sanity checks in there might be a good idea - at least "has the
> object type somehow changed without handle going stale?"...
>
> Said that, what should be picking e.g. chmod/chown by external source?
> You don't have ->permission() instances in there, so inode->i_mode is
> used directly by generic_inode_permission()...

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-22 20:04                   ` Al Viro
  2016-01-22 20:30                     ` Mike Marshall
@ 2016-01-22 20:51                     ` Mike Marshall
  2016-01-22 23:53                       ` Mike Marshall
  1 sibling, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-01-22 20:51 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel

Nope... I guess that's why the original developer called
copy_attributes_to_inode
every time... I need to at least check permissions... if a file's
permissions change
via an external source, say from 755 to 700, the other guy can still cat
the file once...

-Mike

On Fri, Jan 22, 2016 at 3:04 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Jan 22, 2016 at 02:21:44PM -0500, Mike Marshall wrote:
>> Al...
>>
>> I think that my commit f987f4c28
>> "don't trigger copy_attributes_to_inode from d_revalidate."
>> doesn't really do the job... I think I at least need to
>> write code that checks some of the attributes and
>> fails the revalidate if they got changed via some
>> external process... does that sound right?
>
> Won't your orangefs_revalidate_lookup() spot the changed handle?  Anyway,
> some sanity checks in there might be a good idea - at least "has the
> object type somehow changed without handle going stale?"...
>
> Said that, what should be picking e.g. chmod/chown by external source?
> You don't have ->permission() instances in there, so inode->i_mode is
> used directly by generic_inode_permission()...

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-22 20:51                     ` Orangefs ABI documentation Mike Marshall
@ 2016-01-22 23:53                       ` Mike Marshall
  0 siblings, 0 replies; 111+ messages in thread
From: Mike Marshall @ 2016-01-22 23:53 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel

I built and tested orangefs-untested ...

Orangefs still seems to work OK... I built it as a module
and didn't trigger the BUG_ON when I unloaded it.

-Mike

On Fri, Jan 22, 2016 at 3:51 PM, Mike Marshall <hubcap@omnibond.com> wrote:
> Nope... I guess that's why the original developer called
> copy_attributes_to_inode
> every time... I need to at least check permissions... if a file's
> permissions change
> via an external source, say from 755 to 700, the other guy can still cat
> the file once...
>
> -Mike
>
> On Fri, Jan 22, 2016 at 3:04 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Fri, Jan 22, 2016 at 02:21:44PM -0500, Mike Marshall wrote:
>>> Al...
>>>
>>> I think that my commit f987f4c28
>>> "don't trigger copy_attributes_to_inode from d_revalidate."
>>> doesn't really do the job... I think I at least need to
>>> write code that checks some of the attributes and
>>> fails the revalidate if they got changed via some
>>> external process... does that sound right?
>>
>> Won't your orangefs_revalidate_lookup() spot the changed handle?  Anyway,
>> some sanity checks in there might be a good idea - at least "has the
>> object type somehow changed without handle going stale?"...
>>
>> Said that, what should be picking e.g. chmod/chown by external source?
>> You don't have ->permission() instances in there, so inode->i_mode is
>> used directly by generic_inode_permission()...

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-22 20:30                     ` Mike Marshall
@ 2016-01-23  0:12                       ` Al Viro
  2016-01-23  1:28                         ` Al Viro
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-01-23  0:12 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel

On Fri, Jan 22, 2016 at 03:30:02PM -0500, Mike Marshall wrote:
> The userspace daemon (client-core) that reads/writes to the device
> restarts automatically if it stops for some reason... I believe active
> ops are marked "purged" when this happens, and when client-core
> restarts "purged" ops are retried (once)... see the comment
> in waitqueue.c "if the operation was purged in the meantime..."
> 
> I've tried to rattle Walt and Becky's chains to see if they
> can describe it better...

What I mean is the following sequence:

Syscall: puts op into request list, sleeps in wait_for_matching_downcall()
Daemon: exits, markes purged, wakes Syscall up
Daemon gets restarted
Daemon calls read(), finds op still on the list
Syscall: finally gets the timeslice, removes op from the list, decides to
	resubmit

This is very hard to hit - normally by the time we get around to read()
from restarted daemon the waiter had already been woken up and
already removed the purged op from the list.  So in practice you probably
had never hit that case.  However, it is theoretically possible.

What I propose to do is to have purged requests that are still in the lists
to be skipped by orangefs_devreq_read() and orangefs_devreq_remove_op().
IOW, pretend that the race had been won by whatever had been waiting on
that request and got woken up when it had been purged.

Note that by the time it gets resubmitted, it already has the 'purged' flag
removed - set_op_state_waiting(op) is done when we are inserting into
request list and it leaves no trace of OP_VFS_STATE_PURGED.  So I'm not
talking about the resubmitted stuff; just the one that had been in queue
since before the daemon restart and hadn't been removed from there yet.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-23  0:12                       ` Al Viro
@ 2016-01-23  1:28                         ` Al Viro
  2016-01-23  2:54                           ` Mike Marshall
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-01-23  1:28 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel

On Sat, Jan 23, 2016 at 12:12:02AM +0000, Al Viro wrote:
> On Fri, Jan 22, 2016 at 03:30:02PM -0500, Mike Marshall wrote:
> > The userspace daemon (client-core) that reads/writes to the device
> > restarts automatically if it stops for some reason... I believe active
> > ops are marked "purged" when this happens, and when client-core
> > restarts "purged" ops are retried (once)... see the comment
> > in waitqueue.c "if the operation was purged in the meantime..."
> > 
> > I've tried to rattle Walt and Becky's chains to see if they
> > can describe it better...
> 
> What I mean is the following sequence:
> 
> Syscall: puts op into request list, sleeps in wait_for_matching_downcall()
> Daemon: exits, markes purged, wakes Syscall up
> Daemon gets restarted
> Daemon calls read(), finds op still on the list
> Syscall: finally gets the timeslice, removes op from the list, decides to
> 	resubmit
> 
> This is very hard to hit - normally by the time we get around to read()
> from restarted daemon the waiter had already been woken up and
> already removed the purged op from the list.  So in practice you probably
> had never hit that case.  However, it is theoretically possible.
> 
> What I propose to do is to have purged requests that are still in the lists
> to be skipped by orangefs_devreq_read() and orangefs_devreq_remove_op().
> IOW, pretend that the race had been won by whatever had been waiting on
> that request and got woken up when it had been purged.
> 
> Note that by the time it gets resubmitted, it already has the 'purged' flag
> removed - set_op_state_waiting(op) is done when we are inserting into
> request list and it leaves no trace of OP_VFS_STATE_PURGED.  So I'm not
> talking about the resubmitted stuff; just the one that had been in queue
> since before the daemon restart and hadn't been removed from there yet.

OK, aforementioned locking/refcounting scheme implemented (and completely
untested).  See #orangefs-untested.

Rules:

*  refcounting is real refcounting - objects are created with refcount 1,
get_op() increments refcount, op_release() decrements and frees when zero.

* daemon interaction (read/write_iter) is refcount-neutral - it grabs
a reference when picking request from list/hash and always drops it in
the end.  Submitters are always releasing the reference acquired when
allocating request.

* when submitter decides to give up upon request (for any reason - timeout,
signal, daemon disconnect) it marks it with new bit - OP_VFS_STATE_GIVEN_UP.
Once that is done, nobody else is allowed to touch its ->list.

* request is inserted into hash only when we'd succeeded copying to daemon's
memory (and only if request hadn't been given up while we'd been copying it).
If copying fails, we only have to refile to the request list (and only if
it hadn't been given up).

* when copying a response from daemon, request is only marked serviced if it
hadn't been given up while we were parsing the response.  Malformed responses,
failed copying of response, etc. are treated as well-formed error response
would be.  Error value is the one we'll be returning to daemon - might or
might not be right from the error recovery standpoint.

* hash lock nests *outside* of op->lock now and all work with the hash is
protected by it.

* daemon interaction skips the requests that had been around since before
the control device had been opened, acting as if the (already woken up)
submitters had already gotten around to removing those from the list/hash.
That's the normal outcome of such race anyway, and it simplifies the analysis.

Again, it's completely untested; might oops, break stuff, etc.  I _think_
it makes sense from the lifetime rules and locking POV, but...

	I still would like to understand what's going on in
wait_for_cancellation_request() - it *probably* shouldn't matter for the
correctness of that massage, but there might be dragons.  I don't understand
in which situation wrt pending signals it is supposed to be called and the
comments in there contradict the actual code.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-23  1:28                         ` Al Viro
@ 2016-01-23  2:54                           ` Mike Marshall
  2016-01-23 19:10                             ` Al Viro
  0 siblings, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-01-23  2:54 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel

Well... that all seems awesome, and compiled the first
time and all my quick tests on my dinky vm make
it seem fine... It is Becky that recently spent a
bunch of time fighting the cancellation dragons,
I'll see if I can't get her to weigh in on
wait_for_cancellation_downcall tomorrow.

We have some gnarly tests we were running on
real hardware that helped reproduce the problems
she was seeing in production with Clemson's
Palmetto Cluster, I'll run them, but maybe not
until Monday with the ice storm...

Thanks Al...

-Mike

On Fri, Jan 22, 2016 at 8:28 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sat, Jan 23, 2016 at 12:12:02AM +0000, Al Viro wrote:
>> On Fri, Jan 22, 2016 at 03:30:02PM -0500, Mike Marshall wrote:
>> > The userspace daemon (client-core) that reads/writes to the device
>> > restarts automatically if it stops for some reason... I believe active
>> > ops are marked "purged" when this happens, and when client-core
>> > restarts "purged" ops are retried (once)... see the comment
>> > in waitqueue.c "if the operation was purged in the meantime..."
>> >
>> > I've tried to rattle Walt and Becky's chains to see if they
>> > can describe it better...
>>
>> What I mean is the following sequence:
>>
>> Syscall: puts op into request list, sleeps in wait_for_matching_downcall()
>> Daemon: exits, markes purged, wakes Syscall up
>> Daemon gets restarted
>> Daemon calls read(), finds op still on the list
>> Syscall: finally gets the timeslice, removes op from the list, decides to
>>       resubmit
>>
>> This is very hard to hit - normally by the time we get around to read()
>> from restarted daemon the waiter had already been woken up and
>> already removed the purged op from the list.  So in practice you probably
>> had never hit that case.  However, it is theoretically possible.
>>
>> What I propose to do is to have purged requests that are still in the lists
>> to be skipped by orangefs_devreq_read() and orangefs_devreq_remove_op().
>> IOW, pretend that the race had been won by whatever had been waiting on
>> that request and got woken up when it had been purged.
>>
>> Note that by the time it gets resubmitted, it already has the 'purged' flag
>> removed - set_op_state_waiting(op) is done when we are inserting into
>> request list and it leaves no trace of OP_VFS_STATE_PURGED.  So I'm not
>> talking about the resubmitted stuff; just the one that had been in queue
>> since before the daemon restart and hadn't been removed from there yet.
>
> OK, aforementioned locking/refcounting scheme implemented (and completely
> untested).  See #orangefs-untested.
>
> Rules:
>
> *  refcounting is real refcounting - objects are created with refcount 1,
> get_op() increments refcount, op_release() decrements and frees when zero.
>
> * daemon interaction (read/write_iter) is refcount-neutral - it grabs
> a reference when picking request from list/hash and always drops it in
> the end.  Submitters are always releasing the reference acquired when
> allocating request.
>
> * when submitter decides to give up upon request (for any reason - timeout,
> signal, daemon disconnect) it marks it with new bit - OP_VFS_STATE_GIVEN_UP.
> Once that is done, nobody else is allowed to touch its ->list.
>
> * request is inserted into hash only when we'd succeeded copying to daemon's
> memory (and only if request hadn't been given up while we'd been copying it).
> If copying fails, we only have to refile to the request list (and only if
> it hadn't been given up).
>
> * when copying a response from daemon, request is only marked serviced if it
> hadn't been given up while we were parsing the response.  Malformed responses,
> failed copying of response, etc. are treated as well-formed error response
> would be.  Error value is the one we'll be returning to daemon - might or
> might not be right from the error recovery standpoint.
>
> * hash lock nests *outside* of op->lock now and all work with the hash is
> protected by it.
>
> * daemon interaction skips the requests that had been around since before
> the control device had been opened, acting as if the (already woken up)
> submitters had already gotten around to removing those from the list/hash.
> That's the normal outcome of such race anyway, and it simplifies the analysis.
>
> Again, it's completely untested; might oops, break stuff, etc.  I _think_
> it makes sense from the lifetime rules and locking POV, but...
>
>         I still would like to understand what's going on in
> wait_for_cancellation_request() - it *probably* shouldn't matter for the
> correctness of that massage, but there might be dragons.  I don't understand
> in which situation wrt pending signals it is supposed to be called and the
> comments in there contradict the actual code.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-23  2:54                           ` Mike Marshall
@ 2016-01-23 19:10                             ` Al Viro
  2016-01-23 19:24                               ` Mike Marshall
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-01-23 19:10 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel

On Fri, Jan 22, 2016 at 09:54:48PM -0500, Mike Marshall wrote:
> Well... that all seems awesome, and compiled the first
> time and all my quick tests on my dinky vm make
> it seem fine... It is Becky that recently spent a
> bunch of time fighting the cancellation dragons,
> I'll see if I can't get her to weigh in on
> wait_for_cancellation_downcall tomorrow.
> 
> We have some gnarly tests we were running on
> real hardware that helped reproduce the problems
> she was seeing in production with Clemson's
> Palmetto Cluster, I'll run them, but maybe not
> until Monday with the ice storm...

OK, several more pushed.  The most interesting part is probably switch
to real completions - you'd been open-coding them for no good reason
(and as always with reinventing locking primitives, asking for trouble).

New bits just as untested as the earlier ones, of course...

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-23 19:10                             ` Al Viro
@ 2016-01-23 19:24                               ` Mike Marshall
  2016-01-23 21:35                                 ` Mike Marshall
  2016-01-23 21:40                                 ` Al Viro
  0 siblings, 2 replies; 111+ messages in thread
From: Mike Marshall @ 2016-01-23 19:24 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel

OK, I'll get them momentarily...

I merged your other patches, and there was a merge
conflict I had to work around... you're working from
an orangefs tree that lacks one commit I had made
last week... my linux-next tree has all your patches
through yesterday in it now...

I am setting up "the gnarly test" (at home from a VM,
though) that should cause a bunch of cancellations,
I want to see if I can get
wait_for_cancellation_downcall to ever
flow past that "if (signal_pending(current)) {"
block... if it does, that demonstrate where
the comments conflict with the code, right?

-Mike

On Sat, Jan 23, 2016 at 2:10 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Jan 22, 2016 at 09:54:48PM -0500, Mike Marshall wrote:
>> Well... that all seems awesome, and compiled the first
>> time and all my quick tests on my dinky vm make
>> it seem fine... It is Becky that recently spent a
>> bunch of time fighting the cancellation dragons,
>> I'll see if I can't get her to weigh in on
>> wait_for_cancellation_downcall tomorrow.
>>
>> We have some gnarly tests we were running on
>> real hardware that helped reproduce the problems
>> she was seeing in production with Clemson's
>> Palmetto Cluster, I'll run them, but maybe not
>> until Monday with the ice storm...
>
> OK, several more pushed.  The most interesting part is probably switch
> to real completions - you'd been open-coding them for no good reason
> (and as always with reinventing locking primitives, asking for trouble).
>
> New bits just as untested as the earlier ones, of course...

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-23 19:24                               ` Mike Marshall
@ 2016-01-23 21:35                                 ` Mike Marshall
  2016-01-23 22:05                                   ` Al Viro
  2016-01-23 21:40                                 ` Al Viro
  1 sibling, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-01-23 21:35 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel

I compiled and tested the new patches,
they seem to work more than great, unless
it is just my imagination that the kernel
module is much faster now. I'll measure it
with more than seat-of-the-pants to see
for sure. The patches are pushed to
the for-next branch.

My "gnarly test" can get the code to flow
into wait_for_cancellation_downcall, but
never would flow past the
"if (signal_pending(current)) {" block,
though that doesn't prove anything...

I had to look at the wiki page for "cargo culting" <g>...
When Becky was working on the cancellation
problem I alluded to earlier, we talked about and
suspected the spin_lock_irqsaves in
service_operation were not appropriate...

Thanks again Al...

-Mike


On Sat, Jan 23, 2016 at 2:24 PM, Mike Marshall <hubcap@omnibond.com> wrote:
> OK, I'll get them momentarily...
>
> I merged your other patches, and there was a merge
> conflict I had to work around... you're working from
> an orangefs tree that lacks one commit I had made
> last week... my linux-next tree has all your patches
> through yesterday in it now...
>
> I am setting up "the gnarly test" (at home from a VM,
> though) that should cause a bunch of cancellations,
> I want to see if I can get
> wait_for_cancellation_downcall to ever
> flow past that "if (signal_pending(current)) {"
> block... if it does, that demonstrate where
> the comments conflict with the code, right?
>
> -Mike
>
> On Sat, Jan 23, 2016 at 2:10 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Fri, Jan 22, 2016 at 09:54:48PM -0500, Mike Marshall wrote:
>>> Well... that all seems awesome, and compiled the first
>>> time and all my quick tests on my dinky vm make
>>> it seem fine... It is Becky that recently spent a
>>> bunch of time fighting the cancellation dragons,
>>> I'll see if I can't get her to weigh in on
>>> wait_for_cancellation_downcall tomorrow.
>>>
>>> We have some gnarly tests we were running on
>>> real hardware that helped reproduce the problems
>>> she was seeing in production with Clemson's
>>> Palmetto Cluster, I'll run them, but maybe not
>>> until Monday with the ice storm...
>>
>> OK, several more pushed.  The most interesting part is probably switch
>> to real completions - you'd been open-coding them for no good reason
>> (and as always with reinventing locking primitives, asking for trouble).
>>
>> New bits just as untested as the earlier ones, of course...

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-23 19:24                               ` Mike Marshall
  2016-01-23 21:35                                 ` Mike Marshall
@ 2016-01-23 21:40                                 ` Al Viro
  2016-01-23 22:36                                   ` Mike Marshall
  2016-01-23 22:46                                   ` write() semantics (Re: Orangefs ABI documentation) Al Viro
  1 sibling, 2 replies; 111+ messages in thread
From: Al Viro @ 2016-01-23 21:40 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel

On Sat, Jan 23, 2016 at 02:24:51PM -0500, Mike Marshall wrote:
> OK, I'll get them momentarily...
> 
> I merged your other patches, and there was a merge
> conflict I had to work around... you're working from
> an orangefs tree that lacks one commit I had made
> last week... my linux-next tree has all your patches
> through yesterday in it now...
> 
> I am setting up "the gnarly test" (at home from a VM,
> though) that should cause a bunch of cancellations,
> I want to see if I can get
> wait_for_cancellation_downcall to ever
> flow past that "if (signal_pending(current)) {"
> block... if it does, that demonstrate where
> the comments conflict with the code, right?

Yes...  BTW, speaking of that codepath - how can the second caller of
handle_io_error() ever get !op_state_serviced(new_op)?  That failure,
after all, had been in postcopy_buffers(), so the daemon is sitting
in its write_iter() waiting until we finish copying the data out of
bufmap; it's too late for sending cancel anyway, is it not?  IOW, would
the following do the right thing?  That would've left us with only
one caller of handle_io_error()...

diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
index c585063d..86ba1df 100644
--- a/fs/orangefs/file.c
+++ b/fs/orangefs/file.c
@@ -245,15 +245,8 @@ populate_shared_memory:
 				       buffer_index,
 				       iter,
 				       new_op->downcall.resp.io.amt_complete);
-		if (ret < 0) {
-			/*
-			 * put error codes in downcall so that handle_io_error()
-			 * preserves it properly
-			 */
-			new_op->downcall.status = ret;
-			handle_io_error();
-			goto out;
-		}
+		if (ret < 0)
+			goto done_copying;
 	}
 	gossip_debug(GOSSIP_FILE_DEBUG,
 	    "%s(%pU): Amount written as returned by the sys-io call:%d\n",
@@ -263,6 +256,7 @@ populate_shared_memory:
 
 	ret = new_op->downcall.resp.io.amt_complete;
 
+done_copying:
 	/*
 	 * tell the device file owner waiting on I/O that this read has
 	 * completed and it can return now.

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-23 21:35                                 ` Mike Marshall
@ 2016-01-23 22:05                                   ` Al Viro
  0 siblings, 0 replies; 111+ messages in thread
From: Al Viro @ 2016-01-23 22:05 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel

On Sat, Jan 23, 2016 at 04:35:32PM -0500, Mike Marshall wrote:
> I compiled and tested the new patches,
> they seem to work more than great, unless
> it is just my imagination that the kernel
> module is much faster now. I'll measure it
> with more than seat-of-the-pants to see
> for sure. The patches are pushed to
> the for-next branch.

As long as the speedup is not due to not actually doing the IO... ;-)

> My "gnarly test" can get the code to flow
> into wait_for_cancellation_downcall, but
> never would flow past the
> "if (signal_pending(current)) {" block,
> though that doesn't prove anything...

AFAICS, it is possible to get there without a signal - you need the first
attempt of file_read / file_write to fail on disconnect, be retried and
that retry to fail on timeout.  _Then_ you get to sending a cancel without
any signals ever being sent to you.  

Said that, I'm not sure that this "we don't wait at all if the signals
are pending" is right - consider e.g. a read getting killed after it had
been picked by daemon and before the daemon gets a chance to reply.
We submit a cancel, but unless daemon picks it immediately, we don't
even bother waiting - we see that cancel hadn't been serviced yet, that
a signal is pending and we remove the cancel from request list.  In that
scenario daemon doesn't get a chance to see the cancel request at all...

In which situations do we want cancel to be given to daemon and do we
need to wait for its response?

> I had to look at the wiki page for "cargo culting" <g>...
> When Becky was working on the cancellation
> problem I alluded to earlier, we talked about and
> suspected the spin_lock_irqsaves in
> service_operation were not appropriate...

It isn't - first of all, prepare_to_wait() does spin_lock_irqsave() itself
(on the queue lock), so interrupt-disabling bits are there anyway.  And
the exclusion on op->lock is completely pointless, obviously - you are waiting
for bufmap initialization, which has nothing to do with specific op.
Note that by that point op is not on any lists, so there's nobody who could
see it, let alone grab the same spinlock...

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-23 21:40                                 ` Al Viro
@ 2016-01-23 22:36                                   ` Mike Marshall
  2016-01-24  0:16                                     ` Al Viro
  2016-01-23 22:46                                   ` write() semantics (Re: Orangefs ABI documentation) Al Viro
  1 sibling, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-01-23 22:36 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel

AV> IOW, would the following do the right thing?
AV> That would've left us with only one caller of
AV> handle_io_error()...

It works.  With your simplified code all
the needed things still happen: complete and
bufmap_put...

I've never had an error there unless I forgot
to turn on the client-core...

You must be looking for a way to get rid of
another macro <g>...

-Mike

On Sat, Jan 23, 2016 at 4:40 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sat, Jan 23, 2016 at 02:24:51PM -0500, Mike Marshall wrote:
>> OK, I'll get them momentarily...
>>
>> I merged your other patches, and there was a merge
>> conflict I had to work around... you're working from
>> an orangefs tree that lacks one commit I had made
>> last week... my linux-next tree has all your patches
>> through yesterday in it now...
>>
>> I am setting up "the gnarly test" (at home from a VM,
>> though) that should cause a bunch of cancellations,
>> I want to see if I can get
>> wait_for_cancellation_downcall to ever
>> flow past that "if (signal_pending(current)) {"
>> block... if it does, that demonstrate where
>> the comments conflict with the code, right?
>
> Yes...  BTW, speaking of that codepath - how can the second caller of
> handle_io_error() ever get !op_state_serviced(new_op)?  That failure,
> after all, had been in postcopy_buffers(), so the daemon is sitting
> in its write_iter() waiting until we finish copying the data out of
> bufmap; it's too late for sending cancel anyway, is it not?  IOW, would
> the following do the right thing?  That would've left us with only
> one caller of handle_io_error()...
>
> diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
> index c585063d..86ba1df 100644
> --- a/fs/orangefs/file.c
> +++ b/fs/orangefs/file.c
> @@ -245,15 +245,8 @@ populate_shared_memory:
>                                        buffer_index,
>                                        iter,
>                                        new_op->downcall.resp.io.amt_complete);
> -               if (ret < 0) {
> -                       /*
> -                        * put error codes in downcall so that handle_io_error()
> -                        * preserves it properly
> -                        */
> -                       new_op->downcall.status = ret;
> -                       handle_io_error();
> -                       goto out;
> -               }
> +               if (ret < 0)
> +                       goto done_copying;
>         }
>         gossip_debug(GOSSIP_FILE_DEBUG,
>             "%s(%pU): Amount written as returned by the sys-io call:%d\n",
> @@ -263,6 +256,7 @@ populate_shared_memory:
>
>         ret = new_op->downcall.resp.io.amt_complete;
>
> +done_copying:
>         /*
>          * tell the device file owner waiting on I/O that this read has
>          * completed and it can return now.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* write() semantics (Re: Orangefs ABI documentation)
  2016-01-23 21:40                                 ` Al Viro
  2016-01-23 22:36                                   ` Mike Marshall
@ 2016-01-23 22:46                                   ` Al Viro
  2016-01-23 23:35                                     ` Linus Torvalds
  1 sibling, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-01-23 22:46 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel

On Sat, Jan 23, 2016 at 09:40:06PM +0000, Al Viro wrote:

> Yes...  BTW, speaking of that codepath - how can the second caller of
> handle_io_error() ever get !op_state_serviced(new_op)?  That failure,
> after all, had been in postcopy_buffers(), so the daemon is sitting
> in its write_iter() waiting until we finish copying the data out of
> bufmap; it's too late for sending cancel anyway, is it not?  IOW, would
> the following do the right thing?  That would've left us with only
> one caller of handle_io_error()...

FWIW, I'm not sure I like the correctness implications of the cancel
thing.  Look: we do large write(), it sends a couple of chunks successfully,
gets to submitting the third one, copies its data to bufmap, tells the
daemon to start writing, then gets a signal, sends cancel and buggers off.

What should we get?  -EINTR, despite having written some data?  That's
what the code does now, but I'm not sure it's what the userland expects.
Two chunks worth of data we'd written?  That's what one would expect
if the third one had hit an unmapped page, but in scenario with a signal
hitting us the daemon might very well have overwritten more of the file
by the time it had seen the cancel.

AFAICS, POSIX flat-out prohibits the current behaviour - what it says for
write(2) is
[EINTR]
    The write operation was terminated due to the receipt of a signal,
and no data was transferred.
^^^^^^^^^^^^^^^^^^^^^^^^^^^

but I'm not sure if "return a short write and to hell with having some
data beyond the returned amount actually written" would be better from
the userland POV.  It would be closer to what e.g. NFS is doing, though...

Linus?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: write() semantics (Re: Orangefs ABI documentation)
  2016-01-23 22:46                                   ` write() semantics (Re: Orangefs ABI documentation) Al Viro
@ 2016-01-23 23:35                                     ` Linus Torvalds
  2016-03-03 22:25                                       ` Mike Marshall
  0 siblings, 1 reply; 111+ messages in thread
From: Linus Torvalds @ 2016-01-23 23:35 UTC (permalink / raw)
  To: Al Viro; +Cc: Mike Marshall, linux-fsdevel

On Sat, Jan 23, 2016 at 2:46 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> What should we get?  -EINTR, despite having written some data?

No, that's not acceptable.

Either all or nothing (which is POSIX) or the NFS 'intr' mount
behavior (partial write return, -EINTR only when nothing was written
at all). And, like NFS, a mount option might be a good thing.

And of course, for the usual reasons, fatal signals are special in
that for them we generally say "screw posix, nobody sees the return
value anyway", but even there the filesystem might as well still
return the partial return value (just to not introduce yet another
special case).

In fact, I think that with our "fatal signals interrupt" behavior,
nobody should likely use the "intr" mount option on NFS. Even if the
semantics may be "better", there are likely simply just too many
programs that don't check the return value of "write()" at all, much
less handle partial writes correctly.

(And yes, our "screw posix" behavior wrt fatal signals is strictly
wrong even _despite_ the fact that nobody sees the return value -
other processes can still obviously see that the whole write wasn't
done. But blocking on a fatal signal is _so_ annoying that it's one of
those things where we just say "posix was wrong on this one, and if we
squint a bit we look _almost_ like we're compliant").

              Linus

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-23 22:36                                   ` Mike Marshall
@ 2016-01-24  0:16                                     ` Al Viro
  2016-01-24  4:05                                       ` Al Viro
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-01-24  0:16 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel

On Sat, Jan 23, 2016 at 05:36:41PM -0500, Mike Marshall wrote:
> AV> IOW, would the following do the right thing?
> AV> That would've left us with only one caller of
> AV> handle_io_error()...
> 
> It works.  With your simplified code all
> the needed things still happen: complete and
> bufmap_put...
> 
> I've never had an error there unless I forgot
> to turn on the client-core...
> 
> You must be looking for a way to get rid of
> another macro <g>...

That as well, but mostly I want to sort the situation with cancels out and
get a better grasp on when can that code be reached.  BTW, an error at that
spot is trivial to arrange - just pass read() a destination with munmapped
page in the middle and it'll trigger just fine.  IOW,
	p = mmap(NULL, 65536, PROT_READ|PROT_WRITE,
		 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
	munmap(p + 16384, 16384);
	read(fd, p, 65536);
with fd being a file on orangefs should step into that.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-24  0:16                                     ` Al Viro
@ 2016-01-24  4:05                                       ` Al Viro
  2016-01-24 22:12                                         ` Mike Marshall
  2016-01-26 19:52                                         ` Martin Brandenburg
  0 siblings, 2 replies; 111+ messages in thread
From: Al Viro @ 2016-01-24  4:05 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel

On Sun, Jan 24, 2016 at 12:16:15AM +0000, Al Viro wrote:
> On Sat, Jan 23, 2016 at 05:36:41PM -0500, Mike Marshall wrote:
> > AV> IOW, would the following do the right thing?
> > AV> That would've left us with only one caller of
> > AV> handle_io_error()...
> > 
> > It works.  With your simplified code all
> > the needed things still happen: complete and
> > bufmap_put...
> > 
> > I've never had an error there unless I forgot
> > to turn on the client-core...
> > 
> > You must be looking for a way to get rid of
> > another macro <g>...
> 
> That as well, but mostly I want to sort the situation with cancels out and
> get a better grasp on when can that code be reached.  BTW, an error at that
> spot is trivial to arrange - just pass read() a destination with munmapped
> page in the middle and it'll trigger just fine.  IOW,
> 	p = mmap(NULL, 65536, PROT_READ|PROT_WRITE,
> 		 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 	munmap(p + 16384, 16384);
> 	read(fd, p, 65536);
> with fd being a file on orangefs should step into that.

Hmm...  I've just realized that I don't understand why you are waiting in
the orangefs_devreq_write_iter() at all.  After all, you've reserved the
slot in wait_for_direct_io() and do not release it until right after you've
done complete(&op->done).  Why should server block while your read(2) copies
the data from bufmap to its real destination?  After all, any further
request hitting the same slot won't come until the slot is released anyway,
right?  What (and why) would the server be doing to that slot in the
meanwhile?  It's already copied whatever data it was going to copy as part of
that file_read...

Speaking of your slot allocator - surely it would be better to maintain the
count of unused slots there?  For example, replace that waitq with semaphore,
initialize it to number of slots, and have wait_for_a_slot() do
	if (down_interruptible(slargs->slots_sem)) {
		whine "interrupted"
		return -EINTR;
	}
	/* now we know that there's an empty slot for us */
	spin_lock(slargs->slots_lock);
	n = find_first_zero_bit(slargs->slots_bitmap, slargs->slot_count);
	set_bit(slargs->slots_bitmap, n);
	spin_unlock(slargs->slots_lock);
	// or a lockless variant thereof, for that matter - just need to be
	// carefull with barriers
	return n;
with put_back_slot() being
	spin_lock
	clear_bit
	spin_unlock
	up

Wait a minute...  What happens if you have a daemon disconnect while somebody
is holding a bufmap slot?  Each of those (as well as whoever's waiting for
a slot to come free) is holding a reference to bufmap.  devreq_mutex won't
do a damn thing, contrary to the comment in ->release() - it's _not_ held
across the duration of wait_for_direct_io().

We'll just decrement the refcount of bufmap, do nothing since it hasn't
reached zero, proceed to mark all ops as purged, wake each service_operation()
up and sod off.  Now, the holders of those slots will call
orangefs_get_bufmap_init(), get 1 (since we hadn't dropped the last reference
yet - can it *ever* see 0 there, actually?) and return -EAGAIN.  With
wait_for_direct_io() noticing that, freeing the slot and going into restart.
And if there was the only one, we are fine, but what if there were several?

Suppose there had been two read() going on at the time of disconnect.
The first one drops the slot.  Good, refcount of bufmap is down to 1 now.
And we go back to orangefs_bufmap_get().  Which calls orangefs_bufmap_ref().
Which sees that __orangefs_bufmap is still non-NULL.  And promptly regains
the reference and allocates the slot in that sucker.  Then it's back to
service_operations()...  Which will check is_daemon_in_service(), possibly
getting "yes, it is" (if the new one got already started).  Ouch.

The fun part in all of that is that new daemon won't be able to get new
bufmap in place until all users of the old one run down.  What a mess...

Ho-hum...  So we need to
	a) have ->release() wait for all slots to get freed
	b) have orangefs_bufmap_get() recognize that situation and
_not_ get new reference to bufmap that is already going down.
	c) move the "wait for new daemon to come and install a new bufmap"
to some point past the release of the slot - service_operation() is too
early for that to work.

Or am I missing something simple that makes the scenario above not go that
way?  Confused...

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-24  4:05                                       ` Al Viro
@ 2016-01-24 22:12                                         ` Mike Marshall
  2016-01-30 17:22                                           ` Al Viro
  2016-01-26 19:52                                         ` Martin Brandenburg
  1 sibling, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-01-24 22:12 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel

Al...

I read your description of waiting in orangefs_devreq_write_iter()
and it seems to make sense, so I built a version with it removed,
and I also still have the bit in file.c with handle_io_error removed.
I put a git diff at the end of this message in case it is
not clear what I did... that I GOT HERE I put in there was to help
me see the results of running the mmap test you showed me...

I use dbench as a tester, maybe too much, but when I break
the code I can usually see that it is broken when I run dbench.

So... without the waiting in orangefs_devreq_write_iter, I ran
three separate dbenches like this at the same time:

/home/hubcap/dbench  -c client.txt 10 -t 300

That's 30 threads and tons of concurrent IO, both reading and writing,
and all three dbenches complete properly... so... looking
at the code makes it seem reasonble to get rid of the wait,
and tests make it seem reasonable too.

On to the restarting of client-core... they tell me it works
in production with our frankensteined works-on-everything
out-of-tree kernel module, I'm not sure.

But in my tests, if I kill the client-core bad things happen...
sometimes the client-core doesn't restart, and the kernel gets
sick (hangs or slows way down but no oops). When the client-core
does restart, the activity I had going on (dbench again) fizzles out,
and the filesystem is corrupted...

[root@be1 d.dbench1]# pwd
/pvfsmnt/d.dbench1
[root@be1 d.dbench1]# ls
clients  client.txt
[root@be1 d.dbench1]# rm -rf clients
rm: cannot remove ‘clients/client5/~dmtmp/PWRPNT’: Directory not empty

Once when I stopped the client-core, the kernel did oops:


Jan 24 11:17:39 be1 kernel: [ 4284.798158] Call Trace:
Jan 24 11:17:39 be1 kernel: [ 4284.799274]  [<ffffffff8110547c>] ?
print_time.part.9+0x6c/0x90
Jan 24 11:17:39 be1 kernel: [ 4284.801904]  [<ffffffff811a265d>] ?
irq_work_queue+0xd/0x80
Jan 24 11:17:39 be1 kernel: [ 4284.804376]  [<ffffffff81106c32>] ?
wake_up_klogd+0x32/0x40
Jan 24 11:17:39 be1 kernel: [ 4284.806850]  [<ffffffff810f63f4>]
lock_acquire+0xc4/0x150
Jan 24 11:17:39 be1 kernel: [ 4284.809274]  [<ffffffff813471ab>] ?
put_back_slot+0x1b/0x70
Jan 24 11:17:39 be1 kernel: [ 4284.811790]  [<ffffffff817db871>]
_raw_spin_lock+0x31/0x40
Jan 24 11:17:39 be1 kernel: [ 4284.814230]  [<ffffffff813471ab>] ?
put_back_slot+0x1b/0x70
Jan 24 11:17:39 be1 kernel: [ 4284.816750]  [<ffffffff813471ab>]
put_back_slot+0x1b/0x70
Jan 24 11:17:39 be1 kernel: [ 4284.819180]  [<ffffffff813479ab>]
orangefs_readdir_index_put+0x4b/0x70
Jan 24 11:17:39 be1 kernel: [ 4284.822081]  [<ffffffff81346f22>]
orangefs_readdir+0xd42/0xd50
Jan 24 11:17:39 be1 kernel: [ 4284.824672]  [<ffffffff810f344d>] ?
trace_hardirqs_on+0xd/0x10
Jan 24 11:17:39 be1 kernel: [ 4284.827214]  [<ffffffff81256f7f>]
iterate_dir+0x9f/0x130
Jan 24 11:17:39 be1 kernel: [ 4284.829567]  [<ffffffff81257491>]
SyS_getdents+0xa1/0x140
Jan 24 11:17:39 be1 kernel: [ 4284.832022]  [<ffffffff81257010>] ?
iterate_dir+0x130/0x130
Jan 24 11:17:39 be1 kernel: [ 4284.834456]  [<ffffffff817dc332>]
entry_SYSCALL_64_fastpath+0x12/0x76

I turned on the kernel crash dump stuff so that I could see more, but
it hasn't crashed again...

Anyhow, I don't think the "restart the client-core" code is up to snuff <g>.

I'll look closer at how the out-of-tree module works, maybe it really
does work and we've broken it with our massive changes to the
upstream version over the last few years. I see that the client (whose
job it is to restart the client-core) and the client-core implement
signal handling with signal(2), whose man page says to use
sigaction(2) instead...

# pwd
/home/hubcap/linux
[root@be1 linux]# git diff
diff --git a/fs/orangefs/devorangefs-req.c b/fs/orangefs/devorangefs-req.c
index 812844f..d8e138f 100644
--- a/fs/orangefs/devorangefs-req.c
+++ b/fs/orangefs/devorangefs-req.c
@@ -421,6 +421,7 @@ wakeup:
         * application reading/writing this device to return until
         * the buffers are done being used.
         */
+/*
        if (op->downcall.type == ORANGEFS_VFS_OP_FILE_IO) {
                long n = wait_for_completion_interruptible_timeout(&op->done,
                                                        op_timeout_secs * HZ);
@@ -434,6 +435,7 @@ wakeup:
                                __func__);
                }
        }
+*/
 out:
        op_release(op);
        return ret;
diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
index c585063..208f0ee 100644
--- a/fs/orangefs/file.c
+++ b/fs/orangefs/file.c
@@ -245,15 +245,9 @@ populate_shared_memory:
                                       buffer_index,
                                       iter,
                                       new_op->downcall.resp.io.amt_complete);
-               if (ret < 0) {
-                       /*
-                        * put error codes in downcall so that handle_io_error()
-                        * preserves it properly
-                        */
-                       new_op->downcall.status = ret;
-                       handle_io_error();
-                       goto out;
-               }
+gossip_err("%s: I GOT HERE, ret:%zd:\n", __func__, ret);
+               if (ret < 0)
+                       goto done_copying;
        }
        gossip_debug(GOSSIP_FILE_DEBUG,
            "%s(%pU): Amount written as returned by the sys-io call:%d\n",
@@ -263,6 +257,8 @@ populate_shared_memory:

        ret = new_op->downcall.resp.io.amt_complete;

+done_copying:
+
        /*
         * tell the device file owner waiting on I/O that this read has
         * completed and it can return now.

On Sat, Jan 23, 2016 at 11:05 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sun, Jan 24, 2016 at 12:16:15AM +0000, Al Viro wrote:
>> On Sat, Jan 23, 2016 at 05:36:41PM -0500, Mike Marshall wrote:
>> > AV> IOW, would the following do the right thing?
>> > AV> That would've left us with only one caller of
>> > AV> handle_io_error()...
>> >
>> > It works.  With your simplified code all
>> > the needed things still happen: complete and
>> > bufmap_put...
>> >
>> > I've never had an error there unless I forgot
>> > to turn on the client-core...
>> >
>> > You must be looking for a way to get rid of
>> > another macro <g>...
>>
>> That as well, but mostly I want to sort the situation with cancels out and
>> get a better grasp on when can that code be reached.  BTW, an error at that
>> spot is trivial to arrange - just pass read() a destination with munmapped
>> page in the middle and it'll trigger just fine.  IOW,
>>       p = mmap(NULL, 65536, PROT_READ|PROT_WRITE,
>>                MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>       munmap(p + 16384, 16384);
>>       read(fd, p, 65536);
>> with fd being a file on orangefs should step into that.
>
> Hmm...  I've just realized that I don't understand why you are waiting in
> the orangefs_devreq_write_iter() at all.  After all, you've reserved the
> slot in wait_for_direct_io() and do not release it until right after you've
> done complete(&op->done).  Why should server block while your read(2) copies
> the data from bufmap to its real destination?  After all, any further
> request hitting the same slot won't come until the slot is released anyway,
> right?  What (and why) would the server be doing to that slot in the
> meanwhile?  It's already copied whatever data it was going to copy as part of
> that file_read...
>
> Speaking of your slot allocator - surely it would be better to maintain the
> count of unused slots there?  For example, replace that waitq with semaphore,
> initialize it to number of slots, and have wait_for_a_slot() do
>         if (down_interruptible(slargs->slots_sem)) {
>                 whine "interrupted"
>                 return -EINTR;
>         }
>         /* now we know that there's an empty slot for us */
>         spin_lock(slargs->slots_lock);
>         n = find_first_zero_bit(slargs->slots_bitmap, slargs->slot_count);
>         set_bit(slargs->slots_bitmap, n);
>         spin_unlock(slargs->slots_lock);
>         // or a lockless variant thereof, for that matter - just need to be
>         // carefull with barriers
>         return n;
> with put_back_slot() being
>         spin_lock
>         clear_bit
>         spin_unlock
>         up
>
> Wait a minute...  What happens if you have a daemon disconnect while somebody
> is holding a bufmap slot?  Each of those (as well as whoever's waiting for
> a slot to come free) is holding a reference to bufmap.  devreq_mutex won't
> do a damn thing, contrary to the comment in ->release() - it's _not_ held
> across the duration of wait_for_direct_io().
>
> We'll just decrement the refcount of bufmap, do nothing since it hasn't
> reached zero, proceed to mark all ops as purged, wake each service_operation()
> up and sod off.  Now, the holders of those slots will call
> orangefs_get_bufmap_init(), get 1 (since we hadn't dropped the last reference
> yet - can it *ever* see 0 there, actually?) and return -EAGAIN.  With
> wait_for_direct_io() noticing that, freeing the slot and going into restart.
> And if there was the only one, we are fine, but what if there were several?
>
> Suppose there had been two read() going on at the time of disconnect.
> The first one drops the slot.  Good, refcount of bufmap is down to 1 now.
> And we go back to orangefs_bufmap_get().  Which calls orangefs_bufmap_ref().
> Which sees that __orangefs_bufmap is still non-NULL.  And promptly regains
> the reference and allocates the slot in that sucker.  Then it's back to
> service_operations()...  Which will check is_daemon_in_service(), possibly
> getting "yes, it is" (if the new one got already started).  Ouch.
>
> The fun part in all of that is that new daemon won't be able to get new
> bufmap in place until all users of the old one run down.  What a mess...
>
> Ho-hum...  So we need to
>         a) have ->release() wait for all slots to get freed
>         b) have orangefs_bufmap_get() recognize that situation and
> _not_ get new reference to bufmap that is already going down.
>         c) move the "wait for new daemon to come and install a new bufmap"
> to some point past the release of the slot - service_operation() is too
> early for that to work.
>
> Or am I missing something simple that makes the scenario above not go that
> way?  Confused...

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-24  4:05                                       ` Al Viro
  2016-01-24 22:12                                         ` Mike Marshall
@ 2016-01-26 19:52                                         ` Martin Brandenburg
  2016-01-30 17:34                                           ` Al Viro
  1 sibling, 1 reply; 111+ messages in thread
From: Martin Brandenburg @ 2016-01-26 19:52 UTC (permalink / raw)
  To: Al Viro; +Cc: Mike Marshall, Linus Torvalds, linux-fsdevel

On 1/23/16, Al Viro <viro@zeniv.linux.org.uk> wrote:
> We'll just decrement the refcount of bufmap, do nothing since it hasn't
> reached zero, proceed to mark all ops as purged, wake each service_operation()
> up and sod off.  Now, the holders of those slots will call
> orangefs_get_bufmap_init(), get 1 (since we hadn't dropped the last reference
> yet - can it *ever* see 0 there, actually?) and return -EAGAIN.  With
> wait_for_direct_io() noticing that, freeing the slot and going into restart.
> And if there was the only one, we are fine, but what if there were several?

The answer here is yes. Otherwise a malicious client could not set up
the bufmap then crash the kernel by attempting to use it.

-- Martin

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-24 22:12                                         ` Mike Marshall
@ 2016-01-30 17:22                                           ` Al Viro
  0 siblings, 0 replies; 111+ messages in thread
From: Al Viro @ 2016-01-30 17:22 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel

On Sun, Jan 24, 2016 at 05:12:30PM -0500, Mike Marshall wrote:
> But in my tests, if I kill the client-core bad things happen...
> sometimes the client-core doesn't restart, and the kernel gets
> sick (hangs or slows way down but no oops). When the client-core
> does restart, the activity I had going on (dbench again) fizzles out,
> and the filesystem is corrupted...

> Anyhow, I don't think the "restart the client-core" code is up to snuff <g>.
> 
> I'll look closer at how the out-of-tree module works, maybe it really
> does work and we've broken it with our massive changes to the
> upstream version over the last few years. I see that the client (whose
> job it is to restart the client-core) and the client-core implement
> signal handling with signal(2), whose man page says to use
> sigaction(2) instead...

Could you try this and see if either WARN_ON() actually triggers?

diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
index c585063d..e2ab0d4 100644
--- a/fs/orangefs/file.c
+++ b/fs/orangefs/file.c
@@ -246,10 +246,7 @@ populate_shared_memory:
 				       iter,
 				       new_op->downcall.resp.io.amt_complete);
 		if (ret < 0) {
-			/*
-			 * put error codes in downcall so that handle_io_error()
-			 * preserves it properly
-			 */
+			WARN_ON(!op_state_serviced(new_op));
 			new_op->downcall.status = ret;
 			handle_io_error();
 			goto out;
diff --git a/fs/orangefs/waitqueue.c b/fs/orangefs/waitqueue.c
index cdbf57b..191d886 100644
--- a/fs/orangefs/waitqueue.c
+++ b/fs/orangefs/waitqueue.c
@@ -205,6 +205,7 @@ retry_servicing:
 
 		/* op uses shared memory */
 		if (orangefs_get_bufmap_init() == 0) {
+			WARN_ON(1);
 			/*
 			 * This operation uses the shared memory system AND
 			 * the system is not yet ready. This situation occurs

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-26 19:52                                         ` Martin Brandenburg
@ 2016-01-30 17:34                                           ` Al Viro
  2016-01-30 18:27                                             ` Al Viro
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-01-30 17:34 UTC (permalink / raw)
  To: Martin Brandenburg; +Cc: Mike Marshall, Linus Torvalds, linux-fsdevel

On Tue, Jan 26, 2016 at 02:52:23PM -0500, Martin Brandenburg wrote:
> On 1/23/16, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > We'll just decrement the refcount of bufmap, do nothing since it hasn't
> > reached zero, proceed to mark all ops as purged, wake each service_operation()
> > up and sod off.  Now, the holders of those slots will call
> > orangefs_get_bufmap_init(), get 1 (since we hadn't dropped the last reference
> > yet - can it *ever* see 0 there, actually?) and return -EAGAIN.  With
> > wait_for_direct_io() noticing that, freeing the slot and going into restart.
> > And if there was the only one, we are fine, but what if there were several?
> 
> The answer here is yes. Otherwise a malicious client could not set up
> the bufmap then crash the kernel by attempting to use it.

Umm...  The question was "what happens if there was more than one slot
in use when we hit
        if (ret == -EAGAIN && op_state_purged(new_op)) {
                orangefs_bufmap_put(bufmap, buffer_index);
                gossip_debug(GOSSIP_FILE_DEBUG,
                             "%s:going to repopulate_shared_memory.\n",
                             __func__);
                goto populate_shared_memory;
        }
in wait_for_direct_io()?"  If the answer is "yes", I'd like to see more
detailed version, if possible...

Note that __orangefs_bufmap won't become NULL until all slots are freed,
so getting to that place with more than one slot in use will have us
go to populate_shared_memory, where we'll grab a new reference to the
same old bufmap and allocate a slot _there_...

And again, how could
                /* op uses shared memory */
                if (orangefs_get_bufmap_init() == 0) {
in service_operation() possibly be true, when we have
	* op->uses_shared_memory just checked to be set
	* all callers that set it (orangefs_readdir() and wait_for_direct_io()
having allocated a slot before calling service_operation() and not releasing
it until service_operation() returns
	* __orangefs_bufmap not becoming NULL until all slots are freed and
	* orangefs_get_bufmap_init() returning 1 unless __orangefs_bufmap is
NULL?

AFAICS, that code (waiting for daemon to be restarted) is provably never
executed.  What am I missing?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-30 17:34                                           ` Al Viro
@ 2016-01-30 18:27                                             ` Al Viro
  2016-02-04 23:30                                               ` Mike Marshall
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-01-30 18:27 UTC (permalink / raw)
  To: Martin Brandenburg; +Cc: Mike Marshall, Linus Torvalds, linux-fsdevel

On Sat, Jan 30, 2016 at 05:34:13PM +0000, Al Viro wrote:

> And again, how could
>                 /* op uses shared memory */
>                 if (orangefs_get_bufmap_init() == 0) {
> in service_operation() possibly be true, when we have
> 	* op->uses_shared_memory just checked to be set
> 	* all callers that set it (orangefs_readdir() and wait_for_direct_io()
> having allocated a slot before calling service_operation() and not releasing
> it until service_operation() returns
> 	* __orangefs_bufmap not becoming NULL until all slots are freed and
> 	* orangefs_get_bufmap_init() returning 1 unless __orangefs_bufmap is
> NULL?
> 
> AFAICS, that code (waiting for daemon to be restarted) is provably never
> executed.  What am I missing?

While we are at it, what happens to original code (without refcount changes,
etc. from my pile - the problem remains with those, but let's look at the
code at merge from v4.4) if something does read() and gets a signal
just as the daemon gets to
        n = copy_from_iter(&op->downcall, downcall_size, iter);
in ->write_iter(), reporting that is has finished that read?  We were in
wait_for_matching_downcall(), interruptibly sleeping.  We got woken up by
signal delivery.  Checked op->op_state; it's still not marked as serviced.
We check signal_pending() and find it true.  We hit
orangefs_clean_up_interrupted_operation(), which doesn't do anything to
op->op_state, and we return -EINTR to service_operation().  Which returns
without waiting for anything to wait_for_direct_io() and we go into
sending a cancel.  Now the daemon regains the timeslice.  Marks op as
serviced.  And proceeds to wait on op->io_completion_waitq.

Who's going to wake it up?  orangefs_cancel_op_in_progress() sure as hell
won't - it has no way to find op, nevermind doing wakeups.  The rest of
wait_for_direct_io() also doesn't wake the daemon up.  How is that supposed
to work?  Moreover, we issue a cancel; when is it supposed to be processed
and how do we tell if it's already been processed?

Who should be waiting for what in case of cancel being issued just as the
daemon gets around to reporting success of original operation?  On the
protocol level, that is.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-01-30 18:27                                             ` Al Viro
@ 2016-02-04 23:30                                               ` Mike Marshall
  2016-02-06 19:42                                                 ` Al Viro
  0 siblings, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-02-04 23:30 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel, Stephen Rothwell

Hi Al...

Sorry I didn't get to your WARN_ONs right away, I've been trying to get
a better handle on the crashing-the-kernel-when-the-client-core-restarts
problem. I've gotten the kdump stuff working and my feeble crash-foo
has improved a little...

I did some testing with the out-of-tree version of the kernel module,
and it is actually able to recover when the client-core is stopped pretty
well. I looked closer at the client-core code and talked with some
of the others who've been around longer than me. The idea behind
the restart is rooted in "keeping things going" in production if
the client-core was to crash for some reason. On closer inspection
I see that the signal handling code in the client-core that simulates
a crash on SIGSEGV and SIGABRT is indeed written with sigaction
like the signal(2) man page says it should be.

As for the WARN_ONs, the waitqueue one is easy to hit when the
client-core stops and restarts, you can see here where precopy_buffers
started whining about the client-core, you can see that the client
core restarted when the debug mask got sent back over, and then
the WARN_ON in waitqueue gets hit:

[ 1239.198976] precopy_buffers: Failed to copy-in buffers. Please make
sure that  the pvfs2-client is running. -14
[ 1239.198979] precopy_buffers: Failed to copy-in buffers. Please make
sure that  the pvfs2-client is running. -14
[ 1239.198983] orangefs_file_write_iter: do_readv_writev failed, rc:-14:.
[ 1239.199175] precopy_buffers: Failed to copy-in buffers. Please make
sure that  the pvfs2-client is running. -14
[ 1239.199177] precopy_buffers: Failed to copy-in buffers. Please make
sure that  the pvfs2-client is running. -14
[ 1239.199180] orangefs_file_write_iter: do_readv_writev failed, rc:-14:.
[ 1239.199601] precopy_buffers: Failed to copy-in buffers. Please make
sure that  the pvfs2-client is running. -14
[ 1239.199602] precopy_buffers: Failed to copy-in buffers. Please make
sure that  the pvfs2-client is running. -14
[ 1239.199604] orangefs_file_write_iter: do_readv_writev failed, rc:-14:.
[ 1239.248239] dispatch_ioctl_command: client debug mask has been been
received  :0: :0:
[ 1239.248257] dispatch_ioctl_command: client debug array string has
been receiv ed.
[ 1239.307842] ------------[ cut here ]------------
[ 1239.307847] WARNING: CPU: 0 PID: 1347 at
fs/orangefs/waitqueue.c:208 service_ operation+0x59f/0x9b0()
[ 1239.307848] Modules linked in: bnep bluetooth ip6t_rpfilter rfkill
ip6t_REJEC T nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6
nf_conntrack_ipv4 nf_defrag_ip v4 xt_conntrack nf_conntrack
ebtable_nat ebtable_broute bridge stp llc ebtable_f ilter ebtables
ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip
6_tables iptable_mangle iptable_security iptable_raw ppdev parport_pc
virtio_bal loon virtio_console parport 8139too serio_raw pvpanic
i2c_piix4 uinput qxl drm_k ms_helper ttm drm 8139cp i2c_core
virtio_pci ata_generic virtio virtio_ring mii  pata_acpi
[ 1239.307870] CPU: 0 PID: 1347 Comm: dbench Not tainted
4.4.0-161988-g237f828-d irty #49
[ 1239.307871] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 1239.307872]  0000000000000000 0000000011eac412 ffff88003bf27cd0
ffffffff8139c 84d
[ 1239.307874]  ffff88003bf27d08 ffffffff8108e510 ffff880010968000
ffff88001096c 1d0
[ 1239.307876]  ffff88001096c188 00000000fffffff5 0000000000000000
ffff88003bf27 d18
[ 1239.307877] Call Trace:
[ 1239.307881]  [<ffffffff8139c84d>] dump_stack+0x19/0x1c
[ 1239.307884]  [<ffffffff8108e510>] warn_slowpath_common+0x80/0xc0
[ 1239.307886]  [<ffffffff8108e65a>] warn_slowpath_null+0x1a/0x20
[ 1239.307887]  [<ffffffff812fe73f>] service_operation+0x59f/0x9b0
[ 1239.307889]  [<ffffffff810c28b0>] ? prepare_to_wait_event+0x100/0x100
[ 1239.307891]  [<ffffffff810c28b0>] ? prepare_to_wait_event+0x100/0x100
[ 1239.307893]  [<ffffffff812fbd12>] orangefs_readdir+0x172/0xd50
[ 1239.307895]  [<ffffffff810c9a2d>] ? trace_hardirqs_on+0xd/0x10
[ 1239.307898]  [<ffffffff8120e4bf>] iterate_dir+0x9f/0x130
[ 1239.307899]  [<ffffffff8120e9d0>] SyS_getdents+0xa0/0x140
[ 1239.307900]  [<ffffffff8120e550>] ? iterate_dir+0x130/0x130
[ 1239.307903]  [<ffffffff8178302f>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 1239.307904] ---[ end trace 66a9a15ad78b3dea ]---


On the next restart, the kernel crashed. The client-core is restarted
(restarting?) and orangefs_readdir is dealing with the wreckage and
trying to give up the slot it was using:

[ 1255.683226] dispatch_ioctl_command: client debug mask has been been
received  :0: :0:
[ 1255.683245] dispatch_ioctl_command: client debug array string has
been receiv ed.
[ 1255.711036] BUG: unable to handle kernel paging request at ffffffff810b0d68
[ 1255.711911] IP: [<ffffffff810cad14>] __lock_acquire+0x1a4/0x1e50
[ 1255.712605] PGD 1c13067 PUD 1c14063 PMD 10001e1
[ 1255.713147] Oops: 0003 [#1]
[ 1255.713522] Modules linked in: bnep bluetooth ip6t_rpfilter rfkill
ip6t_REJEC T nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6
nf_conntrack_ipv4 nf_defrag_ip v4 xt_conntrack nf_conntrack
ebtable_nat ebtable_broute bridge stp llc ebtable_f ilter ebtables
ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip
6_tables iptable_mangle iptable_security iptable_raw ppdev parport_pc
virtio_bal loon virtio_console parport 8139too serio_raw pvpanic
i2c_piix4 uinput qxl drm_k ms_helper ttm drm 8139cp i2c_core
virtio_pci ata_generic virtio virtio_ring mii  pata_acpi
[ 1255.720073] CPU: 0 PID: 1347 Comm: dbench Tainted: G        W
4.4.0-161 988-g237f828-dirty #49
[ 1255.721160] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 1255.721845] task: ffff880036894d00 ti: ffff88003bf24000 task.ti:
ffff88003bf2 4000
[ 1255.722674] RIP: 0010:[<ffffffff810cad14>]  [<ffffffff810cad14>]
__lock_acqui re+0x1a4/0x1e50
[ 1255.723669] RSP: 0018:ffff88003bf27c28  EFLAGS: 00010082
[ 1255.724289] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff810b0bd0
[ 1255.725143] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880003067d58
[ 1255.726002] RBP: ffff88003bf27ce8 R08: 0000000000000001 R09: 0000000000000000
[ 1255.726852] R10: ffff880036894d00 R11: 0000000000000000 R12: ffff880003067d58
[ 1255.727731] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1255.728585] FS:  00007fbe5f7d7740(0000) GS:ffffffff81c28000(0000)
knlGS:00000 00000000000
[ 1255.729593] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1255.730288] CR2: ffffffff810b0d68 CR3: 00000000160da000 CR4: 00000000000006f0
[ 1255.731145] Stack:
[ 1255.731393]  0000000000000296 0000000011eac412 ffff880010968000
ffff88003bf27 c70
[ 1255.732414]  00000001000ecc8c ffff88003bf27d18 ffffffff81781b3a
ffff88003bf27 c90
[ 1255.733340]  0000000000000282 dead000000000200 0000000000000000
00000001000ec c8c
[ 1255.734267] Call Trace:
[ 1255.734567]  [<ffffffff81781b3a>] ? schedule_timeout+0x18a/0x2a0
[ 1255.735296]  [<ffffffff810e82b0>] ?
trace_event_raw_event_tick_stop+0x100/0x1 00
[ 1255.736176]  [<ffffffff810cca82>] lock_acquire+0xc2/0x150
[ 1255.736814]  [<ffffffff812fcbfb>] ? put_back_slot+0x1b/0x70
[ 1255.737467]  [<ffffffff817825b1>] _raw_spin_lock+0x31/0x40
[ 1255.738159]  [<ffffffff812fcbfb>] ? put_back_slot+0x1b/0x70
[ 1255.738838]  [<ffffffff812fcbfb>] put_back_slot+0x1b/0x70
[ 1255.739494]  [<ffffffff812fd36b>] orangefs_readdir_index_put+0x4b/0x70
[ 1255.740287]  [<ffffffff812fbccd>] orangefs_readdir+0x12d/0xd50
[ 1255.740998]  [<ffffffff810c9a2d>] ? trace_hardirqs_on+0xd/0x10
[ 1255.741688]  [<ffffffff8120e4bf>] iterate_dir+0x9f/0x130
[ 1255.742339]  [<ffffffff8120e9d0>] SyS_getdents+0xa0/0x140
[ 1255.743017]  [<ffffffff8120e550>] ? iterate_dir+0x130/0x130
[ 1255.743678]  [<ffffffff8178302f>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 1255.744417] Code: c3 49 81 3c 24 c0 6b ef 81 b8 00 00 00 00 44 0f
44 c0 83 fb  01 0f 87 f0 fe ff ff 89 d8 49 8b 4c c4 08 48 85 c9 0f 84
e0 fe ff ff <ff> 81 98  01 00 00 44 8b 1d 97 42 ac 01 41 8b 9a 40 06
00 00 45
[ 1255.747699] RIP  [<ffffffff810cad14>] __lock_acquire+0x1a4/0x1e50
[ 1255.748450]  RSP <ffff88003bf27c28>
[ 1255.748898] CR2: ffffffff810b0d68
[ 1255.749293] ---[ end trace 66a9a15ad78b3deb ]---


There's a lot of timing that has to be gotten right across one of these
restarts, and all the changes we have made to the upstream kernel module
vs. the out-of-tree version have, I guess, gotten some of that timing out
of whack. I'm glad you're helping <g>...

One of the tests I made in the last few days was to basically give up on
every op that showed up in service_operation while is_daemon_in_service()
was failing. That kept the kernel happy, but was obviously more
brutal to the processes (dbench in my test case, people doing work in
real life) than managing to wait on as many of the ops as possible.

Anyhow... there's several more patches from both me and Martin in get-next
as of today. There's still a problem with his permissions patch, but I
guess it is good that we have that call-out now, and we'll get it working
properly. And still no patch to fix the as yet official follow_link
change - I hope I'm not giving Stephen Rothwell heartburn over that. He
said I could just merge in your fceef393a538 commit and then
change/get-rid-of our follow_link code and that would be OK...

-Mike

On Sat, Jan 30, 2016 at 1:27 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sat, Jan 30, 2016 at 05:34:13PM +0000, Al Viro wrote:
>
>> And again, how could
>>                 /* op uses shared memory */
>>                 if (orangefs_get_bufmap_init() == 0) {
>> in service_operation() possibly be true, when we have
>>       * op->uses_shared_memory just checked to be set
>>       * all callers that set it (orangefs_readdir() and wait_for_direct_io()
>> having allocated a slot before calling service_operation() and not releasing
>> it until service_operation() returns
>>       * __orangefs_bufmap not becoming NULL until all slots are freed and
>>       * orangefs_get_bufmap_init() returning 1 unless __orangefs_bufmap is
>> NULL?
>>
>> AFAICS, that code (waiting for daemon to be restarted) is provably never
>> executed.  What am I missing?
>
> While we are at it, what happens to original code (without refcount changes,
> etc. from my pile - the problem remains with those, but let's look at the
> code at merge from v4.4) if something does read() and gets a signal
> just as the daemon gets to
>         n = copy_from_iter(&op->downcall, downcall_size, iter);
> in ->write_iter(), reporting that is has finished that read?  We were in
> wait_for_matching_downcall(), interruptibly sleeping.  We got woken up by
> signal delivery.  Checked op->op_state; it's still not marked as serviced.
> We check signal_pending() and find it true.  We hit
> orangefs_clean_up_interrupted_operation(), which doesn't do anything to
> op->op_state, and we return -EINTR to service_operation().  Which returns
> without waiting for anything to wait_for_direct_io() and we go into
> sending a cancel.  Now the daemon regains the timeslice.  Marks op as
> serviced.  And proceeds to wait on op->io_completion_waitq.
>
> Who's going to wake it up?  orangefs_cancel_op_in_progress() sure as hell
> won't - it has no way to find op, nevermind doing wakeups.  The rest of
> wait_for_direct_io() also doesn't wake the daemon up.  How is that supposed
> to work?  Moreover, we issue a cancel; when is it supposed to be processed
> and how do we tell if it's already been processed?
>
> Who should be waiting for what in case of cancel being issued just as the
> daemon gets around to reporting success of original operation?  On the
> protocol level, that is.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-04 23:30                                               ` Mike Marshall
@ 2016-02-06 19:42                                                 ` Al Viro
  2016-02-07  1:38                                                   ` Al Viro
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-02-06 19:42 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Thu, Feb 04, 2016 at 06:30:26PM -0500, Mike Marshall wrote:
> As for the WARN_ONs, the waitqueue one is easy to hit when the
> client-core stops and restarts, you can see here where precopy_buffers
> started whining about the client-core, you can see that the client
> core restarted when the debug mask got sent back over, and then
> the WARN_ON in waitqueue gets hit:
> 
> [ 1239.198976] precopy_buffers: Failed to copy-in buffers. Please make
> sure that  the pvfs2-client is running. -14
> [ 1239.198979] precopy_buffers: Failed to copy-in buffers. Please make
> sure that  the pvfs2-client is running. -14
> [ 1239.198983] orangefs_file_write_iter: do_readv_writev failed, rc:-14:.
> [ 1239.199175] precopy_buffers: Failed to copy-in buffers. Please make
> sure that  the pvfs2-client is running. -14
> [ 1239.199177] precopy_buffers: Failed to copy-in buffers. Please make
> sure that  the pvfs2-client is running. -14
> [ 1239.199180] orangefs_file_write_iter: do_readv_writev failed, rc:-14:.
> [ 1239.199601] precopy_buffers: Failed to copy-in buffers. Please make
> sure that  the pvfs2-client is running. -14
> [ 1239.199602] precopy_buffers: Failed to copy-in buffers. Please make
> sure that  the pvfs2-client is running. -14
> [ 1239.199604] orangefs_file_write_iter: do_readv_writev failed, rc:-14:.
> [ 1239.248239] dispatch_ioctl_command: client debug mask has been been
> received  :0: :0:
> [ 1239.248257] dispatch_ioctl_command: client debug array string has
> been receiv ed.
> [ 1239.307842] ------------[ cut here ]------------
> [ 1239.307847] WARNING: CPU: 0 PID: 1347 at
> fs/orangefs/waitqueue.c:208 service_ operation+0x59f/0x9b0()
> [ 1239.307848] Modules linked in: bnep bluetooth ip6t_rpfilter rfkill
> ip6t_REJEC T nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6
> nf_conntrack_ipv4 nf_defrag_ip v4 xt_conntrack nf_conntrack
> ebtable_nat ebtable_broute bridge stp llc ebtable_f ilter ebtables
> ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip
> 6_tables iptable_mangle iptable_security iptable_raw ppdev parport_pc
> virtio_bal loon virtio_console parport 8139too serio_raw pvpanic
> i2c_piix4 uinput qxl drm_k ms_helper ttm drm 8139cp i2c_core
> virtio_pci ata_generic virtio virtio_ring mii  pata_acpi
> [ 1239.307870] CPU: 0 PID: 1347 Comm: dbench Not tainted
> 4.4.0-161988-g237f828-d irty #49
> [ 1239.307871] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> [ 1239.307872]  0000000000000000 0000000011eac412 ffff88003bf27cd0
> ffffffff8139c 84d
> [ 1239.307874]  ffff88003bf27d08 ffffffff8108e510 ffff880010968000
> ffff88001096c 1d0
> [ 1239.307876]  ffff88001096c188 00000000fffffff5 0000000000000000
> ffff88003bf27 d18
> [ 1239.307877] Call Trace:
> [ 1239.307881]  [<ffffffff8139c84d>] dump_stack+0x19/0x1c
> [ 1239.307884]  [<ffffffff8108e510>] warn_slowpath_common+0x80/0xc0
> [ 1239.307886]  [<ffffffff8108e65a>] warn_slowpath_null+0x1a/0x20
> [ 1239.307887]  [<ffffffff812fe73f>] service_operation+0x59f/0x9b0
> [ 1239.307889]  [<ffffffff810c28b0>] ? prepare_to_wait_event+0x100/0x100
> [ 1239.307891]  [<ffffffff810c28b0>] ? prepare_to_wait_event+0x100/0x100
> [ 1239.307893]  [<ffffffff812fbd12>] orangefs_readdir+0x172/0xd50
> [ 1239.307895]  [<ffffffff810c9a2d>] ? trace_hardirqs_on+0xd/0x10
> [ 1239.307898]  [<ffffffff8120e4bf>] iterate_dir+0x9f/0x130
> [ 1239.307899]  [<ffffffff8120e9d0>] SyS_getdents+0xa0/0x140
> [ 1239.307900]  [<ffffffff8120e550>] ? iterate_dir+0x130/0x130
> [ 1239.307903]  [<ffffffff8178302f>] entry_SYSCALL_64_fastpath+0x12/0x76
> [ 1239.307904] ---[ end trace 66a9a15ad78b3dea ]---

Bloody hell.  OK, something's seriously fishy here.  You have
service_operation() called from orangefs_readdir().  It had been
immediately preceded by
        ret = orangefs_readdir_index_get(&bufmap, &buffer_index);
returning a success.  Moreover, *all* paths after return from
service_operation() pass through some instance of
	orangefs_readdir_index_put(bufmap, buffer_index);
(possibly hidden behind readdir_handle_dtor(bufmap, &rhandle)).
We'd bloody better _not_ have had bufmap freed by that point, right?

OTOH, this service_operation() has just seen orangefs_get_bufmap_init()
returning 0.  Since
int orangefs_get_bufmap_init(void) 
{ 
        return __orangefs_bufmap ? 1 : 0;
}
we have observed NULL __orangefs_bufmap at that point.  __orangefs_bufmap
is static, its address is never taken and the only assignments to it are
static void orangefs_bufmap_unref(struct orangefs_bufmap *bufmap)
{
        if (atomic_dec_and_lock(&bufmap->refcnt, &orangefs_bufmap_lock)) {
                __orangefs_bufmap = NULL;
                spin_unlock(&orangefs_bufmap_lock);

                orangefs_bufmap_unmap(bufmap);
                orangefs_bufmap_free(bufmap);
        }
}
and
        bufmap = orangefs_bufmap_alloc(user_desc);
        if (!bufmap)
                goto out;

        ret = orangefs_bufmap_map(bufmap, user_desc);
        if (ret)
                goto out_free_bufmap;

        spin_lock(&orangefs_bufmap_lock);
        if (__orangefs_bufmap) {
                spin_unlock(&orangefs_bufmap_lock);
                gossip_err("orangefs: error: bufmap already initialized.\n");
                ret = -EALREADY;
                goto out_unmap_bufmap;
        }
        __orangefs_bufmap = bufmap;
        spin_unlock(&orangefs_bufmap_lock);

The latter is definitely *not* making it NULL - that's daemon asking to
install a new one.  And the former is unconditionally followed by
kfree(bufmap) (via orangefs_bufmap_free()).

IOW, if this WARN_ON() is triggered, you've already gotten to freeing the
damn thing.  And subsequent crash appears to match that theory.

The really weird thing is that orangefs_readdir_index_get() increments the
refcount, so having hit that __orangefs_bufmap = NULL; while we sat in
service_operation() means that refcounting of bufmap got broken somehow...

orangefs_bufmap_alloc() creates it with refcount 1 and is very shortly followed
either by orangefs_bufmap_free() or by __orangefs_bufmap changing from NULL to
the pointer to created instance.

orangefs_bufmap_ref() either returns NULL or non-NULL value equal to
__orangefs_bufmap, in the latter case having bumped its refcount.

orangefs_bufmap_unref() drops refcount on its argument and if it has
reached 0, clears __orangefs_bufmap and frees the argument.  That's
a bit of a red flag - either we are always called with argument equal to
__orangefs_bufmap, or we might end up with a leak...

Nothing else changes their refcounts.  In orangefs_bufmap_{size,shift}_query()
a successful orangefs_bufmap_ref() is immediately followed by
orangefs_bufmap_unref() (incidentally, both might as well have just
grabbed orangefs_bufmap_lock, checked __orangefs_bufmap and picked its
->desc_{size,shift} before dropping the lock - no need to mess with refcounts
in those two).

Remaining callers of orangefs_bufmap_ref() are orangefs_bufmap_get() and
orangefs_readdir_index_get().  Those begin with grabbing a reference (and
failing if __orangefs_bufmap was NULL) and eventually either returning a
zero and storing the acquired pointer in *mapp, or dropping the reference
and returning non-zero.  More serious red flag, BTW - *mapp is still set
in the latter case, which might end up with confused callers...

Let's see... bufmap_get is called in wait_for_direct_io() and returning
a negative is followed by buggering off to
        if (buffer_index >= 0) {
                orangefs_bufmap_put(bufmap, buffer_index);
                gossip_debug(GOSSIP_FILE_DEBUG,
                             "%s(%pU): PUT buffer_index %d\n",
                             __func__, handle, buffer_index);
                buffer_index = -1;
        }
Uh-oh...
	1) can that thing return a positive value?
	2) can it end up storing something in buffer_index and still
failing?
	3) can we get there with non-negative buffer_index?

OK, (1) and (2) translate into the same question for wait_for_a_slot(),
which returns 0, -EINTR or -ETIMEDOUT and only overwrites buffer_index in
case when it returns 0.  And (3)...  Fuck. (3) is possible.  Look:
        if (ret == -EAGAIN && op_state_purged(new_op)) {
                orangefs_bufmap_put(bufmap, buffer_index);
                gossip_debug(GOSSIP_FILE_DEBUG,
                             "%s:going to repopulate_shared_memory.\n",
                             __func__);
                goto populate_shared_memory;
        }
in wait_for_direct_io() sends us to populate_shared_memory with buffer_index
unchanged despite the fact that we'd done orangefs_bufmap_put() on it.  And
there we hit
        /* get a shared buffer index */
        ret = orangefs_bufmap_get(&bufmap, &buffer_index);
        if (ret < 0) {
                gossip_debug(GOSSIP_FILE_DEBUG,
                             "%s: orangefs_bufmap_get failure (%ld)\n",
                             __func__, (long)ret);
                goto out;
with failure followed by _another_ orangefs_bufmap_put() around out:

OK, there's your buggered refcounting.  The minimal fix is to slap
buffer_index = -1; right after the orangefs_bufmap_put() in there.

orangefs_readdir_index_get() caller simply buggers off on failure.  No problem
there, and longer term that's what I'd suggest doing in wait_for_direct_io()
as well.  Anyway, could you try this on top of your for-next and see if you
can reproduce either WARN_ON?

diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
index d865b58..40b3805 100644
--- a/fs/orangefs/file.c
+++ b/fs/orangefs/file.c
@@ -210,6 +210,7 @@ populate_shared_memory:
 	 */
 	if (ret == -EAGAIN && op_state_purged(new_op)) {
 		orangefs_bufmap_put(bufmap, buffer_index);
+		buffer_index = -1;
 		gossip_debug(GOSSIP_FILE_DEBUG,
 			     "%s:going to repopulate_shared_memory.\n",
 			     __func__);

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-06 19:42                                                 ` Al Viro
@ 2016-02-07  1:38                                                   ` Al Viro
  2016-02-07  3:53                                                     ` Al Viro
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-02-07  1:38 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel, Stephen Rothwell

> > As for the WARN_ONs, the waitqueue one is easy to hit when the
> > client-core stops and restarts, you can see here where precopy_buffers
> > started whining about the client-core, you can see that the client
> > core restarted when the debug mask got sent back over, and then
> > the WARN_ON in waitqueue gets hit:

> > [ 1239.198976] precopy_buffers: Failed to copy-in buffers. Please make
> > sure that  the pvfs2-client is running. -14

Very interesting...

Looks like there's another bug in restart handling.  Namely, restart happening
on write() tries to fetch more data from iter, without bothering to rewind to
where it used to be.  That's where those -EFAULT are coming from.  Easy to fix,
fortunately - on top of the double-free fix, apply the following:

diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
index 40b3805..c767ec7 100644
--- a/fs/orangefs/file.c
+++ b/fs/orangefs/file.c
@@ -133,6 +133,7 @@ static ssize_t wait_for_direct_io(enum ORANGEFS_io_type type, struct inode *inod
 	struct orangefs_khandle *handle = &orangefs_inode->refn.khandle;
 	struct orangefs_bufmap *bufmap = NULL;
 	struct orangefs_kernel_op_s *new_op = NULL;
+	struct iov_iter saved = *iter;
 	int buffer_index = -1;
 	ssize_t ret;
 
@@ -211,6 +212,8 @@ populate_shared_memory:
 	if (ret == -EAGAIN && op_state_purged(new_op)) {
 		orangefs_bufmap_put(bufmap, buffer_index);
 		buffer_index = -1;
+		if (type == ORANGEFS_IO_WRITE)
+			*iter = saved;
 		gossip_debug(GOSSIP_FILE_DEBUG,
 			     "%s:going to repopulate_shared_memory.\n",
 			     __func__);

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-07  1:38                                                   ` Al Viro
@ 2016-02-07  3:53                                                     ` Al Viro
  2016-02-07 20:01                                                       ` [RFC] bufmap-related wait logics (Re: Orangefs ABI documentation) Al Viro
  2016-02-08 22:26                                                       ` Orangefs ABI documentation Mike Marshall
  0 siblings, 2 replies; 111+ messages in thread
From: Al Viro @ 2016-02-07  3:53 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Sun, Feb 07, 2016 at 01:38:35AM +0000, Al Viro wrote:
> > > As for the WARN_ONs, the waitqueue one is easy to hit when the
> > > client-core stops and restarts, you can see here where precopy_buffers
> > > started whining about the client-core, you can see that the client
> > > core restarted when the debug mask got sent back over, and then
> > > the WARN_ON in waitqueue gets hit:
> 
> > > [ 1239.198976] precopy_buffers: Failed to copy-in buffers. Please make
> > > sure that  the pvfs2-client is running. -14
> 
> Very interesting...
> 
> Looks like there's another bug in restart handling.  Namely, restart happening
> on write() tries to fetch more data from iter, without bothering to rewind to
> where it used to be.  That's where those -EFAULT are coming from.  Easy to fix,
> fortunately - on top of the double-free fix, apply the following:


BTW, could you try to reproduce that WARN_ON with these two patches added
and with bufmap debugging turned on?  Both double-free and lack of rewinding
are real; I can see scenarios where they would trigger, and I'm pretty sure
that the latter is triggering in your reproducer.  Moreover, I'm absolutely
sure that spurious dropping of bufmap references is happening there; what I'm
not sure is whether it was on this double-free or on something else...

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [RFC] bufmap-related wait logics (Re: Orangefs ABI documentation)
  2016-02-07  3:53                                                     ` Al Viro
@ 2016-02-07 20:01                                                       ` Al Viro
  2016-02-08 22:26                                                       ` Orangefs ABI documentation Mike Marshall
  1 sibling, 0 replies; 111+ messages in thread
From: Al Viro @ 2016-02-07 20:01 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Sun, Feb 07, 2016 at 03:53:31AM +0000, Al Viro wrote:

> BTW, could you try to reproduce that WARN_ON with these two patches added
> and with bufmap debugging turned on?  Both double-free and lack of rewinding
> are real; I can see scenarios where they would trigger, and I'm pretty sure
> that the latter is triggering in your reproducer.  Moreover, I'm absolutely
> sure that spurious dropping of bufmap references is happening there; what I'm
> not sure is whether it was on this double-free or on something else...

AFAICS, with bufmap we have 6 kinds of events -
	1) daemon installs a bufmap
	2) daemon shuts down
	3) wait_for_direct_io() requests a read/write slot
	4) orangefs_readdir() requests a readdir slot
	5) wait_for_direct_io() releases a slot
	6) orangefs_readdir() releases a slot
and the whole thing can be described via two counters and two waitqueues.
Rules:

Initially C1 = C2 = -1
(1)	if C1 >= 0
		sod off, we'd already installed that thing
	else
		C1 = number of read/write slots
		wake up that many of those who wait on Q1
		C2 = number of readdir slots
		wake up that many of those who wait on Q2

(2)	C1 -= number of read/write slots + 1
	C2 -= number of readdir slots + 1
	wait on Q1 for C1 == -1
	wait on Q2 for C2 == -1

(3)	if C1 <= 0
		end = now + 15 minutes
		while true
			if C1 < 0
				interruptibly wait on Q1 for (C1 > 0)
						up to min(end - now, 30s)
				if C1 < 0
					return -ETIMEDOUT
			else
				interruptibly wait on Q1 for (C1 > 0)
						up to end - now, exclusive
			if C1 > 0
				break
			if signal arrived
				return -EINTR
			if now after end
				return -ETIMEDOUT
	C1--, and grab a slot in read/write slots bitmap

(5)	release a slot in bitmap; C1++; wake up Q1

(4,6)	same as (3,5) with s/C1/C2/, s/Q1/Q2/

I'd probably use Q1.lock for serializing C1 and Q2.lock for C2; the only
obstacle is the lack of timeout versions of
	wait_event_interruptible{,_exclusive}locked()
(and obscene identifier length of such beasts, of course).

The really annoying thing is that it's very similar to a couple of counting
semaphores; home-grown wait primitive is almost always a Bad Idea(tm) and
if somebody sees a sane way to cobble that out of higher-level ones, I'd
very much prefer that.  Suggestions?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-07  3:53                                                     ` Al Viro
  2016-02-07 20:01                                                       ` [RFC] bufmap-related wait logics (Re: Orangefs ABI documentation) Al Viro
@ 2016-02-08 22:26                                                       ` Mike Marshall
  2016-02-08 23:35                                                         ` Al Viro
  1 sibling, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-02-08 22:26 UTC (permalink / raw)
  To: Al Viro, Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel, Stephen Rothwell

Hey Al...

I studied the relevant parts of the code in the context of your
several mail messages from this weekend so I could get the
most benefit from them... thanks...

Then I applied the patches you suggested and ran some tests,
things are much more better now...

I can't make the kernel crash, or get the WARN_ON to trigger.

The way I run my test (dbench) there's a warmup phase which
involves file and directory creation, and then an execute phase,
which also does some reading of the created files.

My impression is that dbench is more likely to fail ungracefully
if I signal the client-core to abort during the execute phase, and
more likely to complete normally if I signal the client-core to abort
during warmup (or cleanup, which removes the directory tree
built during warmup).

I'll do more tests tomorrow with more debug turned on, and see if
I can get some idea of what makes dbench so ill... the most important
thing is that the kernel doesn't crash, but it would be gravy if user
processes could better withstand a client-core recycle.

Here's the bufmap debug output, I didn't want to send 700k
of mostly "orangefs_bufmap_copy_from_iovec" to the list:

http://myweb.clemson.edu/~hubcap/out

grepping for "finalize" in all that noise is a good way to see
where client-core restarts happened. I ran dbench numerous
times, and managed to signal the client-core to restart during
the same run several times...

-Mike

On Sat, Feb 6, 2016 at 10:53 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sun, Feb 07, 2016 at 01:38:35AM +0000, Al Viro wrote:
>> > > As for the WARN_ONs, the waitqueue one is easy to hit when the
>> > > client-core stops and restarts, you can see here where precopy_buffers
>> > > started whining about the client-core, you can see that the client
>> > > core restarted when the debug mask got sent back over, and then
>> > > the WARN_ON in waitqueue gets hit:
>>
>> > > [ 1239.198976] precopy_buffers: Failed to copy-in buffers. Please make
>> > > sure that  the pvfs2-client is running. -14
>>
>> Very interesting...
>>
>> Looks like there's another bug in restart handling.  Namely, restart happening
>> on write() tries to fetch more data from iter, without bothering to rewind to
>> where it used to be.  That's where those -EFAULT are coming from.  Easy to fix,
>> fortunately - on top of the double-free fix, apply the following:
>
>
> BTW, could you try to reproduce that WARN_ON with these two patches added
> and with bufmap debugging turned on?  Both double-free and lack of rewinding
> are real; I can see scenarios where they would trigger, and I'm pretty sure
> that the latter is triggering in your reproducer.  Moreover, I'm absolutely
> sure that spurious dropping of bufmap references is happening there; what I'm
> not sure is whether it was on this double-free or on something else...

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-08 22:26                                                       ` Orangefs ABI documentation Mike Marshall
@ 2016-02-08 23:35                                                         ` Al Viro
  2016-02-09  3:32                                                           ` Al Viro
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-02-08 23:35 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Mon, Feb 08, 2016 at 05:26:53PM -0500, Mike Marshall wrote:

> My impression is that dbench is more likely to fail ungracefully
> if I signal the client-core to abort during the execute phase, and
> more likely to complete normally if I signal the client-core to abort
> during warmup (or cleanup, which removes the directory tree
> built during warmup).
> 
> I'll do more tests tomorrow with more debug turned on, and see if
> I can get some idea of what makes dbench so ill... the most important
> thing is that the kernel doesn't crash, but it would be gravy if user
> processes could better withstand a client-core recycle.
> 
> Here's the bufmap debug output, I didn't want to send 700k
> of mostly "orangefs_bufmap_copy_from_iovec" to the list:
> 
> http://myweb.clemson.edu/~hubcap/out

>From the look of it, quite a few are of "getattr after create has failed
due to restart" - AFAICS, that has nothing to do with slot-related issues,
but readv/writev failures with -EINTR *are* slot-related (in addition to
the problem with EINTR handling in callers of wait_for_direct_io() mentioned
upthread)...

> grepping for "finalize" in all that noise is a good way to see
> where client-core restarts happened. I ran dbench numerous
> times, and managed to signal the client-core to restart during
> the same run several times...

FWIW, restart treatment is definitely still not right - we *do* want to
wait for bufmap to run down before trying to claim the slot on restart.
The problems with the current code are:
	* more than one process holding slots during the daemon restart =>
possible for the first one to get through dropping the slot and into
requesting the new one before the second one manages to drop its slot.
Restarted daemon won't be able to insert its bufmap in that case.
	* one process holding a slot during restart => very likely EIO,
since it ends up dropping the last reference to old bufmap and finds
NULL __orangefs_bufmap if it gets there before the new daemon.
	* lack of wait for daemon restart anyplace reachable (the one in
service_operation() can be reached only if we are about to do double-free,
already having bufmap refcounts buggered; with that crap fixed, you are
not hitting that logics anymore).

I think I have a solution for the slot side of that, but I'm still not happy
with the use of low-level waitqueue primitives in that sucker, which is almost
always a Very Bad Sign(tm).

I'm not sure what to do with operations that go into restart directly in
service_operation() - it looks like it ought to wait for op_timeout_secs
and if the damn things isn't back *and* managed to service the operation
by that time, we are screwed.  Note that this
                if (op_state_purged(op)) {      
                        ret = (op->attempts < ORANGEFS_PURGE_RETRY_COUNT) ?
                                 -EAGAIN :
                                 -EIO;
                        gossip_debug(GOSSIP_WAIT_DEBUG,
                                     "*** %s:"
                                     " operation purged (tag "
                                     "%llu, %p, att %d)\n",
                                     __func__,
                                     llu(op->tag),
                                     op,
                                     op->attempts);
                        orangefs_clean_up_interrupted_operation(op);
                        break;
                }
is hit only if the daemon got stopped after we'd submitted the operation;
if it had been stopped before entering service_operation() and new one hadn't
managed to catch up with requests in that interval, no attempts to resubmit
are made.

And AFAICS
        if (is_daemon_in_service() < 0) {
                /*
                 * By incrementing the per-operation attempt counter, we
                 * directly go into the timeout logic while waiting for
                 * the matching downcall to be read
                 */
                gossip_debug(GOSSIP_WAIT_DEBUG,
                             "%s:client core is NOT in service(%d).\n",
                             __func__,
                             is_daemon_in_service());
                op->attempts++;
        }
is inherently racy.  Suppose the daemon gets stopped during the window
between that check and the moment request is added to the queue.  If it had
been stopped past that window, op would've been purged; if it had been
stopped before the check, we get ->attempts bumped, but if it happens _within_
that window, in wait_for_matching_downcall() you'll be getting to
                /*
                 * if this was our first attempt and client-core
                 * has not purged our operation, we are happy to
                 * simply wait
                 */
                if (op->attempts == 0 && !op_state_purged(op)) {
                        spin_unlock(&op->lock);
                        schedule();
with nothing to wake you up until the new daemon gets started.  Sure, it's
an interruptible sleep, so you can always kill -9 the sucker, but it still
looks wrong.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-08 23:35                                                         ` Al Viro
@ 2016-02-09  3:32                                                           ` Al Viro
  2016-02-09 14:34                                                             ` Mike Marshall
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-02-09  3:32 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Mon, Feb 08, 2016 at 11:35:35PM +0000, Al Viro wrote:

> And AFAICS
>         if (is_daemon_in_service() < 0) {
>                 /*
>                  * By incrementing the per-operation attempt counter, we
>                  * directly go into the timeout logic while waiting for
>                  * the matching downcall to be read
>                  */
>                 gossip_debug(GOSSIP_WAIT_DEBUG,
>                              "%s:client core is NOT in service(%d).\n",
>                              __func__,
>                              is_daemon_in_service());
>                 op->attempts++;
>         }
> is inherently racy.  Suppose the daemon gets stopped during the window
> between that check and the moment request is added to the queue.  If it had
> been stopped past that window, op would've been purged; if it had been
> stopped before the check, we get ->attempts bumped, but if it happens _within_
> that window, in wait_for_matching_downcall() you'll be getting to
>                 /*
>                  * if this was our first attempt and client-core
>                  * has not purged our operation, we are happy to
>                  * simply wait
>                  */
>                 if (op->attempts == 0 && !op_state_purged(op)) {
>                         spin_unlock(&op->lock);
>                         schedule();
> with nothing to wake you up until the new daemon gets started.  Sure, it's
> an interruptible sleep, so you can always kill -9 the sucker, but it still
> looks wrong.

BTW, what is wait_for_matching_downcall() trying to do if the purge happens
just as it's being entered?  We put ourselves on op->waitq.  Then we check
op_state_serviced() - OK, it hadn't been.  We check for signals; normally
there won't be any.  Then we see that the state is purged and proceed to
do schedule_timeout(op_timeout_secs * HZ).  Which will be a plain and simple
"sleep for 10 seconds", since the wakeup on op->waitq has happened on purge.
Before we'd done prepare_to_wait().  That'll be followed by returning
-ETIMEDOUT and failing service_operation() with no attempts to retry.
OTOH, if the purge happens a few cycles later, after we'd entered
the loop and done prepare_to_wait(), we'll end up in retry logics.
Looks bogus...

FWIW, what all these bugs show is that manual use of wait queues is very
easy to get wrong.  If nothing else, I'd be tempted to replace op->waitq
with another struct completion and kill those loops in wait_for_..._downcall()
entirely - wait_for_completion_interruptible_timeout() *ONCE* and then
look at the state.  With complete() done in op_set_state_{purged,serviced}...
Objections?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-09  3:32                                                           ` Al Viro
@ 2016-02-09 14:34                                                             ` Mike Marshall
  2016-02-09 17:40                                                               ` Al Viro
  0 siblings, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-02-09 14:34 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel, Stephen Rothwell

 > Objections?

Heck no... I've been trying to keep from changing the protocol so as to
avoid making a whole nother project out of keeping the out-of-tree
Frankenstein version of the kernel module going, but getting this version
of the kernel module upstream and getting it infused with ideas from you
depth-of-knowledge folks is the real goal here.

You're talking about changing orangefs_kernel_op_s (pvfs2_kernel_op_t
out of tree) and it doesn't cross the boundary into userspace... even if
it did, that "completion" structure looks like it has been been around
as long as any of the Linux versions we try to run on...

-Mike

On Mon, Feb 8, 2016 at 10:32 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Mon, Feb 08, 2016 at 11:35:35PM +0000, Al Viro wrote:
>
>> And AFAICS
>>         if (is_daemon_in_service() < 0) {
>>                 /*
>>                  * By incrementing the per-operation attempt counter, we
>>                  * directly go into the timeout logic while waiting for
>>                  * the matching downcall to be read
>>                  */
>>                 gossip_debug(GOSSIP_WAIT_DEBUG,
>>                              "%s:client core is NOT in service(%d).\n",
>>                              __func__,
>>                              is_daemon_in_service());
>>                 op->attempts++;
>>         }
>> is inherently racy.  Suppose the daemon gets stopped during the window
>> between that check and the moment request is added to the queue.  If it had
>> been stopped past that window, op would've been purged; if it had been
>> stopped before the check, we get ->attempts bumped, but if it happens _within_
>> that window, in wait_for_matching_downcall() you'll be getting to
>>                 /*
>>                  * if this was our first attempt and client-core
>>                  * has not purged our operation, we are happy to
>>                  * simply wait
>>                  */
>>                 if (op->attempts == 0 && !op_state_purged(op)) {
>>                         spin_unlock(&op->lock);
>>                         schedule();
>> with nothing to wake you up until the new daemon gets started.  Sure, it's
>> an interruptible sleep, so you can always kill -9 the sucker, but it still
>> looks wrong.
>
> BTW, what is wait_for_matching_downcall() trying to do if the purge happens
> just as it's being entered?  We put ourselves on op->waitq.  Then we check
> op_state_serviced() - OK, it hadn't been.  We check for signals; normally
> there won't be any.  Then we see that the state is purged and proceed to
> do schedule_timeout(op_timeout_secs * HZ).  Which will be a plain and simple
> "sleep for 10 seconds", since the wakeup on op->waitq has happened on purge.
> Before we'd done prepare_to_wait().  That'll be followed by returning
> -ETIMEDOUT and failing service_operation() with no attempts to retry.
> OTOH, if the purge happens a few cycles later, after we'd entered
> the loop and done prepare_to_wait(), we'll end up in retry logics.
> Looks bogus...
>
> FWIW, what all these bugs show is that manual use of wait queues is very
> easy to get wrong.  If nothing else, I'd be tempted to replace op->waitq
> with another struct completion and kill those loops in wait_for_..._downcall()
> entirely - wait_for_completion_interruptible_timeout() *ONCE* and then
> look at the state.  With complete() done in op_set_state_{purged,serviced}...
> Objections?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-09 14:34                                                             ` Mike Marshall
@ 2016-02-09 17:40                                                               ` Al Viro
  2016-02-09 21:06                                                                 ` Al Viro
  2016-02-09 22:02                                                                 ` Mike Marshall
  0 siblings, 2 replies; 111+ messages in thread
From: Al Viro @ 2016-02-09 17:40 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Tue, Feb 09, 2016 at 09:34:12AM -0500, Mike Marshall wrote:
>  > Objections?
> 
> Heck no... I've been trying to keep from changing the protocol so as to
> avoid making a whole nother project out of keeping the out-of-tree
> Frankenstein version of the kernel module going, but getting this version
> of the kernel module upstream and getting it infused with ideas from you
> depth-of-knowledge folks is the real goal here.
> 
> You're talking about changing orangefs_kernel_op_s (pvfs2_kernel_op_t
> out of tree) and it doesn't cross the boundary into userspace... even if
> it did, that "completion" structure looks like it has been been around
> as long as any of the Linux versions we try to run on...

OK.  While we are at it...  Remember the question about the need for devreq
->write_iter() to wait wait_for_direct_io() to finish copying the data
from slots to final destination?  You said that removing that wait ends up
with daemon somehow stomping on those slots and I wonder if that was
another effect of that double-free bug.

Could you try, on top of those fixes, comment the entire
        if (op->downcall.type == ORANGEFS_VFS_OP_FILE_IO) {
                long n = wait_for_completion_interruptible_timeout(&op->done,
                                                        op_timeout_secs * HZ);
                if (unlikely(n < 0)) {
                        gossip_debug(GOSSIP_DEV_DEBUG,
                                "%s: signal on I/O wait, aborting\n",
                                __func__);
                } else if (unlikely(n == 0)) {
                        gossip_debug(GOSSIP_DEV_DEBUG,
                                "%s: timed out.\n",
                                __func__);
                }
        }
in orangefs_devreq_write_iter() out and see if the corruption happens?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-09 17:40                                                               ` Al Viro
@ 2016-02-09 21:06                                                                 ` Al Viro
  2016-02-09 22:25                                                                   ` Mike Marshall
  2016-02-11 23:36                                                                   ` Mike Marshall
  2016-02-09 22:02                                                                 ` Mike Marshall
  1 sibling, 2 replies; 111+ messages in thread
From: Al Viro @ 2016-02-09 21:06 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Tue, Feb 09, 2016 at 05:40:49PM +0000, Al Viro wrote:

> Could you try, on top of those fixes, comment the entire
>         if (op->downcall.type == ORANGEFS_VFS_OP_FILE_IO) {
>                 long n = wait_for_completion_interruptible_timeout(&op->done,
>                                                         op_timeout_secs * HZ);
>                 if (unlikely(n < 0)) {
>                         gossip_debug(GOSSIP_DEV_DEBUG,
>                                 "%s: signal on I/O wait, aborting\n",
>                                 __func__);
>                 } else if (unlikely(n == 0)) {
>                         gossip_debug(GOSSIP_DEV_DEBUG,
>                                 "%s: timed out.\n",
>                                 __func__);
>                 }
>         }
> in orangefs_devreq_write_iter() out and see if the corruption happens?

Another thing: what's the protocol rules regarding the cancels?  The current
code looks very odd - if we get a hit by a signal after the daemon has
picked e.g. read request but before it had replied, we will call
orangefs_cancel_op_in_progress(), which will call service_operation() with
ORANGEFS_OP_CANCELLATION which will.  And that'll insert the cancel request
into list and practically immediately notice that we have a pending signal,
remove the cancel request from the list and bugger off.  With daemon almost
certainly *not* getting to see it at all.

I've asked that before if anybody has explained that, I've missed that reply.
How the fuck is that supposed to work?  Forget the kernel-side implementation
details, what should the daemon see in such situation?

I would expect something like "you can't reuse a slot until operation has
been either completed or purged or a cancel had been sent and ACKed by
the daemon".  Is that what is intended?  If so, the handling of cancels might
be better off asynchronous - let the slot freeing be done after the cancel
had been ACKed and _not_ in the context of original syscall...

There are some traces of AIO support in that thing; could this be a victim of
trimming async parts for submission into the mainline?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-09 17:40                                                               ` Al Viro
  2016-02-09 21:06                                                                 ` Al Viro
@ 2016-02-09 22:02                                                                 ` Mike Marshall
  2016-02-09 22:16                                                                   ` Al Viro
  1 sibling, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-02-09 22:02 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel, Stephen Rothwell

Yes... I remember... I think you are referring to my reply in
Message-ID: CAOg9mSSH=LuKyGiVthVajZFc6d=hGWGeLE8G9Y9d5B+g1-2sEg@mail.gmail.com
in this thread...

I just commented those lines out again, and ran tests...
both with and without signaling the client-core to restart.

dbench never complained and completed normally across
restarts every time except the last, where it failed and the
"Failed to allocate orangefs file inode" error was emitted from
orangefs_create.

Until recently I ran everything, including the server, on the same
VM. Currently I am mounting my Orangefs filesystem from a
four-server setup from other VMs... it is can be pretty bad
news for a userspace filesystem when the kernel crashes on
the machine it is running on <g>...

-Mike


On Tue, Feb 9, 2016 at 12:40 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Feb 09, 2016 at 09:34:12AM -0500, Mike Marshall wrote:
>>  > Objections?
>>
>> Heck no... I've been trying to keep from changing the protocol so as to
>> avoid making a whole nother project out of keeping the out-of-tree
>> Frankenstein version of the kernel module going, but getting this version
>> of the kernel module upstream and getting it infused with ideas from you
>> depth-of-knowledge folks is the real goal here.
>>
>> You're talking about changing orangefs_kernel_op_s (pvfs2_kernel_op_t
>> out of tree) and it doesn't cross the boundary into userspace... even if
>> it did, that "completion" structure looks like it has been been around
>> as long as any of the Linux versions we try to run on...
>
> OK.  While we are at it...  Remember the question about the need for devreq
> ->write_iter() to wait wait_for_direct_io() to finish copying the data
> from slots to final destination?  You said that removing that wait ends up
> with daemon somehow stomping on those slots and I wonder if that was
> another effect of that double-free bug.
>
> Could you try, on top of those fixes, comment the entire
>         if (op->downcall.type == ORANGEFS_VFS_OP_FILE_IO) {
>                 long n = wait_for_completion_interruptible_timeout(&op->done,
>                                                         op_timeout_secs * HZ);
>                 if (unlikely(n < 0)) {
>                         gossip_debug(GOSSIP_DEV_DEBUG,
>                                 "%s: signal on I/O wait, aborting\n",
>                                 __func__);
>                 } else if (unlikely(n == 0)) {
>                         gossip_debug(GOSSIP_DEV_DEBUG,
>                                 "%s: timed out.\n",
>                                 __func__);
>                 }
>         }
> in orangefs_devreq_write_iter() out and see if the corruption happens?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-09 22:02                                                                 ` Mike Marshall
@ 2016-02-09 22:16                                                                   ` Al Viro
  2016-02-09 22:40                                                                     ` Al Viro
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-02-09 22:16 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Tue, Feb 09, 2016 at 05:02:59PM -0500, Mike Marshall wrote:
> Yes... I remember... I think you are referring to my reply in
> Message-ID: CAOg9mSSH=LuKyGiVthVajZFc6d=hGWGeLE8G9Y9d5B+g1-2sEg@mail.gmail.com
> in this thread...
> 
> I just commented those lines out again, and ran tests...
> both with and without signaling the client-core to restart.
> 
> dbench never complained and completed normally across
> restarts every time except the last, where it failed and the
> "Failed to allocate orangefs file inode" error was emitted from
> orangefs_create.
> 
> Until recently I ran everything, including the server, on the same
> VM. Currently I am mounting my Orangefs filesystem from a
> four-server setup from other VMs... it is can be pretty bad
> news for a userspace filesystem when the kernel crashes on
> the machine it is running on <g>...

OK...  Then the plan is
	* sort out the cancel semantics
	* replace op->waitq with completion and kill the loops in
wait_for_..._reply()
	* deal with slot allocations vs. bufmap removals
	* kill op->done (along with waiting in devreq write_iter and
complete(&op->done) in file.c)
	* sort out the treatment of signal interrupting file IO
	* test the living hell out of it, including the data corruption
checks, etc.

If that works out, we'll be in sane shape wrt wait-related stuff.  It still
leaves input validation and related fun, but those are at least synchronous
and single-threaded issues.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-09 21:06                                                                 ` Al Viro
@ 2016-02-09 22:25                                                                   ` Mike Marshall
  2016-02-11 23:36                                                                   ` Mike Marshall
  1 sibling, 0 replies; 111+ messages in thread
From: Mike Marshall @ 2016-02-09 22:25 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel, Stephen Rothwell

 > what should the daemon see in such situation?

I'll work to see if I can get an opinion on that from
some of the others...

As to the traces of AIO... I'm not sure how that ever worked.
The out of tree kernel module never had the
address_space_operations direct_IO call-out, I think
AIO, the way it works in modern kernels, requires that?

I fooled with it a couple of years ago when I had
Christoph to mentor me, but implementing it and
getting it to work right were hard and didn't
seem as important as other stuff. How the
stuff that was in there was supposed to work seems
kind of lost-to-history. Perhaps the stuff that was
there was designed to work with that libaio userspace
library, but not the io_setup, io_destroy, io_submit
etc. stuff that is in modern kernels?

-Mike

On Tue, Feb 9, 2016 at 4:06 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Feb 09, 2016 at 05:40:49PM +0000, Al Viro wrote:
>
>> Could you try, on top of those fixes, comment the entire
>>         if (op->downcall.type == ORANGEFS_VFS_OP_FILE_IO) {
>>                 long n = wait_for_completion_interruptible_timeout(&op->done,
>>                                                         op_timeout_secs * HZ);
>>                 if (unlikely(n < 0)) {
>>                         gossip_debug(GOSSIP_DEV_DEBUG,
>>                                 "%s: signal on I/O wait, aborting\n",
>>                                 __func__);
>>                 } else if (unlikely(n == 0)) {
>>                         gossip_debug(GOSSIP_DEV_DEBUG,
>>                                 "%s: timed out.\n",
>>                                 __func__);
>>                 }
>>         }
>> in orangefs_devreq_write_iter() out and see if the corruption happens?
>
> Another thing: what's the protocol rules regarding the cancels?  The current
> code looks very odd - if we get a hit by a signal after the daemon has
> picked e.g. read request but before it had replied, we will call
> orangefs_cancel_op_in_progress(), which will call service_operation() with
> ORANGEFS_OP_CANCELLATION which will.  And that'll insert the cancel request
> into list and practically immediately notice that we have a pending signal,
> remove the cancel request from the list and bugger off.  With daemon almost
> certainly *not* getting to see it at all.
>
> I've asked that before if anybody has explained that, I've missed that reply.
> How the fuck is that supposed to work?  Forget the kernel-side implementation
> details, what should the daemon see in such situation?
>
> I would expect something like "you can't reuse a slot until operation has
> been either completed or purged or a cancel had been sent and ACKed by
> the daemon".  Is that what is intended?  If so, the handling of cancels might
> be better off asynchronous - let the slot freeing be done after the cancel
> had been ACKed and _not_ in the context of original syscall...
>
> There are some traces of AIO support in that thing; could this be a victim of
> trimming async parts for submission into the mainline?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-09 22:16                                                                   ` Al Viro
@ 2016-02-09 22:40                                                                     ` Al Viro
  2016-02-09 23:13                                                                       ` Al Viro
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-02-09 22:40 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Tue, Feb 09, 2016 at 10:16:23PM +0000, Al Viro wrote:
> 	* sort out the cancel semantics

OK, it's definitely bogus right now.  Look: read() grabs a slot and
issues read request.  Daemon picks it and starts processing, copying
the data into the corresponding part of shared memory.  read() gets
interrupted and tries to issue a cancel, which fails to even be seen
by the daemon, since wait_for_cancel_downcall() sees the pending signal
and rips the cancel request out of the list.  read() marks the slot
free and buggers off.  write() is called by another process, picks the
same slot and starts copying the userland data into the same part of
shared memory.  Where it's overwritten by daemon still processing the
read request - it hadn't seen any indications of things going wrong.

This is obviously wrong - killed read() should *NOT* end up with the
data it would've returned being silently mixed into the data being
written by write() in unrelated process on unrelated file.

And the version in orangefs-2.9.3.tar.gz (your Frankenstein module?) is
vulnerable to the same race.  2.8.1 isn't - it ignores signals on the
cancel, but that means waiting for cancel to be processed (or timed out)
on any interrupted read() before we return to userland.  We can return
to that behaviour, of course, but I suspect that offloading it to something
async (along with freeing the slot used by original operation) would be
better from QoI point of view.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-09 22:40                                                                     ` Al Viro
@ 2016-02-09 23:13                                                                       ` Al Viro
  2016-02-10 16:44                                                                         ` Al Viro
  2016-02-11  0:44                                                                         ` Al Viro
  0 siblings, 2 replies; 111+ messages in thread
From: Al Viro @ 2016-02-09 23:13 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Tue, Feb 09, 2016 at 10:40:50PM +0000, Al Viro wrote:

> And the version in orangefs-2.9.3.tar.gz (your Frankenstein module?) is
> vulnerable to the same race.  2.8.1 isn't - it ignores signals on the
> cancel, but that means waiting for cancel to be processed (or timed out)
> on any interrupted read() before we return to userland.  We can return
> to that behaviour, of course, but I suspect that offloading it to something
> async (along with freeing the slot used by original operation) would be
> better from QoI point of view.

That breakage had been introduced between 2.8.5 and 2.8.6 (at some point
during the spring of 2012).  AFAICS, all versions starting with 2.8.6 are
vulnerable...

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-09 23:13                                                                       ` Al Viro
@ 2016-02-10 16:44                                                                         ` Al Viro
  2016-02-10 21:26                                                                           ` Al Viro
  2016-02-11 23:54                                                                           ` Mike Marshall
  2016-02-11  0:44                                                                         ` Al Viro
  1 sibling, 2 replies; 111+ messages in thread
From: Al Viro @ 2016-02-10 16:44 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Tue, Feb 09, 2016 at 11:13:28PM +0000, Al Viro wrote:
> On Tue, Feb 09, 2016 at 10:40:50PM +0000, Al Viro wrote:
> 
> > And the version in orangefs-2.9.3.tar.gz (your Frankenstein module?) is
> > vulnerable to the same race.  2.8.1 isn't - it ignores signals on the
> > cancel, but that means waiting for cancel to be processed (or timed out)
> > on any interrupted read() before we return to userland.  We can return
> > to that behaviour, of course, but I suspect that offloading it to something
> > async (along with freeing the slot used by original operation) would be
> > better from QoI point of view.
> 
> That breakage had been introduced between 2.8.5 and 2.8.6 (at some point
> during the spring of 2012).  AFAICS, all versions starting with 2.8.6 are
> vulnerable...

BTW, what about kill -9 delivered to readdir in progress?  There's no
cancel for those (and AFAICS the daemon will reject cancel on anything
other than FILE_IO), so what's to stop another thread from picking the
same readdir slot and getting (daemon-side) two of them spewing into
the same area of shared memory?  Is it simply that daemon-side the shared
memory on readdir is touched only upon request completion in completely
serialized process_vfs_requests()?  That doesn't seem to be enough -
suppose the second readdir request completes (daemon-side) first, its results
get packed into shared memory slot and it is reported to kernel, which
proceeds to repack and copy that data to userland.  In the meanwhile,
daemon completes the _earlier_ readdir and proceeds to pack its results into
the same slot of shared memory.  Sure, the kernel won't take that (the
op with the matching tag has been gone already), but the data is stored
into shared memory *before* writev() on the control device that would pass
the response to the kernel, so it still gets overwritten.  Right under
decoding readdir()...

Or is there something in the daemon that would guarantee readdir responses
to happen in the same order in which it had picked the requests?  I'm not
familiar enough with that beast (and overall control flow in there is, er,
not the most transparent one I've seen), so I might be missing something,
but I don't see anything obvious that would guarantee such ordering.

Please, clarify.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-10 16:44                                                                         ` Al Viro
@ 2016-02-10 21:26                                                                           ` Al Viro
  2016-02-11 23:54                                                                           ` Mike Marshall
  1 sibling, 0 replies; 111+ messages in thread
From: Al Viro @ 2016-02-10 21:26 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Wed, Feb 10, 2016 at 04:44:36PM +0000, Al Viro wrote:
> > That breakage had been introduced between 2.8.5 and 2.8.6 (at some point
> > during the spring of 2012).  AFAICS, all versions starting with 2.8.6 are
> > vulnerable...
> 
> BTW, what about kill -9 delivered to readdir in progress?  There's no
> cancel for those (and AFAICS the daemon will reject cancel on anything
> other than FILE_IO), so what's to stop another thread from picking the
> same readdir slot and getting (daemon-side) two of them spewing into
> the same area of shared memory?  Is it simply that daemon-side the shared
> memory on readdir is touched only upon request completion in completely
> serialized process_vfs_requests()?  That doesn't seem to be enough -
> suppose the second readdir request completes (daemon-side) first, its results
> get packed into shared memory slot and it is reported to kernel, which
> proceeds to repack and copy that data to userland.  In the meanwhile,
> daemon completes the _earlier_ readdir and proceeds to pack its results into
> the same slot of shared memory.  Sure, the kernel won't take that (the
> op with the matching tag has been gone already), but the data is stored
> into shared memory *before* writev() on the control device that would pass
> the response to the kernel, so it still gets overwritten.  Right under
> decoding readdir()...
> 
> Or is there something in the daemon that would guarantee readdir responses
> to happen in the same order in which it had picked the requests?  I'm not
> familiar enough with that beast (and overall control flow in there is, er,
> not the most transparent one I've seen), so I might be missing something,
> but I don't see anything obvious that would guarantee such ordering.
> 
> Please, clarify.

Two more questions:
	* why do we need cancel to be held back while we are going through
ORANGEFS_DEV_REMOUNT_ALL?  IOW, why do we need to take request_mutex for
them?
	* your ->kill_sb() starts with telling daemon that fs is gone,
then proceeds to evict dentries/inodes.  Sure, you don't have page cache
(or that would've been instantly fatal - dirty pages would need to be
written out, for one thing), but why do it in this order?  IOW, why not
_start_ with kill_anon_super(), then do the rest of the work?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-09 23:13                                                                       ` Al Viro
  2016-02-10 16:44                                                                         ` Al Viro
@ 2016-02-11  0:44                                                                         ` Al Viro
  2016-02-11  3:22                                                                           ` Mike Marshall
  1 sibling, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-02-11  0:44 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Tue, Feb 09, 2016 at 11:13:28PM +0000, Al Viro wrote:
> On Tue, Feb 09, 2016 at 10:40:50PM +0000, Al Viro wrote:
> 
> > And the version in orangefs-2.9.3.tar.gz (your Frankenstein module?) is
> > vulnerable to the same race.  2.8.1 isn't - it ignores signals on the
> > cancel, but that means waiting for cancel to be processed (or timed out)
> > on any interrupted read() before we return to userland.  We can return
> > to that behaviour, of course, but I suspect that offloading it to something
> > async (along with freeing the slot used by original operation) would be
> > better from QoI point of view.
> 
> That breakage had been introduced between 2.8.5 and 2.8.6 (at some point
> during the spring of 2012).  AFAICS, all versions starting with 2.8.6 are
> vulnerable...

Matter of fact, older versions are _also_ broken, but it's much harder
to trigger; you need the daemon to stall long enough for read() to time
out, then everything as before.  After 2.8.5 all it takes on the read()
side is a SIGKILL to read() caller just after the daemon has picked your
request...

TBH, it's tempting to kill the "wait for cancel to be finished" logics
completely, mark cancel requests with a flag and stash the slot number
into them.  Then, if flag is present, let set_op_state_purged() and
set_op_state_serviced() callers release the slot and drop the request.
And have wait_for_direct_io() treat the "need to cancel" case as "fire
and forget" - no waiting for anything, no releasing the slots, just free
the original op and bugger off.

The set_op_state_purged() part is where it gets delicate - we want to remove
the sucker from list/hash before dropping it.  AFAICS, it's doable - there's
nothing in the "release slot and drop the request" that would not be allowed
under the list/hash spinlocks...

In principle, readdir problem could also be handled in a similar way, but
there we'd need to have the interrupted readdir itself marked with that
new flag and left in in-progress hash, so that when the response finally
arrives it would be found and slot freeing would be handled ;-/  Doable,
but not fun...

If there is (or at least supposed to be) something that prevents completions
of readdir requests (on unrelated directories, by different processes, etc.)
out of order, PLEASE SAY SO.  I would really prefer not to have to fight
the readdir side of that mess; cancels are already bad enough ;-/

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-11  0:44                                                                         ` Al Viro
@ 2016-02-11  3:22                                                                           ` Mike Marshall
  2016-02-12  4:27                                                                             ` Al Viro
  0 siblings, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-02-11  3:22 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel, Stephen Rothwell

> If there is (or at least supposed to be) something that prevents completions
> of readdir requests (on unrelated directories, by different processes, etc.)
> out of order, PLEASE SAY SO.  I would really prefer not to have to fight
> the readdir side of that mess; cancels are already bad enough ;-/

Hi Al... your ideas sound good to me, I'll try to get you good
answers on stuff like the above sometime tomorrow...

Thanks!

-Mike

On Wed, Feb 10, 2016 at 7:44 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Feb 09, 2016 at 11:13:28PM +0000, Al Viro wrote:
>> On Tue, Feb 09, 2016 at 10:40:50PM +0000, Al Viro wrote:
>>
>> > And the version in orangefs-2.9.3.tar.gz (your Frankenstein module?) is
>> > vulnerable to the same race.  2.8.1 isn't - it ignores signals on the
>> > cancel, but that means waiting for cancel to be processed (or timed out)
>> > on any interrupted read() before we return to userland.  We can return
>> > to that behaviour, of course, but I suspect that offloading it to something
>> > async (along with freeing the slot used by original operation) would be
>> > better from QoI point of view.
>>
>> That breakage had been introduced between 2.8.5 and 2.8.6 (at some point
>> during the spring of 2012).  AFAICS, all versions starting with 2.8.6 are
>> vulnerable...
>
> Matter of fact, older versions are _also_ broken, but it's much harder
> to trigger; you need the daemon to stall long enough for read() to time
> out, then everything as before.  After 2.8.5 all it takes on the read()
> side is a SIGKILL to read() caller just after the daemon has picked your
> request...
>
> TBH, it's tempting to kill the "wait for cancel to be finished" logics
> completely, mark cancel requests with a flag and stash the slot number
> into them.  Then, if flag is present, let set_op_state_purged() and
> set_op_state_serviced() callers release the slot and drop the request.
> And have wait_for_direct_io() treat the "need to cancel" case as "fire
> and forget" - no waiting for anything, no releasing the slots, just free
> the original op and bugger off.
>
> The set_op_state_purged() part is where it gets delicate - we want to remove
> the sucker from list/hash before dropping it.  AFAICS, it's doable - there's
> nothing in the "release slot and drop the request" that would not be allowed
> under the list/hash spinlocks...
>
> In principle, readdir problem could also be handled in a similar way, but
> there we'd need to have the interrupted readdir itself marked with that
> new flag and left in in-progress hash, so that when the response finally
> arrives it would be found and slot freeing would be handled ;-/  Doable,
> but not fun...
>
> If there is (or at least supposed to be) something that prevents completions
> of readdir requests (on unrelated directories, by different processes, etc.)
> out of order, PLEASE SAY SO.  I would really prefer not to have to fight
> the readdir side of that mess; cancels are already bad enough ;-/

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-09 21:06                                                                 ` Al Viro
  2016-02-09 22:25                                                                   ` Mike Marshall
@ 2016-02-11 23:36                                                                   ` Mike Marshall
  1 sibling, 0 replies; 111+ messages in thread
From: Mike Marshall @ 2016-02-11 23:36 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel, Stephen Rothwell

 >  what should the daemon see in such situation?

I agree that it looks like the client-core doesn't notice that a
process doing IO was cancelled. I don't think the client-core
keeps track of what slots are in use, it just trusts that the
buffer-index in any IO upcall is safe to use.  I believe that
wait_for_cancellation_downcall has its roots in the old
AIO code, and ended up at some point getting used
outside of that context.

-Mike





On Tue, Feb 9, 2016 at 4:06 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Feb 09, 2016 at 05:40:49PM +0000, Al Viro wrote:
>
>> Could you try, on top of those fixes, comment the entire
>>         if (op->downcall.type == ORANGEFS_VFS_OP_FILE_IO) {
>>                 long n = wait_for_completion_interruptible_timeout(&op->done,
>>                                                         op_timeout_secs * HZ);
>>                 if (unlikely(n < 0)) {
>>                         gossip_debug(GOSSIP_DEV_DEBUG,
>>                                 "%s: signal on I/O wait, aborting\n",
>>                                 __func__);
>>                 } else if (unlikely(n == 0)) {
>>                         gossip_debug(GOSSIP_DEV_DEBUG,
>>                                 "%s: timed out.\n",
>>                                 __func__);
>>                 }
>>         }
>> in orangefs_devreq_write_iter() out and see if the corruption happens?
>
> Another thing: what's the protocol rules regarding the cancels?  The current
> code looks very odd - if we get a hit by a signal after the daemon has
> picked e.g. read request but before it had replied, we will call
> orangefs_cancel_op_in_progress(), which will call service_operation() with
> ORANGEFS_OP_CANCELLATION which will.  And that'll insert the cancel request
> into list and practically immediately notice that we have a pending signal,
> remove the cancel request from the list and bugger off.  With daemon almost
> certainly *not* getting to see it at all.
>
> I've asked that before if anybody has explained that, I've missed that reply.
> How the fuck is that supposed to work?  Forget the kernel-side implementation
> details, what should the daemon see in such situation?
>
> I would expect something like "you can't reuse a slot until operation has
> been either completed or purged or a cancel had been sent and ACKed by
> the daemon".  Is that what is intended?  If so, the handling of cancels might
> be better off asynchronous - let the slot freeing be done after the cancel
> had been ACKed and _not_ in the context of original syscall...
>
> There are some traces of AIO support in that thing; could this be a victim of
> trimming async parts for submission into the mainline?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-10 16:44                                                                         ` Al Viro
  2016-02-10 21:26                                                                           ` Al Viro
@ 2016-02-11 23:54                                                                           ` Mike Marshall
  2016-02-12  0:55                                                                             ` Al Viro
  1 sibling, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-02-11 23:54 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel, Stephen Rothwell

> Sure, the kernel won't take that (the
> op with the matching tag has been gone already), but the data is stored
> into shared memory *before* writev() on the control device that would pass
> the response to the kernel, so it still gets overwritten.  Right under
> decoding readdir()...

The readdir buffer isn't a shared buffer like the IO buffer is.
The readdir buffer is preallocated when the client-core starts up
though. The kernel module picks which readdir buffer slot that
the client-core fills, but gets back a copy of that buffer - the
trailer. Unless the kernel module isn't managing the buffer slots
properly, the client core shouldn't have more than one upcall
on-hand that specifies any particular buffer slot. The "kill -9"
on a ls (or whatever) might lead to such mis-management,
but since readdir decoding is happening on a discrete copy
of the buffer slot that was filled by the client-core, it doesn't
seem to me like it could be overwritten during a decode...

I believe there's nothing in userspace that guarantees that
readdirs are replied to in the same order they are received...

-Mike

On Wed, Feb 10, 2016 at 11:44 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Feb 09, 2016 at 11:13:28PM +0000, Al Viro wrote:
>> On Tue, Feb 09, 2016 at 10:40:50PM +0000, Al Viro wrote:
>>
>> > And the version in orangefs-2.9.3.tar.gz (your Frankenstein module?) is
>> > vulnerable to the same race.  2.8.1 isn't - it ignores signals on the
>> > cancel, but that means waiting for cancel to be processed (or timed out)
>> > on any interrupted read() before we return to userland.  We can return
>> > to that behaviour, of course, but I suspect that offloading it to something
>> > async (along with freeing the slot used by original operation) would be
>> > better from QoI point of view.
>>
>> That breakage had been introduced between 2.8.5 and 2.8.6 (at some point
>> during the spring of 2012).  AFAICS, all versions starting with 2.8.6 are
>> vulnerable...
>
> BTW, what about kill -9 delivered to readdir in progress?  There's no
> cancel for those (and AFAICS the daemon will reject cancel on anything
> other than FILE_IO), so what's to stop another thread from picking the
> same readdir slot and getting (daemon-side) two of them spewing into
> the same area of shared memory?  Is it simply that daemon-side the shared
> memory on readdir is touched only upon request completion in completely
> serialized process_vfs_requests()?  That doesn't seem to be enough -
> suppose the second readdir request completes (daemon-side) first, its results
> get packed into shared memory slot and it is reported to kernel, which
> proceeds to repack and copy that data to userland.  In the meanwhile,
> daemon completes the _earlier_ readdir and proceeds to pack its results into
> the same slot of shared memory.  Sure, the kernel won't take that (the
> op with the matching tag has been gone already), but the data is stored
> into shared memory *before* writev() on the control device that would pass
> the response to the kernel, so it still gets overwritten.  Right under
> decoding readdir()...
>
> Or is there something in the daemon that would guarantee readdir responses
> to happen in the same order in which it had picked the requests?  I'm not
> familiar enough with that beast (and overall control flow in there is, er,
> not the most transparent one I've seen), so I might be missing something,
> but I don't see anything obvious that would guarantee such ordering.
>
> Please, clarify.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-11 23:54                                                                           ` Mike Marshall
@ 2016-02-12  0:55                                                                             ` Al Viro
  2016-02-12 12:13                                                                               ` Mike Marshall
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-02-12  0:55 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Thu, Feb 11, 2016 at 06:54:43PM -0500, Mike Marshall wrote:

> The readdir buffer isn't a shared buffer like the IO buffer is.
> The readdir buffer is preallocated when the client-core starts up
> though. The kernel module picks which readdir buffer slot that
> the client-core fills, but gets back a copy of that buffer - the
> trailer. Unless the kernel module isn't managing the buffer slots
> properly, the client core shouldn't have more than one upcall
> on-hand that specifies any particular buffer slot. The "kill -9"
> on a ls (or whatever) might lead to such mis-management,
> but since readdir decoding is happening on a discrete copy
> of the buffer slot that was filled by the client-core, it doesn't
> seem to me like it could be overwritten during a decode...
> 
> I believe there's nothing in userspace that guarantees that
> readdirs are replied to in the same order they are received...

So... why the hell does readdir need to be aware of those slots?  After
rereading this code I have to agree - it *does* copy that stuff, and no
sharing is happening at all.  But the only thing that buf_index is used
for is "which buffer do we memcpy() to before passing it to writev()?"
in the daemon, and it looks like it would've done just as well using
plain and simple malloc() in
    /* get a buffer for xfer of dirents */
    vfs_request->out_downcall.trailer_buf =
        PINT_dev_get_mapped_buffer(BM_READDIR, s_io_desc,
            vfs_request->in_upcall.req.readdir.buf_index);
in the copy_dirents_to_downcall()...

Why is that even a part of protocol?  Were you planning to do zero-copy
there at some point, but hadn't gotten around to that yet?  If you do
(and decide to do it via shared buffers rather than e.g. splice), please
add cancel for readdir before going there...

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-11  3:22                                                                           ` Mike Marshall
@ 2016-02-12  4:27                                                                             ` Al Viro
  2016-02-12 12:26                                                                               ` Mike Marshall
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-02-12  4:27 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Wed, Feb 10, 2016 at 10:22:40PM -0500, Mike Marshall wrote:
> > If there is (or at least supposed to be) something that prevents completions
> > of readdir requests (on unrelated directories, by different processes, etc.)
> > out of order, PLEASE SAY SO.  I would really prefer not to have to fight
> > the readdir side of that mess; cancels are already bad enough ;-/
> 
> Hi Al... your ideas sound good to me, I'll try to get you good
> answers on stuff like the above sometime tomorrow...

OK, this is really, really completely untested, might chew your data,
bugger your dog, etc.  OTOH, if it somehow fails to do the above, it
ought to deal with cancels properly.

Pushed into #orangefs-untested, along with two wait_for_direct_io() fixes
discussed upthread.  This is _not_ all - it still needs saner "wait for slot"
logics, switching op->waitq to completion/killing loop in
wait_for_matching_downcall(), etc.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-12  0:55                                                                             ` Al Viro
@ 2016-02-12 12:13                                                                               ` Mike Marshall
  0 siblings, 0 replies; 111+ messages in thread
From: Mike Marshall @ 2016-02-12 12:13 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel, Stephen Rothwell

 I believed there was some value in preallocating
the readdir buffer, sort of the same way it is
valuable to preallocate those slab buffers
and then getting them and releaseing them
with kmem_cache_alloc and kmem_cache_free.

By default there's five readdir slots. Stuff has to
wait on an available slot if all five slots are in
use. I've made the waiting kick in a few times
when running my "stupid human" test, and it
seems to work right. If the ability to keep
getting more and more buffers was unbounded
I guess it could go run-away like some student's
fork bomb... I'm pretty sure from talking to others
that the waiting for a slot code hardly ever
kicks in on the production cluster on campus.

Martin said he saw commit messages in the
old CVS repository that indicated that the
original developers did indeed plan to make
the readdir buffer a shared buffer.

-Mike

On Thu, Feb 11, 2016 at 7:55 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Feb 11, 2016 at 06:54:43PM -0500, Mike Marshall wrote:
>
>> The readdir buffer isn't a shared buffer like the IO buffer is.
>> The readdir buffer is preallocated when the client-core starts up
>> though. The kernel module picks which readdir buffer slot that
>> the client-core fills, but gets back a copy of that buffer - the
>> trailer. Unless the kernel module isn't managing the buffer slots
>> properly, the client core shouldn't have more than one upcall
>> on-hand that specifies any particular buffer slot. The "kill -9"
>> on a ls (or whatever) might lead to such mis-management,
>> but since readdir decoding is happening on a discrete copy
>> of the buffer slot that was filled by the client-core, it doesn't
>> seem to me like it could be overwritten during a decode...
>>
>> I believe there's nothing in userspace that guarantees that
>> readdirs are replied to in the same order they are received...
>
> So... why the hell does readdir need to be aware of those slots?  After
> rereading this code I have to agree - it *does* copy that stuff, and no
> sharing is happening at all.  But the only thing that buf_index is used
> for is "which buffer do we memcpy() to before passing it to writev()?"
> in the daemon, and it looks like it would've done just as well using
> plain and simple malloc() in
>     /* get a buffer for xfer of dirents */
>     vfs_request->out_downcall.trailer_buf =
>         PINT_dev_get_mapped_buffer(BM_READDIR, s_io_desc,
>             vfs_request->in_upcall.req.readdir.buf_index);
> in the copy_dirents_to_downcall()...
>
> Why is that even a part of protocol?  Were you planning to do zero-copy
> there at some point, but hadn't gotten around to that yet?  If you do
> (and decide to do it via shared buffers rather than e.g. splice), please
> add cancel for readdir before going there...

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-12  4:27                                                                             ` Al Viro
@ 2016-02-12 12:26                                                                               ` Mike Marshall
  2016-02-12 18:00                                                                                 ` Martin Brandenburg
  0 siblings, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-02-12 12:26 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel, Stephen Rothwell

I'll get the patches today... I have about five small patches
that aren't pushed out to github or kernel.org yet, some
cosmetic patches and a couple of things you suggested
in mail messages... if they get in a fight with your
new patches I'll just ditch them and re-do whichever
ones of them are still needed after I've got your
new stuff tested.

Thanks!

-Mike

On Thu, Feb 11, 2016 at 11:27 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Wed, Feb 10, 2016 at 10:22:40PM -0500, Mike Marshall wrote:
>> > If there is (or at least supposed to be) something that prevents completions
>> > of readdir requests (on unrelated directories, by different processes, etc.)
>> > out of order, PLEASE SAY SO.  I would really prefer not to have to fight
>> > the readdir side of that mess; cancels are already bad enough ;-/
>>
>> Hi Al... your ideas sound good to me, I'll try to get you good
>> answers on stuff like the above sometime tomorrow...
>
> OK, this is really, really completely untested, might chew your data,
> bugger your dog, etc.  OTOH, if it somehow fails to do the above, it
> ought to deal with cancels properly.
>
> Pushed into #orangefs-untested, along with two wait_for_direct_io() fixes
> discussed upthread.  This is _not_ all - it still needs saner "wait for slot"
> logics, switching op->waitq to completion/killing loop in
> wait_for_matching_downcall(), etc.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-12 12:26                                                                               ` Mike Marshall
@ 2016-02-12 18:00                                                                                 ` Martin Brandenburg
  2016-02-13 17:18                                                                                   ` Mike Marshall
  0 siblings, 1 reply; 111+ messages in thread
From: Martin Brandenburg @ 2016-02-12 18:00 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Al Viro, Linus Torvalds, linux-fsdevel, Stephen Rothwell

I have some patches for the kernel and our userspace code which
eliminates the useless readdir buffers. They're a few months old at
this point.

The problem is that this is already part of the protocol. Unless we
decide to change it, we can't very well get out of supporting this.
Personally I want to clean this up while we still have the chance. We
already plan to only support this module from the latest OrangeFS and
up.

In either case there's no reason it needs to be so confusing and imply
it's shared.

Mike, there wouldn't be an unlimited number of buffers. It's still
limited by the number of ops which are pre-allocated.

-- Martin

On 2/12/16, Mike Marshall <hubcap@omnibond.com> wrote:
> I'll get the patches today... I have about five small patches
> that aren't pushed out to github or kernel.org yet, some
> cosmetic patches and a couple of things you suggested
> in mail messages... if they get in a fight with your
> new patches I'll just ditch them and re-do whichever
> ones of them are still needed after I've got your
> new stuff tested.
>
> Thanks!
>
> -Mike
>
> On Thu, Feb 11, 2016 at 11:27 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Wed, Feb 10, 2016 at 10:22:40PM -0500, Mike Marshall wrote:
>>> > If there is (or at least supposed to be) something that prevents
>>> > completions
>>> > of readdir requests (on unrelated directories, by different processes,
>>> > etc.)
>>> > out of order, PLEASE SAY SO.  I would really prefer not to have to
>>> > fight
>>> > the readdir side of that mess; cancels are already bad enough ;-/
>>>
>>> Hi Al... your ideas sound good to me, I'll try to get you good
>>> answers on stuff like the above sometime tomorrow...
>>
>> OK, this is really, really completely untested, might chew your data,
>> bugger your dog, etc.  OTOH, if it somehow fails to do the above, it
>> ought to deal with cancels properly.
>>
>> Pushed into #orangefs-untested, along with two wait_for_direct_io() fixes
>> discussed upthread.  This is _not_ all - it still needs saner "wait for
>> slot"
>> logics, switching op->waitq to completion/killing loop in
>> wait_for_matching_downcall(), etc.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-12 18:00                                                                                 ` Martin Brandenburg
@ 2016-02-13 17:18                                                                                   ` Mike Marshall
  2016-02-13 17:47                                                                                     ` Al Viro
  0 siblings, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-02-13 17:18 UTC (permalink / raw)
  To: Martin Brandenburg
  Cc: Al Viro, Linus Torvalds, linux-fsdevel, Stephen Rothwell

I added the patches, and ran a bunch of tests.

Stuff works fine when left unbothered, and also
when wrenches are thrown into the works.

I had multiple userspace things going on at the
same time, dbench, ls -R, find... kill -9 or control-C on
any of them is handled well. When I killed both
the client-core and its restarter, the kernel
dealt with swarm of ops that had nowhere
to go... the WARN_ON in service_operation
was hit.

Feb 12 16:19:12 be1 kernel: [ 3658.167544] orangefs: please confirm
that pvfs2-client daemon is running.
Feb 12 16:19:12 be1 kernel: [ 3658.167547] fs/orangefs/dir.c line 264:
orangefs_readdir: orangefs_readdir_index_get() failure (-5)
Feb 12 16:19:12 be1 kernel: [ 3658.170741] ------------[ cut here ]------------
Feb 12 16:19:12 be1 kernel: [ 3658.170746] WARNING: CPU: 0 PID: 1667
at fs/orangefs/waitqueue.c:203 service_operation+0x4f6/0x7f0()
Feb 12 16:19:12 be1 kernel: [ 3658.170747] Modules linked in:
ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6
nf_defrag_ipv6 bnep bluetooth nf_conntrack_ipv4 nf_defrag_ipv4 rfkill
xt_conntrack nf_conntrack ebtable_nat ebtable_broute bridge stp llc
ebtable_filter ebtables ip6table_mangle ip6table_security ip6table_raw
ip6table_filter ip6_tables iptable_mangle iptable_security iptable_raw
ppdev parport_pc virtio_balloon pvpanic parport serio_raw 8139too
i2c_piix4 virtio_console uinput qxl drm_kms_helper ttm drm 8139cp
i2c_core virtio_pci virtio virtio_ring mii ata_generic pata_acpi
Feb 12 16:19:12 be1 kernel: [ 3658.170770] CPU: 0 PID: 1667 Comm:
dbench Not tainted 4.4.0-161995-gda335c6 #62
Feb 12 16:19:12 be1 kernel: [ 3658.170771] Hardware name: Red Hat KVM,
BIOS 0.5.1 01/01/2011
Feb 12 16:19:12 be1 kernel: [ 3658.170772]  0000000000000000
000000001371c2af ffff88000c89fc08 ffffffff8139c7dd
Feb 12 16:19:12 be1 kernel: [ 3658.170774]  ffff88000c89fc40
ffffffff8108e510 ffff88000cb58000 ffff88000cb5c1d0
Feb 12 16:19:12 be1 kernel: [ 3658.170776]  ffff88000cb5c188
00000000fffffff5 0000000000000000 ffff88000c89fc50
Feb 12 16:19:12 be1 kernel: [ 3658.170778] Call Trace:
Feb 12 16:19:12 be1 kernel: [ 3658.170782]  [<ffffffff8139c7dd>]
dump_stack+0x19/0x1c
Feb 12 16:19:12 be1 kernel: [ 3658.170786]  [<ffffffff8108e510>]
warn_slowpath_common+0x80/0xc0
Feb 12 16:19:12 be1 kernel: [ 3658.170787]  [<ffffffff8108e65a>]
warn_slowpath_null+0x1a/0x20
Feb 12 16:19:12 be1 kernel: [ 3658.170788]  [<ffffffff812fe6a6>]
service_operation+0x4f6/0x7f0
Feb 12 16:19:12 be1 kernel: [ 3658.170791]  [<ffffffff810c28b0>] ?
prepare_to_wait_event+0x100/0x100
Feb 12 16:19:12 be1 kernel: [ 3658.170792]  [<ffffffff810c28b0>] ?
prepare_to_wait_event+0x100/0x100
Feb 12 16:19:12 be1 kernel: [ 3658.170794]  [<ffffffff812f2557>]
wait_for_direct_io+0x157/0x520
Feb 12 16:19:12 be1 kernel: [ 3658.170796]  [<ffffffff812f29d6>]
do_readv_writev+0xb6/0x2a0
Feb 12 16:19:12 be1 kernel: [ 3658.170797]  [<ffffffff812f2c75>]
orangefs_file_write_iter+0xb5/0x1a0
Feb 12 16:19:12 be1 kernel: [ 3658.170801]  [<ffffffff811f962c>]
__vfs_write+0xcc/0x100
Feb 12 16:19:12 be1 kernel: [ 3658.170802]  [<ffffffff811f9c81>]
vfs_write+0xa1/0x190
Feb 12 16:19:12 be1 kernel: [ 3658.170804]  [<ffffffff811fabe7>]
SyS_pwrite64+0x87/0xb0
Feb 12 16:19:12 be1 kernel: [ 3658.170807]  [<ffffffff81782faf>]
entry_SYSCALL_64_fastpath+0x12/0x76
Feb 12 16:19:12 be1 kernel: [ 3658.170808] ---[ end trace 9335703ea9225d7b ]---

I run xfstests a very clunky way... I left it running when I left
the office on Friday. I'll grep through the output on my
terminal <blush> on Monday to see if there's any regressions...

-Mike

On Fri, Feb 12, 2016 at 1:00 PM, Martin Brandenburg <martin@omnibond.com> wrote:
> I have some patches for the kernel and our userspace code which
> eliminates the useless readdir buffers. They're a few months old at
> this point.
>
> The problem is that this is already part of the protocol. Unless we
> decide to change it, we can't very well get out of supporting this.
> Personally I want to clean this up while we still have the chance. We
> already plan to only support this module from the latest OrangeFS and
> up.
>
> In either case there's no reason it needs to be so confusing and imply
> it's shared.
>
> Mike, there wouldn't be an unlimited number of buffers. It's still
> limited by the number of ops which are pre-allocated.
>
> -- Martin
>
> On 2/12/16, Mike Marshall <hubcap@omnibond.com> wrote:
>> I'll get the patches today... I have about five small patches
>> that aren't pushed out to github or kernel.org yet, some
>> cosmetic patches and a couple of things you suggested
>> in mail messages... if they get in a fight with your
>> new patches I'll just ditch them and re-do whichever
>> ones of them are still needed after I've got your
>> new stuff tested.
>>
>> Thanks!
>>
>> -Mike
>>
>> On Thu, Feb 11, 2016 at 11:27 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>> On Wed, Feb 10, 2016 at 10:22:40PM -0500, Mike Marshall wrote:
>>>> > If there is (or at least supposed to be) something that prevents
>>>> > completions
>>>> > of readdir requests (on unrelated directories, by different processes,
>>>> > etc.)
>>>> > out of order, PLEASE SAY SO.  I would really prefer not to have to
>>>> > fight
>>>> > the readdir side of that mess; cancels are already bad enough ;-/
>>>>
>>>> Hi Al... your ideas sound good to me, I'll try to get you good
>>>> answers on stuff like the above sometime tomorrow...
>>>
>>> OK, this is really, really completely untested, might chew your data,
>>> bugger your dog, etc.  OTOH, if it somehow fails to do the above, it
>>> ought to deal with cancels properly.
>>>
>>> Pushed into #orangefs-untested, along with two wait_for_direct_io() fixes
>>> discussed upthread.  This is _not_ all - it still needs saner "wait for
>>> slot"
>>> logics, switching op->waitq to completion/killing loop in
>>> wait_for_matching_downcall(), etc.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-13 17:18                                                                                   ` Mike Marshall
@ 2016-02-13 17:47                                                                                     ` Al Viro
  2016-02-14  2:56                                                                                       ` Al Viro
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-02-13 17:47 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Sat, Feb 13, 2016 at 12:18:12PM -0500, Mike Marshall wrote:
> I added the patches, and ran a bunch of tests.
> 
> Stuff works fine when left unbothered, and also
> when wrenches are thrown into the works.
> 
> I had multiple userspace things going on at the
> same time, dbench, ls -R, find... kill -9 or control-C on
> any of them is handled well. When I killed both
> the client-core and its restarter, the kernel
> dealt with swarm of ops that had nowhere
> to go... the WARN_ON in service_operation
> was hit.
> 
> Feb 12 16:19:12 be1 kernel: [ 3658.167544] orangefs: please confirm
> that pvfs2-client daemon is running.
> Feb 12 16:19:12 be1 kernel: [ 3658.167547] fs/orangefs/dir.c line 264:
> orangefs_readdir: orangefs_readdir_index_get() failure (-5)

I.e. bufmap is gone.

> Feb 12 16:19:12 be1 kernel: [ 3658.170741] ------------[ cut here ]------------
> Feb 12 16:19:12 be1 kernel: [ 3658.170746] WARNING: CPU: 0 PID: 1667
> at fs/orangefs/waitqueue.c:203 service_operation+0x4f6/0x7f0()

... and we are in wait_for_direct_io(), holding an r/w slot and finding
ourselves with bufmap already gone, despite not having freed that slot
yet.  Bloody wonderful - we still have bufmap refcounting buggered somewhere.

Which tree had that been?  Could you push that tree (having checked that
you don't have any uncommitted changes) in some branch?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-13 17:47                                                                                     ` Al Viro
@ 2016-02-14  2:56                                                                                       ` Al Viro
  2016-02-14  3:46                                                                                         ` [RFC] slot allocator - waitqueue use review needed (Re: Orangefs ABI documentation) Al Viro
  2016-02-14 22:31                                                                                         ` Orangefs ABI documentation Mike Marshall
  0 siblings, 2 replies; 111+ messages in thread
From: Al Viro @ 2016-02-14  2:56 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Sat, Feb 13, 2016 at 05:47:38PM +0000, Al Viro wrote:
> On Sat, Feb 13, 2016 at 12:18:12PM -0500, Mike Marshall wrote:
> > I added the patches, and ran a bunch of tests.
> > 
> > Stuff works fine when left unbothered, and also
> > when wrenches are thrown into the works.
> > 
> > I had multiple userspace things going on at the
> > same time, dbench, ls -R, find... kill -9 or control-C on
> > any of them is handled well. When I killed both
> > the client-core and its restarter, the kernel
> > dealt with swarm of ops that had nowhere
> > to go... the WARN_ON in service_operation
> > was hit.
> > 
> > Feb 12 16:19:12 be1 kernel: [ 3658.167544] orangefs: please confirm
> > that pvfs2-client daemon is running.
> > Feb 12 16:19:12 be1 kernel: [ 3658.167547] fs/orangefs/dir.c line 264:
> > orangefs_readdir: orangefs_readdir_index_get() failure (-5)
> 
> I.e. bufmap is gone.
> 
> > Feb 12 16:19:12 be1 kernel: [ 3658.170741] ------------[ cut here ]------------
> > Feb 12 16:19:12 be1 kernel: [ 3658.170746] WARNING: CPU: 0 PID: 1667
> > at fs/orangefs/waitqueue.c:203 service_operation+0x4f6/0x7f0()
> 
> ... and we are in wait_for_direct_io(), holding an r/w slot and finding
> ourselves with bufmap already gone, despite not having freed that slot
> yet.  Bloody wonderful - we still have bufmap refcounting buggered somewhere.
> 
> Which tree had that been?  Could you push that tree (having checked that
> you don't have any uncommitted changes) in some branch?

OK, at the very least there's this; should be folded into "orangefs: delay
freeing slot until cancel completes"

diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h
index 41f8bb1f..1e28555 100644
--- a/fs/orangefs/orangefs-kernel.h
+++ b/fs/orangefs/orangefs-kernel.h
@@ -261,6 +261,7 @@ static inline void set_op_state_purged(struct orangefs_kernel_op_s *op)
 {
 	spin_lock(&op->lock);
 	if (unlikely(op_is_cancel(op))) {
+		list_del(&op->list);
 		spin_unlock(&op->lock);
 		put_cancel(op);
 	} else {

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [RFC] slot allocator - waitqueue use review needed (Re: Orangefs ABI documentation)
  2016-02-14  2:56                                                                                       ` Al Viro
@ 2016-02-14  3:46                                                                                         ` Al Viro
  2016-02-14  4:06                                                                                           ` Al Viro
  2016-02-16  2:12                                                                                           ` Al Viro
  2016-02-14 22:31                                                                                         ` Orangefs ABI documentation Mike Marshall
  1 sibling, 2 replies; 111+ messages in thread
From: Al Viro @ 2016-02-14  3:46 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel,
	Stephen Rothwell, Ingo Molnar

FWIW, I think I have a kinda-sorta solution for bufmap slot allocation/waiting;
if somebody has a better idea, I would love to drop the variant below.  And
I would certainly appreciate review - I hate messing with waitqueue primitives
and I know how easy it is to fuck those up ;-/

Below is a mockup of that thing:

/*  Three possible states: absent, installed and shutting down.
 *  install(map, count, bitmap) sets it up
 *  get(map)/put(map, slot) allocate and free resp.
 *  mark_dead(map) moves to shutdown state - no new allocations succeed until
 *  we reinstall it.
 *  run_down(map) waits for all allocations to be released; in the end, we
 *  are in the "absent" state again.
 *
 *  get() is not allowed to take longer than slot_timeout_secs seconds total;
 *  if the thing gets shut down and reinstalled during the wait, we are OK
 *  as long as reinstall comes within restart_timeout_secs.  For orangefs
 *  those default to 15 minutes and 30 seconds resp...
 */
struct slot_map {
	int c;			// absent -> -1
				// installed and full -> 0
				// installed with n slots free -> n
				// shutting down, with n slots in use -> -1-n
	wait_queue_head_t q;	// q.lock protects everything here.
	int count;
	unsigned long *map;
};

void install(struct slot_map *m, int count, unsigned long *map)
{
	spin_lock(&m->q.lock);
	m->c = m->count = count;
	m->map = map;
	wake_up_all_locked(&m->q);
	spin_unlock(&m->q.lock);
}

void mark_killed(struct slot_map *m)
{
	spin_lock(&m->q.lock);
	m->c -= m->count + 1;
	spin_unlock(&m->q.lock);
}

void run_down(struct slot_map *m)
{
	DEFINE_WAIT(wait);
	spin_lock(&m->q.lock);
#if 0
	// we don't have wait_event_locked(); might be worth adding.
	wait_event_locked(&m->q, m->c == -1);
#else
	// or we can open-code it
	if (m->c != -1) {
		for (;;) {
			if (likely(list_empty(&wait.task_list)))
				__add_wait_queue_tail(&m->q, &wait);
			set_current_state(TASK_UNINTERRUPTIBLE);

			if (m->c == -1)
				break;

			spin_unlock(&m->q.lock);
			schedule();
			spin_lock(&m->q.lock);
		}
		__remove_wait_queue(&m->q, &wait);
		__set_current_state(TASK_RUNNING);
	}
#endif
	m->map = NULL;
	spin_unlock(&m->q.lock);
}

void put(struct slot_map *m, int slot)
{
	int v;
	spin_lock(&m->q.lock);
	__clear_bit(slot, m->map);
	v = ++m->c;
	if (unlikely(v == 1))	/* no free slots -> one free slot */
		wake_up_locked(&m->q);
	else if (unlikely(v == -1))	/* finished dying */
		wake_up_all_locked(&m->q);
	spin_unlock(&m->q.lock);
}

static int wait_for_free(struct slot_map *m)
{
	long left = slot_timeout_secs * HZ;
	DEFINE_WAIT(wait);

#if 0
	// the trouble is, there's no wait_event_interruptible_timeout_locked()
	// might be worth adding...
	do {
		if (m->c > 0)
			break;
		if (m->c < 0) {
			/* we are waiting for map to be installed */
			/* it would better be there soon, or we go away */
			long n = left, t;
			if (n > restart_timeout_secs * HZ)
				n = restart_timeout_secs * HZ;
			t = wait_event_interruptible_timeout_locked(&m->q,
						m->c > 0, n);
			if (unlikely(t < 0) || (!t && m->c < 0))
				left = t;
			else
				left = t + (left - n);
		} else {
			/* just waiting for a slot to come free */
			left = wait_event_interruptible_timeout_locked(&m->q,
						m->c > 0, left);
		}
	} while (left > 0);
#else
	// or we can open-code it
	do {
		if (likely(list_empty(&wait.task_list)))
			__add_wait_queue_tail_exclusive(&m->q, &wait);
		set_current_state(TASK_INTERRUPTIBLE);

		if (m->c > 0)
			break;

		if (m->c < 0) {
			/* we are waiting for map to be installed */
			/* it would better be there soon, or we go away */
			long n = left, t;
			if (n > ORANGEFS_BUFMAP_WAIT_TIMEOUT_SECS * HZ)
				n = ORANGEFS_BUFMAP_WAIT_TIMEOUT_SECS * HZ;
			spin_unlock(&m->q.lock);
			t = schedule_timeout(n);
			spin_lock(&m->q.lock);
			if (unlikely(t < 0) || (!t && m->c < 0))
				left = t;
			else
				left = t + (left - n);
		} else {
			/* just waiting for a slot to come free */
			spin_unlock(&m->q.lock);
			left = schedule_timeout(left);
			spin_lock(&m->q.lock);
		}
	} while (left > 0);

	__remove_wait_queue(&m->q, &wait);
	__set_current_state(TASK_RUNNING);
#endif
	if (likely(left > 0))
		return 0;

	return left < 0 ? -EINTR : -ETIMEDOUT;
}

int get(struct slot_map *m)
{
	int res = 0;
	spin_lock(&m->q.lock);
	if (unlikely(m->c <= 0))
		res = wait_for_free(m);
	if (likely(!res)) {
		m->c--;
		res = find_first_zero_bit(m->map, m->count);
		__set_bit(res, m->map);
	}
	spin_unlock(&m->q.lock);
	return res;
}

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [RFC] slot allocator - waitqueue use review needed (Re: Orangefs ABI documentation)
  2016-02-14  3:46                                                                                         ` [RFC] slot allocator - waitqueue use review needed (Re: Orangefs ABI documentation) Al Viro
@ 2016-02-14  4:06                                                                                           ` Al Viro
  2016-02-16  2:12                                                                                           ` Al Viro
  1 sibling, 0 replies; 111+ messages in thread
From: Al Viro @ 2016-02-14  4:06 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel,
	Stephen Rothwell, Ingo Molnar

On Sun, Feb 14, 2016 at 03:46:08AM +0000, Al Viro wrote:
> #if 0
> 	// the trouble is, there's no wait_event_interruptible_timeout_locked()

Sorry - it's wait_event_interruptible_timeout_locked_exclusive().  IOW, the
open-coded variant is doing the right thing, ifdefed-out one needs
s/wait_event_interruptible_timeout_locked/&_exclusive/...

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-14  2:56                                                                                       ` Al Viro
  2016-02-14  3:46                                                                                         ` [RFC] slot allocator - waitqueue use review needed (Re: Orangefs ABI documentation) Al Viro
@ 2016-02-14 22:31                                                                                         ` Mike Marshall
  2016-02-14 23:43                                                                                           ` Al Viro
  1 sibling, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-02-14 22:31 UTC (permalink / raw)
  To: Al Viro
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel, Stephen Rothwell

I added the list_del...

Everything is very resilient, I killed
the client-core over and over while dbench
was running at the same time as  ls -R
was running, and the client-core always
restarted... until finally, it didn't. I guess
related to the state of just what was going on
at the time... Hit the WARN_ON in service_operation,
and then oopsed on the orangefs_bufmap_put
down at the end of wait_for_direct_io...

http://myweb.clemson.edu/~hubcap/after.list_del

-Mike

On Sat, Feb 13, 2016 at 9:56 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sat, Feb 13, 2016 at 05:47:38PM +0000, Al Viro wrote:
>> On Sat, Feb 13, 2016 at 12:18:12PM -0500, Mike Marshall wrote:
>> > I added the patches, and ran a bunch of tests.
>> >
>> > Stuff works fine when left unbothered, and also
>> > when wrenches are thrown into the works.
>> >
>> > I had multiple userspace things going on at the
>> > same time, dbench, ls -R, find... kill -9 or control-C on
>> > any of them is handled well. When I killed both
>> > the client-core and its restarter, the kernel
>> > dealt with swarm of ops that had nowhere
>> > to go... the WARN_ON in service_operation
>> > was hit.
>> >
>> > Feb 12 16:19:12 be1 kernel: [ 3658.167544] orangefs: please confirm
>> > that pvfs2-client daemon is running.
>> > Feb 12 16:19:12 be1 kernel: [ 3658.167547] fs/orangefs/dir.c line 264:
>> > orangefs_readdir: orangefs_readdir_index_get() failure (-5)
>>
>> I.e. bufmap is gone.
>>
>> > Feb 12 16:19:12 be1 kernel: [ 3658.170741] ------------[ cut here ]------------
>> > Feb 12 16:19:12 be1 kernel: [ 3658.170746] WARNING: CPU: 0 PID: 1667
>> > at fs/orangefs/waitqueue.c:203 service_operation+0x4f6/0x7f0()
>>
>> ... and we are in wait_for_direct_io(), holding an r/w slot and finding
>> ourselves with bufmap already gone, despite not having freed that slot
>> yet.  Bloody wonderful - we still have bufmap refcounting buggered somewhere.
>>
>> Which tree had that been?  Could you push that tree (having checked that
>> you don't have any uncommitted changes) in some branch?
>
> OK, at the very least there's this; should be folded into "orangefs: delay
> freeing slot until cancel completes"
>
> diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h
> index 41f8bb1f..1e28555 100644
> --- a/fs/orangefs/orangefs-kernel.h
> +++ b/fs/orangefs/orangefs-kernel.h
> @@ -261,6 +261,7 @@ static inline void set_op_state_purged(struct orangefs_kernel_op_s *op)
>  {
>         spin_lock(&op->lock);
>         if (unlikely(op_is_cancel(op))) {
> +               list_del(&op->list);
>                 spin_unlock(&op->lock);
>                 put_cancel(op);
>         } else {

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-14 22:31                                                                                         ` Orangefs ABI documentation Mike Marshall
@ 2016-02-14 23:43                                                                                           ` Al Viro
  2016-02-15 17:46                                                                                             ` Mike Marshall
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-02-14 23:43 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Sun, Feb 14, 2016 at 05:31:10PM -0500, Mike Marshall wrote:
> I added the list_del...
> 
> Everything is very resilient, I killed
> the client-core over and over while dbench
> was running at the same time as  ls -R
> was running, and the client-core always
> restarted... until finally, it didn't. I guess
> related to the state of just what was going on
> at the time... Hit the WARN_ON in service_operation,
> and then oopsed on the orangefs_bufmap_put
> down at the end of wait_for_direct_io...

Bloody hell...  I think I see what's going on, and presumably the newer
slot allocator would fix that.  Look: closing control device (== daemon
death) checks if we have a bufmap installed and drops a reference to
it in that case.  The reason why it's conditional is that we might have
not gotten around to installing one (it's done via ioctl on control
device).  But ->release() does *NOT* wait for all references to go away!
In other words, it's possible to restart the daemon while the old bufmap
is still there.  Then have it killed after it has opened control devices
and before the old bufmap has run down.  For ->release() it looks like
we *have* gotten around to installing bufmap, and need the reference dropped.
In reality, the reference acquired when we were installing that one has
already been dropped, so we get double put.  With expected results...

If below ends up fixing the symptoms, analysis above has a good chance to
be correct.  This is no way to wait for rundown, of course - I'm not
suggesting it as the solution, just as a way to narrow down what's going
on.

Incidentally, could you fold the list_del() part into offending commit
(orangefs: delay freeing slot until cancel completes) and repush your
for-next?

diff --git a/fs/orangefs/devorangefs-req.c b/fs/orangefs/devorangefs-req.c
index 6a7df12..630246d 100644
--- a/fs/orangefs/devorangefs-req.c
+++ b/fs/orangefs/devorangefs-req.c
@@ -529,6 +529,9 @@ static int orangefs_devreq_release(struct inode *inode, struct file *file)
 	purge_inprogress_ops();
 	gossip_debug(GOSSIP_DEV_DEBUG,
 		     "pvfs2-client-core: device close complete\n");
+	/* VERY CRUDE, NOT FOR MERGE */
+	while (orangefs_get_bufmap_init())
+		schedule_timeout(HZ);
 	open_access_count = 0;
 	mutex_unlock(&devreq_mutex);
 	return 0;
diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h
index 41f8bb1f..1e28555 100644
--- a/fs/orangefs/orangefs-kernel.h
+++ b/fs/orangefs/orangefs-kernel.h
@@ -261,6 +261,7 @@ static inline void set_op_state_purged(struct orangefs_kernel_op_s *op)
 {
 	spin_lock(&op->lock);
 	if (unlikely(op_is_cancel(op))) {
+		list_del(&op->list);
 		spin_unlock(&op->lock);
 		put_cancel(op);
 	} else {

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-14 23:43                                                                                           ` Al Viro
@ 2016-02-15 17:46                                                                                             ` Mike Marshall
  2016-02-15 18:45                                                                                               ` Al Viro
  0 siblings, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-02-15 17:46 UTC (permalink / raw)
  To: Al Viro
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel, Stephen Rothwell

I pushed the list_del up to the kernel.org for-next branch...

And I've been running tests with the CRUDE bandaid... weird
results...

No oopses, no WARN_ONs... I was running dbench and ls -R
or find and kill-minus-nining different ones of them with no
perceived resulting problems, so I moved on to signalling
the client-core to abort... it restarted numerous times,
and then stuff wedged up differently than I've seen before.

Usually I kill the client-core and it comes back (gets restarted)
as seen by the different PID:

# ps -ef | grep pvfs
root      1292  1185  7 11:39 ?        00:00:01 pvfs2-client-core
--child -a 60000 -n 60000 --logtype file -L /var/log/client.log
# kill -6 1292
# ps -ef | grep pvfs
root      1299  1185  8 11:40 ?        00:00:00 pvfs2-client-core
--child -a 60000 -n 60000 --logtype file -L /var/log/client.log

Until once, it didn't die, and the gorked up unkillable left-over thing's
argv[0] (or wherever this string gets scraped from) was goofy:

# ps -ef | grep pvfs
root      1324  1185  1 11:41 ?        00:00:02 pvfs2-client-core
--child -a 60000 -n 60000 --logtype file -L /var/log/client.log
[root@be1 hubcap]# kill -6 1324
[root@be1 hubcap]# ps -ef | grep pvfs
root      1324  1185  2 11:41 ?        00:00:05 [pvfs2-client-co]

The virtual host was pretty wedged up after that, I couldn't look
at anything interesting, and got a bunch of terminal windows hung
trying:

# strace -f -p 1324
Process 1324 attached
^C

^C^C
                                     .
                     ls -R's output was flowing out here
/pvfsmnt/tdir/z_really_long_disgustingly_long_super_long_file_name52
/pvfsmnt/tdir/z_really_long_disgustingly_long_super_long_file_name53



^C^C^C


[root@logtruck hubcap]# ssh be1
root@be1's password:
Last login: Mon Feb 15 11:33:42 2016 from logtruck.clemson.edu
[root@be1 ~]# df


I still had one functioning window, and looked at dmesg from there,
nothing interesting there... a couple of expected tag WARNINGS while I was
killing finds and dbenches... ioctls that happened during the
successful restarts of the client-core...

[  809.520966] client-core: opening device
[  809.521031] pvfs2-client-core: open device complete (ret = 0)
[  809.521050] dispatch_ioctl_command: client debug mask has been been
received :0: :0:
[  809.521068] dispatch_ioctl_command: client debug array string has
been received.
[  809.521070] orangefs_prepare_debugfs_help_string: start
[  809.521071] orangefs_prepare_cdm_array: start
[  809.521104] orangefs_prepare_cdm_array: rc:50:
[  809.521106] orangefs_prepare_debugfs_help_string: cdm_element_count:50:
[  809.521239] debug_mask_to_string: start
[  809.521242] debug_mask_to_string: string:none:
[  809.521243] orangefs_client_debug_init: start
[  809.521249] orangefs_client_debug_init: rc:0:
[  809.566652] dispatch_ioctl_command: got ORANGEFS_DEV_REMOUNT_ALL
[  809.566667] dispatch_ioctl_command: priority remount in progress
[  809.566668] dispatch_ioctl_command: priority remount complete
[  812.454255] orangefs_debug_open: orangefs_debug_disabled: 0
[  812.454294] orangefs_debug_open: rc: 0
[  812.454320] orangefs_debug_write: kernel-debug
[  812.454323] debug_string_to_mask: start
[  896.410522] WARNING: No one's waiting for tag 15612
[ 1085.339948] WARNING: No one's waiting for tag 127943
[ 1146.820485] orangefs: please confirm that pvfs2-client daemon is running.
[ 1146.820488] fs/orangefs/dir.c line 264: orangefs_readdir:
orangefs_readdir_index_get() failure (-5)
[ 1146.866812] dispatch_ioctl_command: client debug mask has been been
received :0: :0:
[ 1146.866834] dispatch_ioctl_command: client debug array string has
been received.
[ 1175.906800] dispatch_ioctl_command: client debug mask has been been
received :0: :0:
[ 1175.906817] dispatch_ioctl_command: client debug array string has
been received.
[ 1223.915862] dispatch_ioctl_command: client debug mask has been been
received :0: :0:
[ 1223.915880] dispatch_ioctl_command: client debug array string has
been received.
[ 1274.458852] dispatch_ioctl_command: client debug mask has been been
received :0: :0:
[ 1274.458870] dispatch_ioctl_command: client debug array string has
been received.
[root@be1 hubcap]#


ps aux shows every process' state as S except for 1324 which is
racking up time:

[hubcap@be1 ~]$ ps aux | grep pvfs2-client
root      1324 92.4  0.0      0     0 ?        R    11:41  46:29
[pvfs2-client-co]
[hubcap@be1 ~]$ ps aux | grep pvfs2-client
root      1324 92.4  0.0      0     0 ?        R    11:41  46:30
[pvfs2-client-co]

I'll virsh destroy this thing now <g>...

-Mike



On Sun, Feb 14, 2016 at 6:43 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sun, Feb 14, 2016 at 05:31:10PM -0500, Mike Marshall wrote:
>> I added the list_del...
>>
>> Everything is very resilient, I killed
>> the client-core over and over while dbench
>> was running at the same time as  ls -R
>> was running, and the client-core always
>> restarted... until finally, it didn't. I guess
>> related to the state of just what was going on
>> at the time... Hit the WARN_ON in service_operation,
>> and then oopsed on the orangefs_bufmap_put
>> down at the end of wait_for_direct_io...
>
> Bloody hell...  I think I see what's going on, and presumably the newer
> slot allocator would fix that.  Look: closing control device (== daemon
> death) checks if we have a bufmap installed and drops a reference to
> it in that case.  The reason why it's conditional is that we might have
> not gotten around to installing one (it's done via ioctl on control
> device).  But ->release() does *NOT* wait for all references to go away!
> In other words, it's possible to restart the daemon while the old bufmap
> is still there.  Then have it killed after it has opened control devices
> and before the old bufmap has run down.  For ->release() it looks like
> we *have* gotten around to installing bufmap, and need the reference dropped.
> In reality, the reference acquired when we were installing that one has
> already been dropped, so we get double put.  With expected results...
>
> If below ends up fixing the symptoms, analysis above has a good chance to
> be correct.  This is no way to wait for rundown, of course - I'm not
> suggesting it as the solution, just as a way to narrow down what's going
> on.
>
> Incidentally, could you fold the list_del() part into offending commit
> (orangefs: delay freeing slot until cancel completes) and repush your
> for-next?
>
> diff --git a/fs/orangefs/devorangefs-req.c b/fs/orangefs/devorangefs-req.c
> index 6a7df12..630246d 100644
> --- a/fs/orangefs/devorangefs-req.c
> +++ b/fs/orangefs/devorangefs-req.c
> @@ -529,6 +529,9 @@ static int orangefs_devreq_release(struct inode *inode, struct file *file)
>         purge_inprogress_ops();
>         gossip_debug(GOSSIP_DEV_DEBUG,
>                      "pvfs2-client-core: device close complete\n");
> +       /* VERY CRUDE, NOT FOR MERGE */
> +       while (orangefs_get_bufmap_init())
> +               schedule_timeout(HZ);
>         open_access_count = 0;
>         mutex_unlock(&devreq_mutex);
>         return 0;
> diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h
> index 41f8bb1f..1e28555 100644
> --- a/fs/orangefs/orangefs-kernel.h
> +++ b/fs/orangefs/orangefs-kernel.h
> @@ -261,6 +261,7 @@ static inline void set_op_state_purged(struct orangefs_kernel_op_s *op)
>  {
>         spin_lock(&op->lock);
>         if (unlikely(op_is_cancel(op))) {
> +               list_del(&op->list);
>                 spin_unlock(&op->lock);
>                 put_cancel(op);
>         } else {

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-15 17:46                                                                                             ` Mike Marshall
@ 2016-02-15 18:45                                                                                               ` Al Viro
  2016-02-15 22:32                                                                                                 ` Martin Brandenburg
  2016-02-15 22:47                                                                                                 ` Mike Marshall
  0 siblings, 2 replies; 111+ messages in thread
From: Al Viro @ 2016-02-15 18:45 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Mon, Feb 15, 2016 at 12:46:51PM -0500, Mike Marshall wrote:
> I pushed the list_del up to the kernel.org for-next branch...
> 
> And I've been running tests with the CRUDE bandaid... weird
> results...
> 
> No oopses, no WARN_ONs... I was running dbench and ls -R
> or find and kill-minus-nining different ones of them with no
> perceived resulting problems, so I moved on to signalling
> the client-core to abort... it restarted numerous times,
> and then stuff wedged up differently than I've seen before.

There are other problems with that thing (starting with the fact that
retrying readdir/wait_for_direct_io can try to grab a slot despite the
bufmap winding down).  OK, at that point I think we should try to see
if bufmap rewrite works - I've rebased on top of your branch and pushed
(head at 8c3bc9a).  Bufmap rewrite is really completely untested -
it's done pretty much blindly and I'd be surprised as hell if it has no
brainos at the first try.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-15 18:45                                                                                               ` Al Viro
@ 2016-02-15 22:32                                                                                                 ` Martin Brandenburg
  2016-02-15 23:04                                                                                                   ` Al Viro
  2016-02-15 22:47                                                                                                 ` Mike Marshall
  1 sibling, 1 reply; 111+ messages in thread
From: Martin Brandenburg @ 2016-02-15 22:32 UTC (permalink / raw)
  To: Al Viro; +Cc: Mike Marshall, Linus Torvalds, linux-fsdevel, Stephen Rothwell

On 2/15/16, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Mon, Feb 15, 2016 at 12:46:51PM -0500, Mike Marshall wrote:
>> I pushed the list_del up to the kernel.org for-next branch...
>>
>> And I've been running tests with the CRUDE bandaid... weird
>> results...
>>
>> No oopses, no WARN_ONs... I was running dbench and ls -R
>> or find and kill-minus-nining different ones of them with no
>> perceived resulting problems, so I moved on to signalling
>> the client-core to abort... it restarted numerous times,
>> and then stuff wedged up differently than I've seen before.
>
> There are other problems with that thing (starting with the fact that
> retrying readdir/wait_for_direct_io can try to grab a slot despite the
> bufmap winding down).  OK, at that point I think we should try to see
> if bufmap rewrite works - I've rebased on top of your branch and pushed
> (head at 8c3bc9a).  Bufmap rewrite is really completely untested -
> it's done pretty much blindly and I'd be surprised as hell if it has no
> brainos at the first try.
>

There's at least one major issue aside from a small typo.

Something that used a slot, such as reader, would call
service_operation while holding a bufmap. Then the client-core would
crash, and the kernel would get run_down waiting on the slots to be
given up. But the slots are not given up until someone wakes all the
processes waiting in service_operation up, which happens after all the
slots are given up. Then client-core hangs until someone sends a
deadly signal to all the processes waiting in service_operation or
presumably the timeout expires.

This splits finalize and run_down so that orangefs_devreq_release can
mark the slot map as killed, then purge waiting ops, then wait for all
the slots to be released. Meanwhile, processes which were waiting will
get into orangefs_bufmap_get which will see that the slot map is
shutting down and wait for the client-core to come back.

This is all at https://www.github.com/martinbrandenburg/linux.git branch slots.

-- Martin

diff --git a/fs/orangefs/devorangefs-req.c b/fs/orangefs/devorangefs-req.c
index d96bcf10..b27ed1c 100644
--- a/fs/orangefs/devorangefs-req.c
+++ b/fs/orangefs/devorangefs-req.c
@@ -513,6 +513,9 @@ static int orangefs_devreq_release(struct inode
*inode, struct file *file)
 	 * them as purged and wake them up
 	 */
 	purge_inprogress_ops();
+
+	orangefs_bufmap_run_down();
+
 	gossip_debug(GOSSIP_DEV_DEBUG,
 		     "pvfs2-client-core: device close complete\n");
 	open_access_count = 0;
diff --git a/fs/orangefs/orangefs-bufmap.c b/fs/orangefs/orangefs-bufmap.c
index 3c6e07c..c544710 100644
--- a/fs/orangefs/orangefs-bufmap.c
+++ b/fs/orangefs/orangefs-bufmap.c
@@ -20,7 +20,7 @@ static struct slot_map rw_map = {
 };
 static struct slot_map readdir_map = {
 	.c = -1,
-	.q = __WAIT_QUEUE_HEAD_INITIALIZER(rw_map.q)
+	.q = __WAIT_QUEUE_HEAD_INITIALIZER(readdir_map.q)
 };


@@ -430,6 +430,15 @@ void orangefs_bufmap_finalize(void)
 	gossip_debug(GOSSIP_BUFMAP_DEBUG, "orangefs_bufmap_finalize: called\n");
 	mark_killed(&rw_map);
 	mark_killed(&readdir_map);
+	gossip_debug(GOSSIP_BUFMAP_DEBUG,
+		     "orangefs_bufmap_finalize: exiting normally\n");
+}
+
+void orangefs_bufmap_run_down(void)
+{
+	struct orangefs_bufmap *bufmap = __orangefs_bufmap;
+	if (!bufmap)
+		return;
 	run_down(&rw_map);
 	run_down(&readdir_map);
 	spin_lock(&orangefs_bufmap_lock);
@@ -437,8 +446,6 @@ void orangefs_bufmap_finalize(void)
 	spin_unlock(&orangefs_bufmap_lock);
 	orangefs_bufmap_unmap(bufmap);
 	orangefs_bufmap_free(bufmap);
-	gossip_debug(GOSSIP_BUFMAP_DEBUG,
-		     "orangefs_bufmap_finalize: exiting normally\n");
 }

 /*
diff --git a/fs/orangefs/orangefs-bufmap.h b/fs/orangefs/orangefs-bufmap.h
index ad8d82a..0be62be 100644
--- a/fs/orangefs/orangefs-bufmap.h
+++ b/fs/orangefs/orangefs-bufmap.h
@@ -17,6 +17,8 @@ int orangefs_bufmap_initialize(struct
ORANGEFS_dev_map_desc *user_desc);

 void orangefs_bufmap_finalize(void);

+void orangefs_bufmap_run_down(void);
+
 int orangefs_bufmap_get(struct orangefs_bufmap **mapp, int *buffer_index);

 void orangefs_bufmap_put(int buffer_index);

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-15 18:45                                                                                               ` Al Viro
  2016-02-15 22:32                                                                                                 ` Martin Brandenburg
@ 2016-02-15 22:47                                                                                                 ` Mike Marshall
  1 sibling, 0 replies; 111+ messages in thread
From: Mike Marshall @ 2016-02-15 22:47 UTC (permalink / raw)
  To: Al Viro
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel, Stephen Rothwell

> Bufmap rewrite is really completely untested -
> it's done pretty much blindly and I'd be surprised as hell if it has no
> brainos at the first try.

You did pretty good, it takes me two tries to get hello world right...

Right off the bat, the kernel crashed, because:

static struct slot_map rw_map = {
        .c = -1,
        .q = __WAIT_QUEUE_HEAD_INITIALIZER(rw_map.q)
};
static struct slot_map readdir_map = {
        .c = -1,
        .q = __WAIT_QUEUE_HEAD_INITIALIZER(rw_map.q)
};                                          ^
                                            |
                                          D'OH!

But after that stuff almost worked...

It can still "sort of" wedge up.

We think that when dbench is running and the client-core is killed, you
can hit orangefs_bufmap_finalize -> mark_killed -> run_down/schedule().
while  those wait_for_completion_* schedules of extant ops in
wait_for_matching_downcall have also given up the processor...

Then... when you interrupt dbench, stuff starts flowing again...

I added a couple of gossip statements inside of mark_killed and
run_down...

Feb 15 16:40:15 be1 kernel: [  349.981597] orangefs_bufmap_finalize: called
Feb 15 16:40:15 be1 kernel: [  349.981600] mark_killed enter
Feb 15 16:40:15 be1 kernel: [  349.981602] mark_killed: leave
Feb 15 16:40:15 be1 kernel: [  349.981603] mark_killed enter
Feb 15 16:40:15 be1 kernel: [  349.981605] mark_killed: leave
Feb 15 16:40:15 be1 kernel: [  349.981606] run_down: enter:-1:
Feb 15 16:40:15 be1 kernel: [  349.981608] run_down: leave
Feb 15 16:40:15 be1 kernel: [  349.981609] run_down: enter:-2:
Feb 15 16:40:15 be1 kernel: [  349.981610] run_down: before schedule:-2:

            stuff just sits here while dbench is still running.
            Then Ctrl-C on dbench and off to the races again.

eb 15 16:42:28 be1 kernel: [  483.049927] ***
wait_for_matching_downcall: operation interrupted by a signal (tag
16523, op ffff880013418000)
Feb 15 16:42:28 be1 kernel: [  483.049930] Interrupted: Removed op
ffff880013418000 from htable_ops_in_progress
Feb 15 16:42:28 be1 kernel: [  483.049932] orangefs: service_operation
orangefs_inode_getattr returning: -4 for ffff880013418000.
Feb 15 16:42:28 be1 kernel: [  483.050116] ***
wait_for_matching_downcall: operation interrupted by a signal (tag
16518, op ffff8800001a8000)
Feb 15 16:42:28 be1 kernel: [  483.050118] Interrupted: Removed op
ffff8800001a8000 from htable_ops_in_progress
Feb 15 16:42:28 be1 kernel: [  483.050120] orangefs: service_operation
orangefs_inode_getattr returning: -4 for ffff8800001a8000.

Martin already has a patch... What do you think?
I'm headed home for supper...

-Mike

On Mon, Feb 15, 2016 at 1:45 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Mon, Feb 15, 2016 at 12:46:51PM -0500, Mike Marshall wrote:
>> I pushed the list_del up to the kernel.org for-next branch...
>>
>> And I've been running tests with the CRUDE bandaid... weird
>> results...
>>
>> No oopses, no WARN_ONs... I was running dbench and ls -R
>> or find and kill-minus-nining different ones of them with no
>> perceived resulting problems, so I moved on to signalling
>> the client-core to abort... it restarted numerous times,
>> and then stuff wedged up differently than I've seen before.
>
> There are other problems with that thing (starting with the fact that
> retrying readdir/wait_for_direct_io can try to grab a slot despite the
> bufmap winding down).  OK, at that point I think we should try to see
> if bufmap rewrite works - I've rebased on top of your branch and pushed
> (head at 8c3bc9a).  Bufmap rewrite is really completely untested -
> it's done pretty much blindly and I'd be surprised as hell if it has no
> brainos at the first try.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-15 22:32                                                                                                 ` Martin Brandenburg
@ 2016-02-15 23:04                                                                                                   ` Al Viro
  2016-02-16 23:15                                                                                                     ` Mike Marshall
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-02-15 23:04 UTC (permalink / raw)
  To: Martin Brandenburg
  Cc: Mike Marshall, Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Mon, Feb 15, 2016 at 05:32:54PM -0500, Martin Brandenburg wrote:

> Something that used a slot, such as reader, would call
> service_operation while holding a bufmap. Then the client-core would
> crash, and the kernel would get run_down waiting on the slots to be
> given up. But the slots are not given up until someone wakes all the
> processes waiting in service_operation up, which happens after all the
> slots are given up. Then client-core hangs until someone sends a
> deadly signal to all the processes waiting in service_operation or
> presumably the timeout expires.
> 
> This splits finalize and run_down so that orangefs_devreq_release can
> mark the slot map as killed, then purge waiting ops, then wait for all
> the slots to be released. Meanwhile, processes which were waiting will
> get into orangefs_bufmap_get which will see that the slot map is
> shutting down and wait for the client-core to come back.

D'oh.  Yes, that was exactly the point of separating mark_dead and run_down -
the latter should've been done after purging all requests.  Fixes folded,
branch force-pushed.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [RFC] slot allocator - waitqueue use review needed (Re: Orangefs ABI documentation)
  2016-02-14  3:46                                                                                         ` [RFC] slot allocator - waitqueue use review needed (Re: Orangefs ABI documentation) Al Viro
  2016-02-14  4:06                                                                                           ` Al Viro
@ 2016-02-16  2:12                                                                                           ` Al Viro
  2016-02-16 19:28                                                                                             ` Al Viro
  1 sibling, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-02-16  2:12 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel,
	Stephen Rothwell, Ingo Molnar

On Sun, Feb 14, 2016 at 03:46:08AM +0000, Al Viro wrote:
> FWIW, I think I have a kinda-sorta solution for bufmap slot allocation/waiting;
> if somebody has a better idea, I would love to drop the variant below.  And
> I would certainly appreciate review - I hate messing with waitqueue primitives
> and I know how easy it is to fuck those up ;-/

... and in the "easy to fuck up" department, this thing doesn't stop waiting
when it gets a signal *and* is vulnerable to an analogue of the problem
dealt with in commit 777c6c5f1f6e757ae49ecca2ed72d6b1f523c007
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Wed Feb 4 15:12:14 2009 -0800

    wait: prevent exclusive waiter starvation

Suppose we are waiting for a slot when everything's full.  Somebody releases
theirs, and we get woken up, just as we are hit with SIGKILL or time out.
We remove ourselves from the waitqueue and bugger off.  Too bad - everybody
else is going to get stuck until they time out.  If we got the slot, we would've
done wakeup when releasing it.  Since we hadn't, no such wakeup happens...

I wonder if wait_event_interruptible_exclusive_locked{,_irq}() is
vulnerable to the same problem; plain one is used in fuse, irq - in
gadgetfs and the latter looks somewhat fishy in that respect...

orangefs one fixed, folded and pushed, but I hadn't really looked into
fuse and gadgetfs enough to tell if they have similar problems...

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [RFC] slot allocator - waitqueue use review needed (Re: Orangefs ABI documentation)
  2016-02-16  2:12                                                                                           ` Al Viro
@ 2016-02-16 19:28                                                                                             ` Al Viro
  0 siblings, 0 replies; 111+ messages in thread
From: Al Viro @ 2016-02-16 19:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Brandenburg, linux-fsdevel, Stephen Rothwell, Ingo Molnar,
	Mike Marshall

On Tue, Feb 16, 2016 at 02:12:28AM +0000, Al Viro wrote:

> I wonder if wait_event_interruptible_exclusive_locked{,_irq}() is
> vulnerable to the same problem; plain one is used in fuse, irq - in
> gadgetfs and the latter looks somewhat fishy in that respect...
> 
> orangefs one fixed, folded and pushed, but I hadn't really looked into
> fuse and gadgetfs enough to tell if they have similar problems...

fuse_dev_do_read() looks broken - it assumes that there will be an eventual
call of request_end() and we'll issue a wakeup there.  With the switch
to wait_event_interruptible_exclusive_locked() (last summer, AFAICS) that's
no longer true - we'll bugger off with -ERESTARTSYS without waking anyone up,
even if we were the (exclusive) recepient of wakeup and the signal has
arrived right after that.

I'm looking through gadgetfs, but it seems that the only user of _irq variant 
is also broken the same way.  It certainly looks like an accident waiting to
happen, especially since the regular variants (sans _locked) have subtly
different semantics there.

Linus, what do you think about the following:

[PATCH] Fix the lost wakeup bug in wait_event_interruptible_exclusive_locked()

wait_event_interruptible_exclusive() used to have an unpleasant problem -
it might have eaten a wakeup, only to be be hit by a signal immediately
after that.  In such case wakeup wasn't passed to the next waiter.  That
was fixed in commit 777c6c5 ("wait: prevent exclusive waiter starvation")
back in 2009.  A year later ..._locked() analogue had been introduced with
exact same problem.

Passing a wakeup further in such case (exclusive, caught signal after we'd
already eaten a wakeup) is always legitimate - that is exactly what would've
happened if we caught signal first.  __wake_up_common() would skip
decrementing nr_exclusive after try_to_wake_up() would return false.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/include/linux/wait.h b/include/linux/wait.h
index 513b36f..675ff32 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -575,6 +575,9 @@ do {									\
 			__add_wait_queue_tail(&(wq), &__wait);		\
 		set_current_state(TASK_INTERRUPTIBLE);			\
 		if (signal_pending(current)) {				\
+			if (exclusive && list_empty(&__wait.task_list))	\
+				__wake_up_locked_key(&(wq),		\
+					TASK_INTERRUPTIBLE, NULL);	\
 			__ret = -ERESTARTSYS;				\
 			break;						\
 		}							\

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-15 23:04                                                                                                   ` Al Viro
@ 2016-02-16 23:15                                                                                                     ` Mike Marshall
  2016-02-16 23:36                                                                                                       ` Al Viro
  0 siblings, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-02-16 23:15 UTC (permalink / raw)
  To: Al Viro
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel, Stephen Rothwell

This thing is invulnerable now!

Nothing hangs when I kill the client-core, and the client-core
always restarts.

Sometimes, if you hit it right with a kill while dbench is running,
a file create will fail.

I've been trying to trace down why all day, in case there's
something that can be done...

Here's what I see:

  orangefs_create
    service_operation
      wait_for_matching_downcall purges op and returns -EAGAIN
      orangefs_clean_up_interrupted_operation
      if (EAGAIN)
        ...
        goto retry_servicing
      wait_for_matching_downcall returns 0
    service_operation returns 0
  orangefs_create has good return value from service_operation

   op->khandle: 00000000-0000-0000-0000-000000000000
   op->fs_id: 0

   subsequent getattr on bogus object fails orangefs_create on EINVAL.

   seems like the second time around, wait_for_matching_downcall
   must have seen op_state_serviced, but I don't see how yet...

I pushed the new patches out to

gitolite.kernel.org:pub/scm/linux/kernel/git/hubcap/linux
for-next

I made a couple of additional patches that make it easier to read
the flow of gossip statements, and also removed a few lines of vestigial
ASYNC code.

-Mike

On Mon, Feb 15, 2016 at 6:04 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Mon, Feb 15, 2016 at 05:32:54PM -0500, Martin Brandenburg wrote:
>
>> Something that used a slot, such as reader, would call
>> service_operation while holding a bufmap. Then the client-core would
>> crash, and the kernel would get run_down waiting on the slots to be
>> given up. But the slots are not given up until someone wakes all the
>> processes waiting in service_operation up, which happens after all the
>> slots are given up. Then client-core hangs until someone sends a
>> deadly signal to all the processes waiting in service_operation or
>> presumably the timeout expires.
>>
>> This splits finalize and run_down so that orangefs_devreq_release can
>> mark the slot map as killed, then purge waiting ops, then wait for all
>> the slots to be released. Meanwhile, processes which were waiting will
>> get into orangefs_bufmap_get which will see that the slot map is
>> shutting down and wait for the client-core to come back.
>
> D'oh.  Yes, that was exactly the point of separating mark_dead and run_down -
> the latter should've been done after purging all requests.  Fixes folded,
> branch force-pushed.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-16 23:15                                                                                                     ` Mike Marshall
@ 2016-02-16 23:36                                                                                                       ` Al Viro
  2016-02-16 23:54                                                                                                         ` Al Viro
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-02-16 23:36 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Tue, Feb 16, 2016 at 06:15:56PM -0500, Mike Marshall wrote:
> Here's what I see:
> 
>   orangefs_create
>     service_operation
>       wait_for_matching_downcall purges op and returns -EAGAIN
>       orangefs_clean_up_interrupted_operation
>       if (EAGAIN)
>         ...
>         goto retry_servicing
>       wait_for_matching_downcall returns 0
>     service_operation returns 0
>   orangefs_create has good return value from service_operation
> 
>    op->khandle: 00000000-0000-0000-0000-000000000000
>    op->fs_id: 0
> 
>    subsequent getattr on bogus object fails orangefs_create on EINVAL.
> 
>    seems like the second time around, wait_for_matching_downcall
>    must have seen op_state_serviced, but I don't see how yet...

I strongly suspect that this is what's missing.  Could you check if it helps?

diff --git a/fs/orangefs/waitqueue.c b/fs/orangefs/waitqueue.c
index 2539813..36eedd6 100644
--- a/fs/orangefs/waitqueue.c
+++ b/fs/orangefs/waitqueue.c
@@ -244,6 +244,7 @@ static void orangefs_clean_up_interrupted_operation(struct orangefs_kernel_op_s
 		gossip_err("%s: can't get here.\n", __func__);
 		spin_unlock(&op->lock);
 	}
+	reinit_completion(&op->waitq);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-16 23:36                                                                                                       ` Al Viro
@ 2016-02-16 23:54                                                                                                         ` Al Viro
  2016-02-17 19:24                                                                                                           ` Mike Marshall
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-02-16 23:54 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Tue, Feb 16, 2016 at 11:36:09PM +0000, Al Viro wrote:

> I strongly suspect that this is what's missing.  Could you check if it helps?

BTW, I've pushed #orangefs-untested; the differences are
	* several fixes folded back into relevant commits (no point creating
bisect hazards and making the history harder to read)
	* this thing also folded (and in fact is what you get from diffing
that tree with your for-next)
	* commit message for bufmap rewrite supplied.

Could you (assuming it works, etc.) switch your branch to that?
git checkout for-next, git fetch from vfs.git, git diff FETCH_HEAD to
verify that the delta is as it should be, then git reset --hard FETCH_HEAD
and git push --force <whatever remote you are using> for-next would do it...

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-16 23:54                                                                                                         ` Al Viro
@ 2016-02-17 19:24                                                                                                           ` Mike Marshall
  2016-02-17 20:11                                                                                                             ` Al Viro
  2016-02-17 22:40                                                                                                             ` Martin Brandenburg
  0 siblings, 2 replies; 111+ messages in thread
From: Mike Marshall @ 2016-02-17 19:24 UTC (permalink / raw)
  To: Al Viro
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel, Stephen Rothwell

It is still busted, I've been trying to find clues as to why...

Maybe this is relevant:

Alloced OP ffff880015698000 <- doomed op for orangefs_create MAILBOX2.CPT
service_operation: orangefs_create op ffff880015698000
ffff880015698000 got past is_daemon_in_service

... lots of stuff ...

w_f_m_d returned -11 for ffff880015698000 <- first op to get EAGAIN

first client core is NOT in service
second op to get EAGAIN
          ...
last client core is NOT in service

... lots of stuff ...

service_operation returns to orangef_create with handle 0 fsid 0 ret 0
for MAILBOX2.CPT

I'm guessing you want me to wait to do the switching of my branch
until we fix this (last?) thing, let me know...

-Mike


On Tue, Feb 16, 2016 at 6:54 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Feb 16, 2016 at 11:36:09PM +0000, Al Viro wrote:
>
>> I strongly suspect that this is what's missing.  Could you check if it helps?
>
> BTW, I've pushed #orangefs-untested; the differences are
>         * several fixes folded back into relevant commits (no point creating
> bisect hazards and making the history harder to read)
>         * this thing also folded (and in fact is what you get from diffing
> that tree with your for-next)
>         * commit message for bufmap rewrite supplied.
>
> Could you (assuming it works, etc.) switch your branch to that?
> git checkout for-next, git fetch from vfs.git, git diff FETCH_HEAD to
> verify that the delta is as it should be, then git reset --hard FETCH_HEAD
> and git push --force <whatever remote you are using> for-next would do it...

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-17 19:24                                                                                                           ` Mike Marshall
@ 2016-02-17 20:11                                                                                                             ` Al Viro
  2016-02-17 21:17                                                                                                               ` Al Viro
  2016-02-17 22:40                                                                                                             ` Martin Brandenburg
  1 sibling, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-02-17 20:11 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Wed, Feb 17, 2016 at 02:24:34PM -0500, Mike Marshall wrote:
> It is still busted, I've been trying to find clues as to why...

With reinit_completion() added?

> Maybe this is relevant:
> 
> Alloced OP ffff880015698000 <- doomed op for orangefs_create MAILBOX2.CPT
> service_operation: orangefs_create op ffff880015698000
> ffff880015698000 got past is_daemon_in_service
> 
> ... lots of stuff ...
> 
> w_f_m_d returned -11 for ffff880015698000 <- first op to get EAGAIN
> 
> first client core is NOT in service
> second op to get EAGAIN
>           ...
> last client core is NOT in service
> 
> ... lots of stuff ...
> 
> service_operation returns to orangef_create with handle 0 fsid 0 ret 0
> for MAILBOX2.CPT
> 
> I'm guessing you want me to wait to do the switching of my branch
> until we fix this (last?) thing, let me know...

What I'd like to check is the value of op->waitq.done at retry_servicing.
If we get there with a non-zero value, we've a problem.  BTW, do you
hit any of gossip_err() in orangefs_clean_up_interrupted_operation()?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-17 20:11                                                                                                             ` Al Viro
@ 2016-02-17 21:17                                                                                                               ` Al Viro
  2016-02-17 22:24                                                                                                                 ` Mike Marshall
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-02-17 21:17 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Wed, Feb 17, 2016 at 08:11:20PM +0000, Al Viro wrote:
> With reinit_completion() added?
> 
> > Maybe this is relevant:
> > 
> > Alloced OP ffff880015698000 <- doomed op for orangefs_create MAILBOX2.CPT
> > service_operation: orangefs_create op ffff880015698000
> > ffff880015698000 got past is_daemon_in_service
> > 
> > ... lots of stuff ...
> > 
> > w_f_m_d returned -11 for ffff880015698000 <- first op to get EAGAIN
> > 
> > first client core is NOT in service
> > second op to get EAGAIN
> >           ...
> > last client core is NOT in service
> > 
> > ... lots of stuff ...
> > 
> > service_operation returns to orangef_create with handle 0 fsid 0 ret 0
> > for MAILBOX2.CPT
> > 
> > I'm guessing you want me to wait to do the switching of my branch
> > until we fix this (last?) thing, let me know...
> 
> What I'd like to check is the value of op->waitq.done at retry_servicing.
> If we get there with a non-zero value, we've a problem.  BTW, do you
> hit any of gossip_err() in orangefs_clean_up_interrupted_operation()?

Am I right assuming that you are seeing zero from service_operation()
for ORANGEFS_VFS_OP_CREATE with zeroed ->downcall.resp.create.refn?
AFAICS, that can only happen with wait_for_matching_downcall() returning
0, which mean op_state_serviced(op).  And prior to that we had
set_op_state_waiting(op), which means set_op_state_serviced(op) done between
those two...

So unless you have something really nasty going on (buggered op refcounting,
memory corruption, etc.), we had orangefs_devreq_write_iter() pick that
op and copy the entire op->downcall from userland.  With op->downcall.status
not set to -EFAULT, or we would've...  Oh, shit - it's fed through
orangefs_normalize_to_errno().  OTOH, it would've hit
gossip_err("orangefs: orangefs_normalize_to_errno: got error code which is not
from ORANGEFS.\n") and presumably you would've noticed that.  And it still
would've returned that -EFAULT intact, actually.  In any case, it's a bug -
we need
	ret = -(ORANGEFS_ERROR_BIT|9);
	goto Broken;
instead of those
	ret = -EFAULT;
	goto Broken;
and
	ret = -(ORANGEFS_ERROR_BIT|8);
	goto Broken;
instead of
	ret = -ENOMEM;
	goto Broken;
in orangefs_devreq_write_iter().

Anyway, it looks like we didn't go through Broken: in there, so copy_from_user()
must've been successful.

Hmm...  Could the daemon have really given that reply?  Relevant test would
be something like
	WARN_ON(op->upcall.type == ORANGEFS_VFS_OP_CREATE &&
	    !op->downcall.resp.create.refn.fsid);
right after
wakeup:
        /*
         * tell the vfs op waiting on a waitqueue
         * that this op is done
         */
        spin_lock(&op->lock);
        if (unlikely(op_state_given_up(op))) {
                spin_unlock(&op->lock);
                goto out;
        }
        set_op_state_serviced(op);

and plain WARN_ON(1) right after Broken:

If that op goes through either of those set_op_state_serviced() with that
zero refn.fsid, we'll see a WARN_ON triggered...

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-17 21:17                                                                                                               ` Al Viro
@ 2016-02-17 22:24                                                                                                                 ` Mike Marshall
  0 siblings, 0 replies; 111+ messages in thread
From: Mike Marshall @ 2016-02-17 22:24 UTC (permalink / raw)
  To: Al Viro
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel, Stephen Rothwell

> with reinit_completion() added?

In the right place, even...

> Am I right assuming that you are seeing zero from service_operation()
> for ORANGEFS_VFS_OP_CREATE with zeroed ->downcall.resp.create.refn?
> AFAICS, that can only happen with wait_for_matching_downcall() returning
> 0, which mean op_state_serviced(op).

Yes sir:

Feb 17 12:25:15 be1 kernel: [  857.901225] service_operation:
wait_for_matching_downcall returned 0 for ffff880015698000
Feb 17 12:25:15 be1 kernel: [  857.901228] orangefs: service_operation
orangefs_create returning: 0 for ffff880015698000.
Feb 17 12:25:15 be1 kernel: [  857.901230] orangefs_create:
MAILBOX2.CPT: handle:00000000-0000-0000-0000-000000000000: fsid:0:
new_op:ffff880015698000: ret:0:

I'll get those WARN_ONs in there and do some more
tests.

Martin thinks he might be on to a solution... he'll
probably send a response soon... I can hear him typing
at 500 characters a minute a few cubes away...

-Mike

On Wed, Feb 17, 2016 at 4:17 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Wed, Feb 17, 2016 at 08:11:20PM +0000, Al Viro wrote:
>> With reinit_completion() added?
>>
>> > Maybe this is relevant:
>> >
>> > Alloced OP ffff880015698000 <- doomed op for orangefs_create MAILBOX2.CPT
>> > service_operation: orangefs_create op ffff880015698000
>> > ffff880015698000 got past is_daemon_in_service
>> >
>> > ... lots of stuff ...
>> >
>> > w_f_m_d returned -11 for ffff880015698000 <- first op to get EAGAIN
>> >
>> > first client core is NOT in service
>> > second op to get EAGAIN
>> >           ...
>> > last client core is NOT in service
>> >
>> > ... lots of stuff ...
>> >
>> > service_operation returns to orangef_create with handle 0 fsid 0 ret 0
>> > for MAILBOX2.CPT
>> >
>> > I'm guessing you want me to wait to do the switching of my branch
>> > until we fix this (last?) thing, let me know...
>>
>> What I'd like to check is the value of op->waitq.done at retry_servicing.
>> If we get there with a non-zero value, we've a problem.  BTW, do you
>> hit any of gossip_err() in orangefs_clean_up_interrupted_operation()?
>
> Am I right assuming that you are seeing zero from service_operation()
> for ORANGEFS_VFS_OP_CREATE with zeroed ->downcall.resp.create.refn?
> AFAICS, that can only happen with wait_for_matching_downcall() returning
> 0, which mean op_state_serviced(op).  And prior to that we had
> set_op_state_waiting(op), which means set_op_state_serviced(op) done between
> those two...
>
> So unless you have something really nasty going on (buggered op refcounting,
> memory corruption, etc.), we had orangefs_devreq_write_iter() pick that
> op and copy the entire op->downcall from userland.  With op->downcall.status
> not set to -EFAULT, or we would've...  Oh, shit - it's fed through
> orangefs_normalize_to_errno().  OTOH, it would've hit
> gossip_err("orangefs: orangefs_normalize_to_errno: got error code which is not
> from ORANGEFS.\n") and presumably you would've noticed that.  And it still
> would've returned that -EFAULT intact, actually.  In any case, it's a bug -
> we need
>         ret = -(ORANGEFS_ERROR_BIT|9);
>         goto Broken;
> instead of those
>         ret = -EFAULT;
>         goto Broken;
> and
>         ret = -(ORANGEFS_ERROR_BIT|8);
>         goto Broken;
> instead of
>         ret = -ENOMEM;
>         goto Broken;
> in orangefs_devreq_write_iter().
>
> Anyway, it looks like we didn't go through Broken: in there, so copy_from_user()
> must've been successful.
>
> Hmm...  Could the daemon have really given that reply?  Relevant test would
> be something like
>         WARN_ON(op->upcall.type == ORANGEFS_VFS_OP_CREATE &&
>             !op->downcall.resp.create.refn.fsid);
> right after
> wakeup:
>         /*
>          * tell the vfs op waiting on a waitqueue
>          * that this op is done
>          */
>         spin_lock(&op->lock);
>         if (unlikely(op_state_given_up(op))) {
>                 spin_unlock(&op->lock);
>                 goto out;
>         }
>         set_op_state_serviced(op);
>
> and plain WARN_ON(1) right after Broken:
>
> If that op goes through either of those set_op_state_serviced() with that
> zero refn.fsid, we'll see a WARN_ON triggered...

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-17 19:24                                                                                                           ` Mike Marshall
  2016-02-17 20:11                                                                                                             ` Al Viro
@ 2016-02-17 22:40                                                                                                             ` Martin Brandenburg
  2016-02-17 23:09                                                                                                               ` Al Viro
  1 sibling, 1 reply; 111+ messages in thread
From: Martin Brandenburg @ 2016-02-17 22:40 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Al Viro, Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Wed, 17 Feb 2016, Mike Marshall wrote:

> It is still busted, I've been trying to find clues as to why...
> 
> Maybe this is relevant:
> 
> Alloced OP ffff880015698000 <- doomed op for orangefs_create MAILBOX2.CPT
> service_operation: orangefs_create op ffff880015698000
> ffff880015698000 got past is_daemon_in_service
> 
> ... lots of stuff ...
> 
> w_f_m_d returned -11 for ffff880015698000 <- first op to get EAGAIN
> 
> first client core is NOT in service
> second op to get EAGAIN
>           ...
> last client core is NOT in service
> 
> ... lots of stuff ...
> 
> service_operation returns to orangef_create with handle 0 fsid 0 ret 0
> for MAILBOX2.CPT
> 
> I'm guessing you want me to wait to do the switching of my branch
> until we fix this (last?) thing, let me know...
> 
> -Mike

I think I've identified something screwy.

Some process creates a file. Eventually we get into
wait_for_matching_downcall with no client-core. W_f_m_d
returns EAGAIN and op->lock is held. The op is still
waiting and in orangefs_request_list. Service_operation
calls orangefs_clean_up_interrupted_operation, which
attempts to remove the op from orangefs_request_list.

Meanwhile the client-core comes back and does a read.
W_f_m_d has returned EAGAIN, but the op is still in
orangefs_request_list, so it gets passed to the
client-core. Now the op is in service and in
htable_ops_in_progress.

But service_operation is about to retry it under the
impression that it was purged. So it puts the op back
in orangefs_request_list.

Then the client-core returns the op, so it is marked
serviced and returned to orangefs_inode_create.
Meanwhile something or other (great theory right? now
I'm less sure) happens with the second request (they
have the same tag) causing it to become corrupted.

I admit it starts to fall apart at the end, and I don't
have a clear theory on how this produces what we see.

In orangefs_clean_up_interrupted_operation

	if (op_state_waiting(op)) {
		/*
		 * upcall hasn't been read; remove op from upcall request
		 * list.
		 */
		spin_unlock(&op->lock);

		/* HERE */

		spin_lock(&orangefs_request_list_lock);
		list_del(&op->list);
		spin_unlock(&orangefs_request_list_lock);
		gossip_debug(GOSSIP_WAIT_DEBUG,
			     "Interrupted: Removed op %p from request_list\n",
			     op);
	} else if (op_state_in_progress(op)) {

and orangefs_devreq_read

restart:
	/* Get next op (if any) from top of list. */
	spin_lock(&orangefs_request_list_lock);
	list_for_each_entry_safe(op, temp, &orangefs_request_list, list) {
		__s32 fsid;
		/* This lock is held past the end of the loop when we break. */

		/* HERE */

		spin_lock(&op->lock);
		if (unlikely(op_state_purged(op))) {
			spin_unlock(&op->lock);
			continue;
		}

I think both processes can end up working on the same
op.

-- Martin

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-17 22:40                                                                                                             ` Martin Brandenburg
@ 2016-02-17 23:09                                                                                                               ` Al Viro
  2016-02-17 23:15                                                                                                                 ` Al Viro
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-02-17 23:09 UTC (permalink / raw)
  To: Martin Brandenburg
  Cc: Mike Marshall, Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Wed, Feb 17, 2016 at 05:40:08PM -0500, Martin Brandenburg wrote:

> In orangefs_clean_up_interrupted_operation
> 
> 	if (op_state_waiting(op)) {
> 		/*
> 		 * upcall hasn't been read; remove op from upcall request
> 		 * list.
> 		 */
> 		spin_unlock(&op->lock);
> 
> 		/* HERE */
> 
> 		spin_lock(&orangefs_request_list_lock);
> 		list_del(&op->list);
> 		spin_unlock(&orangefs_request_list_lock);

Hmm...  We'd already marked it as given up, though.  Before dropping op->lock.

> 		gossip_debug(GOSSIP_WAIT_DEBUG,
> 			     "Interrupted: Removed op %p from request_list\n",
> 			     op);
> 	} else if (op_state_in_progress(op)) {
> 
> and orangefs_devreq_read
> 
> restart:
> 	/* Get next op (if any) from top of list. */
> 	spin_lock(&orangefs_request_list_lock);
> 	list_for_each_entry_safe(op, temp, &orangefs_request_list, list) {
> 		__s32 fsid;
> 		/* This lock is held past the end of the loop when we break. */
> 
> 		/* HERE */
> 
> 		spin_lock(&op->lock);
> 		if (unlikely(op_state_purged(op))) {
> 			spin_unlock(&op->lock);
> 			continue;
> 		}
> 
> I think both processes can end up working on the same
> op.

It can be picked up.  And then we'll run into

        if (unlikely(op_state_given_up(cur_op))) {
                spin_unlock(&cur_op->lock);
                spin_unlock(&htable_ops_in_progress_lock);
                op_release(cur_op);
                goto restart;

Oh, I see...  OK, yes - by the time we get to that check the sucker has
already been through resubmitting it into the list, so the "given up" flag
is lost.

Hrm...  The obvious approach is to at least avoid taking it off the list
if it's given up - i.e. turn
                __s32 fsid;
                /* This lock is held past the end of the loop when we break. */
                spin_lock(&op->lock);
                if (unlikely(op_state_purged(op))) {
into
                __s32 fsid;
                /* This lock is held past the end of the loop when we break. */
                spin_lock(&op->lock);
                if (unlikely(op_state_purged(op) || op_state_given_up(op))) {
a bit before that point.

However, that doesn't prevent all unpleasantness here - giving up just as
it's being copied to userland and going into restart.  Ho-hum...  How about
the following:
	* move increment of op->attempts into the same place where we
set "given up"
	* in addition to the check for "given up" in the request-picking loop
(as above), fetch op->attempts before dropping op->lock
	* after having retaken op->lock (after copy_to_user()) recheck
op->attempts instead of checking for "given up".

IOW, something like this:

diff --git a/fs/orangefs/devorangefs-req.c b/fs/orangefs/devorangefs-req.c
index b27ed1c..1938d55 100644
--- a/fs/orangefs/devorangefs-req.c
+++ b/fs/orangefs/devorangefs-req.c
@@ -109,6 +109,7 @@ static ssize_t orangefs_devreq_read(struct file *file,
 	static __s32 magic = ORANGEFS_DEVREQ_MAGIC;
 	struct orangefs_kernel_op_s *cur_op = NULL;
 	unsigned long ret;
+	int attempts;
 
 	/* We do not support blocking IO. */
 	if (!(file->f_flags & O_NONBLOCK)) {
@@ -133,7 +134,7 @@ restart:
 		__s32 fsid;
 		/* This lock is held past the end of the loop when we break. */
 		spin_lock(&op->lock);
-		if (unlikely(op_state_purged(op))) {
+		if (unlikely(op_state_purged(op) || op_state_given_up(op))) {
 			spin_unlock(&op->lock);
 			continue;
 		}
@@ -207,6 +208,7 @@ restart:
 	list_del_init(&cur_op->list);
 	get_op(op);
 	spin_unlock(&orangefs_request_list_lock);
+	attempts = op->attempts;
 
 	spin_unlock(&cur_op->lock);
 
@@ -227,7 +229,8 @@ restart:
 
 	spin_lock(&htable_ops_in_progress_lock);
 	spin_lock(&cur_op->lock);
-	if (unlikely(op_state_given_up(cur_op))) {
+	if (unlikely(cur_op->attempts != attempts)) {
+		/* given up just as we copied to userland */
 		spin_unlock(&cur_op->lock);
 		spin_unlock(&htable_ops_in_progress_lock);
 		op_release(cur_op);
diff --git a/fs/orangefs/waitqueue.c b/fs/orangefs/waitqueue.c
index d980240..cc43ac8 100644
--- a/fs/orangefs/waitqueue.c
+++ b/fs/orangefs/waitqueue.c
@@ -139,7 +139,6 @@ retry_servicing:
 	op->downcall.status = ret;
 	/* retry if operation has not been serviced and if requested */
 	if (ret == -EAGAIN) {
-		op->attempts++;
 		timeout = op_timeout_secs * HZ;
 		gossip_debug(GOSSIP_WAIT_DEBUG,
 			     "orangefs: tag %llu (%s)"
@@ -208,6 +207,7 @@ static void orangefs_clean_up_interrupted_operation(struct orangefs_kernel_op_s
 	 * Called with op->lock held.
 	 */
 	op->op_state |= OP_VFS_STATE_GIVEN_UP;
+	op->attempts++;
 
 	if (op_state_waiting(op)) {
 		/*

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-17 23:09                                                                                                               ` Al Viro
@ 2016-02-17 23:15                                                                                                                 ` Al Viro
  2016-02-18  0:04                                                                                                                   ` Al Viro
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-02-17 23:15 UTC (permalink / raw)
  To: Martin Brandenburg
  Cc: Mike Marshall, Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Wed, Feb 17, 2016 at 11:09:00PM +0000, Al Viro wrote:

> However, that doesn't prevent all unpleasantness here - giving up just as
> it's being copied to userland and going into restart.  Ho-hum...  How about
> the following:
> 	* move increment of op->attempts into the same place where we
> set "given up"
> 	* in addition to the check for "given up" in the request-picking loop
> (as above), fetch op->attempts before dropping op->lock
> 	* after having retaken op->lock (after copy_to_user()) recheck
> op->attempts instead of checking for "given up".

	Crap...  There's a similar problem on the other end - in
orangefs_devreq_write_iter() between the time when op has been fetched
from the hash and the time we finish copying reply from userland.  Same
kind of "what it clear_... gets all way through and we resubmit it before
we get around to checking the given_up flag" problem...

	Let me think a bit...

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-17 23:15                                                                                                                 ` Al Viro
@ 2016-02-18  0:04                                                                                                                   ` Al Viro
  2016-02-18 11:11                                                                                                                     ` Al Viro
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-02-18  0:04 UTC (permalink / raw)
  To: Martin Brandenburg
  Cc: Mike Marshall, Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Wed, Feb 17, 2016 at 11:15:24PM +0000, Al Viro wrote:
> On Wed, Feb 17, 2016 at 11:09:00PM +0000, Al Viro wrote:
> 
> > However, that doesn't prevent all unpleasantness here - giving up just as
> > it's being copied to userland and going into restart.  Ho-hum...  How about
> > the following:
> > 	* move increment of op->attempts into the same place where we
> > set "given up"
> > 	* in addition to the check for "given up" in the request-picking loop
> > (as above), fetch op->attempts before dropping op->lock
> > 	* after having retaken op->lock (after copy_to_user()) recheck
> > op->attempts instead of checking for "given up".
> 
> 	Crap...  There's a similar problem on the other end - in
> orangefs_devreq_write_iter() between the time when op has been fetched
> from the hash and the time we finish copying reply from userland.  Same
> kind of "what it clear_... gets all way through and we resubmit it before
> we get around to checking the given_up flag" problem...
> 
> 	Let me think a bit...

Looks like the right approach is to have orangefs_clean_... hitting the
sucker being copied to/from daemon to wait until that's finished (and
discarded).  That, BTW, would have an extra benefit of making life simpler
for refcounting.

So...  We need to have them marked as "being copied" for the duration, instead
of bumping the refcount.  That setting and dropping that flag should happen
under op->lock.  Setting it should happen only if it's not given up (that would
be interpreted as "not found").  Cleaning, OTOH, would recheck the "given up"
and do complete(&op->waitq) in case it's been given up...

How about this (instead of the previous variant, includes a fix for
errno bogosity spotted a bit upthread; if it works, it'll need a bit of
splitup)

diff --git a/fs/orangefs/devorangefs-req.c b/fs/orangefs/devorangefs-req.c
index b27ed1c..bb7ff9b 100644
--- a/fs/orangefs/devorangefs-req.c
+++ b/fs/orangefs/devorangefs-req.c
@@ -58,9 +58,10 @@ static struct orangefs_kernel_op_s *orangefs_devreq_remove_op(__u64 tag)
 				 next,
 				 &htable_ops_in_progress[index],
 				 list) {
-		if (op->tag == tag && !op_state_purged(op)) {
+		if (op->tag == tag && !op_state_purged(op) &&
+		    !op_state_given_up(op)) {
 			list_del_init(&op->list);
-			get_op(op); /* increase ref count. */
+			op->op_state |= OP_VFS_STATE_COPYING;
 			spin_unlock(&htable_ops_in_progress_lock);
 			return op;
 		}
@@ -133,7 +134,7 @@ restart:
 		__s32 fsid;
 		/* This lock is held past the end of the loop when we break. */
 		spin_lock(&op->lock);
-		if (unlikely(op_state_purged(op))) {
+		if (unlikely(op_state_purged(op) || op_state_given_up(op))) {
 			spin_unlock(&op->lock);
 			continue;
 		}
@@ -205,7 +206,7 @@ restart:
 		return -EAGAIN;
 	}
 	list_del_init(&cur_op->list);
-	get_op(op);
+	op->op_state |= OP_VFS_STATE_COPYING;
 	spin_unlock(&orangefs_request_list_lock);
 
 	spin_unlock(&cur_op->lock);
@@ -227,10 +228,11 @@ restart:
 
 	spin_lock(&htable_ops_in_progress_lock);
 	spin_lock(&cur_op->lock);
+	cur_op->op_state &= ~OP_VFS_STATE_COPYING;
 	if (unlikely(op_state_given_up(cur_op))) {
 		spin_unlock(&cur_op->lock);
 		spin_unlock(&htable_ops_in_progress_lock);
-		op_release(cur_op);
+		complete(&cur_op->waitq);
 		goto restart;
 	}
 
@@ -242,7 +244,6 @@ restart:
 	orangefs_devreq_add_op(cur_op);
 	spin_unlock(&cur_op->lock);
 	spin_unlock(&htable_ops_in_progress_lock);
-	op_release(cur_op);
 
 	/* The client only asks to read one size buffer. */
 	return MAX_DEV_REQ_UPSIZE;
@@ -255,13 +256,16 @@ error:
 	gossip_err("orangefs: Failed to copy data to user space\n");
 	spin_lock(&orangefs_request_list_lock);
 	spin_lock(&cur_op->lock);
+	cur_op->op_state &= ~OP_VFS_STATE_COPYING;
 	if (likely(!op_state_given_up(cur_op))) {
 		set_op_state_waiting(cur_op);
 		list_add(&cur_op->list, &orangefs_request_list);
+		spin_unlock(&cur_op->lock);
+	} else {
+		spin_unlock(&cur_op->lock);
+		complete(&cur_op->waitq);
 	}
-	spin_unlock(&cur_op->lock);
 	spin_unlock(&orangefs_request_list_lock);
-	op_release(cur_op);
 	return -EFAULT;
 }
 
@@ -333,8 +337,7 @@ static ssize_t orangefs_devreq_write_iter(struct kiocb *iocb,
 	n = copy_from_iter(&op->downcall, downcall_size, iter);
 	if (n != downcall_size) {
 		gossip_err("%s: failed to copy downcall.\n", __func__);
-		ret = -EFAULT;
-		goto Broken;
+		goto Efault;
 	}
 
 	if (op->downcall.status)
@@ -354,8 +357,7 @@ static ssize_t orangefs_devreq_write_iter(struct kiocb *iocb,
 			   downcall_size,
 			   op->downcall.trailer_size,
 			   total);
-		ret = -EFAULT;
-		goto Broken;
+		goto Efault;
 	}
 
 	/* Only READDIR operations should have trailers. */
@@ -364,8 +366,7 @@ static ssize_t orangefs_devreq_write_iter(struct kiocb *iocb,
 		gossip_err("%s: %x operation with trailer.",
 			   __func__,
 			   op->downcall.type);
-		ret = -EFAULT;
-		goto Broken;
+		goto Efault;
 	}
 
 	/* READDIR operations should always have trailers. */
@@ -374,8 +375,7 @@ static ssize_t orangefs_devreq_write_iter(struct kiocb *iocb,
 		gossip_err("%s: %x operation with no trailer.",
 			   __func__,
 			   op->downcall.type);
-		ret = -EFAULT;
-		goto Broken;
+		goto Efault;
 	}
 
 	if (op->downcall.type != ORANGEFS_VFS_OP_READDIR)
@@ -386,8 +386,7 @@ static ssize_t orangefs_devreq_write_iter(struct kiocb *iocb,
 	if (op->downcall.trailer_buf == NULL) {
 		gossip_err("%s: failed trailer vmalloc.\n",
 			   __func__);
-		ret = -ENOMEM;
-		goto Broken;
+		goto Enomem;
 	}
 	memset(op->downcall.trailer_buf, 0, op->downcall.trailer_size);
 	n = copy_from_iter(op->downcall.trailer_buf,
@@ -396,8 +395,7 @@ static ssize_t orangefs_devreq_write_iter(struct kiocb *iocb,
 	if (n != op->downcall.trailer_size) {
 		gossip_err("%s: failed to copy trailer.\n", __func__);
 		vfree(op->downcall.trailer_buf);
-		ret = -EFAULT;
-		goto Broken;
+		goto Efault;
 	}
 
 wakeup:
@@ -406,38 +404,28 @@ wakeup:
 	 * that this op is done
 	 */
 	spin_lock(&op->lock);
-	if (unlikely(op_state_given_up(op))) {
+	op->op_state &= ~OP_VFS_STATE_COPYING;
+	if (unlikely(op_is_cancel(op))) {
 		spin_unlock(&op->lock);
-		goto out;
-	}
-	set_op_state_serviced(op);
-	spin_unlock(&op->lock);
-
-	/*
-	 * If this operation is an I/O operation we need to wait
-	 * for all data to be copied before we can return to avoid
-	 * buffer corruption and races that can pull the buffers
-	 * out from under us.
-	 *
-	 * Essentially we're synchronizing with other parts of the
-	 * vfs implicitly by not allowing the user space
-	 * application reading/writing this device to return until
-	 * the buffers are done being used.
-	 */
-out:
-	if (unlikely(op_is_cancel(op)))
 		put_cancel(op);
-	op_release(op);
-	return ret;
-
-Broken:
-	spin_lock(&op->lock);
-	if (!op_state_given_up(op)) {
-		op->downcall.status = ret;
+	} else if (unlikely(op_state_given_up(op))) {
+		spin_unlock(&op->lock);
+		complete(&op->waitq);
+	} else {
 		set_op_state_serviced(op);
+		spin_unlock(&op->lock);
 	}
-	spin_unlock(&op->lock);
-	goto out;
+	return ret;
+
+Efault:
+	op->downcall.status = -(ORANGEFS_ERROR_BIT | 9);
+	ret = -EFAULT;
+	goto wakeup;
+
+Enomem:
+	op->downcall.status = -(ORANGEFS_ERROR_BIT | 8);
+	ret = -ENOMEM;
+	goto wakeup;
 }
 
 /* Returns whether any FS are still pending remounted */
diff --git a/fs/orangefs/orangefs-cache.c b/fs/orangefs/orangefs-cache.c
index 817092a..900a2e3 100644
--- a/fs/orangefs/orangefs-cache.c
+++ b/fs/orangefs/orangefs-cache.c
@@ -120,8 +120,6 @@ struct orangefs_kernel_op_s *op_alloc(__s32 type)
 		spin_lock_init(&new_op->lock);
 		init_completion(&new_op->waitq);
 
-		atomic_set(&new_op->ref_count, 1);
-
 		new_op->upcall.type = ORANGEFS_VFS_OP_INVALID;
 		new_op->downcall.type = ORANGEFS_VFS_OP_INVALID;
 		new_op->downcall.status = -1;
@@ -149,7 +147,7 @@ struct orangefs_kernel_op_s *op_alloc(__s32 type)
 	return new_op;
 }
 
-void __op_release(struct orangefs_kernel_op_s *orangefs_op)
+void op_release(struct orangefs_kernel_op_s *orangefs_op)
 {
 	if (orangefs_op) {
 		gossip_debug(GOSSIP_CACHE_DEBUG,
diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h
index 1f8310c..8688d7c 100644
--- a/fs/orangefs/orangefs-kernel.h
+++ b/fs/orangefs/orangefs-kernel.h
@@ -98,6 +98,7 @@ enum orangefs_vfs_op_states {
 	OP_VFS_STATE_SERVICED = 4,
 	OP_VFS_STATE_PURGED = 8,
 	OP_VFS_STATE_GIVEN_UP = 16,
+	OP_VFS_STATE_COPYING = 32,
 };
 
 /*
@@ -205,8 +206,6 @@ struct orangefs_kernel_op_s {
 	struct completion waitq;
 	spinlock_t lock;
 
-	atomic_t ref_count;
-
 	/* VFS aio fields */
 
 	int attempts;
@@ -230,23 +229,7 @@ static inline void set_op_state_serviced(struct orangefs_kernel_op_s *op)
 #define op_state_given_up(op)    ((op)->op_state & OP_VFS_STATE_GIVEN_UP)
 #define op_is_cancel(op)         ((op)->downcall.type == ORANGEFS_VFS_OP_CANCEL)
 
-static inline void get_op(struct orangefs_kernel_op_s *op)
-{
-	atomic_inc(&op->ref_count);
-	gossip_debug(GOSSIP_DEV_DEBUG,
-			"(get) Alloced OP (%p:%llu)\n",	op, llu(op->tag));
-}
-
-void __op_release(struct orangefs_kernel_op_s *op);
-
-static inline void op_release(struct orangefs_kernel_op_s *op)
-{
-	if (atomic_dec_and_test(&op->ref_count)) {
-		gossip_debug(GOSSIP_DEV_DEBUG,
-			"(put) Releasing OP (%p:%llu)\n", op, llu((op)->tag));
-		__op_release(op);
-	}
-}
+void op_release(struct orangefs_kernel_op_s *op);
 
 extern void orangefs_bufmap_put(int);
 static inline void put_cancel(struct orangefs_kernel_op_s *op)
diff --git a/fs/orangefs/waitqueue.c b/fs/orangefs/waitqueue.c
index d980240..55737b5 100644
--- a/fs/orangefs/waitqueue.c
+++ b/fs/orangefs/waitqueue.c
@@ -200,6 +200,7 @@ bool orangefs_cancel_op_in_progress(struct orangefs_kernel_op_s *op)
 
 static void orangefs_clean_up_interrupted_operation(struct orangefs_kernel_op_s *op)
 {
+	bool copying;
 	/*
 	 * handle interrupted cases depending on what state we were in when
 	 * the interruption is detected.  there is a coarse grained lock
@@ -208,6 +209,7 @@ static void orangefs_clean_up_interrupted_operation(struct orangefs_kernel_op_s
 	 * Called with op->lock held.
 	 */
 	op->op_state |= OP_VFS_STATE_GIVEN_UP;
+	copying = op->op_state & OP_VFS_STATE_COPYING;
 
 	if (op_state_waiting(op)) {
 		/*
@@ -243,6 +245,8 @@ static void orangefs_clean_up_interrupted_operation(struct orangefs_kernel_op_s
 		gossip_err("%s: can't get here.\n", __func__);
 		spin_unlock(&op->lock);
 	}
+	if (copying)
+		wait_for_completion(&op->waitq);
 	reinit_completion(&op->waitq);
 }
 

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-18  0:04                                                                                                                   ` Al Viro
@ 2016-02-18 11:11                                                                                                                     ` Al Viro
  2016-02-18 18:58                                                                                                                       ` Mike Marshall
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-02-18 11:11 UTC (permalink / raw)
  To: Martin Brandenburg
  Cc: Mike Marshall, Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Thu, Feb 18, 2016 at 12:04:39AM +0000, Al Viro wrote:
> Looks like the right approach is to have orangefs_clean_... hitting the
> sucker being copied to/from daemon to wait until that's finished (and
> discarded).  That, BTW, would have an extra benefit of making life simpler
> for refcounting.
> 
> So...  We need to have them marked as "being copied" for the duration, instead
> of bumping the refcount.  That setting and dropping that flag should happen
> under op->lock.  Setting it should happen only if it's not given up (that would
> be interpreted as "not found").  Cleaning, OTOH, would recheck the "given up"
> and do complete(&op->waitq) in case it's been given up...
> 
> How about this (instead of the previous variant, includes a fix for
> errno bogosity spotted a bit upthread; if it works, it'll need a bit of
> splitup)

Better yet, let's use list_del_init() on op->list instead of those list_del().
Then, seeing that ..._clean_interrupted_... can't be called in case of
serviced (we hadn't dropped op->lock since the time we'd checked it), we
can use list_empty(&op->list) as a test for "given up while copying to/from
daemon", so there's no need for separate flag that way:
	* we never pick given up op from list/hash
	* daemon read/write_iter never modifies op->list after op has
been given up
	* if op is given up while copying to/from userland in daemon
read/write_iter, it will call complete(&op->waitq) once it finds that,
so giveup side can wait for completion if it finds op it's about to give
up not on any list.

Should be equivalent to the previous variant, but IMO it's cleaner that
way...

diff --git a/fs/orangefs/devorangefs-req.c b/fs/orangefs/devorangefs-req.c
index b27ed1c..f7914f5 100644
--- a/fs/orangefs/devorangefs-req.c
+++ b/fs/orangefs/devorangefs-req.c
@@ -58,9 +58,9 @@ static struct orangefs_kernel_op_s *orangefs_devreq_remove_op(__u64 tag)
 				 next,
 				 &htable_ops_in_progress[index],
 				 list) {
-		if (op->tag == tag && !op_state_purged(op)) {
+		if (op->tag == tag && !op_state_purged(op) &&
+		    !op_state_given_up(op)) {
 			list_del_init(&op->list);
-			get_op(op); /* increase ref count. */
 			spin_unlock(&htable_ops_in_progress_lock);
 			return op;
 		}
@@ -133,7 +133,7 @@ restart:
 		__s32 fsid;
 		/* This lock is held past the end of the loop when we break. */
 		spin_lock(&op->lock);
-		if (unlikely(op_state_purged(op))) {
+		if (unlikely(op_state_purged(op) || op_state_given_up(op))) {
 			spin_unlock(&op->lock);
 			continue;
 		}
@@ -199,13 +199,12 @@ restart:
 	 */
 	if (op_state_in_progress(cur_op) || op_state_serviced(cur_op)) {
 		gossip_err("orangefs: ERROR: Current op already queued.\n");
-		list_del(&cur_op->list);
+		list_del_init(&cur_op->list);
 		spin_unlock(&cur_op->lock);
 		spin_unlock(&orangefs_request_list_lock);
 		return -EAGAIN;
 	}
 	list_del_init(&cur_op->list);
-	get_op(op);
 	spin_unlock(&orangefs_request_list_lock);
 
 	spin_unlock(&cur_op->lock);
@@ -230,7 +229,7 @@ restart:
 	if (unlikely(op_state_given_up(cur_op))) {
 		spin_unlock(&cur_op->lock);
 		spin_unlock(&htable_ops_in_progress_lock);
-		op_release(cur_op);
+		complete(&cur_op->waitq);
 		goto restart;
 	}
 
@@ -242,7 +241,6 @@ restart:
 	orangefs_devreq_add_op(cur_op);
 	spin_unlock(&cur_op->lock);
 	spin_unlock(&htable_ops_in_progress_lock);
-	op_release(cur_op);
 
 	/* The client only asks to read one size buffer. */
 	return MAX_DEV_REQ_UPSIZE;
@@ -258,10 +256,12 @@ error:
 	if (likely(!op_state_given_up(cur_op))) {
 		set_op_state_waiting(cur_op);
 		list_add(&cur_op->list, &orangefs_request_list);
+		spin_unlock(&cur_op->lock);
+	} else {
+		spin_unlock(&cur_op->lock);
+		complete(&cur_op->waitq);
 	}
-	spin_unlock(&cur_op->lock);
 	spin_unlock(&orangefs_request_list_lock);
-	op_release(cur_op);
 	return -EFAULT;
 }
 
@@ -333,8 +333,7 @@ static ssize_t orangefs_devreq_write_iter(struct kiocb *iocb,
 	n = copy_from_iter(&op->downcall, downcall_size, iter);
 	if (n != downcall_size) {
 		gossip_err("%s: failed to copy downcall.\n", __func__);
-		ret = -EFAULT;
-		goto Broken;
+		goto Efault;
 	}
 
 	if (op->downcall.status)
@@ -354,8 +353,7 @@ static ssize_t orangefs_devreq_write_iter(struct kiocb *iocb,
 			   downcall_size,
 			   op->downcall.trailer_size,
 			   total);
-		ret = -EFAULT;
-		goto Broken;
+		goto Efault;
 	}
 
 	/* Only READDIR operations should have trailers. */
@@ -364,8 +362,7 @@ static ssize_t orangefs_devreq_write_iter(struct kiocb *iocb,
 		gossip_err("%s: %x operation with trailer.",
 			   __func__,
 			   op->downcall.type);
-		ret = -EFAULT;
-		goto Broken;
+		goto Efault;
 	}
 
 	/* READDIR operations should always have trailers. */
@@ -374,8 +371,7 @@ static ssize_t orangefs_devreq_write_iter(struct kiocb *iocb,
 		gossip_err("%s: %x operation with no trailer.",
 			   __func__,
 			   op->downcall.type);
-		ret = -EFAULT;
-		goto Broken;
+		goto Efault;
 	}
 
 	if (op->downcall.type != ORANGEFS_VFS_OP_READDIR)
@@ -386,8 +382,7 @@ static ssize_t orangefs_devreq_write_iter(struct kiocb *iocb,
 	if (op->downcall.trailer_buf == NULL) {
 		gossip_err("%s: failed trailer vmalloc.\n",
 			   __func__);
-		ret = -ENOMEM;
-		goto Broken;
+		goto Enomem;
 	}
 	memset(op->downcall.trailer_buf, 0, op->downcall.trailer_size);
 	n = copy_from_iter(op->downcall.trailer_buf,
@@ -396,8 +391,7 @@ static ssize_t orangefs_devreq_write_iter(struct kiocb *iocb,
 	if (n != op->downcall.trailer_size) {
 		gossip_err("%s: failed to copy trailer.\n", __func__);
 		vfree(op->downcall.trailer_buf);
-		ret = -EFAULT;
-		goto Broken;
+		goto Efault;
 	}
 
 wakeup:
@@ -406,38 +400,27 @@ wakeup:
 	 * that this op is done
 	 */
 	spin_lock(&op->lock);
-	if (unlikely(op_state_given_up(op))) {
+	if (unlikely(op_is_cancel(op))) {
 		spin_unlock(&op->lock);
-		goto out;
-	}
-	set_op_state_serviced(op);
-	spin_unlock(&op->lock);
-
-	/*
-	 * If this operation is an I/O operation we need to wait
-	 * for all data to be copied before we can return to avoid
-	 * buffer corruption and races that can pull the buffers
-	 * out from under us.
-	 *
-	 * Essentially we're synchronizing with other parts of the
-	 * vfs implicitly by not allowing the user space
-	 * application reading/writing this device to return until
-	 * the buffers are done being used.
-	 */
-out:
-	if (unlikely(op_is_cancel(op)))
 		put_cancel(op);
-	op_release(op);
-	return ret;
-
-Broken:
-	spin_lock(&op->lock);
-	if (!op_state_given_up(op)) {
-		op->downcall.status = ret;
+	} else if (unlikely(op_state_given_up(op))) {
+		spin_unlock(&op->lock);
+		complete(&op->waitq);
+	} else {
 		set_op_state_serviced(op);
+		spin_unlock(&op->lock);
 	}
-	spin_unlock(&op->lock);
-	goto out;
+	return ret;
+
+Efault:
+	op->downcall.status = -(ORANGEFS_ERROR_BIT | 9);
+	ret = -EFAULT;
+	goto wakeup;
+
+Enomem:
+	op->downcall.status = -(ORANGEFS_ERROR_BIT | 8);
+	ret = -ENOMEM;
+	goto wakeup;
 }
 
 /* Returns whether any FS are still pending remounted */
diff --git a/fs/orangefs/orangefs-cache.c b/fs/orangefs/orangefs-cache.c
index 817092a..900a2e3 100644
--- a/fs/orangefs/orangefs-cache.c
+++ b/fs/orangefs/orangefs-cache.c
@@ -120,8 +120,6 @@ struct orangefs_kernel_op_s *op_alloc(__s32 type)
 		spin_lock_init(&new_op->lock);
 		init_completion(&new_op->waitq);
 
-		atomic_set(&new_op->ref_count, 1);
-
 		new_op->upcall.type = ORANGEFS_VFS_OP_INVALID;
 		new_op->downcall.type = ORANGEFS_VFS_OP_INVALID;
 		new_op->downcall.status = -1;
@@ -149,7 +147,7 @@ struct orangefs_kernel_op_s *op_alloc(__s32 type)
 	return new_op;
 }
 
-void __op_release(struct orangefs_kernel_op_s *orangefs_op)
+void op_release(struct orangefs_kernel_op_s *orangefs_op)
 {
 	if (orangefs_op) {
 		gossip_debug(GOSSIP_CACHE_DEBUG,
diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h
index 1f8310c..e387d3c 100644
--- a/fs/orangefs/orangefs-kernel.h
+++ b/fs/orangefs/orangefs-kernel.h
@@ -205,8 +205,6 @@ struct orangefs_kernel_op_s {
 	struct completion waitq;
 	spinlock_t lock;
 
-	atomic_t ref_count;
-
 	/* VFS aio fields */
 
 	int attempts;
@@ -230,23 +228,7 @@ static inline void set_op_state_serviced(struct orangefs_kernel_op_s *op)
 #define op_state_given_up(op)    ((op)->op_state & OP_VFS_STATE_GIVEN_UP)
 #define op_is_cancel(op)         ((op)->downcall.type == ORANGEFS_VFS_OP_CANCEL)
 
-static inline void get_op(struct orangefs_kernel_op_s *op)
-{
-	atomic_inc(&op->ref_count);
-	gossip_debug(GOSSIP_DEV_DEBUG,
-			"(get) Alloced OP (%p:%llu)\n",	op, llu(op->tag));
-}
-
-void __op_release(struct orangefs_kernel_op_s *op);
-
-static inline void op_release(struct orangefs_kernel_op_s *op)
-{
-	if (atomic_dec_and_test(&op->ref_count)) {
-		gossip_debug(GOSSIP_DEV_DEBUG,
-			"(put) Releasing OP (%p:%llu)\n", op, llu((op)->tag));
-		__op_release(op);
-	}
-}
+void op_release(struct orangefs_kernel_op_s *op);
 
 extern void orangefs_bufmap_put(int);
 static inline void put_cancel(struct orangefs_kernel_op_s *op)
@@ -259,7 +241,7 @@ static inline void set_op_state_purged(struct orangefs_kernel_op_s *op)
 {
 	spin_lock(&op->lock);
 	if (unlikely(op_is_cancel(op))) {
-		list_del(&op->list);
+		list_del_init(&op->list);
 		spin_unlock(&op->lock);
 		put_cancel(op);
 	} else {
diff --git a/fs/orangefs/waitqueue.c b/fs/orangefs/waitqueue.c
index d980240..3f9e430 100644
--- a/fs/orangefs/waitqueue.c
+++ b/fs/orangefs/waitqueue.c
@@ -208,15 +208,20 @@ static void orangefs_clean_up_interrupted_operation(struct orangefs_kernel_op_s
 	 * Called with op->lock held.
 	 */
 	op->op_state |= OP_VFS_STATE_GIVEN_UP;
-
-	if (op_state_waiting(op)) {
+	/* from that point on it can't be moved by anybody else */
+	if (list_empty(&op->list)) {
+		/* caught copying to/from daemon */
+		BUG_ON(op_state_serviced(op));
+		spin_unlock(&op->lock);
+		wait_for_completion(&op->waitq);
+	} else if (op_state_waiting(op)) {
 		/*
 		 * upcall hasn't been read; remove op from upcall request
 		 * list.
 		 */
 		spin_unlock(&op->lock);
 		spin_lock(&orangefs_request_list_lock);
-		list_del(&op->list);
+		list_del_init(&op->list);
 		spin_unlock(&orangefs_request_list_lock);
 		gossip_debug(GOSSIP_WAIT_DEBUG,
 			     "Interrupted: Removed op %p from request_list\n",
@@ -225,23 +230,16 @@ static void orangefs_clean_up_interrupted_operation(struct orangefs_kernel_op_s
 		/* op must be removed from the in progress htable */
 		spin_unlock(&op->lock);
 		spin_lock(&htable_ops_in_progress_lock);
-		list_del(&op->list);
+		list_del_init(&op->list);
 		spin_unlock(&htable_ops_in_progress_lock);
 		gossip_debug(GOSSIP_WAIT_DEBUG,
 			     "Interrupted: Removed op %p"
 			     " from htable_ops_in_progress\n",
 			     op);
-	} else if (!op_state_serviced(op)) {
+	} else {
 		spin_unlock(&op->lock);
 		gossip_err("interrupted operation is in a weird state 0x%x\n",
 			   op->op_state);
-	} else {
-		/*
-		 * It is not intended for execution to flow here,
-		 * but having this unlock here makes sparse happy.
-		 */
-		gossip_err("%s: can't get here.\n", __func__);
-		spin_unlock(&op->lock);
 	}
 	reinit_completion(&op->waitq);
 }

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-18 11:11                                                                                                                     ` Al Viro
@ 2016-02-18 18:58                                                                                                                       ` Mike Marshall
  2016-02-18 19:20                                                                                                                         ` Al Viro
  2016-02-18 19:49                                                                                                                         ` Martin Brandenburg
  0 siblings, 2 replies; 111+ messages in thread
From: Mike Marshall @ 2016-02-18 18:58 UTC (permalink / raw)
  To: Al Viro
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel,
	Stephen Rothwell, Mike Marshall

Still busted, exactly the same, I think. The doomed op gets a good
return code from is_daemon_in_service in service_operation but
gets EAGAIN from wait_for_matching_downcall... an edge case kind of
problem.

Here's the raw (well, slightly edited for readability) logs showing
the doomed op and subsequent failed op that uses the bogus handle
and fsid from the doomed op.



Alloced OP (ffff880012898000: 10889 OP_CREATE)
service_operation: orangefs_create op:ffff880012898000:



wait_for_matching_downcall: operation purged (tag 10889, ffff880012898000, att 0
service_operation: wait_for_matching_downcall returned -11 for ffff880012898000
Interrupted: Removed op ffff880012898000 from htable_ops_in_progress
tag 10889 (orangefs_create) -- operation to be retried (1 attempt)
service_operation: orangefs_create op:ffff880012898000:
service_operation:client core is NOT in service, ffff880012898000



service_operation: wait_for_matching_downcall returned 0 for ffff880012898000
service_operation orangefs_create returning: 0 for ffff880012898000
orangefs_create: PPTOOLS1.PPA:
handle:00000000-0000-0000-0000-000000000000: fsid:0:
new_op:ffff880012898000: ret:0:



Alloced OP (ffff880012888000: 10958 OP_GETATTR)
service_operation: orangefs_inode_getattr op:ffff880012888000:
service_operation: wait_for_matching_downcall returned 0 for ffff880012888000
service_operation orangefs_inode_getattr returning: -22 for ffff880012888000
Releasing OP (ffff880012888000: 10958
orangefs_create: Failed to allocate inode for file :PPTOOLS1.PPA:
Releasing OP (ffff880012898000: 10889




What I'm testing with differs from what is at kernel.org#for-next by
  - diffs from Al's most recent email
  - 1 souped up gossip message
  - changed 0 to OP_VFS_STATE_UNKNOWN one place in service_operation
  - reinit_completion(&op->waitq) in orangefs_clean_up_interrupted_operation



diff --git a/fs/orangefs/devorangefs-req.c b/fs/orangefs/devorangefs-req.c
index b27ed1c..f7914f5 100644
--- a/fs/orangefs/devorangefs-req.c
+++ b/fs/orangefs/devorangefs-req.c
@@ -58,9 +58,9 @@ static struct orangefs_kernel_op_s
*orangefs_devreq_remove_op(__u64 tag)
  next,
  &htable_ops_in_progress[index],
  list) {
- if (op->tag == tag && !op_state_purged(op)) {
+ if (op->tag == tag && !op_state_purged(op) &&
+    !op_state_given_up(op)) {
  list_del_init(&op->list);
- get_op(op); /* increase ref count. */
  spin_unlock(&htable_ops_in_progress_lock);
  return op;
  }
@@ -133,7 +133,7 @@ restart:
  __s32 fsid;
  /* This lock is held past the end of the loop when we break. */
  spin_lock(&op->lock);
- if (unlikely(op_state_purged(op))) {
+ if (unlikely(op_state_purged(op) || op_state_given_up(op))) {
  spin_unlock(&op->lock);
  continue;
  }
@@ -199,13 +199,12 @@ restart:
  */
  if (op_state_in_progress(cur_op) || op_state_serviced(cur_op)) {
  gossip_err("orangefs: ERROR: Current op already queued.\n");
- list_del(&cur_op->list);
+ list_del_init(&cur_op->list);
  spin_unlock(&cur_op->lock);
  spin_unlock(&orangefs_request_list_lock);
  return -EAGAIN;
  }
  list_del_init(&cur_op->list);
- get_op(op);
  spin_unlock(&orangefs_request_list_lock);

  spin_unlock(&cur_op->lock);
@@ -230,7 +229,7 @@ restart:
  if (unlikely(op_state_given_up(cur_op))) {
  spin_unlock(&cur_op->lock);
  spin_unlock(&htable_ops_in_progress_lock);
- op_release(cur_op);
+ complete(&cur_op->waitq);
  goto restart;
  }

@@ -242,7 +241,6 @@ restart:
  orangefs_devreq_add_op(cur_op);
  spin_unlock(&cur_op->lock);
  spin_unlock(&htable_ops_in_progress_lock);
- op_release(cur_op);

  /* The client only asks to read one size buffer. */
  return MAX_DEV_REQ_UPSIZE;
@@ -258,10 +256,12 @@ error:
  if (likely(!op_state_given_up(cur_op))) {
  set_op_state_waiting(cur_op);
  list_add(&cur_op->list, &orangefs_request_list);
+ spin_unlock(&cur_op->lock);
+ } else {
+ spin_unlock(&cur_op->lock);
+ complete(&cur_op->waitq);
  }
- spin_unlock(&cur_op->lock);
  spin_unlock(&orangefs_request_list_lock);
- op_release(cur_op);
  return -EFAULT;
 }

@@ -333,8 +333,7 @@ static ssize_t orangefs_devreq_write_iter(struct
kiocb *iocb,
  n = copy_from_iter(&op->downcall, downcall_size, iter);
  if (n != downcall_size) {
  gossip_err("%s: failed to copy downcall.\n", __func__);
- ret = -EFAULT;
- goto Broken;
+ goto Efault;
  }

  if (op->downcall.status)
@@ -354,8 +353,7 @@ static ssize_t orangefs_devreq_write_iter(struct
kiocb *iocb,
    downcall_size,
    op->downcall.trailer_size,
    total);
- ret = -EFAULT;
- goto Broken;
+ goto Efault;
  }

  /* Only READDIR operations should have trailers. */
@@ -364,8 +362,7 @@ static ssize_t orangefs_devreq_write_iter(struct
kiocb *iocb,
  gossip_err("%s: %x operation with trailer.",
    __func__,
    op->downcall.type);
- ret = -EFAULT;
- goto Broken;
+ goto Efault;
  }

  /* READDIR operations should always have trailers. */
@@ -374,8 +371,7 @@ static ssize_t orangefs_devreq_write_iter(struct
kiocb *iocb,
  gossip_err("%s: %x operation with no trailer.",
    __func__,
    op->downcall.type);
- ret = -EFAULT;
- goto Broken;
+ goto Efault;
  }

  if (op->downcall.type != ORANGEFS_VFS_OP_READDIR)
@@ -386,8 +382,7 @@ static ssize_t orangefs_devreq_write_iter(struct
kiocb *iocb,
  if (op->downcall.trailer_buf == NULL) {
  gossip_err("%s: failed trailer vmalloc.\n",
    __func__);
- ret = -ENOMEM;
- goto Broken;
+ goto Enomem;
  }
  memset(op->downcall.trailer_buf, 0, op->downcall.trailer_size);
  n = copy_from_iter(op->downcall.trailer_buf,
@@ -396,8 +391,7 @@ static ssize_t orangefs_devreq_write_iter(struct
kiocb *iocb,
  if (n != op->downcall.trailer_size) {
  gossip_err("%s: failed to copy trailer.\n", __func__);
  vfree(op->downcall.trailer_buf);
- ret = -EFAULT;
- goto Broken;
+ goto Efault;
  }

 wakeup:
@@ -406,38 +400,27 @@ wakeup:
  * that this op is done
  */
  spin_lock(&op->lock);
- if (unlikely(op_state_given_up(op))) {
+ if (unlikely(op_is_cancel(op))) {
  spin_unlock(&op->lock);
- goto out;
- }
- set_op_state_serviced(op);
- spin_unlock(&op->lock);
-
- /*
- * If this operation is an I/O operation we need to wait
- * for all data to be copied before we can return to avoid
- * buffer corruption and races that can pull the buffers
- * out from under us.
- *
- * Essentially we're synchronizing with other parts of the
- * vfs implicitly by not allowing the user space
- * application reading/writing this device to return until
- * the buffers are done being used.
- */
-out:
- if (unlikely(op_is_cancel(op)))
  put_cancel(op);
- op_release(op);
- return ret;
-
-Broken:
- spin_lock(&op->lock);
- if (!op_state_given_up(op)) {
- op->downcall.status = ret;
+ } else if (unlikely(op_state_given_up(op))) {
+ spin_unlock(&op->lock);
+ complete(&op->waitq);
+ } else {
  set_op_state_serviced(op);
+ spin_unlock(&op->lock);
  }
- spin_unlock(&op->lock);
- goto out;
+ return ret;
+
+Efault:
+ op->downcall.status = -(ORANGEFS_ERROR_BIT | 9);
+ ret = -EFAULT;
+ goto wakeup;
+
+Enomem:
+ op->downcall.status = -(ORANGEFS_ERROR_BIT | 8);
+ ret = -ENOMEM;
+ goto wakeup;
 }

 /* Returns whether any FS are still pending remounted */
diff --git a/fs/orangefs/orangefs-cache.c b/fs/orangefs/orangefs-cache.c
index 817092a..900a2e3 100644
--- a/fs/orangefs/orangefs-cache.c
+++ b/fs/orangefs/orangefs-cache.c
@@ -120,8 +120,6 @@ struct orangefs_kernel_op_s *op_alloc(__s32 type)
  spin_lock_init(&new_op->lock);
  init_completion(&new_op->waitq);

- atomic_set(&new_op->ref_count, 1);
-
  new_op->upcall.type = ORANGEFS_VFS_OP_INVALID;
  new_op->downcall.type = ORANGEFS_VFS_OP_INVALID;
  new_op->downcall.status = -1;
@@ -149,7 +147,7 @@ struct orangefs_kernel_op_s *op_alloc(__s32 type)
  return new_op;
 }

-void __op_release(struct orangefs_kernel_op_s *orangefs_op)
+void op_release(struct orangefs_kernel_op_s *orangefs_op)
 {
  if (orangefs_op) {
  gossip_debug(GOSSIP_CACHE_DEBUG,
diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h
index 1f8310c..e387d3c 100644
--- a/fs/orangefs/orangefs-kernel.h
+++ b/fs/orangefs/orangefs-kernel.h
@@ -205,8 +205,6 @@ struct orangefs_kernel_op_s {
  struct completion waitq;
  spinlock_t lock;

- atomic_t ref_count;
-
  /* VFS aio fields */

  int attempts;
@@ -230,23 +228,7 @@ static inline void set_op_state_serviced(struct
orangefs_kernel_op_s *op)
 #define op_state_given_up(op)    ((op)->op_state & OP_VFS_STATE_GIVEN_UP)
 #define op_is_cancel(op)         ((op)->downcall.type ==
ORANGEFS_VFS_OP_CANCEL)

-static inline void get_op(struct orangefs_kernel_op_s *op)
-{
- atomic_inc(&op->ref_count);
- gossip_debug(GOSSIP_DEV_DEBUG,
- "(get) Alloced OP (%p:%llu)\n", op, llu(op->tag));
-}
-
-void __op_release(struct orangefs_kernel_op_s *op);
-
-static inline void op_release(struct orangefs_kernel_op_s *op)
-{
- if (atomic_dec_and_test(&op->ref_count)) {
- gossip_debug(GOSSIP_DEV_DEBUG,
- "(put) Releasing OP (%p:%llu)\n", op, llu((op)->tag));
- __op_release(op);
- }
-}
+void op_release(struct orangefs_kernel_op_s *op);

 extern void orangefs_bufmap_put(int);
 static inline void put_cancel(struct orangefs_kernel_op_s *op)
@@ -259,7 +241,7 @@ static inline void set_op_state_purged(struct
orangefs_kernel_op_s *op)
 {
  spin_lock(&op->lock);
  if (unlikely(op_is_cancel(op))) {
- list_del(&op->list);
+ list_del_init(&op->list);
  spin_unlock(&op->lock);
  put_cancel(op);
  } else {
diff --git a/fs/orangefs/waitqueue.c b/fs/orangefs/waitqueue.c
index 2528d58..0ea2741 100644
--- a/fs/orangefs/waitqueue.c
+++ b/fs/orangefs/waitqueue.c
@@ -65,7 +65,7 @@ int service_operation(struct orangefs_kernel_op_s *op,
  op->upcall.pid = current->pid;

 retry_servicing:
- op->downcall.status = 0;
+ op->downcall.status = OP_VFS_STATE_UNKNOWN;
  gossip_debug(GOSSIP_WAIT_DEBUG,
      "%s: %s op:%p: process:%s: pid:%d:\n",
      __func__,
@@ -103,8 +103,9 @@ retry_servicing:
  wake_up_interruptible(&orangefs_request_list_waitq);
  if (!__is_daemon_in_service()) {
  gossip_debug(GOSSIP_WAIT_DEBUG,
-     "%s:client core is NOT in service.\n",
-     __func__);
+     "%s:client core is NOT in service, %p.\n",
+     __func__,
+     op);
  timeout = op_timeout_secs * HZ;
  }
  spin_unlock(&orangefs_request_list_lock);
@@ -208,15 +209,20 @@ static void
orangefs_clean_up_interrupted_operation(struct orangefs_kernel_op_s
  * Called with op->lock held.
  */
  op->op_state |= OP_VFS_STATE_GIVEN_UP;
-
- if (op_state_waiting(op)) {
+ /* from that point on it can't be moved by anybody else */
+ if (list_empty(&op->list)) {
+ /* caught copying to/from daemon */
+ BUG_ON(op_state_serviced(op));
+ spin_unlock(&op->lock);
+ wait_for_completion(&op->waitq);
+ } else if (op_state_waiting(op)) {
  /*
  * upcall hasn't been read; remove op from upcall request
  * list.
  */
  spin_unlock(&op->lock);
  spin_lock(&orangefs_request_list_lock);
- list_del(&op->list);
+ list_del_init(&op->list);
  spin_unlock(&orangefs_request_list_lock);
  gossip_debug(GOSSIP_WAIT_DEBUG,
      "Interrupted: Removed op %p from request_list\n",
@@ -225,24 +231,18 @@ static void
orangefs_clean_up_interrupted_operation(struct orangefs_kernel_op_s
  /* op must be removed from the in progress htable */
  spin_unlock(&op->lock);
  spin_lock(&htable_ops_in_progress_lock);
- list_del(&op->list);
+ list_del_init(&op->list);
  spin_unlock(&htable_ops_in_progress_lock);
  gossip_debug(GOSSIP_WAIT_DEBUG,
      "Interrupted: Removed op %p"
      " from htable_ops_in_progress\n",
      op);
- } else if (!op_state_serviced(op)) {
+ } else {
  spin_unlock(&op->lock);
  gossip_err("interrupted operation is in a weird state 0x%x\n",
    op->op_state);
- } else {
- /*
- * It is not intended for execution to flow here,
- * but having this unlock here makes sparse happy.
- */
- gossip_err("%s: can't get here.\n", __func__);
- spin_unlock(&op->lock);
  }
+ reinit_completion(&op->waitq);
 }

 /*

On Thu, Feb 18, 2016 at 6:11 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Feb 18, 2016 at 12:04:39AM +0000, Al Viro wrote:
>> Looks like the right approach is to have orangefs_clean_... hitting the
>> sucker being copied to/from daemon to wait until that's finished (and
>> discarded).  That, BTW, would have an extra benefit of making life simpler
>> for refcounting.
>>
>> So...  We need to have them marked as "being copied" for the duration, instead
>> of bumping the refcount.  That setting and dropping that flag should happen
>> under op->lock.  Setting it should happen only if it's not given up (that would
>> be interpreted as "not found").  Cleaning, OTOH, would recheck the "given up"
>> and do complete(&op->waitq) in case it's been given up...
>>
>> How about this (instead of the previous variant, includes a fix for
>> errno bogosity spotted a bit upthread; if it works, it'll need a bit of
>> splitup)
>
> Better yet, let's use list_del_init() on op->list instead of those list_del().
> Then, seeing that ..._clean_interrupted_... can't be called in case of
> serviced (we hadn't dropped op->lock since the time we'd checked it), we
> can use list_empty(&op->list) as a test for "given up while copying to/from
> daemon", so there's no need for separate flag that way:
>         * we never pick given up op from list/hash
>         * daemon read/write_iter never modifies op->list after op has
> been given up
>         * if op is given up while copying to/from userland in daemon
> read/write_iter, it will call complete(&op->waitq) once it finds that,
> so giveup side can wait for completion if it finds op it's about to give
> up not on any list.
>
> Should be equivalent to the previous variant, but IMO it's cleaner that
> way...
>
> diff --git a/fs/orangefs/devorangefs-req.c b/fs/orangefs/devorangefs-req.c
> index b27ed1c..f7914f5 100644
> --- a/fs/orangefs/devorangefs-req.c
> +++ b/fs/orangefs/devorangefs-req.c
> @@ -58,9 +58,9 @@ static struct orangefs_kernel_op_s *orangefs_devreq_remove_op(__u64 tag)
>                                  next,
>                                  &htable_ops_in_progress[index],
>                                  list) {
> -               if (op->tag == tag && !op_state_purged(op)) {
> +               if (op->tag == tag && !op_state_purged(op) &&
> +                   !op_state_given_up(op)) {
>                         list_del_init(&op->list);
> -                       get_op(op); /* increase ref count. */
>                         spin_unlock(&htable_ops_in_progress_lock);
>                         return op;
>                 }
> @@ -133,7 +133,7 @@ restart:
>                 __s32 fsid;
>                 /* This lock is held past the end of the loop when we break. */
>                 spin_lock(&op->lock);
> -               if (unlikely(op_state_purged(op))) {
> +               if (unlikely(op_state_purged(op) || op_state_given_up(op))) {
>                         spin_unlock(&op->lock);
>                         continue;
>                 }
> @@ -199,13 +199,12 @@ restart:
>          */
>         if (op_state_in_progress(cur_op) || op_state_serviced(cur_op)) {
>                 gossip_err("orangefs: ERROR: Current op already queued.\n");
> -               list_del(&cur_op->list);
> +               list_del_init(&cur_op->list);
>                 spin_unlock(&cur_op->lock);
>                 spin_unlock(&orangefs_request_list_lock);
>                 return -EAGAIN;
>         }
>         list_del_init(&cur_op->list);
> -       get_op(op);
>         spin_unlock(&orangefs_request_list_lock);
>
>         spin_unlock(&cur_op->lock);
> @@ -230,7 +229,7 @@ restart:
>         if (unlikely(op_state_given_up(cur_op))) {
>                 spin_unlock(&cur_op->lock);
>                 spin_unlock(&htable_ops_in_progress_lock);
> -               op_release(cur_op);
> +               complete(&cur_op->waitq);
>                 goto restart;
>         }
>
> @@ -242,7 +241,6 @@ restart:
>         orangefs_devreq_add_op(cur_op);
>         spin_unlock(&cur_op->lock);
>         spin_unlock(&htable_ops_in_progress_lock);
> -       op_release(cur_op);
>
>         /* The client only asks to read one size buffer. */
>         return MAX_DEV_REQ_UPSIZE;
> @@ -258,10 +256,12 @@ error:
>         if (likely(!op_state_given_up(cur_op))) {
>                 set_op_state_waiting(cur_op);
>                 list_add(&cur_op->list, &orangefs_request_list);
> +               spin_unlock(&cur_op->lock);
> +       } else {
> +               spin_unlock(&cur_op->lock);
> +               complete(&cur_op->waitq);
>         }
> -       spin_unlock(&cur_op->lock);
>         spin_unlock(&orangefs_request_list_lock);
> -       op_release(cur_op);
>         return -EFAULT;
>  }
>
> @@ -333,8 +333,7 @@ static ssize_t orangefs_devreq_write_iter(struct kiocb *iocb,
>         n = copy_from_iter(&op->downcall, downcall_size, iter);
>         if (n != downcall_size) {
>                 gossip_err("%s: failed to copy downcall.\n", __func__);
> -               ret = -EFAULT;
> -               goto Broken;
> +               goto Efault;
>         }
>
>         if (op->downcall.status)
> @@ -354,8 +353,7 @@ static ssize_t orangefs_devreq_write_iter(struct kiocb *iocb,
>                            downcall_size,
>                            op->downcall.trailer_size,
>                            total);
> -               ret = -EFAULT;
> -               goto Broken;
> +               goto Efault;
>         }
>
>         /* Only READDIR operations should have trailers. */
> @@ -364,8 +362,7 @@ static ssize_t orangefs_devreq_write_iter(struct kiocb *iocb,
>                 gossip_err("%s: %x operation with trailer.",
>                            __func__,
>                            op->downcall.type);
> -               ret = -EFAULT;
> -               goto Broken;
> +               goto Efault;
>         }
>
>         /* READDIR operations should always have trailers. */
> @@ -374,8 +371,7 @@ static ssize_t orangefs_devreq_write_iter(struct kiocb *iocb,
>                 gossip_err("%s: %x operation with no trailer.",
>                            __func__,
>                            op->downcall.type);
> -               ret = -EFAULT;
> -               goto Broken;
> +               goto Efault;
>         }
>
>         if (op->downcall.type != ORANGEFS_VFS_OP_READDIR)
> @@ -386,8 +382,7 @@ static ssize_t orangefs_devreq_write_iter(struct kiocb *iocb,
>         if (op->downcall.trailer_buf == NULL) {
>                 gossip_err("%s: failed trailer vmalloc.\n",
>                            __func__);
> -               ret = -ENOMEM;
> -               goto Broken;
> +               goto Enomem;
>         }
>         memset(op->downcall.trailer_buf, 0, op->downcall.trailer_size);
>         n = copy_from_iter(op->downcall.trailer_buf,
> @@ -396,8 +391,7 @@ static ssize_t orangefs_devreq_write_iter(struct kiocb *iocb,
>         if (n != op->downcall.trailer_size) {
>                 gossip_err("%s: failed to copy trailer.\n", __func__);
>                 vfree(op->downcall.trailer_buf);
> -               ret = -EFAULT;
> -               goto Broken;
> +               goto Efault;
>         }
>
>  wakeup:
> @@ -406,38 +400,27 @@ wakeup:
>          * that this op is done
>          */
>         spin_lock(&op->lock);
> -       if (unlikely(op_state_given_up(op))) {
> +       if (unlikely(op_is_cancel(op))) {
>                 spin_unlock(&op->lock);
> -               goto out;
> -       }
> -       set_op_state_serviced(op);
> -       spin_unlock(&op->lock);
> -
> -       /*
> -        * If this operation is an I/O operation we need to wait
> -        * for all data to be copied before we can return to avoid
> -        * buffer corruption and races that can pull the buffers
> -        * out from under us.
> -        *
> -        * Essentially we're synchronizing with other parts of the
> -        * vfs implicitly by not allowing the user space
> -        * application reading/writing this device to return until
> -        * the buffers are done being used.
> -        */
> -out:
> -       if (unlikely(op_is_cancel(op)))
>                 put_cancel(op);
> -       op_release(op);
> -       return ret;
> -
> -Broken:
> -       spin_lock(&op->lock);
> -       if (!op_state_given_up(op)) {
> -               op->downcall.status = ret;
> +       } else if (unlikely(op_state_given_up(op))) {
> +               spin_unlock(&op->lock);
> +               complete(&op->waitq);
> +       } else {
>                 set_op_state_serviced(op);
> +               spin_unlock(&op->lock);
>         }
> -       spin_unlock(&op->lock);
> -       goto out;
> +       return ret;
> +
> +Efault:
> +       op->downcall.status = -(ORANGEFS_ERROR_BIT | 9);
> +       ret = -EFAULT;
> +       goto wakeup;
> +
> +Enomem:
> +       op->downcall.status = -(ORANGEFS_ERROR_BIT | 8);
> +       ret = -ENOMEM;
> +       goto wakeup;
>  }
>
>  /* Returns whether any FS are still pending remounted */
> diff --git a/fs/orangefs/orangefs-cache.c b/fs/orangefs/orangefs-cache.c
> index 817092a..900a2e3 100644
> --- a/fs/orangefs/orangefs-cache.c
> +++ b/fs/orangefs/orangefs-cache.c
> @@ -120,8 +120,6 @@ struct orangefs_kernel_op_s *op_alloc(__s32 type)
>                 spin_lock_init(&new_op->lock);
>                 init_completion(&new_op->waitq);
>
> -               atomic_set(&new_op->ref_count, 1);
> -
>                 new_op->upcall.type = ORANGEFS_VFS_OP_INVALID;
>                 new_op->downcall.type = ORANGEFS_VFS_OP_INVALID;
>                 new_op->downcall.status = -1;
> @@ -149,7 +147,7 @@ struct orangefs_kernel_op_s *op_alloc(__s32 type)
>         return new_op;
>  }
>
> -void __op_release(struct orangefs_kernel_op_s *orangefs_op)
> +void op_release(struct orangefs_kernel_op_s *orangefs_op)
>  {
>         if (orangefs_op) {
>                 gossip_debug(GOSSIP_CACHE_DEBUG,
> diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h
> index 1f8310c..e387d3c 100644
> --- a/fs/orangefs/orangefs-kernel.h
> +++ b/fs/orangefs/orangefs-kernel.h
> @@ -205,8 +205,6 @@ struct orangefs_kernel_op_s {
>         struct completion waitq;
>         spinlock_t lock;
>
> -       atomic_t ref_count;
> -
>         /* VFS aio fields */
>
>         int attempts;
> @@ -230,23 +228,7 @@ static inline void set_op_state_serviced(struct orangefs_kernel_op_s *op)
>  #define op_state_given_up(op)    ((op)->op_state & OP_VFS_STATE_GIVEN_UP)
>  #define op_is_cancel(op)         ((op)->downcall.type == ORANGEFS_VFS_OP_CANCEL)
>
> -static inline void get_op(struct orangefs_kernel_op_s *op)
> -{
> -       atomic_inc(&op->ref_count);
> -       gossip_debug(GOSSIP_DEV_DEBUG,
> -                       "(get) Alloced OP (%p:%llu)\n", op, llu(op->tag));
> -}
> -
> -void __op_release(struct orangefs_kernel_op_s *op);
> -
> -static inline void op_release(struct orangefs_kernel_op_s *op)
> -{
> -       if (atomic_dec_and_test(&op->ref_count)) {
> -               gossip_debug(GOSSIP_DEV_DEBUG,
> -                       "(put) Releasing OP (%p:%llu)\n", op, llu((op)->tag));
> -               __op_release(op);
> -       }
> -}
> +void op_release(struct orangefs_kernel_op_s *op);
>
>  extern void orangefs_bufmap_put(int);
>  static inline void put_cancel(struct orangefs_kernel_op_s *op)
> @@ -259,7 +241,7 @@ static inline void set_op_state_purged(struct orangefs_kernel_op_s *op)
>  {
>         spin_lock(&op->lock);
>         if (unlikely(op_is_cancel(op))) {
> -               list_del(&op->list);
> +               list_del_init(&op->list);
>                 spin_unlock(&op->lock);
>                 put_cancel(op);
>         } else {
> diff --git a/fs/orangefs/waitqueue.c b/fs/orangefs/waitqueue.c
> index d980240..3f9e430 100644
> --- a/fs/orangefs/waitqueue.c
> +++ b/fs/orangefs/waitqueue.c
> @@ -208,15 +208,20 @@ static void orangefs_clean_up_interrupted_operation(struct orangefs_kernel_op_s
>          * Called with op->lock held.
>          */
>         op->op_state |= OP_VFS_STATE_GIVEN_UP;
> -
> -       if (op_state_waiting(op)) {
> +       /* from that point on it can't be moved by anybody else */
> +       if (list_empty(&op->list)) {
> +               /* caught copying to/from daemon */
> +               BUG_ON(op_state_serviced(op));
> +               spin_unlock(&op->lock);
> +               wait_for_completion(&op->waitq);
> +       } else if (op_state_waiting(op)) {
>                 /*
>                  * upcall hasn't been read; remove op from upcall request
>                  * list.
>                  */
>                 spin_unlock(&op->lock);
>                 spin_lock(&orangefs_request_list_lock);
> -               list_del(&op->list);
> +               list_del_init(&op->list);
>                 spin_unlock(&orangefs_request_list_lock);
>                 gossip_debug(GOSSIP_WAIT_DEBUG,
>                              "Interrupted: Removed op %p from request_list\n",
> @@ -225,23 +230,16 @@ static void orangefs_clean_up_interrupted_operation(struct orangefs_kernel_op_s
>                 /* op must be removed from the in progress htable */
>                 spin_unlock(&op->lock);
>                 spin_lock(&htable_ops_in_progress_lock);
> -               list_del(&op->list);
> +               list_del_init(&op->list);
>                 spin_unlock(&htable_ops_in_progress_lock);
>                 gossip_debug(GOSSIP_WAIT_DEBUG,
>                              "Interrupted: Removed op %p"
>                              " from htable_ops_in_progress\n",
>                              op);
> -       } else if (!op_state_serviced(op)) {
> +       } else {
>                 spin_unlock(&op->lock);
>                 gossip_err("interrupted operation is in a weird state 0x%x\n",
>                            op->op_state);
> -       } else {
> -               /*
> -                * It is not intended for execution to flow here,
> -                * but having this unlock here makes sparse happy.
> -                */
> -               gossip_err("%s: can't get here.\n", __func__);
> -               spin_unlock(&op->lock);
>         }
>         reinit_completion(&op->waitq);
>  }

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-18 18:58                                                                                                                       ` Mike Marshall
@ 2016-02-18 19:20                                                                                                                         ` Al Viro
  2016-02-18 19:49                                                                                                                         ` Martin Brandenburg
  1 sibling, 0 replies; 111+ messages in thread
From: Al Viro @ 2016-02-18 19:20 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Thu, Feb 18, 2016 at 01:58:52PM -0500, Mike Marshall wrote:

> wait_for_matching_downcall: operation purged (tag 10889, ffff880012898000, att 0
> service_operation: wait_for_matching_downcall returned -11 for ffff880012898000
> Interrupted: Removed op ffff880012898000 from htable_ops_in_progress

state is "in progress"

> tag 10889 (orangefs_create) -- operation to be retried (1 attempt)
> service_operation: orangefs_create op:ffff880012898000:

moved to "waiting"

> service_operation:client core is NOT in service, ffff880012898000
> 
> 
> 
> service_operation: wait_for_matching_downcall returned 0 for ffff880012898000
> service_operation orangefs_create returning: 0 for ffff880012898000

... and we've got to "serviced" somehow.

IDGI...  Are you sure that it's not a daemon replying with zero fsid?

Could you slap
        gossip_debug(GOSSIP_WAIT_DEBUG,
                     "%s: %s op:%p: process:%s state -> %d\n",
                     __func__,
                     op_name,
                     op,
                     current->comm,
		     op->op_state);
after assignments to ->op_state in set_op_state_purged() and
set_op_state_serviced() as well as after the calls of set_op_state_waiting()
(in service_operation() and orangefs_devreq_read()) and
set_op_state_inprogress() (in orangefs_devreq_read()).

Another thing: in orangefs_devreq_write_iter(), just before the
set_op_state_serviced() add
	WARN_ON(op->upcall.type == ORANGEFS_OP_VFS_CREATE &&
		!op->downcall.create.refn.fs_id);
to make sure that this crap isn't coming from the daemon.

While we are at it -
#define op_is_cancel(op)         ((op)->downcall.type == ORANGEFS_VFS_OP_CANCEL)is checking the wrong thing; should be
#define op_is_cancel(op)         ((op)->upcall.type == ORANGEFS_VFS_OP_CANCEL)

Shouldn't be worse than a leak, though, so I doubt that it could be causing
this problem...

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-18 18:58                                                                                                                       ` Mike Marshall
  2016-02-18 19:20                                                                                                                         ` Al Viro
@ 2016-02-18 19:49                                                                                                                         ` Martin Brandenburg
  2016-02-18 20:08                                                                                                                           ` Mike Marshall
  1 sibling, 1 reply; 111+ messages in thread
From: Martin Brandenburg @ 2016-02-18 19:49 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Al Viro, Martin Brandenburg, Linus Torvalds, linux-fsdevel,
	Stephen Rothwell

On Thu, 18 Feb 2016, Mike Marshall wrote:

> Still busted, exactly the same, I think. The doomed op gets a good
> return code from is_daemon_in_service in service_operation but
> gets EAGAIN from wait_for_matching_downcall... an edge case kind of
> problem.
> 
> Here's the raw (well, slightly edited for readability) logs showing
> the doomed op and subsequent failed op that uses the bogus handle
> and fsid from the doomed op.
> 
> 
> 
> Alloced OP (ffff880012898000: 10889 OP_CREATE)
> service_operation: orangefs_create op:ffff880012898000:
> 
> 
> 
> wait_for_matching_downcall: operation purged (tag 10889, ffff880012898000, att 0
> service_operation: wait_for_matching_downcall returned -11 for ffff880012898000
> Interrupted: Removed op ffff880012898000 from htable_ops_in_progress
> tag 10889 (orangefs_create) -- operation to be retried (1 attempt)
> service_operation: orangefs_create op:ffff880012898000:
> service_operation:client core is NOT in service, ffff880012898000
> 
> 
> 
> service_operation: wait_for_matching_downcall returned 0 for ffff880012898000
> service_operation orangefs_create returning: 0 for ffff880012898000
> orangefs_create: PPTOOLS1.PPA:
> handle:00000000-0000-0000-0000-000000000000: fsid:0:
> new_op:ffff880012898000: ret:0:
> 
> 
> 
> Alloced OP (ffff880012888000: 10958 OP_GETATTR)
> service_operation: orangefs_inode_getattr op:ffff880012888000:
> service_operation: wait_for_matching_downcall returned 0 for ffff880012888000
> service_operation orangefs_inode_getattr returning: -22 for ffff880012888000
> Releasing OP (ffff880012888000: 10958
> orangefs_create: Failed to allocate inode for file :PPTOOLS1.PPA:
> Releasing OP (ffff880012898000: 10889
> 
> 
> 
> 
> What I'm testing with differs from what is at kernel.org#for-next by
>   - diffs from Al's most recent email
>   - 1 souped up gossip message
>   - changed 0 to OP_VFS_STATE_UNKNOWN one place in service_operation
>   - reinit_completion(&op->waitq) in orangefs_clean_up_interrupted_operation
> 
> 
> 

Mike,

what error do you get from userspace (i.e. from dbench)?

open("./clients/client0/~dmtmp/EXCEL/5D7C0000", O_RDWR|O_CREAT, 0600) = -1 ENODEV (No such device)

An interesting note is that I can't reproduce at all
with only one dbench process. It seems there's not
enough load.

I don't see how the kernel could return ENODEV at all.
This may be coming from our client-core.

-- Martin

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-18 19:49                                                                                                                         ` Martin Brandenburg
@ 2016-02-18 20:08                                                                                                                           ` Mike Marshall
  2016-02-18 20:22                                                                                                                             ` Mike Marshall
  0 siblings, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-02-18 20:08 UTC (permalink / raw)
  To: Martin Brandenburg
  Cc: Al Viro, Linus Torvalds, linux-fsdevel, Stephen Rothwell

I haven't been trussing it... it reports EINVAL to stderr... I find
the ops to look
at in the debug output by looking for the -22...

(373) open ./clients/client8/~dmtmp/PARADOX/STUDENTS.DB failed for
handle 9981 (Invalid argument)

I just got the whacky code <g> from Al's last message to compile, I'll
have results from that soon...

-Mike

On Thu, Feb 18, 2016 at 2:49 PM, Martin Brandenburg <martin@omnibond.com> wrote:
> On Thu, 18 Feb 2016, Mike Marshall wrote:
>
>> Still busted, exactly the same, I think. The doomed op gets a good
>> return code from is_daemon_in_service in service_operation but
>> gets EAGAIN from wait_for_matching_downcall... an edge case kind of
>> problem.
>>
>> Here's the raw (well, slightly edited for readability) logs showing
>> the doomed op and subsequent failed op that uses the bogus handle
>> and fsid from the doomed op.
>>
>>
>>
>> Alloced OP (ffff880012898000: 10889 OP_CREATE)
>> service_operation: orangefs_create op:ffff880012898000:
>>
>>
>>
>> wait_for_matching_downcall: operation purged (tag 10889, ffff880012898000, att 0
>> service_operation: wait_for_matching_downcall returned -11 for ffff880012898000
>> Interrupted: Removed op ffff880012898000 from htable_ops_in_progress
>> tag 10889 (orangefs_create) -- operation to be retried (1 attempt)
>> service_operation: orangefs_create op:ffff880012898000:
>> service_operation:client core is NOT in service, ffff880012898000
>>
>>
>>
>> service_operation: wait_for_matching_downcall returned 0 for ffff880012898000
>> service_operation orangefs_create returning: 0 for ffff880012898000
>> orangefs_create: PPTOOLS1.PPA:
>> handle:00000000-0000-0000-0000-000000000000: fsid:0:
>> new_op:ffff880012898000: ret:0:
>>
>>
>>
>> Alloced OP (ffff880012888000: 10958 OP_GETATTR)
>> service_operation: orangefs_inode_getattr op:ffff880012888000:
>> service_operation: wait_for_matching_downcall returned 0 for ffff880012888000
>> service_operation orangefs_inode_getattr returning: -22 for ffff880012888000
>> Releasing OP (ffff880012888000: 10958
>> orangefs_create: Failed to allocate inode for file :PPTOOLS1.PPA:
>> Releasing OP (ffff880012898000: 10889
>>
>>
>>
>>
>> What I'm testing with differs from what is at kernel.org#for-next by
>>   - diffs from Al's most recent email
>>   - 1 souped up gossip message
>>   - changed 0 to OP_VFS_STATE_UNKNOWN one place in service_operation
>>   - reinit_completion(&op->waitq) in orangefs_clean_up_interrupted_operation
>>
>>
>>
>
> Mike,
>
> what error do you get from userspace (i.e. from dbench)?
>
> open("./clients/client0/~dmtmp/EXCEL/5D7C0000", O_RDWR|O_CREAT, 0600) = -1 ENODEV (No such device)
>
> An interesting note is that I can't reproduce at all
> with only one dbench process. It seems there's not
> enough load.
>
> I don't see how the kernel could return ENODEV at all.
> This may be coming from our client-core.
>
> -- Martin

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-18 20:08                                                                                                                           ` Mike Marshall
@ 2016-02-18 20:22                                                                                                                             ` Mike Marshall
  2016-02-18 20:38                                                                                                                               ` Mike Marshall
  2016-02-18 20:49                                                                                                                               ` Al Viro
  0 siblings, 2 replies; 111+ messages in thread
From: Mike Marshall @ 2016-02-18 20:22 UTC (permalink / raw)
  To: Martin Brandenburg
  Cc: Al Viro, Linus Torvalds, linux-fsdevel, Stephen Rothwell

I haven't edited up a list of how the debug output looked,
but most importantly: the WARN_ON is hit... it appears that
the client-core is sending over fsid:0:

-Mike

On Thu, Feb 18, 2016 at 3:08 PM, Mike Marshall <hubcap@omnibond.com> wrote:
> I haven't been trussing it... it reports EINVAL to stderr... I find
> the ops to look
> at in the debug output by looking for the -22...
>
> (373) open ./clients/client8/~dmtmp/PARADOX/STUDENTS.DB failed for
> handle 9981 (Invalid argument)
>
> I just got the whacky code <g> from Al's last message to compile, I'll
> have results from that soon...
>
> -Mike
>
> On Thu, Feb 18, 2016 at 2:49 PM, Martin Brandenburg <martin@omnibond.com> wrote:
>> On Thu, 18 Feb 2016, Mike Marshall wrote:
>>
>>> Still busted, exactly the same, I think. The doomed op gets a good
>>> return code from is_daemon_in_service in service_operation but
>>> gets EAGAIN from wait_for_matching_downcall... an edge case kind of
>>> problem.
>>>
>>> Here's the raw (well, slightly edited for readability) logs showing
>>> the doomed op and subsequent failed op that uses the bogus handle
>>> and fsid from the doomed op.
>>>
>>>
>>>
>>> Alloced OP (ffff880012898000: 10889 OP_CREATE)
>>> service_operation: orangefs_create op:ffff880012898000:
>>>
>>>
>>>
>>> wait_for_matching_downcall: operation purged (tag 10889, ffff880012898000, att 0
>>> service_operation: wait_for_matching_downcall returned -11 for ffff880012898000
>>> Interrupted: Removed op ffff880012898000 from htable_ops_in_progress
>>> tag 10889 (orangefs_create) -- operation to be retried (1 attempt)
>>> service_operation: orangefs_create op:ffff880012898000:
>>> service_operation:client core is NOT in service, ffff880012898000
>>>
>>>
>>>
>>> service_operation: wait_for_matching_downcall returned 0 for ffff880012898000
>>> service_operation orangefs_create returning: 0 for ffff880012898000
>>> orangefs_create: PPTOOLS1.PPA:
>>> handle:00000000-0000-0000-0000-000000000000: fsid:0:
>>> new_op:ffff880012898000: ret:0:
>>>
>>>
>>>
>>> Alloced OP (ffff880012888000: 10958 OP_GETATTR)
>>> service_operation: orangefs_inode_getattr op:ffff880012888000:
>>> service_operation: wait_for_matching_downcall returned 0 for ffff880012888000
>>> service_operation orangefs_inode_getattr returning: -22 for ffff880012888000
>>> Releasing OP (ffff880012888000: 10958
>>> orangefs_create: Failed to allocate inode for file :PPTOOLS1.PPA:
>>> Releasing OP (ffff880012898000: 10889
>>>
>>>
>>>
>>>
>>> What I'm testing with differs from what is at kernel.org#for-next by
>>>   - diffs from Al's most recent email
>>>   - 1 souped up gossip message
>>>   - changed 0 to OP_VFS_STATE_UNKNOWN one place in service_operation
>>>   - reinit_completion(&op->waitq) in orangefs_clean_up_interrupted_operation
>>>
>>>
>>>
>>
>> Mike,
>>
>> what error do you get from userspace (i.e. from dbench)?
>>
>> open("./clients/client0/~dmtmp/EXCEL/5D7C0000", O_RDWR|O_CREAT, 0600) = -1 ENODEV (No such device)
>>
>> An interesting note is that I can't reproduce at all
>> with only one dbench process. It seems there's not
>> enough load.
>>
>> I don't see how the kernel could return ENODEV at all.
>> This may be coming from our client-core.
>>
>> -- Martin

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-18 20:22                                                                                                                             ` Mike Marshall
@ 2016-02-18 20:38                                                                                                                               ` Mike Marshall
  2016-02-18 20:52                                                                                                                                 ` Al Viro
  2016-02-18 20:49                                                                                                                               ` Al Viro
  1 sibling, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-02-18 20:38 UTC (permalink / raw)
  To: Martin Brandenburg
  Cc: Al Viro, Linus Torvalds, linux-fsdevel, Stephen Rothwell

Yeah, it looks like the fault is entirely with the client-core...

orangefs-kernel.h:      OP_VFS_STATE_UNKNOWN = 0,
orangefs-kernel.h:      OP_VFS_STATE_WAITING = 1,
orangefs-kernel.h:      OP_VFS_STATE_INPROGR = 2,
orangefs-kernel.h:      OP_VFS_STATE_SERVICED = 4,
orangefs-kernel.h:      OP_VFS_STATE_PURGED = 8,
orangefs-kernel.h:      OP_VFS_STATE_GIVEN_UP = 16,


Alloced OP (ffff880011078000: 20210 OP_CREATE)
service_operation: orangefs_create op:ffff880011078000:
service_op: orangefs_create op:ffff880011078000: process:dbench state -> 1

orangefs_devreq_read: op:ffff880011078000: process:pvfs2-client-co state -> 2

set_op_state_purged: op:ffff880011078000: process:pvfs2-client-co state -> 10

wait_for_matching_downcall: operation purged (tag 20210, ffff880011078000, att 0
service_operation: wait_for_matching_downcall returned -11 for ffff880011078000
Interrupted: Removed op ffff880011078000 from htable_ops_in_progress
tag 20210 (orangefs_create) -- operation to be retried (1 attempt)
service_operation: orangefs_create op:ffff880011078000:
process:dbench: pid:1171service_op: orangefs_create
op:ffff880011078000: process:dbench state -> 1
service_operation:client core is NOT in service, ffff880011078000

orangefs_devreq_read: op:ffff880011078000: process:pvfs2-client-co state -> 2

WARNING: CPU: 0 PID: 1216 at fs/orangefs/devorangefs-req.c:423
set_op_state_serviced: op:ffff880011078000: process:pvfs2-client-co state -> 4
service_operation: wait_for_matching_downcall returned 0 for ffff880011078000
service_operation orangefs_create returning: 0 for ffff880011078000
orangefs_create: BENCHS.LWP:
handle:00000000-0000-0000-0000-000000000000: fsid:0:
new_op:ffff880011078000: ret:0:

-Mike

On Thu, Feb 18, 2016 at 3:22 PM, Mike Marshall <hubcap@omnibond.com> wrote:
> I haven't edited up a list of how the debug output looked,
> but most importantly: the WARN_ON is hit... it appears that
> the client-core is sending over fsid:0:
>
> -Mike
>
> On Thu, Feb 18, 2016 at 3:08 PM, Mike Marshall <hubcap@omnibond.com> wrote:
>> I haven't been trussing it... it reports EINVAL to stderr... I find
>> the ops to look
>> at in the debug output by looking for the -22...
>>
>> (373) open ./clients/client8/~dmtmp/PARADOX/STUDENTS.DB failed for
>> handle 9981 (Invalid argument)
>>
>> I just got the whacky code <g> from Al's last message to compile, I'll
>> have results from that soon...
>>
>> -Mike
>>
>> On Thu, Feb 18, 2016 at 2:49 PM, Martin Brandenburg <martin@omnibond.com> wrote:
>>> On Thu, 18 Feb 2016, Mike Marshall wrote:
>>>
>>>> Still busted, exactly the same, I think. The doomed op gets a good
>>>> return code from is_daemon_in_service in service_operation but
>>>> gets EAGAIN from wait_for_matching_downcall... an edge case kind of
>>>> problem.
>>>>
>>>> Here's the raw (well, slightly edited for readability) logs showing
>>>> the doomed op and subsequent failed op that uses the bogus handle
>>>> and fsid from the doomed op.
>>>>
>>>>
>>>>
>>>> Alloced OP (ffff880012898000: 10889 OP_CREATE)
>>>> service_operation: orangefs_create op:ffff880012898000:
>>>>
>>>>
>>>>
>>>> wait_for_matching_downcall: operation purged (tag 10889, ffff880012898000, att 0
>>>> service_operation: wait_for_matching_downcall returned -11 for ffff880012898000
>>>> Interrupted: Removed op ffff880012898000 from htable_ops_in_progress
>>>> tag 10889 (orangefs_create) -- operation to be retried (1 attempt)
>>>> service_operation: orangefs_create op:ffff880012898000:
>>>> service_operation:client core is NOT in service, ffff880012898000
>>>>
>>>>
>>>>
>>>> service_operation: wait_for_matching_downcall returned 0 for ffff880012898000
>>>> service_operation orangefs_create returning: 0 for ffff880012898000
>>>> orangefs_create: PPTOOLS1.PPA:
>>>> handle:00000000-0000-0000-0000-000000000000: fsid:0:
>>>> new_op:ffff880012898000: ret:0:
>>>>
>>>>
>>>>
>>>> Alloced OP (ffff880012888000: 10958 OP_GETATTR)
>>>> service_operation: orangefs_inode_getattr op:ffff880012888000:
>>>> service_operation: wait_for_matching_downcall returned 0 for ffff880012888000
>>>> service_operation orangefs_inode_getattr returning: -22 for ffff880012888000
>>>> Releasing OP (ffff880012888000: 10958
>>>> orangefs_create: Failed to allocate inode for file :PPTOOLS1.PPA:
>>>> Releasing OP (ffff880012898000: 10889
>>>>
>>>>
>>>>
>>>>
>>>> What I'm testing with differs from what is at kernel.org#for-next by
>>>>   - diffs from Al's most recent email
>>>>   - 1 souped up gossip message
>>>>   - changed 0 to OP_VFS_STATE_UNKNOWN one place in service_operation
>>>>   - reinit_completion(&op->waitq) in orangefs_clean_up_interrupted_operation
>>>>
>>>>
>>>>
>>>
>>> Mike,
>>>
>>> what error do you get from userspace (i.e. from dbench)?
>>>
>>> open("./clients/client0/~dmtmp/EXCEL/5D7C0000", O_RDWR|O_CREAT, 0600) = -1 ENODEV (No such device)
>>>
>>> An interesting note is that I can't reproduce at all
>>> with only one dbench process. It seems there's not
>>> enough load.
>>>
>>> I don't see how the kernel could return ENODEV at all.
>>> This may be coming from our client-core.
>>>
>>> -- Martin

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-18 20:22                                                                                                                             ` Mike Marshall
  2016-02-18 20:38                                                                                                                               ` Mike Marshall
@ 2016-02-18 20:49                                                                                                                               ` Al Viro
  1 sibling, 0 replies; 111+ messages in thread
From: Al Viro @ 2016-02-18 20:49 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Thu, Feb 18, 2016 at 03:22:33PM -0500, Mike Marshall wrote:
> I haven't edited up a list of how the debug output looked,
> but most importantly: the WARN_ON is hit... it appears that
> the client-core is sending over fsid:0:

OK, that's a bit of relief...  The next question, of course, is whether it's
a genuine reply or buggered attempt to copy it from userland and/or something
stomping on that memory.

It should've come from package_downcall_members(), right?  And there you
have this:
                if (*error_code == -PVFS_EEXIST)
                {
                    PVFS_hint hints;
                    PVFS_credential *credential;

                    fill_hints(&hints, vfs_request);

                    credential = lookup_credential(
                        vfs_request->in_upcall.uid,
                        vfs_request->in_upcall.gid);

                    /* compat */
                    refn1.handle =
                     pvfs2_khandle_to_ino(
                      &(vfs_request->in_upcall.req.create.parent_refn.khandle));
                    refn1.fs_id =
                      vfs_request->in_upcall.req.create.parent_refn.fs_id;
                    refn1.__pad1 =
                      vfs_request->in_upcall.req.create.parent_refn.__pad1;


//hubcap            vfs_request->out_downcall.resp.create.refn =
                    refn2 =
                      perform_lookup_on_create_error(

And AFAICS nothing in there sets resp.create.refn.  Is it actually set
earlier?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-18 20:38                                                                                                                               ` Mike Marshall
@ 2016-02-18 20:52                                                                                                                                 ` Al Viro
  2016-02-18 21:50                                                                                                                                   ` Mike Marshall
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-02-18 20:52 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Thu, Feb 18, 2016 at 03:38:26PM -0500, Mike Marshall wrote:
> WARNING: CPU: 0 PID: 1216 at fs/orangefs/devorangefs-req.c:423
> set_op_state_serviced: op:ffff880011078000: process:pvfs2-client-co state -> 4
> service_operation: wait_for_matching_downcall returned 0 for ffff880011078000
> service_operation orangefs_create returning: 0 for ffff880011078000
> orangefs_create: BENCHS.LWP:
> handle:00000000-0000-0000-0000-000000000000: fsid:0:
> new_op:ffff880011078000: ret:0:

Smells like retry hitting EEXIST and package_downcall_members() treatment of
that case doesn't set create.refn at all - used to, but that code is commented
out.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-18 20:52                                                                                                                                 ` Al Viro
@ 2016-02-18 21:50                                                                                                                                   ` Mike Marshall
  2016-02-19  0:25                                                                                                                                     ` Al Viro
  0 siblings, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-02-18 21:50 UTC (permalink / raw)
  To: Al Viro
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel, Stephen Rothwell

As part of the attempt to go upstream, this "hubcap" guy you see
in the comments worked on a thing that changes 64bit userspace handles
back and forth into 128bit kernel handles... we did this because
one day, when we have orangefs3, we will be using 128bit uuid-derived
handles, and we believe it is our responsibility to not break the
upstream kernel module.

Anywho, I bet you are right Al, he messed up this part of it...
I'll look and see if that is really so, and get it fixed.

-Mike "hubcap"

On Thu, Feb 18, 2016 at 3:52 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Feb 18, 2016 at 03:38:26PM -0500, Mike Marshall wrote:
>> WARNING: CPU: 0 PID: 1216 at fs/orangefs/devorangefs-req.c:423
>> set_op_state_serviced: op:ffff880011078000: process:pvfs2-client-co state -> 4
>> service_operation: wait_for_matching_downcall returned 0 for ffff880011078000
>> service_operation orangefs_create returning: 0 for ffff880011078000
>> orangefs_create: BENCHS.LWP:
>> handle:00000000-0000-0000-0000-000000000000: fsid:0:
>> new_op:ffff880011078000: ret:0:
>
> Smells like retry hitting EEXIST and package_downcall_members() treatment of
> that case doesn't set create.refn at all - used to, but that code is commented
> out.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-18 21:50                                                                                                                                   ` Mike Marshall
@ 2016-02-19  0:25                                                                                                                                     ` Al Viro
  2016-02-19 22:11                                                                                                                                       ` Mike Marshall
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-02-19  0:25 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Thu, Feb 18, 2016 at 04:50:11PM -0500, Mike Marshall wrote:
> As part of the attempt to go upstream, this "hubcap" guy you see
> in the comments worked on a thing that changes 64bit userspace handles
> back and forth into 128bit kernel handles... we did this because
> one day, when we have orangefs3, we will be using 128bit uuid-derived
> handles, and we believe it is our responsibility to not break the
> upstream kernel module.
> 
> Anywho, I bet you are right Al, he messed up this part of it...
> I'll look and see if that is really so, and get it fixed.
> 
> -Mike "hubcap"

OK...  I'll fold the trivial braino fix (op_is_cancel() checking the wrong
thing) into "orangefs: delay freeing slot until cancel completes" where it
had been introduced, but the rest of it is probably too far and will have
to be a couple of commits on top of that queue.  Had it been just my tree,
I probably would still reorder and fold, but I know that my habits in that
respect are rather extreme.

FWIW, the scenario spotted by Martin wouldn't cause any real problems, but
only because by the time we ended copying to/from daemon service_operation()
couldn't have reached resubmit - it only happens if there had been a purge
and that can't happen while somebody is inside a control device method.

So the original code had been correct, but it was more brittle than
I'd like *and* making sure that nobody else sees an op by the time
orangefs_clean_interrupted_operation() returns is a good thing.

New logics gives that, and avoids the need to play with refcounts on ops.

I've pushed that into #orangefs-untested; if that works, please switch your
for-next to it.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-19  0:25                                                                                                                                     ` Al Viro
@ 2016-02-19 22:11                                                                                                                                       ` Mike Marshall
  2016-02-19 22:22                                                                                                                                         ` Al Viro
  2016-02-19 22:32                                                                                                                                         ` Al Viro
  0 siblings, 2 replies; 111+ messages in thread
From: Mike Marshall @ 2016-02-19 22:11 UTC (permalink / raw)
  To: Al Viro
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel,
	Stephen Rothwell, Mike Marshall

Yay! The problem is fixed.

Boo! Now a new problem is uncovered, I don't have a handle on it yet.
Now it is possible to create a broken file on the orangefs server
across a restart of the client-core.

dbench:
(808) open ./clients/client0/~dmtmp/PWRPNT/PPTC112.TMP failed for
handle 10042 (No such file or directory)

ls -l /pvfsmnt/clients/client0/~dmtmp/PWRPNT
ls: cannot access /pvfsmnt/clients/client0/~dmtmp/PWRPNT/PPTC112.TMP:
No such file or directory
total 1364
-rw-------. 1 root root  85026 Feb 19 14:53 NEWPCB.PPT
-rw-------. 1 root root 260096 Feb 19 14:52 PCBENCHM.PPT
??????????? ? ?    ?         ?            ? PPTC112.TMP
-rw-------. 1 root root 260096 Feb 19 14:51 PPTOOLS1.PPA
-rw-------. 1 root root 260096 Feb 19 14:51 TIPS.PPT
-rw-------. 1 root root 260096 Feb 19 14:51 TRIDOTS.POT
-rw-------. 1 root root 260096 Feb 19 14:51 ZD16.BMP

The filename comes back from the server in the readdir buffer.

I can reproduce this, so I'll have to work the problem some more
to find more information. First place I'll look is the khandle
code <g>...

Anywho...

The fixed version of the client-core for the other problem is in
this SVN repository:

http://www.orangefs.org/svn/orangefs/branches/trunk.kernel.update/

As far as orangefs for-next is concerned... I don't see how to
update it without destroying the top few commit messages in the
commit history.

I plan to update the kernel.org orangefs for-next tree to look exactly
like the "current" branch of my github tree, unless someone says
not to:

github.com/hubcapsc/linux/tree/current           Latest commit c1223ca

-Mike

On Thu, Feb 18, 2016 at 7:25 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Feb 18, 2016 at 04:50:11PM -0500, Mike Marshall wrote:
>> As part of the attempt to go upstream, this "hubcap" guy you see
>> in the comments worked on a thing that changes 64bit userspace handles
>> back and forth into 128bit kernel handles... we did this because
>> one day, when we have orangefs3, we will be using 128bit uuid-derived
>> handles, and we believe it is our responsibility to not break the
>> upstream kernel module.
>>
>> Anywho, I bet you are right Al, he messed up this part of it...
>> I'll look and see if that is really so, and get it fixed.
>>
>> -Mike "hubcap"
>
> OK...  I'll fold the trivial braino fix (op_is_cancel() checking the wrong
> thing) into "orangefs: delay freeing slot until cancel completes" where it
> had been introduced, but the rest of it is probably too far and will have
> to be a couple of commits on top of that queue.  Had it been just my tree,
> I probably would still reorder and fold, but I know that my habits in that
> respect are rather extreme.
>
> FWIW, the scenario spotted by Martin wouldn't cause any real problems, but
> only because by the time we ended copying to/from daemon service_operation()
> couldn't have reached resubmit - it only happens if there had been a purge
> and that can't happen while somebody is inside a control device method.
>
> So the original code had been correct, but it was more brittle than
> I'd like *and* making sure that nobody else sees an op by the time
> orangefs_clean_interrupted_operation() returns is a good thing.
>
> New logics gives that, and avoids the need to play with refcounts on ops.
>
> I've pushed that into #orangefs-untested; if that works, please switch your
> for-next to it.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-19 22:11                                                                                                                                       ` Mike Marshall
@ 2016-02-19 22:22                                                                                                                                         ` Al Viro
  2016-02-20 12:14                                                                                                                                           ` Mike Marshall
  2016-02-19 22:32                                                                                                                                         ` Al Viro
  1 sibling, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-02-19 22:22 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Fri, Feb 19, 2016 at 05:11:29PM -0500, Mike Marshall wrote:

> I plan to update the kernel.org orangefs for-next tree to look exactly
> like the "current" branch of my github tree, unless someone says
> not to:
> 
> github.com/hubcapsc/linux/tree/current           Latest commit c1223ca

$ git checkout current
$ git fetch git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git orangefs-untested
$ git diff FETCH_HEAD # should report no differences
$ git reset --hard FETCH_HEAD
$ git push --force

then push the same branch into your kernel.org (as for-next, again with -force).

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-19 22:11                                                                                                                                       ` Mike Marshall
  2016-02-19 22:22                                                                                                                                         ` Al Viro
@ 2016-02-19 22:32                                                                                                                                         ` Al Viro
  2016-02-19 22:45                                                                                                                                           ` Martin Brandenburg
  2016-02-19 22:50                                                                                                                                           ` Martin Brandenburg
  1 sibling, 2 replies; 111+ messages in thread
From: Al Viro @ 2016-02-19 22:32 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Fri, Feb 19, 2016 at 05:11:29PM -0500, Mike Marshall wrote:

> Boo! Now a new problem is uncovered, I don't have a handle on it yet.
> Now it is possible to create a broken file on the orangefs server
> across a restart of the client-core.

I suspect that it's your "getattr after create and leave dentry negative if
that getattr fails".  Might make sense to d_drop() the sucker in case of
such late failure - or mark it so that subsequent d_revalidate() would
*not* skip getattr, despite NULL ->d_inode.

Incidentally, why does your ->d_revalidate() bother with d_drop()?  Just
have it return 0 and let the caller DTRT...

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-19 22:32                                                                                                                                         ` Al Viro
@ 2016-02-19 22:45                                                                                                                                           ` Martin Brandenburg
  2016-02-19 22:50                                                                                                                                           ` Martin Brandenburg
  1 sibling, 0 replies; 111+ messages in thread
From: Martin Brandenburg @ 2016-02-19 22:45 UTC (permalink / raw)
  To: Al Viro; +Cc: Mike Marshall, Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Fri, 19 Feb 2016, Al Viro wrote:

> On Fri, Feb 19, 2016 at 05:11:29PM -0500, Mike Marshall wrote:
> 
> > Boo! Now a new problem is uncovered, I don't have a handle on it yet.
> > Now it is possible to create a broken file on the orangefs server
> > across a restart of the client-core.
> 
> I suspect that it's your "getattr after create and leave dentry negative if
> that getattr fails".  Might make sense to d_drop() the sucker in case of
> such late failure - or mark it so that subsequent d_revalidate() would
> *not* skip getattr, despite NULL ->d_inode.
> 
> Incidentally, why does your ->d_revalidate() bother with d_drop()?  Just
> have it return 0 and let the caller DTRT...
> 

Because I recently worked on it and didn't know that was
desirable. *oops*

I see what fs/namei.c does now.

I suppose you see the request I just sent out for review
of that code.

-- Martin

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-19 22:32                                                                                                                                         ` Al Viro
  2016-02-19 22:45                                                                                                                                           ` Martin Brandenburg
@ 2016-02-19 22:50                                                                                                                                           ` Martin Brandenburg
  1 sibling, 0 replies; 111+ messages in thread
From: Martin Brandenburg @ 2016-02-19 22:50 UTC (permalink / raw)
  To: Al Viro; +Cc: Mike Marshall, Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Fri, 19 Feb 2016, Al Viro wrote:

> On Fri, Feb 19, 2016 at 05:11:29PM -0500, Mike Marshall wrote:
> 
> > Boo! Now a new problem is uncovered, I don't have a handle on it yet.
> > Now it is possible to create a broken file on the orangefs server
> > across a restart of the client-core.
> 
> I suspect that it's your "getattr after create and leave dentry negative if
> that getattr fails".  Might make sense to d_drop() the sucker in case of
> such late failure - or mark it so that subsequent d_revalidate() would
> *not* skip getattr, despite NULL ->d_inode.
> 
> Incidentally, why does your ->d_revalidate() bother with d_drop()?  Just
> have it return 0 and let the caller DTRT...
> 

However I'm not so sure the kernel is at fault here. We
see with a userspace tool which just opens a socket to
the server that orangefs-readdir lists the file and
orangefs-stat says ENOENT.

Looks like the server's database is corrupt.

-- Martin

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-19 22:22                                                                                                                                         ` Al Viro
@ 2016-02-20 12:14                                                                                                                                           ` Mike Marshall
  2016-02-20 13:36                                                                                                                                             ` Al Viro
  0 siblings, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-02-20 12:14 UTC (permalink / raw)
  To: Al Viro
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel, Stephen Rothwell

Hi Al...

There's something I'll be thinking about all weekend (while my friend
Stanley the grader helps me distribute 40 tons of gravel)...

Your orangefs-untested branch has 5625087 commits. My "current" branch
has 5625087 commits. In each all of the commit signatures match, except
for the most recent 15 commits. The last 15 commits in my "current"
branch were made from your orangefs-untested branch with "git format-patch"
and applied to my "current" branch with "git am -s". "git log -p" shows that
my most recent 15 commits differ from your most recent 15 commits by
the addition of my "sign off" line.

I will absolutely update my kernel.org for-next branch with the procedure you
outlined, because you said so.

I wish I understood it better, though... I can only guess at this point that
the procedure you outlined will do some desirable thing to git metadata...?

-Mike

On Fri, Feb 19, 2016 at 5:22 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Feb 19, 2016 at 05:11:29PM -0500, Mike Marshall wrote:
>
>> I plan to update the kernel.org orangefs for-next tree to look exactly
>> like the "current" branch of my github tree, unless someone says
>> not to:
>>
>> github.com/hubcapsc/linux/tree/current           Latest commit c1223ca
>
> $ git checkout current
> $ git fetch git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git orangefs-untested
> $ git diff FETCH_HEAD # should report no differences
> $ git reset --hard FETCH_HEAD
> $ git push --force
>
> then push the same branch into your kernel.org (as for-next, again with -force).

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-20 12:14                                                                                                                                           ` Mike Marshall
@ 2016-02-20 13:36                                                                                                                                             ` Al Viro
  2016-02-22 16:20                                                                                                                                               ` Mike Marshall
  0 siblings, 1 reply; 111+ messages in thread
From: Al Viro @ 2016-02-20 13:36 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel, Stephen Rothwell

On Sat, Feb 20, 2016 at 07:14:26AM -0500, Mike Marshall wrote:

> Your orangefs-untested branch has 5625087 commits. My "current" branch
> has 5625087 commits. In each all of the commit signatures match, except
> for the most recent 15 commits. The last 15 commits in my "current"
> branch were made from your orangefs-untested branch with "git format-patch"
> and applied to my "current" branch with "git am -s". "git log -p" shows that
> my most recent 15 commits differ from your most recent 15 commits by
> the addition of my "sign off" line.

*blinks*
*checks*

OK, ignore what I asked, then.  Looks like I'd screwed up checking last time.

> I will absolutely update my kernel.org for-next branch with the procedure you
> outlined, because you said so.
> 
> I wish I understood it better, though... I can only guess at this point that
> the procedure you outlined will do some desirable thing to git metadata...?

None whatsoever, ignore it.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-20 13:36                                                                                                                                             ` Al Viro
@ 2016-02-22 16:20                                                                                                                                               ` Mike Marshall
  2016-02-22 21:22                                                                                                                                                 ` Mike Marshall
  0 siblings, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-02-22 16:20 UTC (permalink / raw)
  To: Al Viro
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel,
	Stephen Rothwell, Mike Marshall

 > Looks like I'd screwed up checking last time.

Probably not that <g>... my branch did diverge over the course
of the few days that we were thrashing around in the kernel trying
to fix what I had broken two years ago in userspace.

I can relate to why you were motivated to remove the thrashing
around from the git history, but your git-foo is much stronger
than mine. I wanted to try and get my branch back into line using
a methodology that I understand to keep from ending up like
this fellow:

http://myweb.clemson.edu/~hubcap/harris.jpg

I'm glad it worked out... my kernel.org for-next branch is updated now.

so, I'll keep working the problem, using your d_drop idea first off...
I'll be back with more information, and hopefully even have it fixed, soon...

-Mike

On Sat, Feb 20, 2016 at 8:36 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sat, Feb 20, 2016 at 07:14:26AM -0500, Mike Marshall wrote:
>
>> Your orangefs-untested branch has 5625087 commits. My "current" branch
>> has 5625087 commits. In each all of the commit signatures match, except
>> for the most recent 15 commits. The last 15 commits in my "current"
>> branch were made from your orangefs-untested branch with "git format-patch"
>> and applied to my "current" branch with "git am -s". "git log -p" shows that
>> my most recent 15 commits differ from your most recent 15 commits by
>> the addition of my "sign off" line.
>
> *blinks*
> *checks*
>
> OK, ignore what I asked, then.  Looks like I'd screwed up checking last time.
>
>> I will absolutely update my kernel.org for-next branch with the procedure you
>> outlined, because you said so.
>>
>> I wish I understood it better, though... I can only guess at this point that
>> the procedure you outlined will do some desirable thing to git metadata...?
>
> None whatsoever, ignore it.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-22 16:20                                                                                                                                               ` Mike Marshall
@ 2016-02-22 21:22                                                                                                                                                 ` Mike Marshall
  2016-02-23 21:58                                                                                                                                                   ` Mike Marshall
  0 siblings, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-02-22 21:22 UTC (permalink / raw)
  To: Al Viro
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel,
	Stephen Rothwell, Mike Marshall

I did this and the problem seems fixed:

# git diff
diff --git a/fs/orangefs/namei.c b/fs/orangefs/namei.c
index b3ae374..249bda5 100644
--- a/fs/orangefs/namei.c
+++ b/fs/orangefs/namei.c
@@ -61,6 +61,7 @@ static int orangefs_create(struct inode *dir,
                           __func__,
                           dentry->d_name.name);
                ret = PTR_ERR(inode);
+               d_drop(dentry);
                goto out;
        }

Of course, this has uncovered yet another reproducible problem:

710055 orangefs_unlink: called on PPTB1E4.TMP
710058 service_operation: orangefs_unlink ffff880014828000

                right in here I think the rm is
                being processed in the server just
                as the client-core has died.

710534 wait_for_matching_downcall: operation purged ffff880014828000
710538 service_operation: orangefs_unlink ffff880014828000
710539 service_operation:client core is NOT in service

                right in here I think stuff starts
                working again and we're going
                to unsuccessfully try to process
                the rm again.

710646 wait_for_matching_downcall returned 0 for ffff880014828000

                happy, because we got the matching downcall

710647 service_operation orangefs_unlink returning -2 for ffff880014828000
710648 orangefs_unlink: service_operation returned -2

                sad, because we got ENOENT on second rm

710649 Releasing OP ffff880014828000

                so... the userspace process (dbench in this case) thinks
                the rm failed, but it didn't.



On Mon, Feb 22, 2016 at 11:20 AM, Mike Marshall <hubcap@omnibond.com> wrote:
>  > Looks like I'd screwed up checking last time.
>
> Probably not that <g>... my branch did diverge over the course
> of the few days that we were thrashing around in the kernel trying
> to fix what I had broken two years ago in userspace.
>
> I can relate to why you were motivated to remove the thrashing
> around from the git history, but your git-foo is much stronger
> than mine. I wanted to try and get my branch back into line using
> a methodology that I understand to keep from ending up like
> this fellow:
>
> http://myweb.clemson.edu/~hubcap/harris.jpg
>
> I'm glad it worked out... my kernel.org for-next branch is updated now.
>
> so, I'll keep working the problem, using your d_drop idea first off...
> I'll be back with more information, and hopefully even have it fixed, soon...
>
> -Mike
>
> On Sat, Feb 20, 2016 at 8:36 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Sat, Feb 20, 2016 at 07:14:26AM -0500, Mike Marshall wrote:
>>
>>> Your orangefs-untested branch has 5625087 commits. My "current" branch
>>> has 5625087 commits. In each all of the commit signatures match, except
>>> for the most recent 15 commits. The last 15 commits in my "current"
>>> branch were made from your orangefs-untested branch with "git format-patch"
>>> and applied to my "current" branch with "git am -s". "git log -p" shows that
>>> my most recent 15 commits differ from your most recent 15 commits by
>>> the addition of my "sign off" line.
>>
>> *blinks*
>> *checks*
>>
>> OK, ignore what I asked, then.  Looks like I'd screwed up checking last time.
>>
>>> I will absolutely update my kernel.org for-next branch with the procedure you
>>> outlined, because you said so.
>>>
>>> I wish I understood it better, though... I can only guess at this point that
>>> the procedure you outlined will do some desirable thing to git metadata...?
>>
>> None whatsoever, ignore it.

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-22 21:22                                                                                                                                                 ` Mike Marshall
@ 2016-02-23 21:58                                                                                                                                                   ` Mike Marshall
  2016-02-26 20:21                                                                                                                                                     ` Mike Marshall
  0 siblings, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-02-23 21:58 UTC (permalink / raw)
  To: Al Viro
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel,
	Stephen Rothwell, Mike Marshall

Ok, I understand these last couple of problems better.

If the client-core crashes (kill -9 in my test cases) in
the middle of a rename or an unlink (and maybe some other
operations, these are the ones I have captured and studied) a
couple of things can happen.

In both cases, you get:

  service_operation
    queue the operation
    wait_for_matching_downcall = -EAGAIN
    queue the operation
    wait_for_matching_downcall = 0
    out:

Sometimes when the operation is first queued, the client-core
will be in the middle of the state machine code and the operation
will be half done when the client-core dies, and the object that
was being operated on will be broken. In other words, it is possible
for a userspace program using the Orangefs native API to corrupt
the filesystem if it crashes in a critical area. Do other userspace
filesystems have this same problem?

Other times, when the operation is first queued, the client-core
gets the operation fully launched, and then dies. Then, when
the operation is queued up again, the operation fails on -ENOENT.
You can't rename a to b if a has already been renamed to b. You can't
unlink a if there is no a.

For the first case I don't see how there's anything that can be done.
The filesystem is corrupted. Its not toast or anything, but there's
a directory somewhere with a broken file in it.

I have made a patch that appears to actually work and cause no bad
side effects for the second case. Al has come colorful phrases,
like "this is too ugly to live" and some others. What you think about
this patch, Al <g>...

The d_drop is how I implemented the idea you had at first, Al, I'm
not sure now if it helps or hurts or is a no-op.

# git --no-pager diff
diff --git a/fs/orangefs/namei.c b/fs/orangefs/namei.c
index b3ae374..6d953b1 100644
--- a/fs/orangefs/namei.c
+++ b/fs/orangefs/namei.c
@@ -61,6 +61,7 @@ static int orangefs_create(struct inode *dir,
    __func__,
    dentry->d_name.name);
  ret = PTR_ERR(inode);
+ d_drop(dentry);
  goto out;
  }

@@ -246,12 +247,22 @@ static int orangefs_unlink(struct inode *dir,
struct dentry *dentry)

  op_release(new_op);

- if (!ret) {
+ /*
+ * We would never have gotten here if the object didn't
+ * exist when we started down this path. There's a race
+ * condition where if a restart of the client-core
+ * coincides just right with an in-progress unlink a
+ * file can get deleted on the server and be gone
+ * when service-operation does the retry...
+ */
+ if ((!ret) || (ret == -ENOENT)) {
  drop_nlink(inode);

  SetMtimeFlag(parent);
  dir->i_mtime = dir->i_ctime = current_fs_time(dir->i_sb);
  mark_inode_dirty_sync(dir);
+
+ ret = 0;
  }
  return ret;
 }
@@ -433,6 +444,17 @@ static int orangefs_rename(struct inode *old_dir,
      "orangefs_rename: got downcall status %d\n",
      ret);

+ /*
+ * We would never have gotten here if the object didn't
+ * exist when we started down this path. There's a race
+ * condition where if a restart of the client-core
+ * coincides just right with an in-progress rename a
+ * file can get renamed on the server and be gone
+ * when service-operation does the retry...
+ */
+ if (ret == -ENOENT)
+ ret = 0;
+
  if (new_dentry->d_inode)
  new_dentry->d_inode->i_ctime = CURRENT_TIME;


On Mon, Feb 22, 2016 at 4:22 PM, Mike Marshall <hubcap@omnibond.com> wrote:
> I did this and the problem seems fixed:
>
> # git diff
> diff --git a/fs/orangefs/namei.c b/fs/orangefs/namei.c
> index b3ae374..249bda5 100644
> --- a/fs/orangefs/namei.c
> +++ b/fs/orangefs/namei.c
> @@ -61,6 +61,7 @@ static int orangefs_create(struct inode *dir,
>                            __func__,
>                            dentry->d_name.name);
>                 ret = PTR_ERR(inode);
> +               d_drop(dentry);
>                 goto out;
>         }
>
> Of course, this has uncovered yet another reproducible problem:
>
> 710055 orangefs_unlink: called on PPTB1E4.TMP
> 710058 service_operation: orangefs_unlink ffff880014828000
>
>                 right in here I think the rm is
>                 being processed in the server just
>                 as the client-core has died.
>
> 710534 wait_for_matching_downcall: operation purged ffff880014828000
> 710538 service_operation: orangefs_unlink ffff880014828000
> 710539 service_operation:client core is NOT in service
>
>                 right in here I think stuff starts
>                 working again and we're going
>                 to unsuccessfully try to process
>                 the rm again.
>
> 710646 wait_for_matching_downcall returned 0 for ffff880014828000
>
>                 happy, because we got the matching downcall
>
> 710647 service_operation orangefs_unlink returning -2 for ffff880014828000
> 710648 orangefs_unlink: service_operation returned -2
>
>                 sad, because we got ENOENT on second rm
>
> 710649 Releasing OP ffff880014828000
>
>                 so... the userspace process (dbench in this case) thinks
>                 the rm failed, but it didn't.
>
>
>
> On Mon, Feb 22, 2016 at 11:20 AM, Mike Marshall <hubcap@omnibond.com> wrote:
>>  > Looks like I'd screwed up checking last time.
>>
>> Probably not that <g>... my branch did diverge over the course
>> of the few days that we were thrashing around in the kernel trying
>> to fix what I had broken two years ago in userspace.
>>
>> I can relate to why you were motivated to remove the thrashing
>> around from the git history, but your git-foo is much stronger
>> than mine. I wanted to try and get my branch back into line using
>> a methodology that I understand to keep from ending up like
>> this fellow:
>>
>> http://myweb.clemson.edu/~hubcap/harris.jpg
>>
>> I'm glad it worked out... my kernel.org for-next branch is updated now.
>>
>> so, I'll keep working the problem, using your d_drop idea first off...
>> I'll be back with more information, and hopefully even have it fixed, soon...
>>
>> -Mike
>>
>> On Sat, Feb 20, 2016 at 8:36 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>> On Sat, Feb 20, 2016 at 07:14:26AM -0500, Mike Marshall wrote:
>>>
>>>> Your orangefs-untested branch has 5625087 commits. My "current" branch
>>>> has 5625087 commits. In each all of the commit signatures match, except
>>>> for the most recent 15 commits. The last 15 commits in my "current"
>>>> branch were made from your orangefs-untested branch with "git format-patch"
>>>> and applied to my "current" branch with "git am -s". "git log -p" shows that
>>>> my most recent 15 commits differ from your most recent 15 commits by
>>>> the addition of my "sign off" line.
>>>
>>> *blinks*
>>> *checks*
>>>
>>> OK, ignore what I asked, then.  Looks like I'd screwed up checking last time.
>>>
>>>> I will absolutely update my kernel.org for-next branch with the procedure you
>>>> outlined, because you said so.
>>>>
>>>> I wish I understood it better, though... I can only guess at this point that
>>>> the procedure you outlined will do some desirable thing to git metadata...?
>>>
>>> None whatsoever, ignore it.

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: Orangefs ABI documentation
  2016-02-23 21:58                                                                                                                                                   ` Mike Marshall
@ 2016-02-26 20:21                                                                                                                                                     ` Mike Marshall
  0 siblings, 0 replies; 111+ messages in thread
From: Mike Marshall @ 2016-02-26 20:21 UTC (permalink / raw)
  To: Al Viro
  Cc: Martin Brandenburg, Linus Torvalds, linux-fsdevel,
	Stephen Rothwell, Mike Marshall

I have updated orangefs.txt to reflect all the work Al has done
recently. There are also several other cosmetic and comment
updates in the code, along with the time patches that Arnd
Bergmann sent today.

I would appreciate any corrections or comments on the new parts
in orangefs.txt,

thanks!

git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux.git
for-next

-Mike

On Tue, Feb 23, 2016 at 4:58 PM, Mike Marshall <hubcap@omnibond.com> wrote:
> Ok, I understand these last couple of problems better.
>
> If the client-core crashes (kill -9 in my test cases) in
> the middle of a rename or an unlink (and maybe some other
> operations, these are the ones I have captured and studied) a
> couple of things can happen.
>
> In both cases, you get:
>
>   service_operation
>     queue the operation
>     wait_for_matching_downcall = -EAGAIN
>     queue the operation
>     wait_for_matching_downcall = 0
>     out:
>
> Sometimes when the operation is first queued, the client-core
> will be in the middle of the state machine code and the operation
> will be half done when the client-core dies, and the object that
> was being operated on will be broken. In other words, it is possible
> for a userspace program using the Orangefs native API to corrupt
> the filesystem if it crashes in a critical area. Do other userspace
> filesystems have this same problem?
>
> Other times, when the operation is first queued, the client-core
> gets the operation fully launched, and then dies. Then, when
> the operation is queued up again, the operation fails on -ENOENT.
> You can't rename a to b if a has already been renamed to b. You can't
> unlink a if there is no a.
>
> For the first case I don't see how there's anything that can be done.
> The filesystem is corrupted. Its not toast or anything, but there's
> a directory somewhere with a broken file in it.
>
> I have made a patch that appears to actually work and cause no bad
> side effects for the second case. Al has come colorful phrases,
> like "this is too ugly to live" and some others. What you think about
> this patch, Al <g>...
>
> The d_drop is how I implemented the idea you had at first, Al, I'm
> not sure now if it helps or hurts or is a no-op.
>
> # git --no-pager diff
> diff --git a/fs/orangefs/namei.c b/fs/orangefs/namei.c
> index b3ae374..6d953b1 100644
> --- a/fs/orangefs/namei.c
> +++ b/fs/orangefs/namei.c
> @@ -61,6 +61,7 @@ static int orangefs_create(struct inode *dir,
>     __func__,
>     dentry->d_name.name);
>   ret = PTR_ERR(inode);
> + d_drop(dentry);
>   goto out;
>   }
>
> @@ -246,12 +247,22 @@ static int orangefs_unlink(struct inode *dir,
> struct dentry *dentry)
>
>   op_release(new_op);
>
> - if (!ret) {
> + /*
> + * We would never have gotten here if the object didn't
> + * exist when we started down this path. There's a race
> + * condition where if a restart of the client-core
> + * coincides just right with an in-progress unlink a
> + * file can get deleted on the server and be gone
> + * when service-operation does the retry...
> + */
> + if ((!ret) || (ret == -ENOENT)) {
>   drop_nlink(inode);
>
>   SetMtimeFlag(parent);
>   dir->i_mtime = dir->i_ctime = current_fs_time(dir->i_sb);
>   mark_inode_dirty_sync(dir);
> +
> + ret = 0;
>   }
>   return ret;
>  }
> @@ -433,6 +444,17 @@ static int orangefs_rename(struct inode *old_dir,
>       "orangefs_rename: got downcall status %d\n",
>       ret);
>
> + /*
> + * We would never have gotten here if the object didn't
> + * exist when we started down this path. There's a race
> + * condition where if a restart of the client-core
> + * coincides just right with an in-progress rename a
> + * file can get renamed on the server and be gone
> + * when service-operation does the retry...
> + */
> + if (ret == -ENOENT)
> + ret = 0;
> +
>   if (new_dentry->d_inode)
>   new_dentry->d_inode->i_ctime = CURRENT_TIME;
>
>
> On Mon, Feb 22, 2016 at 4:22 PM, Mike Marshall <hubcap@omnibond.com> wrote:
>> I did this and the problem seems fixed:
>>
>> # git diff
>> diff --git a/fs/orangefs/namei.c b/fs/orangefs/namei.c
>> index b3ae374..249bda5 100644
>> --- a/fs/orangefs/namei.c
>> +++ b/fs/orangefs/namei.c
>> @@ -61,6 +61,7 @@ static int orangefs_create(struct inode *dir,
>>                            __func__,
>>                            dentry->d_name.name);
>>                 ret = PTR_ERR(inode);
>> +               d_drop(dentry);
>>                 goto out;
>>         }
>>
>> Of course, this has uncovered yet another reproducible problem:
>>
>> 710055 orangefs_unlink: called on PPTB1E4.TMP
>> 710058 service_operation: orangefs_unlink ffff880014828000
>>
>>                 right in here I think the rm is
>>                 being processed in the server just
>>                 as the client-core has died.
>>
>> 710534 wait_for_matching_downcall: operation purged ffff880014828000
>> 710538 service_operation: orangefs_unlink ffff880014828000
>> 710539 service_operation:client core is NOT in service
>>
>>                 right in here I think stuff starts
>>                 working again and we're going
>>                 to unsuccessfully try to process
>>                 the rm again.
>>
>> 710646 wait_for_matching_downcall returned 0 for ffff880014828000
>>
>>                 happy, because we got the matching downcall
>>
>> 710647 service_operation orangefs_unlink returning -2 for ffff880014828000
>> 710648 orangefs_unlink: service_operation returned -2
>>
>>                 sad, because we got ENOENT on second rm
>>
>> 710649 Releasing OP ffff880014828000
>>
>>                 so... the userspace process (dbench in this case) thinks
>>                 the rm failed, but it didn't.
>>
>>
>>
>> On Mon, Feb 22, 2016 at 11:20 AM, Mike Marshall <hubcap@omnibond.com> wrote:
>>>  > Looks like I'd screwed up checking last time.
>>>
>>> Probably not that <g>... my branch did diverge over the course
>>> of the few days that we were thrashing around in the kernel trying
>>> to fix what I had broken two years ago in userspace.
>>>
>>> I can relate to why you were motivated to remove the thrashing
>>> around from the git history, but your git-foo is much stronger
>>> than mine. I wanted to try and get my branch back into line using
>>> a methodology that I understand to keep from ending up like
>>> this fellow:
>>>
>>> http://myweb.clemson.edu/~hubcap/harris.jpg
>>>
>>> I'm glad it worked out... my kernel.org for-next branch is updated now.
>>>
>>> so, I'll keep working the problem, using your d_drop idea first off...
>>> I'll be back with more information, and hopefully even have it fixed, soon...
>>>
>>> -Mike
>>>
>>> On Sat, Feb 20, 2016 at 8:36 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>>> On Sat, Feb 20, 2016 at 07:14:26AM -0500, Mike Marshall wrote:
>>>>
>>>>> Your orangefs-untested branch has 5625087 commits. My "current" branch
>>>>> has 5625087 commits. In each all of the commit signatures match, except
>>>>> for the most recent 15 commits. The last 15 commits in my "current"
>>>>> branch were made from your orangefs-untested branch with "git format-patch"
>>>>> and applied to my "current" branch with "git am -s". "git log -p" shows that
>>>>> my most recent 15 commits differ from your most recent 15 commits by
>>>>> the addition of my "sign off" line.
>>>>
>>>> *blinks*
>>>> *checks*
>>>>
>>>> OK, ignore what I asked, then.  Looks like I'd screwed up checking last time.
>>>>
>>>>> I will absolutely update my kernel.org for-next branch with the procedure you
>>>>> outlined, because you said so.
>>>>>
>>>>> I wish I understood it better, though... I can only guess at this point that
>>>>> the procedure you outlined will do some desirable thing to git metadata...?
>>>>
>>>> None whatsoever, ignore it.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: write() semantics (Re: Orangefs ABI documentation)
  2016-01-23 23:35                                     ` Linus Torvalds
@ 2016-03-03 22:25                                       ` Mike Marshall
  2016-03-04 20:55                                         ` Mike Marshall
  0 siblings, 1 reply; 111+ messages in thread
From: Mike Marshall @ 2016-03-03 22:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, linux-fsdevel, Mike Marshall

Here is what I have come up with to try and make our return to
interrupted writes more acceptable... in my tests it seems to
work. My test involved running the client-core with a tiny IO
buffer size (4k) and a C program with a large write buffer(32M).
That way there was plenty of time for me to fire off the C program
and hit Ctrl-C while the write was chugging along.

The return value from do_readv_writev always matches the size
of the file written by the aborted C program in my tests.

I changed the C program around so that sometimes I ran it with
(O_CREAT | O_RDWR) on open and sometimes with (O_CREAT | O_RDWR | O_APPEND)
and it seemed to do the right thing. I didn't try to set up a
signal handler so that the signal wasn't fatal to the process,
I guess that would be a test to actually see and verify the correct
short return code to write...

Do you all think this looks like it should work in principle?

BTW: in the distant past someone else attempted to solve this problem
the "nfs intr" way - we have an intr mount option, and that's why there's
all that sweating over whether or not stuff is "interruptible" in
waitqueue.c... I'm not sure if our intr mount option is relevant anymore
given the way the op handling code has evolved...

diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
index 6f2e0f7..4349c9d 100644
--- a/fs/orangefs/file.c
+++ b/fs/orangefs/file.c
@@ -180,21 +180,54 @@ populate_shared_memory:
  }

  if (ret < 0) {
- /*
- * don't write an error to syslog on signaled operation
- * termination unless we've got debugging turned on, as
- * this can happen regularly (i.e. ctrl-c)
- */
- if (ret == -EINTR)
+ if (ret == -EINTR) {
+ /*
+ * We can't return EINTR if any data was written,
+ * it's not POSIX. It is minimally acceptable
+ * to give a partial write, the way NFS does.
+ *
+ * It would be optimal to return all or nothing,
+ * but if a userspace write is bigger than
+ * an IO buffer, and the interrupt occurs
+ * between buffer writes, that would not be
+ * possible.
+ */
+ switch (new_op->op_state - OP_VFS_STATE_GIVEN_UP) {
+ /*
+ * If the op was waiting when the interrupt
+ * occurred, then the client-core did not
+ * trigger the write.
+ */
+ case OP_VFS_STATE_WAITING:
+ ret = 0;
+ break;
+ /*
+ * If the op was in progress when the interrupt
+ * occurred, then the client-core was able to
+ * trigger the write.
+ */
+ case OP_VFS_STATE_INPROGR:
+ ret = total_size;
+ break;
+ default:
+ gossip_err("%s: unexpected op state :%d:.\n",
+   __func__,
+   new_op->op_state);
+ ret = 0;
+ break;
+ }
  gossip_debug(GOSSIP_FILE_DEBUG,
-     "%s: returning error %ld\n", __func__,
-     (long)ret);
- else
+     "%s: got EINTR, state:%d: %p\n",
+     __func__,
+     new_op->op_state,
+     new_op);
+ } else {
  gossip_err("%s: error in %s handle %pU, returning %zd\n",
  __func__,
  type == ORANGEFS_IO_READ ?
  "read from" : "write to",
  handle, ret);
+ }
  if (orangefs_cancel_op_in_progress(new_op))
  return ret;


On Sat, Jan 23, 2016 at 6:35 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Sat, Jan 23, 2016 at 2:46 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> What should we get?  -EINTR, despite having written some data?
>
> No, that's not acceptable.
>
> Either all or nothing (which is POSIX) or the NFS 'intr' mount
> behavior (partial write return, -EINTR only when nothing was written
> at all). And, like NFS, a mount option might be a good thing.
>
> And of course, for the usual reasons, fatal signals are special in
> that for them we generally say "screw posix, nobody sees the return
> value anyway", but even there the filesystem might as well still
> return the partial return value (just to not introduce yet another
> special case).
>
> In fact, I think that with our "fatal signals interrupt" behavior,
> nobody should likely use the "intr" mount option on NFS. Even if the
> semantics may be "better", there are likely simply just too many
> programs that don't check the return value of "write()" at all, much
> less handle partial writes correctly.
>
> (And yes, our "screw posix" behavior wrt fatal signals is strictly
> wrong even _despite_ the fact that nobody sees the return value -
> other processes can still obviously see that the whole write wasn't
> done. But blocking on a fatal signal is _so_ annoying that it's one of
> those things where we just say "posix was wrong on this one, and if we
> squint a bit we look _almost_ like we're compliant").
>
>               Linus

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: write() semantics (Re: Orangefs ABI documentation)
  2016-03-03 22:25                                       ` Mike Marshall
@ 2016-03-04 20:55                                         ` Mike Marshall
  0 siblings, 0 replies; 111+ messages in thread
From: Mike Marshall @ 2016-03-04 20:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, linux-fsdevel, Mike Marshall

I added a signal handler to my test program today, and now I see the
giant difference between wait_for_completion_interruptible_timeout,
which is what you get when you use the -intr mount option, and
wait_for_completion_killable_timeout, which is what you get when
you don't, so I retract what I said about the -intr mount option
not being relevant <g>...

It also seems like it would be easy (and correct) to modify the
patch a little so that -EINTR would still be returned on the off-chance
that the interrupt was caught before any data was written...

-Mike

On Thu, Mar 3, 2016 at 5:25 PM, Mike Marshall <hubcap@omnibond.com> wrote:
> Here is what I have come up with to try and make our return to
> interrupted writes more acceptable... in my tests it seems to
> work. My test involved running the client-core with a tiny IO
> buffer size (4k) and a C program with a large write buffer(32M).
> That way there was plenty of time for me to fire off the C program
> and hit Ctrl-C while the write was chugging along.
>
> The return value from do_readv_writev always matches the size
> of the file written by the aborted C program in my tests.
>
> I changed the C program around so that sometimes I ran it with
> (O_CREAT | O_RDWR) on open and sometimes with (O_CREAT | O_RDWR | O_APPEND)
> and it seemed to do the right thing. I didn't try to set up a
> signal handler so that the signal wasn't fatal to the process,
> I guess that would be a test to actually see and verify the correct
> short return code to write...
>
> Do you all think this looks like it should work in principle?
>
> BTW: in the distant past someone else attempted to solve this problem
> the "nfs intr" way - we have an intr mount option, and that's why there's
> all that sweating over whether or not stuff is "interruptible" in
> waitqueue.c... I'm not sure if our intr mount option is relevant anymore
> given the way the op handling code has evolved...
>
> diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
> index 6f2e0f7..4349c9d 100644
> --- a/fs/orangefs/file.c
> +++ b/fs/orangefs/file.c
> @@ -180,21 +180,54 @@ populate_shared_memory:
>   }
>
>   if (ret < 0) {
> - /*
> - * don't write an error to syslog on signaled operation
> - * termination unless we've got debugging turned on, as
> - * this can happen regularly (i.e. ctrl-c)
> - */
> - if (ret == -EINTR)
> + if (ret == -EINTR) {
> + /*
> + * We can't return EINTR if any data was written,
> + * it's not POSIX. It is minimally acceptable
> + * to give a partial write, the way NFS does.
> + *
> + * It would be optimal to return all or nothing,
> + * but if a userspace write is bigger than
> + * an IO buffer, and the interrupt occurs
> + * between buffer writes, that would not be
> + * possible.
> + */
> + switch (new_op->op_state - OP_VFS_STATE_GIVEN_UP) {
> + /*
> + * If the op was waiting when the interrupt
> + * occurred, then the client-core did not
> + * trigger the write.
> + */
> + case OP_VFS_STATE_WAITING:
> + ret = 0;
> + break;
> + /*
> + * If the op was in progress when the interrupt
> + * occurred, then the client-core was able to
> + * trigger the write.
> + */
> + case OP_VFS_STATE_INPROGR:
> + ret = total_size;
> + break;
> + default:
> + gossip_err("%s: unexpected op state :%d:.\n",
> +   __func__,
> +   new_op->op_state);
> + ret = 0;
> + break;
> + }
>   gossip_debug(GOSSIP_FILE_DEBUG,
> -     "%s: returning error %ld\n", __func__,
> -     (long)ret);
> - else
> +     "%s: got EINTR, state:%d: %p\n",
> +     __func__,
> +     new_op->op_state,
> +     new_op);
> + } else {
>   gossip_err("%s: error in %s handle %pU, returning %zd\n",
>   __func__,
>   type == ORANGEFS_IO_READ ?
>   "read from" : "write to",
>   handle, ret);
> + }
>   if (orangefs_cancel_op_in_progress(new_op))
>   return ret;
>
>
> On Sat, Jan 23, 2016 at 6:35 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Sat, Jan 23, 2016 at 2:46 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>>
>>> What should we get?  -EINTR, despite having written some data?
>>
>> No, that's not acceptable.
>>
>> Either all or nothing (which is POSIX) or the NFS 'intr' mount
>> behavior (partial write return, -EINTR only when nothing was written
>> at all). And, like NFS, a mount option might be a good thing.
>>
>> And of course, for the usual reasons, fatal signals are special in
>> that for them we generally say "screw posix, nobody sees the return
>> value anyway", but even there the filesystem might as well still
>> return the partial return value (just to not introduce yet another
>> special case).
>>
>> In fact, I think that with our "fatal signals interrupt" behavior,
>> nobody should likely use the "intr" mount option on NFS. Even if the
>> semantics may be "better", there are likely simply just too many
>> programs that don't check the return value of "write()" at all, much
>> less handle partial writes correctly.
>>
>> (And yes, our "screw posix" behavior wrt fatal signals is strictly
>> wrong even _despite_ the fact that nobody sees the return value -
>> other processes can still obviously see that the whole write wasn't
>> done. But blocking on a fatal signal is _so_ annoying that it's one of
>> those things where we just say "posix was wrong on this one, and if we
>> squint a bit we look _almost_ like we're compliant").
>>
>>               Linus

^ permalink raw reply	[flat|nested] 111+ messages in thread

end of thread, other threads:[~2016-03-04 20:55 UTC | newest]

Thread overview: 111+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-15 21:46 Orangefs ABI documentation Mike Marshall
2016-01-22  7:11 ` Al Viro
2016-01-22 11:09   ` Mike Marshall
2016-01-22 16:59     ` Mike Marshall
2016-01-22 17:08       ` Al Viro
2016-01-22 17:40         ` Mike Marshall
2016-01-22 17:43         ` Al Viro
2016-01-22 18:17           ` Mike Marshall
2016-01-22 18:37             ` Al Viro
2016-01-22 19:07               ` Mike Marshall
2016-01-22 19:21                 ` Mike Marshall
2016-01-22 20:04                   ` Al Viro
2016-01-22 20:30                     ` Mike Marshall
2016-01-23  0:12                       ` Al Viro
2016-01-23  1:28                         ` Al Viro
2016-01-23  2:54                           ` Mike Marshall
2016-01-23 19:10                             ` Al Viro
2016-01-23 19:24                               ` Mike Marshall
2016-01-23 21:35                                 ` Mike Marshall
2016-01-23 22:05                                   ` Al Viro
2016-01-23 21:40                                 ` Al Viro
2016-01-23 22:36                                   ` Mike Marshall
2016-01-24  0:16                                     ` Al Viro
2016-01-24  4:05                                       ` Al Viro
2016-01-24 22:12                                         ` Mike Marshall
2016-01-30 17:22                                           ` Al Viro
2016-01-26 19:52                                         ` Martin Brandenburg
2016-01-30 17:34                                           ` Al Viro
2016-01-30 18:27                                             ` Al Viro
2016-02-04 23:30                                               ` Mike Marshall
2016-02-06 19:42                                                 ` Al Viro
2016-02-07  1:38                                                   ` Al Viro
2016-02-07  3:53                                                     ` Al Viro
2016-02-07 20:01                                                       ` [RFC] bufmap-related wait logics (Re: Orangefs ABI documentation) Al Viro
2016-02-08 22:26                                                       ` Orangefs ABI documentation Mike Marshall
2016-02-08 23:35                                                         ` Al Viro
2016-02-09  3:32                                                           ` Al Viro
2016-02-09 14:34                                                             ` Mike Marshall
2016-02-09 17:40                                                               ` Al Viro
2016-02-09 21:06                                                                 ` Al Viro
2016-02-09 22:25                                                                   ` Mike Marshall
2016-02-11 23:36                                                                   ` Mike Marshall
2016-02-09 22:02                                                                 ` Mike Marshall
2016-02-09 22:16                                                                   ` Al Viro
2016-02-09 22:40                                                                     ` Al Viro
2016-02-09 23:13                                                                       ` Al Viro
2016-02-10 16:44                                                                         ` Al Viro
2016-02-10 21:26                                                                           ` Al Viro
2016-02-11 23:54                                                                           ` Mike Marshall
2016-02-12  0:55                                                                             ` Al Viro
2016-02-12 12:13                                                                               ` Mike Marshall
2016-02-11  0:44                                                                         ` Al Viro
2016-02-11  3:22                                                                           ` Mike Marshall
2016-02-12  4:27                                                                             ` Al Viro
2016-02-12 12:26                                                                               ` Mike Marshall
2016-02-12 18:00                                                                                 ` Martin Brandenburg
2016-02-13 17:18                                                                                   ` Mike Marshall
2016-02-13 17:47                                                                                     ` Al Viro
2016-02-14  2:56                                                                                       ` Al Viro
2016-02-14  3:46                                                                                         ` [RFC] slot allocator - waitqueue use review needed (Re: Orangefs ABI documentation) Al Viro
2016-02-14  4:06                                                                                           ` Al Viro
2016-02-16  2:12                                                                                           ` Al Viro
2016-02-16 19:28                                                                                             ` Al Viro
2016-02-14 22:31                                                                                         ` Orangefs ABI documentation Mike Marshall
2016-02-14 23:43                                                                                           ` Al Viro
2016-02-15 17:46                                                                                             ` Mike Marshall
2016-02-15 18:45                                                                                               ` Al Viro
2016-02-15 22:32                                                                                                 ` Martin Brandenburg
2016-02-15 23:04                                                                                                   ` Al Viro
2016-02-16 23:15                                                                                                     ` Mike Marshall
2016-02-16 23:36                                                                                                       ` Al Viro
2016-02-16 23:54                                                                                                         ` Al Viro
2016-02-17 19:24                                                                                                           ` Mike Marshall
2016-02-17 20:11                                                                                                             ` Al Viro
2016-02-17 21:17                                                                                                               ` Al Viro
2016-02-17 22:24                                                                                                                 ` Mike Marshall
2016-02-17 22:40                                                                                                             ` Martin Brandenburg
2016-02-17 23:09                                                                                                               ` Al Viro
2016-02-17 23:15                                                                                                                 ` Al Viro
2016-02-18  0:04                                                                                                                   ` Al Viro
2016-02-18 11:11                                                                                                                     ` Al Viro
2016-02-18 18:58                                                                                                                       ` Mike Marshall
2016-02-18 19:20                                                                                                                         ` Al Viro
2016-02-18 19:49                                                                                                                         ` Martin Brandenburg
2016-02-18 20:08                                                                                                                           ` Mike Marshall
2016-02-18 20:22                                                                                                                             ` Mike Marshall
2016-02-18 20:38                                                                                                                               ` Mike Marshall
2016-02-18 20:52                                                                                                                                 ` Al Viro
2016-02-18 21:50                                                                                                                                   ` Mike Marshall
2016-02-19  0:25                                                                                                                                     ` Al Viro
2016-02-19 22:11                                                                                                                                       ` Mike Marshall
2016-02-19 22:22                                                                                                                                         ` Al Viro
2016-02-20 12:14                                                                                                                                           ` Mike Marshall
2016-02-20 13:36                                                                                                                                             ` Al Viro
2016-02-22 16:20                                                                                                                                               ` Mike Marshall
2016-02-22 21:22                                                                                                                                                 ` Mike Marshall
2016-02-23 21:58                                                                                                                                                   ` Mike Marshall
2016-02-26 20:21                                                                                                                                                     ` Mike Marshall
2016-02-19 22:32                                                                                                                                         ` Al Viro
2016-02-19 22:45                                                                                                                                           ` Martin Brandenburg
2016-02-19 22:50                                                                                                                                           ` Martin Brandenburg
2016-02-18 20:49                                                                                                                               ` Al Viro
2016-02-15 22:47                                                                                                 ` Mike Marshall
2016-01-23 22:46                                   ` write() semantics (Re: Orangefs ABI documentation) Al Viro
2016-01-23 23:35                                     ` Linus Torvalds
2016-03-03 22:25                                       ` Mike Marshall
2016-03-04 20:55                                         ` Mike Marshall
2016-01-22 20:51                     ` Orangefs ABI documentation Mike Marshall
2016-01-22 23:53                       ` Mike Marshall
2016-01-22 19:54                 ` Al Viro
2016-01-22 19:50             ` Al Viro

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.