linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PULL REQUEST] : ima-appraisal patches
@ 2012-04-18 13:04 Mimi Zohar
  2012-04-18 15:02 ` James Morris
  0 siblings, 1 reply; 91+ messages in thread
From: Mimi Zohar @ 2012-04-18 13:04 UTC (permalink / raw)
  To: James Morris
  Cc: linux-security-module, linux-kernel, linux-fsdevel, Al Viro,
	David Safford, Dmitry Kasatkin

Hi James,

As the last IMA-appraisal posting on 3/29 addressed Al's
performance/maintenance concerns of deferring the __fput() and there
hasn't been any additional comments, please consider pulling the
IMA-appraisal patches.

The linux-integrity.git also contains the two prereqs:
   vfs: fix IMA lockdep circular locking dependency  (Acked by Eric)
   vfs: iversion truncate bug fix (currently in linux-next, via Andrew)

The following changes since commit eadc10b3e17f00681f7bfb2ed6e4aee39ad93f03:

  vfs: extend vfs_removexattr locking (2012-04-18 07:06:55 -0400)

are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity.git next-ima-appraisal

thanks,

Mimi

Dmitry Kasatkin (3):
      ima: free securityfs violations file
      ima: allocating iint improvements
      ima: digital signature verification support

Mimi Zohar (8):
      vfs: move ima_file_free before releasing the file
      ima: integrity appraisal extension
      ima: add appraise action keywords and default rules
      ima: replace iint spinlock with rwlock/read_lock
      ima: add inode_post_setattr call
      ima: add ima_inode_setxattr/removexattr function and calls
      ima: defer calling __fput()
      ima: add support for different security.ima data types

 Documentation/ABI/testing/ima_policy  |   25 ++-
 Documentation/kernel-parameters.txt   |    8 +
 fs/attr.c                             |    2 +
 fs/file_table.c                       |    7 +-
 include/linux/ima.h                   |   32 +++
 include/linux/integrity.h             |    7 +-
 include/linux/xattr.h                 |    3 +
 security/integrity/evm/evm_main.c     |    3 +
 security/integrity/iint.c             |   64 +++----
 security/integrity/ima/Kconfig        |   15 ++
 security/integrity/ima/Makefile       |    2 +
 security/integrity/ima/ima.h          |   37 ++++-
 security/integrity/ima/ima_api.c      |   56 ++++--
 security/integrity/ima/ima_appraise.c |  344 +++++++++++++++++++++++++++++++++
 security/integrity/ima/ima_crypto.c   |    8 +-
 security/integrity/ima/ima_fs.c       |    1 +
 security/integrity/ima/ima_main.c     |   89 ++++++---
 security/integrity/ima/ima_policy.c   |  181 +++++++++++++-----
 security/integrity/integrity.h        |   11 +-
 security/security.c                   |    6 +
 20 files changed, 754 insertions(+), 147 deletions(-)
 create mode 100644 security/integrity/ima/ima_appraise.c



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PULL REQUEST] : ima-appraisal patches
  2012-04-18 13:04 [PULL REQUEST] : ima-appraisal patches Mimi Zohar
@ 2012-04-18 15:02 ` James Morris
  2012-04-18 18:07   ` Mimi Zohar
  0 siblings, 1 reply; 91+ messages in thread
From: James Morris @ 2012-04-18 15:02 UTC (permalink / raw)
  To: Mimi Zohar
  Cc: linux-security-module, linux-kernel, linux-fsdevel, Al Viro,
	David Safford, Dmitry Kasatkin

On Wed, 18 Apr 2012, Mimi Zohar wrote:

> Hi James,
> 
> As the last IMA-appraisal posting on 3/29 addressed Al's
> performance/maintenance concerns of deferring the __fput() and there
> hasn't been any additional comments, please consider pulling the
> IMA-appraisal patches.

I thought Dmitry was still waiting on comments from Al for this.



- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PULL REQUEST] : ima-appraisal patches
  2012-04-18 15:02 ` James Morris
@ 2012-04-18 18:07   ` Mimi Zohar
  2012-04-18 18:39     ` Al Viro
  0 siblings, 1 reply; 91+ messages in thread
From: Mimi Zohar @ 2012-04-18 18:07 UTC (permalink / raw)
  To: James Morris
  Cc: linux-security-module, linux-kernel, linux-fsdevel, Al Viro,
	David Safford, Dmitry Kasatkin

On Thu, 2012-04-19 at 01:02 +1000, James Morris wrote:
> On Wed, 18 Apr 2012, Mimi Zohar wrote:
> 
> > Hi James,
> > 
> > As the last IMA-appraisal posting on 3/29 addressed Al's
> > performance/maintenance concerns of deferring the __fput() and there
> > hasn't been any additional comments, please consider pulling the
> > IMA-appraisal patches.
> 
> I thought Dmitry was still waiting on comments from Al for this.

Al NAK'ed the original 'ima: defer calling __fput()' patch from 3/22,
but we addressed both of Al's performance and maintaince concerns in the
last posting on 3/29, which was three weeks ago.  We haven't heard
back ...

>From the 'ima: defer calling __fput()' patch description:

ima_file_free(), which is called on __fput(), updates the file data
hash stored as an extended attribute to reflect file changes.  If a
file is closed before it is munmapped, __fput() is called with the
mmap_sem taken.  With IMA-appraisal enabled, this results in an
mmap_sem/i_mutex lockdep.  ima_defer_fput() increments the f_count to
defer the __fput() being called until after the mmap_sem is released.

The number of __fput() calls needing to be deferred is minimal.  Only
those files in policy, that were closed prior to the munmap and were
mmapped write, need to defer the __fput().

With this patch, on a clean F16 install, from boot to login, only
5 out of ~100,000 mmap_sem held fput() calls were deferred.

Changelog v4:
- ima_defer_fput() performance improvements
- move ima_defer_fput() call from remove_vma() to __fput()
- only defer mmapped files opened for write
- remove unnecessary ima_fput_mempool_destroy()

Mimi


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PULL REQUEST] : ima-appraisal patches
  2012-04-18 18:07   ` Mimi Zohar
@ 2012-04-18 18:39     ` Al Viro
  2012-04-18 20:56       ` Mimi Zohar
  2012-04-19 19:57       ` Mimi Zohar
  0 siblings, 2 replies; 91+ messages in thread
From: Al Viro @ 2012-04-18 18:39 UTC (permalink / raw)
  To: Mimi Zohar
  Cc: James Morris, linux-security-module, linux-kernel, linux-fsdevel,
	David Safford, Dmitry Kasatkin

On Wed, Apr 18, 2012 at 02:07:52PM -0400, Mimi Zohar wrote:

> >From the 'ima: defer calling __fput()' patch description:
> 
> ima_file_free(), which is called on __fput(), updates the file data
> hash stored as an extended attribute to reflect file changes.  If a
> file is closed before it is munmapped, __fput() is called with the
> mmap_sem taken.  With IMA-appraisal enabled, this results in an
> mmap_sem/i_mutex lockdep.  ima_defer_fput() increments the f_count to
> defer the __fput() being called until after the mmap_sem is released.
> 
> The number of __fput() calls needing to be deferred is minimal.  Only
> those files in policy, that were closed prior to the munmap and were
> mmapped write, need to defer the __fput().
> 
> With this patch, on a clean F16 install, from boot to login, only
> 5 out of ~100,000 mmap_sem held fput() calls were deferred.

Assuming that it's commit 3cee52ffe8ca925bb1e96f804daa87f7e2e34e46
Author: Mimi Zohar <zohar@linux.vnet.ibm.com>
Date:   Fri Feb 24 06:23:12 2012 -0500

    ima: defer calling __fput()
in your tree, the NAK still stands.  For starters, but you are creating a
different locking rules for IMA-enabled builds and for everything else.
Moreover, this deferral is done only for files opened for write; the
rules are convoluted as hell *and* inviting abuses.  

NAKed at least until you come up with formal proof that there's no other
lock where fput() would be possible and ->i_mutex was not allowed.  This
is not a way to go; that kind of kludges leads to locking code that is
impossible to reason about.

PS: BTW, what the hell is "fput already scheduled" codepath about?
Why is it pr_info() and not an outright BUG_ON()?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PULL REQUEST] : ima-appraisal patches
  2012-04-18 18:39     ` Al Viro
@ 2012-04-18 20:56       ` Mimi Zohar
  2012-04-19 19:57       ` Mimi Zohar
  1 sibling, 0 replies; 91+ messages in thread
From: Mimi Zohar @ 2012-04-18 20:56 UTC (permalink / raw)
  To: Al Viro
  Cc: James Morris, linux-security-module, linux-kernel, linux-fsdevel,
	David Safford, Dmitry Kasatkin

On Wed, 2012-04-18 at 19:39 +0100, Al Viro wrote:
> On Wed, Apr 18, 2012 at 02:07:52PM -0400, Mimi Zohar wrote:
> 
> > >From the 'ima: defer calling __fput()' patch description:
> > 
> > ima_file_free(), which is called on __fput(), updates the file data
> > hash stored as an extended attribute to reflect file changes.  If a
> > file is closed before it is munmapped, __fput() is called with the
> > mmap_sem taken.  With IMA-appraisal enabled, this results in an
> > mmap_sem/i_mutex lockdep.  ima_defer_fput() increments the f_count to
> > defer the __fput() being called until after the mmap_sem is released.
> > 
> > The number of __fput() calls needing to be deferred is minimal.  Only
> > those files in policy, that were closed prior to the munmap and were
> > mmapped write, need to defer the __fput().
> > 
> > With this patch, on a clean F16 install, from boot to login, only
> > 5 out of ~100,000 mmap_sem held fput() calls were deferred.
> 
> Assuming that it's commit 3cee52ffe8ca925bb1e96f804daa87f7e2e34e46
> Author: Mimi Zohar <zohar@linux.vnet.ibm.com>
> Date:   Fri Feb 24 06:23:12 2012 -0500
> 
>     ima: defer calling __fput()
> in your tree, the NAK still stands.  For starters, but you are creating a
> different locking rules for IMA-enabled builds and for everything else.
> Moreover, this deferral is done only for files opened for write; the
> rules are convoluted as hell *and* inviting abuses.  

Yes, that is the updated version.  For performance, we limited deferring
the __fput() to only those files that could possibly change - open for
write, were closed before being munmapped, and that IMA-appraisal
maintains a file data hash as an xattr.  If the main concern is
different locking rules when IMA is enabled, then we could remove the
IMA criteria and rename ima_defer_fput() to something more generic.

As for "*and* inviting abuses", I'm not sure what you mean.

> NAKed at least until you come up with formal proof that there's no other
> lock where fput() would be possible and ->i_mutex was not allowed.  This
> is not a way to go; that kind of kludges leads to locking code that is
> impossible to reason about.

On __fput(), we need to update the security.ima xattr with a hash of the
file data.  The original thread discussion suggested changing the xattr
locking.  The filesystems seem to do their own xattr locking, but in
fs/xattr.c the i_mutex is taken before accessing the inode
setxattr/removexattr ops. 

hm, lockdep isn't complaining about anything else.  Not sure if that
qualifies as formal proof. 

> PS: BTW, what the hell is "fput already scheduled" codepath about?
> Why is it pr_info() and not an outright BUG_ON()?

I'll fix this.

thanks,

Mimi


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PULL REQUEST] : ima-appraisal patches
  2012-04-18 18:39     ` Al Viro
  2012-04-18 20:56       ` Mimi Zohar
@ 2012-04-19 19:57       ` Mimi Zohar
  2012-04-20  0:43         ` [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches) Al Viro
  1 sibling, 1 reply; 91+ messages in thread
From: Mimi Zohar @ 2012-04-19 19:57 UTC (permalink / raw)
  To: Al Viro
  Cc: James Morris, linux-security-module, linux-kernel, linux-fsdevel,
	David Safford, Dmitry Kasatkin

On Wed, 2012-04-18 at 19:39 +0100, Al Viro wrote:
> On Wed, Apr 18, 2012 at 02:07:52PM -0400, Mimi Zohar wrote:

> NAKed at least until you come up with formal proof that there's no other
> lock where fput() would be possible and ->i_mutex was not allowed.  

Has the discussion here moved from deferring the __fput() for the
mmap_sem/i_mutex lockdep side case, to taking the i_mutex in __fput() in
general?  Lockdep has not reported any problems, other than for the
mmap_sem/i_mutex scenario.

> This
> is not a way to go; that kind of kludges leads to locking code that is
> impossible to reason about.

Are you referring to defering the __fput() or taking the i_mutex in
__fput() in general?

The i_mutex is currently used to protect file data and metadata (eg.
chown, chmod, xattrs).  After the last file data change, the file
metadata needs to be updated to reflect the file data changes.  As
i_mutex is used for protecting both the file data and file metadata, why
would taking the i_mutex in __fput() be kludgie.

I'd really appreciate any help, suggestions.

thanks,

Mimi


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches)
  2012-04-19 19:57       ` Mimi Zohar
@ 2012-04-20  0:43         ` Al Viro
  2012-04-20  2:31           ` Linus Torvalds
  2012-04-20 18:54           ` Hugh Dickins
  0 siblings, 2 replies; 91+ messages in thread
From: Al Viro @ 2012-04-20  0:43 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: James Morris, linux-security-module, linux-kernel, linux-fsdevel,
	David Safford, Dmitry Kasatkin, Mimi Zohar, Linus Torvalds,
	David Miller

On Thu, Apr 19, 2012 at 03:57:28PM -0400, Mimi Zohar wrote:

> Has the discussion here moved from deferring the __fput() for the
> mmap_sem/i_mutex lockdep side case, to taking the i_mutex in __fput() in
> general?  Lockdep has not reported any problems, other than for the
> mmap_sem/i_mutex scenario.
> 
> > This
> > is not a way to go; that kind of kludges leads to locking code that is
> > impossible to reason about.
> 
> Are you referring to defering the __fput() or taking the i_mutex in
> __fput() in general?

I'm referring to having "are we holding ->mmap_sem" logics anywhere in fput().
Note that your "lockdep has not reported..." is a symptom of the same
problem - it's *NOT* enough to show the lack of deadlocks from the change
of locking rules.  And after that change we'll get even worse situation,
since proving the safety will become harder (and even more so if we end
up adding other cases when we need to defer).

Summary for those who'd been spared the earlier fun: IMA stuff wants to grab
->i_mutex on the final fput() in some cases.  That violates the locking
order in at least one way - final fput() may come under ->mmap_sem, in
which case grabbing ->i_mutex would be a Bloody Bad Idea(tm).  "Solution"
proposed: a bit of spectaculary ugly logics checking in final fput() whether
we have ->mmap_sem locked.  If we do, said final fput() becomes non-final
and is done asynchronously.  This is fundamentally flawed, of course,
since "is ->mmap_sem unlocked" is used as "is it safe to have ->i_mutex
grabbed", and it's both unproven and brittle as hell even if it happens
to be true right now (and I wouldn't bet on that, TBH).

If it had been IMA alone, I would've cheerfully told them where to stuff
the whole thing.  Unfortunately, fput() *is* lock-heavy even without them.
Note that it can easily end up triggering such things as final
deactivate_super().  Now, ->mmap_sem is irrelevant here - after all,
any inodes involved won't be mmapped, or deactivate_super() wouldn't
be final.  However, fput() is called e.g. under rtnl_lock() and I'm
not at all sure that something like NFS won't try to grab it from its
->kill_sb().

So the question is, do we have any reasonable way to get to the situation
where the __fput() would only be done without any locks held?  It might
be worth trying. What we CAN'T do:
	* defer all __fput() (i.e. always do final fput() async).  Obviously
no-go, for performance reasons alone.
	* check some predicate about the set of locks held and defer if it's
false.  That's what IMA folks have tried to do; no-go for the reasons described
above.
	* add a new helper (fput_carefully() or something like that) that would
defer final __fput(), leaving fput() with the current behaviour.  And convert
the potentially unsafe callers to it (starting with everything that holds
->mmap_sem).  No-go since it's not maintainable - a change pretty far away
from a function that does (currently safe) fput() can end up requiring the
conversion to fput_carefully().  Too easy to miss, so this will decay and it
won't be easy to verify correctness several years down the road.

However, there's an approach that might be feasible.  Most of the time
the final fput() *is* done without any locks held and there's a very
large subclass of those call sites - those that come via fput_light().  
What we could do, and what might be maintainable is:
	* prohibit fput_light() with locks held.  Right now we are very
close to that (or already there - I haven't finished checking).
	* convert low-hanging fget/fput in syscalls to fget_light/fput_light.
Makes sense anyway.
	* split off fput_nodefer() (identical to what fput() is right now),
make fput_light() call it instead of fput().  Switch path_lookupat() and
path_openat() to fput_nodefer() as well.  Ditto for sys_socketpair() and
sys_accept4().
	* make fput() (now almost never leading to __fput()) defer the sucker
in very rare cases when it ends up dropping the last reference.
At that point we would have __fput() always done without any locks held,
which would remove all restrictions on the locks that can be taken from it.

Comments?

Disclaimer: I have not finished going through the call sites (note that
there are wrappers - e.g. sockfd_put() *and* sockfd_lookup()), so there might
be obstacles.  In particular, I'm still not sure about SCM_RIGHTS datagrams
handling - IIRC, we had an interesting kludge in there, with a kinda-sorta
deferral/batching.  BTW, I wonder about deadlocks around that one -
drivers/vhost/net.c is doing ->recvmsg() while holding vq->mutex and if
an SCM_RIGHTS datagram comes in, we'll get a bunch of fput() done.  Which
can lead to final umount of a network filesystem, etc.  If that thing can
lead to handle_rx() on the same sucker, we have a deadlock...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches)
  2012-04-20  0:43         ` [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches) Al Viro
@ 2012-04-20  2:31           ` Linus Torvalds
  2012-04-20  2:54             ` Al Viro
  2012-04-20 18:54           ` Hugh Dickins
  1 sibling, 1 reply; 91+ messages in thread
From: Linus Torvalds @ 2012-04-20  2:31 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, James Morris, linux-security-module, linux-kernel,
	David Safford, Dmitry Kasatkin, Mimi Zohar, David Miller

On Thu, Apr 19, 2012 at 5:43 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> However, there's an approach that might be feasible.  Most of the time
> the final fput() *is* done without any locks held and there's a very
> large subclass of those call sites - those that come via fput_light().
> What we could do, and what might be maintainable is:
>        * prohibit fput_light() with locks held.  Right now we are very
> close to that (or already there - I haven't finished checking).
>        * convert low-hanging fget/fput in syscalls to fget_light/fput_light.
> Makes sense anyway.

Many of them would make sense, yes (looking at vfs_fstatat() etc.

But a lot of fput() calls come from close() -> filp_close -> fput().

And the "fput_light()" model *only* works together with fget_light()
as it is now.

So I do think you need some other model. Of course, we can just do
"fput_light(file, 1)" instead - that seems pretty ugly, though. But
just making "fput()" do a defer on the last count sounds actively
*wrong* for things like close(), which may actually have serious
consistency guarantees (ie the process doing the close() may "know"
that it is the last user, and depend on the fact that the close() did
actually delete the inode etc.

                 Linus

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches)
  2012-04-20  2:31           ` Linus Torvalds
@ 2012-04-20  2:54             ` Al Viro
  2012-04-20  2:58               ` Linus Torvalds
  2012-04-20  3:15               ` [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches) Al Viro
  0 siblings, 2 replies; 91+ messages in thread
From: Al Viro @ 2012-04-20  2:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, James Morris, linux-security-module, linux-kernel,
	David Safford, Dmitry Kasatkin, Mimi Zohar, David Miller

On Thu, Apr 19, 2012 at 07:31:01PM -0700, Linus Torvalds wrote:
> On Thu, Apr 19, 2012 at 5:43 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > However, there's an approach that might be feasible. ?Most of the time
> > the final fput() *is* done without any locks held and there's a very
> > large subclass of those call sites - those that come via fput_light().
> > What we could do, and what might be maintainable is:
> > ? ? ? ?* prohibit fput_light() with locks held. ?Right now we are very
> > close to that (or already there - I haven't finished checking).
> > ? ? ? ?* convert low-hanging fget/fput in syscalls to fget_light/fput_light.
> > Makes sense anyway.
> 
> Many of them would make sense, yes (looking at vfs_fstatat() etc.
> 
> But a lot of fput() calls come from close() -> filp_close -> fput().
> 
> And the "fput_light()" model *only* works together with fget_light()
> as it is now.
> 
> So I do think you need some other model. Of course, we can just do
> "fput_light(file, 1)" instead - that seems pretty ugly, though. But
> just making "fput()" do a defer on the last count sounds actively
> *wrong* for things like close(), which may actually have serious
> consistency guarantees (ie the process doing the close() may "know"
> that it is the last user, and depend on the fact that the close() did
> actually delete the inode etc.

Umm...  I really wonder if we *want* filp_close() under any kind of
locks.  You are right - it should not be deferred.  I haven't finished
checking the callers of that puppy, but if we really do it while holding
any kind of lock, we are asking for trouble.  So I'd rather switch
filp_close() to use of fput_nodefer() if that turns out to be possible.

FWIW, the set of primitives I'm thinking of right now is

__fput(file) - same as now
schedule_fput(file) - takes the only reference to file and schedules __fput()
fput_nodefer(file)
{
        if (atomic_long_dec_and_test(&file->f_count))
                __fput(file);
}
fput(file)
{
	if (unlikely(!fput_atomic(file))
		schedule_fput(file);
}
fput_light(file, need_fput)
{
	if (need_fput)
		fput_nodefer(file);
}
fput_light_defer(file, need_fput) // for callers in some weird ioctls, might
				  // not be needed at all
{
	if (need_fput)
		fput(file);
}

and filp_close() would, if that turns out to be possible, call fput_nodefer()
instead of fput().  If we *do* have places where we need deferral in
filp_close() (and I'm fairly sure that any such place is a deadlock right
now), well, we'll need a variant of filp_close() sans the call of fput...()
and those places would call that, followed by full (deferring) fput().

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches)
  2012-04-20  2:54             ` Al Viro
@ 2012-04-20  2:58               ` Linus Torvalds
  2012-04-20  8:09                 ` Al Viro
  2012-04-20  3:15               ` [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches) Al Viro
  1 sibling, 1 reply; 91+ messages in thread
From: Linus Torvalds @ 2012-04-20  2:58 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, James Morris, linux-security-module, linux-kernel,
	David Safford, Dmitry Kasatkin, Mimi Zohar, David Miller

On Thu, Apr 19, 2012 at 7:54 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Umm...  I really wonder if we *want* filp_close() under any kind of
> locks.  You are right - it should not be deferred.  I haven't finished
> checking the callers of that puppy, but if we really do it while holding
> any kind of lock, we are asking for trouble.  So I'd rather switch
> filp_close() to use of fput_nodefer() if that turns out to be possible.

Ok, fair enough, looks like a reasonable plan to me.

                  Linus

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches)
  2012-04-20  2:54             ` Al Viro
  2012-04-20  2:58               ` Linus Torvalds
@ 2012-04-20  3:15               ` Al Viro
  1 sibling, 0 replies; 91+ messages in thread
From: Al Viro @ 2012-04-20  3:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, James Morris, linux-security-module, linux-kernel,
	David Safford, Dmitry Kasatkin, Mimi Zohar, David Miller

On Fri, Apr 20, 2012 at 03:54:38AM +0100, Al Viro wrote:

> and filp_close() would, if that turns out to be possible, call fput_nodefer()
> instead of fput().  If we *do* have places where we need deferral in
> filp_close() (and I'm fairly sure that any such place is a deadlock right
> now), well, we'll need a variant of filp_close() sans the call of fput...()
> and those places would call that, followed by full (deferring) fput().

BTW, some of those filp_close() are simply wrong - e.g. a big pile in
drivers/media/video/cx25821/*.c should be fput().

And yes, we have at least some under mutex - binder_lock held by
binder_ioctl(), which ends up doing binder_transaction() with its
failure cleanup hitting close_filp().  On an arbitrary struct file.
It *probably* doesn't deadlock on the spot, since binder_release()
itself contains a deferral mechanism (perhaps they'd spotted the
deadlock, perhaps there are other reasons to have it).

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches)
  2012-04-20  2:58               ` Linus Torvalds
@ 2012-04-20  8:09                 ` Al Viro
  2012-04-20 15:56                   ` Linus Torvalds
  0 siblings, 1 reply; 91+ messages in thread
From: Al Viro @ 2012-04-20  8:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, James Morris, linux-security-module, linux-kernel,
	David Safford, Dmitry Kasatkin, Mimi Zohar, David Miller

On Thu, Apr 19, 2012 at 07:58:57PM -0700, Linus Torvalds wrote:
> On Thu, Apr 19, 2012 at 7:54 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Umm... ?I really wonder if we *want* filp_close() under any kind of
> > locks. ?You are right - it should not be deferred. ?I haven't finished
> > checking the callers of that puppy, but if we really do it while holding
> > any kind of lock, we are asking for trouble. ?So I'd rather switch
> > filp_close() to use of fput_nodefer() if that turns out to be possible.
> 
> Ok, fair enough, looks like a reasonable plan to me.

Actually, scratch that; I think I have a better variant
	* switch to fget_light/fput_light where possible; it's not needed
for the rest, but is useful anyway
	* move the guts of filp_close() (everything prior to fput() it does
in the end) into a new helper, turning filp_close() into a couple of calls
(inlined, at that).  Equivalent transformation.
	* add fput_nodefer(); does the same thing fput() does now.  Make
fput() defer the call of __fput().  Add a filp_close_nodefer() - same as
filp_close(), but the second call in it is fput_nodefer(), not fput().  And
switch the callers that are known to be done outside of any locks to
filp_close_nodefer().  Switch accept4()/socketpair() to fput_nodefer()
(for freshly created struct file in case of cleanup on failure exit).

Note that we do *not* need to bother with fput_light() - even if it does
fput(), that fput() won't usually be the final one.  We'd need it to
race with close() or dup2() from another thread for that to happen.  So
we only need to review the callers of filp_close() and there are much
fewer of those.

We also get something else out of that - AFAICS, the kludge in __scm_destroy()
can be killed after that.  We did it to prevent recursion on fput(), right?
Now that recursion will be gone...

I'd probably not bother with trying to be clever in doing the deferral itself -
the point is to make it rare, so it's not a hot path anyway.  We can play
with per-CPU lists, etc., but in this case dumber is better.

Comments?  I'm half-asleep right now, so I might be missing something
obvious...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches)
  2012-04-20  8:09                 ` Al Viro
@ 2012-04-20 15:56                   ` Linus Torvalds
  2012-04-20 16:08                     ` Al Viro
  0 siblings, 1 reply; 91+ messages in thread
From: Linus Torvalds @ 2012-04-20 15:56 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, James Morris, linux-security-module, linux-kernel,
	David Safford, Dmitry Kasatkin, Mimi Zohar, David Miller

On Fri, Apr 20, 2012 at 1:09 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Note that we do *not* need to bother with fput_light() - even if it does
> fput(), that fput() won't usually be the final one.

Ack. Most of the time the fput_light()->fput will just decrement the use count.

> We also get something else out of that - AFAICS, the kludge in __scm_destroy()
> can be killed after that.  We did it to prevent recursion on fput(), right?
> Now that recursion will be gone...

Hmm.. That points out that we may have a *lot* of these pending final
fput's, though. So the deferral queueing should be fairly light. What
were your particular plans for it?

This actually sounds like a fairly good usage-case for Oleg's new
task_work_add() thing. That would defer the final fput, but at the
same time guarantee that it gets done before returning to user space -
in case there are any issues with synchronous actions. Have you looked
at Oleg's series? You weren't cc'd because it didn't affect you, but
look at lkml for "task_work_add()" to find it.

NOTE! If pure kernel threads do fput() deferral (and maybe they do -
I'm thinking nfsd etc), then the task-work thing might need some extra
thought.

                     Linus

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches)
  2012-04-20 15:56                   ` Linus Torvalds
@ 2012-04-20 16:08                     ` Al Viro
  2012-04-20 16:42                       ` Al Viro
  0 siblings, 1 reply; 91+ messages in thread
From: Al Viro @ 2012-04-20 16:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, James Morris, linux-security-module, linux-kernel,
	David Safford, Dmitry Kasatkin, Mimi Zohar, David Miller

On Fri, Apr 20, 2012 at 08:56:57AM -0700, Linus Torvalds wrote:
> > We also get something else out of that - AFAICS, the kludge in __scm_destroy()
> > can be killed after that. ?We did it to prevent recursion on fput(), right?
> > Now that recursion will be gone...
> 
> Hmm.. That points out that we may have a *lot* of these pending final
> fput's, though. So the deferral queueing should be fairly light. What
> were your particular plans for it?

Doing removal from per-sb list immediately (i.e. before possible
deferral; we skip ones with zero ->f_count when we walk the list
anyway), then in case we decide to defer just move them to per-CPU
list and schedule work on that CPU, with handler that will pull the
corresponding list out and do the rest of __fput() for everything
in that list.  No extra locking, just preempt_disable() around the
"move to per-CPU list" bit.  Or a per-CPU spinlock with worker not
being tied to specific CPU and told which CPU's list to work with.
How does CPU hotplug interact with work scheduled on CPU about to
be taken down, BTW?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches)
  2012-04-20 16:08                     ` Al Viro
@ 2012-04-20 16:42                       ` Al Viro
  2012-04-20 17:21                         ` Linus Torvalds
  0 siblings, 1 reply; 91+ messages in thread
From: Al Viro @ 2012-04-20 16:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, James Morris, linux-security-module, linux-kernel,
	David Safford, Dmitry Kasatkin, Mimi Zohar, David Miller

On Fri, Apr 20, 2012 at 05:08:48PM +0100, Al Viro wrote:

> Doing removal from per-sb list immediately (i.e. before possible
> deferral; we skip ones with zero ->f_count when we walk the list
> anyway), then in case we decide to defer just move them to per-CPU
> list and schedule work on that CPU, with handler that will pull the
> corresponding list out and do the rest of __fput() for everything
> in that list.  No extra locking, just preempt_disable() around the
> "move to per-CPU list" bit.  Or a per-CPU spinlock with worker not
> being tied to specific CPU and told which CPU's list to work with.
> How does CPU hotplug interact with work scheduled on CPU about to
> be taken down, BTW?

Actually, I like the per-CPU spinlock variant better; the thing is,
with that scheme we get normal fput() (i.e. non-nodefer variant)
non-blocking.  How about this:

__fput() loses file_sb_list_del() call

fput(file)
{
	if (atomic_long_dec_and_test(...)) {
		unsigned long flags;
		struct foo *p;
		file_sb_list_del(file);
		p = get_cpu_var(deferral_lists);
		spin_lock_irqsave(&p->lock, flags);
		list_move(&file->f_u.fu_list, &p->list);
		spin_unlock_irqrestore(&p->lock, flags);
		schedule_work(&p->work);
		put_cpu_var(p);
	}
}

fput_nodefer(file)
{
	if (atomic_long_dec_and_test(...)) {
		file_sb_list_del(file);
		__fput(file);
	}
}

do_deferred_fput_work(work)
{
	struct foo *p = container_of(work, struct foo, work);
	LIST_HEAD(list);
	spin_lock_irq(&p->lock);
	list_splice_init(&p->list, list);
	spin_unlock_irq(&p->lock);
	while (!list_empty(list)) {
		struct file *file = list_entry(list, struct file, f_u.fu_list);
		list_del_init(&file->f_u.fu_list);
		__fput(file);
	}
}

Voila - now only fput_nodefer() is blocking!  fput() can be used from
any context that way, which should kill e.g. a kludge in fs/aio.c.

Comments?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches)
  2012-04-20 16:42                       ` Al Viro
@ 2012-04-20 17:21                         ` Linus Torvalds
  2012-04-20 18:07                           ` Al Viro
  0 siblings, 1 reply; 91+ messages in thread
From: Linus Torvalds @ 2012-04-20 17:21 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, James Morris, linux-security-module, linux-kernel,
	David Safford, Dmitry Kasatkin, Mimi Zohar, David Miller

On Fri, Apr 20, 2012 at 9:42 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Actually, I like the per-CPU spinlock variant better; the thing is,
> with that scheme we get normal fput() (i.e. non-nodefer variant)
> non-blocking.  How about this:

What's the advantage of a per-cpu lock?

If you make the work be per-cpu, then you're better with no locking at
all: just disable interrupts (which you do anyway).

And if you want to use a spinlock, don't bother with the percpu side.

The thing I do not like about the schedule_work approach is that it
(a) totally hides the real cost  - which is the scheduling - and (b)
it is so asynchronous that it will happen potentially long after the
task dropped the reference.

And seriously - that is user-visible behavior.

For example, think about this *common* pattern:

  open+mmap+close+unlink+munmap

which would trigger the whole deferred fput, but also triggers the
actual real unlink() at fput time.

Right now, you can have that kind of thing in a program and
immediately unmount the filesystem afterwards (replace "unmount" with
"cannot see silly-renamed files" etc).

The "totally asynchronous deferral" literally *breaks*semantics*.

Sure, it won't be noticeable in 99.99% of all cases, and I doubt you
can trigger much of a test for it. But it's potential real breakage,
and it's going to be hard to ever see. And then when it *does* happen,
it's going to be totally impossible to debug.

It's not just the "last unlink" thing that gets delayed. It things
like file locking. It's "drop_file_write_access()". It's whatever
random thing that file does at "release()". It's a ton of things like
that. Delaying them has user-visible actions.

That's a whole can of complexities and worries outside of the kernel
interface that you are completely ignoring - just because you are
trying to solve the *simple* complexity of locking interaction
entirely within the kernel.

I think that's a bit myopic. We don't even *know* what the problems
with the async approach might be. Your "simple" solution is simple
only inside the kernel.

This is why I suggested you look at Oleg's patches. If we guarantee
that things won't be delayed past re-entering user mode, all those
issues go away.

                     Linus

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches)
  2012-04-20 17:21                         ` Linus Torvalds
@ 2012-04-20 18:07                           ` Al Viro
  2012-04-23 18:01                             ` [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such Al Viro
  0 siblings, 1 reply; 91+ messages in thread
From: Al Viro @ 2012-04-20 18:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, James Morris, linux-security-module, linux-kernel,
	David Safford, Dmitry Kasatkin, Mimi Zohar, David Miller

On Fri, Apr 20, 2012 at 10:21:35AM -0700, Linus Torvalds wrote:
> On Fri, Apr 20, 2012 at 9:42 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Actually, I like the per-CPU spinlock variant better; the thing is,
> > with that scheme we get normal fput() (i.e. non-nodefer variant)
> > non-blocking. ?How about this:
> 
> What's the advantage of a per-cpu lock?
> 
> If you make the work be per-cpu, then you're better with no locking at
> all: just disable interrupts (which you do anyway).

Point taken.

> The thing I do not like about the schedule_work approach is that it
> (a) totally hides the real cost  - which is the scheduling - and (b)
> it is so asynchronous that it will happen potentially long after the
> task dropped the reference.

[snip]

> This is why I suggested you look at Oleg's patches. If we guarantee
> that things won't be delayed past re-entering user mode, all those
> issues go away.

I've looked at them.  One obvious problem is that it tracehook_notify_resume()
is not universally called.  AFAICS, hexagon, m68k, microblaze, score, um
and xtensa never call tracehook_notify_resume().  Out of those, hexagon is
manually checking TIF_NOTIFY_RESUME and does key_replace_session_keyring(),
so the call could be easily added into the same place; the rest of those
guys don't even look at TIF_NOTIFY_RESUME anywhere near their signal*.c
and m68k/um/xtensa don't even have it defined, let alone handled.  So this
stuff depends on some amount of asm glue hacking on several architectures ;-/

Another is that final fput() can, indeed, happen in kernel threads, so
pure switch to task_work_add() won't be enough.  I agree that having this
stuff flushed before we return to userland would be a good thing; the
question is, can we easily tell if there will be such a return to userland?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches)
  2012-04-20  0:43         ` [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches) Al Viro
  2012-04-20  2:31           ` Linus Torvalds
@ 2012-04-20 18:54           ` Hugh Dickins
  2012-04-20 19:04             ` Al Viro
  1 sibling, 1 reply; 91+ messages in thread
From: Hugh Dickins @ 2012-04-20 18:54 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, James Morris, linux-security-module, linux-kernel,
	David Safford, Dmitry Kasatkin, Mimi Zohar, Linus Torvalds,
	David Miller, Andrew Morton

On Fri, 20 Apr 2012, Al Viro wrote:
> On Thu, Apr 19, 2012 at 03:57:28PM -0400, Mimi Zohar wrote:
> 
> > Has the discussion here moved from deferring the __fput() for the
> > mmap_sem/i_mutex lockdep side case, to taking the i_mutex in __fput() in
> > general?  Lockdep has not reported any problems, other than for the
> > mmap_sem/i_mutex scenario.
> > 
> > > This
> > > is not a way to go; that kind of kludges leads to locking code that is
> > > impossible to reason about.
> > 
> > Are you referring to defering the __fput() or taking the i_mutex in
> > __fput() in general?
> 
> I'm referring to having "are we holding ->mmap_sem" logics anywhere in fput().
> Note that your "lockdep has not reported..." is a symptom of the same
> problem - it's *NOT* enough to show the lack of deadlocks from the change
> of locking rules.  And after that change we'll get even worse situation,
> since proving the safety will become harder (and even more so if we end
> up adding other cases when we need to defer).
> 
> Summary for those who'd been spared the earlier fun: IMA stuff wants to grab
> ->i_mutex on the final fput() in some cases.  That violates the locking
> order in at least one way - final fput() may come under ->mmap_sem, in
> which case grabbing ->i_mutex would be a Bloody Bad Idea(tm).  "Solution"
> proposed: a bit of spectaculary ugly logics checking in final fput() whether
> we have ->mmap_sem locked.  If we do, said final fput() becomes non-final
> and is done asynchronously.  This is fundamentally flawed, of course,
> since "is ->mmap_sem unlocked" is used as "is it safe to have ->i_mutex
> grabbed", and it's both unproven and brittle as hell even if it happens
> to be true right now (and I wouldn't bet on that, TBH).

I can see that the discussion has since moved on quite a way from here.

But it looks to me fairly easy for mm to stop doing fput() under mmap_sem.

That's already the case when exiting (no mmap_sem held), and shouldn't add
observable cost when unmapping (we already work on a chain of vmas to be
freed, and when unmapping that chain will usually just be of one: shouldn't
matter to defer a final pass until after mmap_sem is dropped).  Unless I'm
mistaken, the fput() buried in vma_adjust() can never be a final fput.

Is it worth my trying to implement that?  Or do you see an immediate
gotcha that I'm missing?  Or are you happy enough with your deferred
fput() ideas, that it would be a waste of time to rearrange the mm end?

Hugh

> 
> If it had been IMA alone, I would've cheerfully told them where to stuff
> the whole thing.  Unfortunately, fput() *is* lock-heavy even without them.
> Note that it can easily end up triggering such things as final
> deactivate_super().  Now, ->mmap_sem is irrelevant here - after all,
> any inodes involved won't be mmapped, or deactivate_super() wouldn't
> be final.  However, fput() is called e.g. under rtnl_lock() and I'm
> not at all sure that something like NFS won't try to grab it from its
> ->kill_sb().
> 
> So the question is, do we have any reasonable way to get to the situation
> where the __fput() would only be done without any locks held?  It might
> be worth trying. What we CAN'T do:
> 	* defer all __fput() (i.e. always do final fput() async).  Obviously
> no-go, for performance reasons alone.
> 	* check some predicate about the set of locks held and defer if it's
> false.  That's what IMA folks have tried to do; no-go for the reasons described
> above.
> 	* add a new helper (fput_carefully() or something like that) that would
> defer final __fput(), leaving fput() with the current behaviour.  And convert
> the potentially unsafe callers to it (starting with everything that holds
> ->mmap_sem).  No-go since it's not maintainable - a change pretty far away
> from a function that does (currently safe) fput() can end up requiring the
> conversion to fput_carefully().  Too easy to miss, so this will decay and it
> won't be easy to verify correctness several years down the road.
> 
> However, there's an approach that might be feasible.  Most of the time
> the final fput() *is* done without any locks held and there's a very
> large subclass of those call sites - those that come via fput_light().  
> What we could do, and what might be maintainable is:
> 	* prohibit fput_light() with locks held.  Right now we are very
> close to that (or already there - I haven't finished checking).
> 	* convert low-hanging fget/fput in syscalls to fget_light/fput_light.
> Makes sense anyway.
> 	* split off fput_nodefer() (identical to what fput() is right now),
> make fput_light() call it instead of fput().  Switch path_lookupat() and
> path_openat() to fput_nodefer() as well.  Ditto for sys_socketpair() and
> sys_accept4().
> 	* make fput() (now almost never leading to __fput()) defer the sucker
> in very rare cases when it ends up dropping the last reference.
> At that point we would have __fput() always done without any locks held,
> which would remove all restrictions on the locks that can be taken from it.
> 
> Comments?
> 
> Disclaimer: I have not finished going through the call sites (note that
> there are wrappers - e.g. sockfd_put() *and* sockfd_lookup()), so there might
> be obstacles.  In particular, I'm still not sure about SCM_RIGHTS datagrams
> handling - IIRC, we had an interesting kludge in there, with a kinda-sorta
> deferral/batching.  BTW, I wonder about deadlocks around that one -
> drivers/vhost/net.c is doing ->recvmsg() while holding vq->mutex and if
> an SCM_RIGHTS datagram comes in, we'll get a bunch of fput() done.  Which
> can lead to final umount of a network filesystem, etc.  If that thing can
> lead to handle_rx() on the same sucker, we have a deadlock...
> --

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches)
  2012-04-20 18:54           ` Hugh Dickins
@ 2012-04-20 19:04             ` Al Viro
  2012-04-20 19:18               ` Linus Torvalds
  2012-04-20 19:37               ` Al Viro
  0 siblings, 2 replies; 91+ messages in thread
From: Al Viro @ 2012-04-20 19:04 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-fsdevel, James Morris, linux-security-module, linux-kernel,
	David Safford, Dmitry Kasatkin, Mimi Zohar, Linus Torvalds,
	David Miller, Andrew Morton

On Fri, Apr 20, 2012 at 11:54:01AM -0700, Hugh Dickins wrote:
> I can see that the discussion has since moved on quite a way from here.
> 
> But it looks to me fairly easy for mm to stop doing fput() under mmap_sem.
> 
> That's already the case when exiting (no mmap_sem held), and shouldn't add
> observable cost when unmapping (we already work on a chain of vmas to be
> freed, and when unmapping that chain will usually just be of one: shouldn't
> matter to defer a final pass until after mmap_sem is dropped).  Unless I'm
> mistaken, the fput() buried in vma_adjust() can never be a final fput.
> 
> Is it worth my trying to implement that?  Or do you see an immediate
> gotcha that I'm missing?  Or are you happy enough with your deferred
> fput() ideas, that it would be a waste of time to rearrange the mm end?

Deferring the final pass after dropping ->mmap_sem is going to be
interesting; what would protect ->vm_next on those suckers?  Locking
rules are already bloody complicated in mm/*; this will only add more
subtle fun.  Note that right now both the dissolution of ->vm_next
list and freeing of VMAs happen under ->mmap_sem; changing that might
cost us a lot of headache...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches)
  2012-04-20 19:04             ` Al Viro
@ 2012-04-20 19:18               ` Linus Torvalds
  2012-04-20 19:32                 ` Hugh Dickins
  2012-04-20 19:58                 ` Al Viro
  2012-04-20 19:37               ` Al Viro
  1 sibling, 2 replies; 91+ messages in thread
From: Linus Torvalds @ 2012-04-20 19:18 UTC (permalink / raw)
  To: Al Viro
  Cc: Hugh Dickins, linux-fsdevel, James Morris, linux-security-module,
	linux-kernel, David Safford, Dmitry Kasatkin, Mimi Zohar,
	David Miller, Andrew Morton

On Fri, Apr 20, 2012 at 12:04 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Deferring the final pass after dropping ->mmap_sem is going to be
> interesting; what would protect ->vm_next on those suckers?

Just remove them from the active list, and keep them linked to each
other using vm_next.

After all, the only case we care about is the case where the vma gets
removed entirely, so we just put them on their own list.

In fact, that's what we already do for other reasons. See
detach_vmas_to_be_unmapped().

So vm_next is actually entirely *private* by this time, and needs no
locking at all.

As far as I can tell, we could just make do_munmap() return that
private list, and then do the fput's and freeing of the list outside
the mmap_sem lock.

That actually looks pretty simple. There are a fair number of callers,
which looks to be the main annoyance. But fixing it up with some
pattern of "do finish_munmap after drooping the mmap_sem" doesn't look
*complicated*, just slightly annoying.

The *bigger* annoyance is actually "do_mmap()", which does a
do_munmap() as part of it, so it needs the same cleanup too.

There might be other cases that do munmap as part of their operation
(and that have the mmap_sem held outside of the caller), but
do_munmap() and do_mmap() seem to be the two obvious ones.

Many of the callers seem to do the mmap_sem() right around the call,
though (see binfmt_elf.c for example), so it really would be a rather
local fixup.

                   Linus

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches)
  2012-04-20 19:18               ` Linus Torvalds
@ 2012-04-20 19:32                 ` Hugh Dickins
  2012-04-20 19:58                 ` Al Viro
  1 sibling, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2012-04-20 19:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, linux-fsdevel, James Morris, linux-security-module,
	linux-kernel, David Safford, Dmitry Kasatkin, Mimi Zohar,
	David Miller, Andrew Morton

On Fri, 20 Apr 2012, Linus Torvalds wrote:
> On Fri, Apr 20, 2012 at 12:04 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Deferring the final pass after dropping ->mmap_sem is going to be
> > interesting; what would protect ->vm_next on those suckers?
> 
> Just remove them from the active list, and keep them linked to each
> other using vm_next.
> 
> After all, the only case we care about is the case where the vma gets
> removed entirely, so we just put them on their own list.
> 
> In fact, that's what we already do for other reasons. See
> detach_vmas_to_be_unmapped().
> 
> So vm_next is actually entirely *private* by this time, and needs no
> locking at all.
> 
> As far as I can tell, we could just make do_munmap() return that
> private list, and then do the fput's and freeing of the list outside
> the mmap_sem lock.
> 
> That actually looks pretty simple. There are a fair number of callers,
> which looks to be the main annoyance. But fixing it up with some
> pattern of "do finish_munmap after drooping the mmap_sem" doesn't look
> *complicated*, just slightly annoying.
> 
> The *bigger* annoyance is actually "do_mmap()", which does a
> do_munmap() as part of it, so it needs the same cleanup too.
> 
> There might be other cases that do munmap as part of their operation
> (and that have the mmap_sem held outside of the caller), but
> do_munmap() and do_mmap() seem to be the two obvious ones.
> 
> Many of the callers seem to do the mmap_sem() right around the call,
> though (see binfmt_elf.c for example), so it really would be a rather
> local fixup.

Yes, that's exactly how I was thinking of it.

Hugh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches)
  2012-04-20 19:04             ` Al Viro
  2012-04-20 19:18               ` Linus Torvalds
@ 2012-04-20 19:37               ` Al Viro
  1 sibling, 0 replies; 91+ messages in thread
From: Al Viro @ 2012-04-20 19:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-fsdevel, James Morris, linux-security-module, linux-kernel,
	David Safford, Dmitry Kasatkin, Mimi Zohar, Linus Torvalds,
	David Miller, Andrew Morton

On Fri, Apr 20, 2012 at 08:04:18PM +0100, Al Viro wrote:

> Deferring the final pass after dropping ->mmap_sem is going to be
> interesting; what would protect ->vm_next on those suckers?  Locking
> rules are already bloody complicated in mm/*; this will only add more
> subtle fun.  Note that right now both the dissolution of ->vm_next
> list and freeing of VMAs happen under ->mmap_sem; changing that might
> cost us a lot of headache...

BTW, even leaving that aside, we have at least one definite deadlock
(mainline, not on ->mmap_sem - see the mess in drivers/video/msm/mdp.c)
and a possibility of other fun (e.g. any network filesystem that
ends up triggering rtnl_lock() during umount is going to deadlock if
you combine it with sch_atm.c mess where we do fget()/check/fput()
in netlink message handling, close() from another thread racing with
it and descriptor being the only thing that keeps that network fs
alive past lazy umount or namespace dissolution).  And then there's
SCM_RIGHTS vs. drivers/vhost/net.c, which might or might not be
deadlockable - I'm not familiar enough with that driver to tell
right now.

_If_ it had been ->mmap_sem alone, the things would be much simpler.
There wouldn't be a problem, to start with - IMA stuff is not in the
mainline and that's the only thing that really wants ->i_mutex in
fput(); the few existing places that grab it in ->release() are racy
anyway and need to be fixed with proper locking.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches)
  2012-04-20 19:18               ` Linus Torvalds
  2012-04-20 19:32                 ` Hugh Dickins
@ 2012-04-20 19:58                 ` Al Viro
  2012-04-20 21:12                   ` Linus Torvalds
  1 sibling, 1 reply; 91+ messages in thread
From: Al Viro @ 2012-04-20 19:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, linux-fsdevel, James Morris, linux-security-module,
	linux-kernel, David Safford, Dmitry Kasatkin, Mimi Zohar,
	David Miller, Andrew Morton

On Fri, Apr 20, 2012 at 12:18:43PM -0700, Linus Torvalds wrote:

> The *bigger* annoyance is actually "do_mmap()", which does a
> do_munmap() as part of it, so it needs the same cleanup too.

So does a bunch of other places.  Let me dig out the call graph circa
3.3.0...  Here is the relevant part:
do_munmap() -
        <- pfm_do_munmap() <- pfm_remove_smpl_mapping() which grabs mmap_sem excl
        <- 64_munmap(2) which grabs mmap_sem excl
        <- kvm_arch_commit_memory_region() which grabs mmap_sem excl
        <- i810_unmap_buffer() which grabs mmap_sem excl
        <- aio_free_ring() which grabs mmap_sem excl
        <- elf_map() which grabs mmap_sem excl
        <- [flat] load_flat_file() --- BUG HERE
        <- shmdt(2) which grabs mmap_sem excl
        <- brk(2) which grabs mmap_sem excl
        <- mmap_region() [see below]
        <- munmap(2) which grabs mmap_sem excl
        <- do_brk() [see below]
        <- move_vma()
                <- mremap_to() <- do_mremap() [see below]
                <- do_mremap() [see below]
        <- mremap_to() <- do_mremap() [see below]
        <- do_mremap() [see below]
do_brk() -
        <- brk(2) which grabs mmap_sem excl
        <- [ia32_aout] set_brk() which grabs mmap_sem excl
        <- [ia32_aout] load_aout_binary() which grabs mmap_sem excl
        <- [ia32_aout] load_aout_library() which grabs mmap_sem excl
        <- [aout] set_brk() which grabs mmap_sem excl
        <- [aout] load_aout_binary() which grabs mmap_sem excl
        <- [aout] load_aout_library() which grabs mmap_sem excl
        <- [elf] set_brk() which grabs mmap_sem excl
        <- [elf] load_elf_interp() which grabs mmap_sem excl
        <- [elf] load_elf_library() which grabs mmap_sem excl
mmap_region() -
        <- remap_file_pages(2) which grabs mmap_sem excl
        <- do_mmap_pgoff() [see below]
        <- [tile] arch_setup_additional_pages() which grabs mmap_sem excl
                (a bit too late, BTW, but not for this one)
do_mmap_pgoff() -
        <- do_mmap() [see below]
        <- mmap_pgoff(2) which grabs mmap_sem excl
do_mmap() -
        <- shmat(2) which grabs mmap_sem excl
        <- aio_setup_ring() which grabs mmap_sem excl [NB: only because ctx->mm == current->mm]
        <- kvm_arch_prepare_memory_region() which grabs mmap_sem excl
        <- drm_mapbufs() which grabs mmap_sem excl
        <- exynos_drm_gem_mmap_ioctl() which grabs mmap_sem excl
        <- i810_map_buffer() which grabs mmap_sem excl [NB: racy changes of ->f_op]
        <- i915_gem_mmap_ioctl() which grabs mmap_sem excl
        <- [tile] single_step_once() which grabs mmap_sem excl
        <- [elf] elf_map() which grabs mmap_sem excl
        <- [elf] load_elf_binary() which grabs mmap_sem excl
        <- [elf_fdpic] load_elf_fdpic_binary() which grabs mmap_sem excl
        <- [elf_fdpic] elf_fdpic_map_file_constdisp_on_uclinux() which grabs mmap_sem excl
        <- [elf_fdpic] elf_fdpic_map_file_by_direct_mmap() which grabs mmap_sem excl
        <- [aout] load_aout_binary() which grabs mmap_sem excl
        <- [aout] load_aout_library() which grabs mmap_sem excl
        <- [ia32_aout] load_aout_binary() which grabs mmap_sem excl
        <- [ia32_aout] load_aout_library() which grabs mmap_sem excl
        <- [flat] load_flat_file() which grabs mmap_sem excl
        <- [som] map_som_binary() which grabs mmap_sem excl
do_mremap() -
        <- mremap(2) which grabs mmap_sem excl

(bug mentioned re load_flat_file() is still there, but it's irrelevant
for our purposes - no ->mmap_sem held by caller of do_munmap()).  That's
a metric arseload of sites to propagate that thing to...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches)
  2012-04-20 19:58                 ` Al Viro
@ 2012-04-20 21:12                   ` Linus Torvalds
  2012-04-20 22:13                     ` Al Viro
  0 siblings, 1 reply; 91+ messages in thread
From: Linus Torvalds @ 2012-04-20 21:12 UTC (permalink / raw)
  To: Al Viro
  Cc: Hugh Dickins, linux-fsdevel, James Morris, linux-security-module,
	linux-kernel, David Safford, Dmitry Kasatkin, Mimi Zohar,
	David Miller, Andrew Morton

On Fri, Apr 20, 2012 at 12:58 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> So does a bunch of other places.  Let me dig out the call graph circa
> 3.3.0...  Here is the relevant part:

Yes, but a lot of those would actually be helped by a helper function
that does all of:
 - grab mmap_sem
 - call do_m[un]map()
 - release mmap_sem

and that would actually clean them up even in the current case.

And then we could do the cleanup in just the helper function.

Not all, no. But a preparatory patch that just creates the helper
functions for doing brk/mmap/munmap would get rid of a fairly big
chunk of them.

You can visualize how many of them do that by just doing

    git grep -5 do_m[un]*map

and then high-lghting '_write(' (to visually show the
down_write/up_write pairs that surround most of them) by searching for
it.

Are they all like that? No. But most of the ones outside of mm/ do fit
that simple pattern and should probably be fixed up just to have them
not contain VM locking details in them *anyway*.

                    Linus

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches)
  2012-04-20 21:12                   ` Linus Torvalds
@ 2012-04-20 22:13                     ` Al Viro
  2012-04-20 22:35                       ` Linus Torvalds
  0 siblings, 1 reply; 91+ messages in thread
From: Al Viro @ 2012-04-20 22:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, linux-fsdevel, James Morris, linux-security-module,
	linux-kernel, David Safford, Dmitry Kasatkin, Mimi Zohar,
	David Miller, Andrew Morton

On Fri, Apr 20, 2012 at 02:12:55PM -0700, Linus Torvalds wrote:

> Are they all like that? No. But most of the ones outside of mm/ do fit
> that simple pattern and should probably be fixed up just to have them
> not contain VM locking details in them *anyway*.

Kinda-sorta.  I agree that such helpers make sense, but you are too
optimistic about the number of such places.  And clusterfuck around
mremap() is fairly deep, so propagating all way back to up_write()
wont' be fun.

sys_shmdt() is misusing do_munmap(), AFAICS.  And there we call it many
times in a loop, unmapping only a subset, so it's not like we could
blindly pick a string of vmas and handle them all separately afterwards.
It could be handled if we passed an initially empty list, collecting
vmas in it as we go, but... ouch

BTW, looks like I've just found a bug in oprofile - take a look at
sys_munmap() and you'll see that it's almost exactly your proposed
helper.  The only difference is that it starts with calling
	profile_munmap(addr);
(mind you, *before* grabbing ->mmap_sem), which is to say
        blocking_notifier_call_chain(&munmap_notifier, 0, (void *)addr);
i.e. a really fancy way to arrange a call of munmap_notify() from
drivers/oprofile/buffer_sync.c.  Which does the following:
{
        unsigned long addr = (unsigned long)data;
        struct mm_struct *mm = current->mm;
        struct vm_area_struct *mpnt;

        down_read(&mm->mmap_sem);

        mpnt = find_vma(mm, addr);
        if (mpnt && mpnt->vm_file && (mpnt->vm_flags & VM_EXEC)) {
                up_read(&mm->mmap_sem);
                /* To avoid latency problems, we only process the current CPU,
                 * hoping that most samples for the task are on this CPU
                 */
                sync_buffer(raw_smp_processor_id());
                return 0;
        }

        up_read(&mm->mmap_sem);
        return 0;
}

Leaving aside the convoluted way it's done, it's obviously both racy and
incomplete.  The worst part is, sync_buffer() *can't* be called with
any ->mmap_sem held; it goes through a bunch of task_struct, grabbing
->mmap_sem shared.  Fun...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches)
  2012-04-20 22:13                     ` Al Viro
@ 2012-04-20 22:35                       ` Linus Torvalds
  2012-04-27  7:35                         ` Kasatkin, Dmitry
  0 siblings, 1 reply; 91+ messages in thread
From: Linus Torvalds @ 2012-04-20 22:35 UTC (permalink / raw)
  To: Al Viro
  Cc: Hugh Dickins, linux-fsdevel, James Morris, linux-security-module,
	linux-kernel, David Safford, Dmitry Kasatkin, Mimi Zohar,
	David Miller, Andrew Morton

On Fri, Apr 20, 2012 at 3:13 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Kinda-sorta.  I agree that such helpers make sense, but you are too
> optimistic about the number of such places.  And clusterfuck around
> mremap() is fairly deep, so propagating all way back to up_write()
> wont' be fun.

Fair enough.

I'll do the helpers and see how much they get rid of, just because
looking at all the callers, those helpers seem to be obviously the
right thing anyway. So even if we don't do anything else, we can
improve things regardless.

For do_brk(), for example, it looks like do_brk() itself should
actually be entirely static to mm/mmap.c, because every single caller
from the outside actually wants the self-locking version.

So plan right now: do "vm_xyzzy()" helper functions that do
"do_xyzzy()" and take the lock (and do not take the "mm" argument,
because it had better always be the current one - keep the calling
convention as simple as possible).

                   Linus

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-20 18:07                           ` Al Viro
@ 2012-04-23 18:01                             ` Al Viro
  2012-04-23 18:37                               ` Oleg Nesterov
  2012-04-24  7:26                               ` Al Viro
  0 siblings, 2 replies; 91+ messages in thread
From: Al Viro @ 2012-04-23 18:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Oleg Nesterov, linux-arch, linux-kernel

On Fri, Apr 20, 2012 at 07:07:48PM +0100, Al Viro wrote:
> On Fri, Apr 20, 2012 at 10:21:35AM -0700, Linus Torvalds wrote:

> > This is why I suggested you look at Oleg's patches. If we guarantee
> > that things won't be delayed past re-entering user mode, all those
> > issues go away.
> 
> I've looked at them.  One obvious problem is that it tracehook_notify_resume()
> is not universally called.  AFAICS, hexagon, m68k, microblaze, score, um
> and xtensa never call tracehook_notify_resume().  Out of those, hexagon is
> manually checking TIF_NOTIFY_RESUME and does key_replace_session_keyring(),
> so the call could be easily added into the same place; the rest of those
> guys don't even look at TIF_NOTIFY_RESUME anywhere near their signal*.c
> and m68k/um/xtensa don't even have it defined, let alone handled.  So this
> stuff depends on some amount of asm glue hacking on several architectures ;-/

BTW, I've looked into dealing with that; I think I have a tentative solution
for all these architectures.
	* hexagon: just needs tracehook_notify_resume() added, everything
else is already in place
	* score: TIF_NOTIFY_RESUME is defined *and* included into the
"we need to call do_notify_resume()" logics in assembler glue; just need
to add the usual boilerplate into said do_notify_resume()
	* um: glue in question is in C; easily dealt with, I can do that
(and test the results) tonight
	* m68k: that'll need some glue changes; AFAICS, the easiest solution
is to put TIF_NOTIFY_RESUME into bit 5 - then nommu glue needs no changes
at all, and entry_mm.S needs two "jmi do_signal_return" replaced with
"jne do_signal_return"; the code before those shifts bit 6 to MSBit and
currently bits 0--5 are unused.  Replacing "most significant bit is set" with
"some bits are set" would do the right thing, AFAICT - make the sucker
go into do_signal() handling if either TIF_SIGPENDING (bit 6) or
TIF_NOTIFY_RESUME (bit 5) is set (at that point it has already checked that
TIF_NEED_RESCHEDULE is not set).  On top of that it will need the obvious
changes in do_signal() itself - boilerplate added and current contents
made conditional on TIF_SIGPENDING being set.  I can only test that
on aranym, though - all real m68k hardware I have is pining for fjords right
now.
	* microblaze: TIF_NOTIFY_RESUME is defined, but not hooked anywhere.
Fortunately, the glue is easy enough there - all relevant spots have the
same form
        lwi     r11, r11, TI_FLAGS;     /* get flags in thread info */
        andi    r11, r11, _TIF_SIGPENDING;
        beqi    r11, 1f;                /* Signals to handle, handle them */
and replacing that _TIF_SIGPENDING with _TIF_SIGPENDING | _TIF_NOTIFY_RESUME
will do the right thing; of course, do_signal() itself will need to be
taught about TIF_NOTIFY_RESUME - same as in case of m68k.  No hardware, no
emulators set up, but then it's less intrusive in the glue part than m68k
counterpart.
	* xtensa: TIF_NOTIFY_RESUME needs to be defined (bit 7 would do,
AFAICS) and there the glue does need some change:
        l32i    a4, a2, TI_FLAGS

        _bbsi.l a4, TIF_NEED_RESCHED, 3f
        _bbci.l a4, TIF_SIGPENDING, 4f
should be replaced with (if I'm not misreading their ISA documentation)
        l32i    a4, a2, TI_FLAGS

        _bbsi.l a4, TIF_NEED_RESCHED, 3f
        _bbsi.l a4, TIF_SIGPENDING, 2f
        _bbci.l a4, TIF_NOTIFY_RESUME, 4f
2:
(and do_signal() changes, of course).  That's the most intrusive one and
again, I've neither hw nor emulators for that sucker.

I'll post the patches for all of those tonight; if everything ends up working,
at least we can get rid of the ifdefs on TIF_NOTIFY_RESUME.

Oleg, where does your tree live?  I've walked through the signal handling
on all targets over this weekend (and it's still not complete - there are
fun bugs re multiple sigframes and restart handling on many of those) and
my current queue is at git.kernel.org/pub/scm/linux/kernel/git/viro/signal;
I don't promise that it even builds on everything, so it's *not* a pull
request.  Besides, it's still growing and will be reordered...  The issues
dealt with by now:
[done] don't open-code force_sigsegv()
[done] don't open-code block_sigmask()
[done] If we have failed to set sigframe up, we should send SIGSEGV with
force_sigsegv() (or force_sigsegv_info()) and leave the sigmask alone;
otherwise we need to call block_sigmask() and clear RESTORE_SIGMASK
[done] looking at SA_ONESHOT is pointless (it's a rudiment of very old things;
these days kernel/signal.c does it properly and arch/* code doesn't need to
and actually can't - it only gets a local copy filled, so clearing sa_handler
in it is bloody pointless)
[done] all sigsuspend variants should use sigsuspend() helper (i.e. be based
on use of ->saved_sigmask; see the first patch in queue introducing the helper
in question)
[done] restart_block.fn should be reset on sigreturn, not on signal delivery
[done] sigreturn variants should use set_current_blocked(); make sure to remove KILL/STOP from the set first
[mostly done; parisc and blackfin left] check __get_user()/__put_user() results

I really want to get arch/*/*/*signal* more or less in sync wrt fixes; having
TIF_NOTIFY_RESUME working fits nicely into that.  I'd appreciate a look through
that stuff...

PS: I don't think I'll be posting any pull requests on that tree; it's just
a staging ground for future linux-arch patchbomb(s).

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-23 18:01                             ` [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such Al Viro
@ 2012-04-23 18:37                               ` Oleg Nesterov
  2012-04-24  7:26                               ` Al Viro
  1 sibling, 0 replies; 91+ messages in thread
From: Oleg Nesterov @ 2012-04-23 18:37 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-arch, linux-kernel

On 04/23, Al Viro wrote:
>
> On Fri, Apr 20, 2012 at 07:07:48PM +0100, Al Viro wrote:
> > On Fri, Apr 20, 2012 at 10:21:35AM -0700, Linus Torvalds wrote:
>
> > > This is why I suggested you look at Oleg's patches. If we guarantee
> > > that things won't be delayed past re-entering user mode, all those
> > > issues go away.
> >
> > I've looked at them.  One obvious problem is that it tracehook_notify_resume()
> > is not universally called.  AFAICS, hexagon, m68k, microblaze, score, um
> > and xtensa never call tracehook_notify_resume().

Yes,

> > Out of those, hexagon is
> > manually checking TIF_NOTIFY_RESUME and does key_replace_session_keyring(),
> > so the call could be easily added into the same place;

Yes, I sent the trivial patch.

> > the rest of those
> > guys don't even look at TIF_NOTIFY_RESUME anywhere near their signal*.c
> > and m68k/um/xtensa don't even have it defined, let alone handled.  So this
> > stuff depends on some amount of asm glue hacking on several architectures ;-/

which I don't understand even remotely :/ But I'll try to understand at
least something.

> Oleg, where does your tree live?

Cough. Nowhere. I still can't restore my korg account.

> I've walked through the signal handling
> on all targets over this weekend (and it's still not complete - there are
> fun bugs re multiple sigframes and restart handling on many of those) and
> my current queue is at git.kernel.org/pub/scm/linux/kernel/git/viro/signal;

Thanks. I don't think you need my help, but I'll certainly try to look this
week anyway.

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-23 18:01                             ` [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such Al Viro
  2012-04-23 18:37                               ` Oleg Nesterov
@ 2012-04-24  7:26                               ` Al Viro
  2012-04-25  3:06                                 ` Al Viro
  2012-04-26 18:37                                 ` Oleg Nesterov
  1 sibling, 2 replies; 91+ messages in thread
From: Al Viro @ 2012-04-24  7:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Oleg Nesterov, linux-arch, linux-kernel

On Mon, Apr 23, 2012 at 07:01:50PM +0100, Al Viro wrote:
> BTW, I've looked into dealing with that; I think I have a tentative solution
> for all these architectures.
> 	* hexagon: just needs tracehook_notify_resume() added, everything
> else is already in place
> 	* score: TIF_NOTIFY_RESUME is defined *and* included into the
> "we need to call do_notify_resume()" logics in assembler glue; just need
> to add the usual boilerplate into said do_notify_resume()
> 	* um: glue in question is in C; easily dealt with, I can do that
> (and test the results) tonight
> 	* m68k: that'll need some glue changes; AFAICS, the easiest solution
> is to put TIF_NOTIFY_RESUME into bit 5 - then nommu glue needs no changes
> at all, and entry_mm.S needs two "jmi do_signal_return" replaced with
> "jne do_signal_return"; the code before those shifts bit 6 to MSBit and
> currently bits 0--5 are unused.  Replacing "most significant bit is set" with
> "some bits are set" would do the right thing, AFAICT - make the sucker
> go into do_signal() handling if either TIF_SIGPENDING (bit 6) or
> TIF_NOTIFY_RESUME (bit 5) is set (at that point it has already checked that
> TIF_NEED_RESCHEDULE is not set).  On top of that it will need the obvious
> changes in do_signal() itself - boilerplate added and current contents
> made conditional on TIF_SIGPENDING being set.  I can only test that
> on aranym, though - all real m68k hardware I have is pining for fjords right
> now.
> 	* microblaze: TIF_NOTIFY_RESUME is defined, but not hooked anywhere.
> Fortunately, the glue is easy enough there - all relevant spots have the
> same form
>         lwi     r11, r11, TI_FLAGS;     /* get flags in thread info */
>         andi    r11, r11, _TIF_SIGPENDING;
>         beqi    r11, 1f;                /* Signals to handle, handle them */
> and replacing that _TIF_SIGPENDING with _TIF_SIGPENDING | _TIF_NOTIFY_RESUME
> will do the right thing; of course, do_signal() itself will need to be
> taught about TIF_NOTIFY_RESUME - same as in case of m68k.  No hardware, no
> emulators set up, but then it's less intrusive in the glue part than m68k
> counterpart.
> 	* xtensa: TIF_NOTIFY_RESUME needs to be defined (bit 7 would do,
> AFAICS) and there the glue does need some change:
>         l32i    a4, a2, TI_FLAGS
> 
>         _bbsi.l a4, TIF_NEED_RESCHED, 3f
>         _bbci.l a4, TIF_SIGPENDING, 4f
> should be replaced with (if I'm not misreading their ISA documentation)
>         l32i    a4, a2, TI_FLAGS
> 
>         _bbsi.l a4, TIF_NEED_RESCHED, 3f
>         _bbsi.l a4, TIF_SIGPENDING, 2f
>         _bbci.l a4, TIF_NOTIFY_RESUME, 4f
> 2:
> (and do_signal() changes, of course).  That's the most intrusive one and
> again, I've neither hw nor emulators for that sucker.
> 
> I'll post the patches for all of those tonight; if everything ends up working,
> at least we can get rid of the ifdefs on TIF_NOTIFY_RESUME.

Untested variants pushed into signal.git#master; will test tomorrow.  In
the meanwhile, any code review (and testing of the entire thing on as many
targets as possible) would be very welcome.

Next target in signal fixes: handling of multiple signal arrivals.  We really
need to handle all of them (i.e. build a stack frame for each, with sigcontext
pointing to the entry into the previous one); otherwise we are risking to get
things like e.g. SIGSEGV on failure to build a sigframe being handled at
random point - we handle one arriving signal, send ourselves SIGSEGV in
process and return to userland (without actually going into the signal
handler stack frame we'd failed to build).  Broken architectures: blackfin,
cris, h8300, hexagon, microblaze and probably ia64...

And then there's a lovely issue of what syscall restarts - both on multiple
arriving signals (we want the restart to apply on the _first_ signal being
processed, TYVM, since the rest of those signals are not interrupting
a syscall - conceptually, they are interrupting the previous signal handlers
at the point of entry) and on {rt_,}sigreturn(2) (where we might get a value
in the register normally used to return sys_whatever() value that would look
like one of restart-related special errors, except that it's simply what that
register used to have when e.g. a timer interrupt had hit while we had a signal
pending; hell to debug, since it looks e.g. like one register in userland
process getting its value randomly replaced with -EINTR if it happened to
contain -ERESTARTSYS when an interrupt had happened).  I'd fixed one like
that on arm a couple of years ago, but AFAICS we still have several on
other architectures ;-/

BTW, another fun issue is FUBAR assembler variant of rt_sigprocmask() on
itanic; it's missing the fixes done to set_current_blocked() in commit
e6fa16ab "signal: sigprocmask() should do retarget_shared_pending()".
I'm _not_ about to transpose those fixes into ia64 assembler, thank you
very much - Itanic maintainers are whole-heartedly welcome to deal with
that one.  The horror in question is fsys_rt_sigprocmask()...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-24  7:26                               ` Al Viro
@ 2012-04-25  3:06                                 ` Al Viro
  2012-04-25 12:37                                   ` Oleg Nesterov
  2012-04-26 18:37                                 ` Oleg Nesterov
  1 sibling, 1 reply; 91+ messages in thread
From: Al Viro @ 2012-04-25  3:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, linux-arch, linux-kernel, Russell King, Tejun Heo,
	Arnd Bergmann


> And then there's a lovely issue of what syscall restarts - both on multiple
> arriving signals (we want the restart to apply on the _first_ signal being
> processed, TYVM, since the rest of those signals are not interrupting
> a syscall - conceptually, they are interrupting the previous signal handlers
> at the point of entry) and on {rt_,}sigreturn(2) (where we might get a value
> in the register normally used to return sys_whatever() value that would look
> like one of restart-related special errors, except that it's simply what that
> register used to have when e.g. a timer interrupt had hit while we had a signal
> pending; hell to debug, since it looks e.g. like one register in userland
> process getting its value randomly replaced with -EINTR if it happened to
> contain -ERESTARTSYS when an interrupt had happened).  I'd fixed one like
> that on arm a couple of years ago, but AFAICS we still have several on
> other architectures ;-/

FWIW, there's an interesting question rmk has brought up.  Consider the
following scenario (on any architecture):
	sigsuspend(2) sees a signal and returns ERESTARTNOHAND.
	do_signal() is called and calls get_signal_to_deliver() and gets 0,
for whatever reason.
	We decide to restart, return address adjusted, syscall number
returned to the right register in pt_regs.  In the meanwhile, no matter what
state interrupts used to have before, get_signal_to_deliver() has enabled
them when returning, so we'll need to reload thread flags.  And we find that
another signal has arrived in the meanwhile.
	OK, do_signal() is called again, and this time we have a handler for
the arrived signal.  We form a stack frame and return to userland, into the
beginning of the handler.  We don't even look at the restart-related logics
this time around, due to the usual logics protecting us from double restarts.
	Handler is executed, up to rt_sigreturn(2).
	We decode the sigcontext, restore pt_regs and return to userland.
	Right into the beginning of interrupted sigsuspend()

So we have sigsuspend() hit by a signal we have a handler for.  Handler is
executed and we are stuck is sigsuspend() again, all because a signal without
a handler has arrived just before that one - close enough for our signal to
come right after get_signal_to_deliver() has returned zero to do_signal().

AFAICS, that's a clear bug.  Arrival of a signal that has userland handler
and that isn't blocked by the mask given to sigsuspend() should terminate
sigsuspend().  Sure, we can weasel around the wording in POSIX (it says
"sigsuspend() will return after the signal-catching function returns" without
explicitly saying that it shouldn't sit around waiting for another signals
to arrive), but it's obviously against the intent of standard (as well as
expectations of any programmer).  Note that if the handler-less signal had
arrived a bit earlier, we would've returned to userland and reenter the
kernel before the arrival of our signal.  Then it *would* terminate
sigsuspend() - handler would be executed and we would've returned to caller
of sigsuspend() with EINTR as return value.  If they came simultaneously,
the same thing would've happened - get_signal_to_deliver() would have
returned non-zero the first time around and that would be it.  It's just the
right timing for the second signal that triggers that race.

Solution proposed last summer when that had been noticed by arm folks was
more or less along the lines of
	* new thread flag, checked after we'd seen that no SIGPENDING et.al.
is there.  If it's set, we clear it, do syscall restart work as we would for
handlerless signal and recheck the flags if we had to do something like
__put_user() in process (arm might have to do that for ERESTART_RESTARTBLOCK)[1]
	* do_signal() would set that flag if
		+ anti double-restart logics would not have prevented
		  restarts
		+ error value was ERESTART_...
	* no restart work on "no signal" path in do_signal()
	* if we have a handler and the flag is set, clear it and do what
we normally do for restarts (including the "has ptrace mangled registers
in a way that would prevent restarts in the current code" logics for
architectures that have such logics - arm and sparc, at least).

I would've probably made that TS_... instead of TIF_... at least for something
like x86 - it's obviously thread-synchronous.

[1] it wasn't covered in that thread, but if a signal arrives during that
__put_user() we really won't care - the usual logics in sigreturn() will
make sure that when we return from handler into the resulting restart_syscall(2)
we'll have it immediately fail with -EINTR.

Comments?  AFAICS, the bug is arch-independent; it's not just arm and it's
worse than the example mentioned last year - sigsuspend() is a lot more
common than ppoll() and behaviour clearly goes against what sigsuspend()
exists for.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-25  3:06                                 ` Al Viro
@ 2012-04-25 12:37                                   ` Oleg Nesterov
  2012-04-25 12:50                                     ` Al Viro
  0 siblings, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2012-04-25 12:37 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, linux-arch, linux-kernel, Russell King,
	Tejun Heo, Arnd Bergmann, Roland McGrath

On 04/25, Al Viro wrote:
>
> FWIW, there's an interesting question rmk has brought up.  Consider the
> following scenario (on any architecture):
> 	sigsuspend(2) sees a signal and returns ERESTARTNOHAND.
> 	do_signal() is called and calls get_signal_to_deliver() and gets 0,
> for whatever reason.
> 	We decide to restart, return address adjusted, syscall number
> returned to the right register in pt_regs.  In the meanwhile, no matter what
> state interrupts used to have before, get_signal_to_deliver() has enabled
> them when returning

Afaics this doesn't really matter, TIF_SIGPENDING can be set by another CPU
once get_signal_to_deliver() drops ->siglock.

> , so we'll need to reload thread flags.  And we find that
> another signal has arrived in the meanwhile.
> 	OK, do_signal() is called again, and this time we have a handler for
> the arrived signal.  We form a stack frame and return to userland, into the
> beginning of the handler.  We don't even look at the restart-related logics
> this time around, due to the usual logics protecting us from double restarts.
> 	Handler is executed, up to rt_sigreturn(2).
> 	We decode the sigcontext, restore pt_regs and return to userland.
> 	Right into the beginning of interrupted sigsuspend()
>
> So we have sigsuspend() hit by a signal we have a handler for.  Handler is
> executed and we are stuck is sigsuspend() again, all because a signal without
> a handler has arrived just before that one - close enough for our signal to
> come right after get_signal_to_deliver() has returned zero to do_signal().

Yes, this (and the similar races) were already discussed a couple of times.
In short, regs->ax = -ERESTART* and ->ip doesn't survive after do_signal().
In this case the syscall was already restarted after the first do_signal()
even if we do not return to user-mode yet.

> AFAICS, that's a clear bug.

I do not know. So far it was decided that we do not really care, but
I won't argue if we decide to change the current behaviour.

As for sys_sigsuspend() and this race in particular:

> Arrival of a signal that has userland handler
> and that isn't blocked by the mask given to sigsuspend() should terminate
> sigsuspend().

Yes. But note that do_signal() restores the old sigmask. This means that
the signal we get after the first do_signal() was not blocked before
sigsuspend() was called. So, to some extent, we can pretend that the
handler was executed before sigsuspend() and it was never restarted.

IOW, I tend to agree with the comments from Roland, see for example

	HR timers prevent an itimer from generating EINTR?
	http://marc.info/?t=125210012600005

	[RESEND] [RFC][PATCH X86_32 1/2]: Call do_notify_resume() with interrupts enabled
	http://marc.info/?t=131955450100004

But let me repeat that I never really understood if this is "by design"
or not.

> Solution proposed last summer when that had been noticed by arm folks was
> more or less along the lines of
> 	* new thread flag, checked after we'd seen that no SIGPENDING et.al.
> is there.  If it's set, we clear it, do syscall restart work as we would for
> handlerless signal and recheck the flags if we had to do something like
> __put_user() in process (arm might have to do that for ERESTART_RESTARTBLOCK)[1]
> 	* do_signal() would set that flag if
> 		+ anti double-restart logics would not have prevented
> 		  restarts
> 		+ error value was ERESTART_...
> 	* no restart work on "no signal" path in do_signal()
> 	* if we have a handler and the flag is set, clear it and do what
> we normally do for restarts (including the "has ptrace mangled registers
> in a way that would prevent restarts in the current code" logics for
> architectures that have such logics - arm and sparc, at least).

Hmm. Not sure I understand this in details. But at first glance,
"do_signal() would set that flag" is not enough. We have the similar problem
if we dequeue a SA_RESTART signal first, then another signal without SA_RESTART.
Or I simply misunderstood.

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-25 12:37                                   ` Oleg Nesterov
@ 2012-04-25 12:50                                     ` Al Viro
  2012-04-25 13:03                                       ` Oleg Nesterov
  0 siblings, 1 reply; 91+ messages in thread
From: Al Viro @ 2012-04-25 12:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, linux-arch, linux-kernel, Russell King,
	Tejun Heo, Arnd Bergmann, Roland McGrath

On Wed, Apr 25, 2012 at 02:37:46PM +0200, Oleg Nesterov wrote:
> As for sys_sigsuspend() and this race in particular:
> 
> > Arrival of a signal that has userland handler
> > and that isn't blocked by the mask given to sigsuspend() should terminate
> > sigsuspend().
> 
> Yes. But note that do_signal() restores the old sigmask. This means that
> the signal we get after the first do_signal() was not blocked before
> sigsuspend() was called. So, to some extent, we can pretend that the
> handler was executed before sigsuspend() and it was never restarted.

Signal might have already arrived by the time we restore sigmask.  So no,
it might have been blocked prior to sigsuspend().

I agree that relying on disabled interrupts doesn't work - objection would
have worked if it was just "what if we get NEED_RESCHED and while we are
scheduled away a signal arrives", but this scenario doesn't depend on
that.  We definitely want interrupts enabled before we start playing with
do_notify_resume(), especially if things like deferred fput, etc. end up
there as well.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-25 12:50                                     ` Al Viro
@ 2012-04-25 13:03                                       ` Oleg Nesterov
  2012-04-25 13:32                                         ` Oleg Nesterov
  2012-04-25 13:32                                         ` Al Viro
  0 siblings, 2 replies; 91+ messages in thread
From: Oleg Nesterov @ 2012-04-25 13:03 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, linux-arch, linux-kernel, Russell King,
	Tejun Heo, Arnd Bergmann, Roland McGrath

On 04/25, Al Viro wrote:
>
> On Wed, Apr 25, 2012 at 02:37:46PM +0200, Oleg Nesterov wrote:
> > As for sys_sigsuspend() and this race in particular:
> >
> > > Arrival of a signal that has userland handler
> > > and that isn't blocked by the mask given to sigsuspend() should terminate
> > > sigsuspend().
> >
> > Yes. But note that do_signal() restores the old sigmask. This means that
> > the signal we get after the first do_signal() was not blocked before
> > sigsuspend() was called. So, to some extent, we can pretend that the
> > handler was executed before sigsuspend() and it was never restarted.
>
> Signal might have already arrived by the time we restore sigmask.

Yes, and it sets TIF_SIGPENDING, but unless I missed something this doesn't
matter.

> So no,
> it might have been blocked prior to sigsuspend().

If it was not blocked, then the next do_signal()->get_signal_to_deliver()
returns 0 and clears TIF_SIGPENDING. After that we finally re-enter
sys_sigsuspend() and (assuming it unblocks this sig) notice this pending
signal again and return -EINTR eventually.

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-25 13:03                                       ` Oleg Nesterov
@ 2012-04-25 13:32                                         ` Oleg Nesterov
  2012-04-25 13:32                                         ` Al Viro
  1 sibling, 0 replies; 91+ messages in thread
From: Oleg Nesterov @ 2012-04-25 13:32 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, linux-arch, linux-kernel, Russell King,
	Tejun Heo, Arnd Bergmann, Roland McGrath

On 04/25, Oleg Nesterov wrote:
>
> On 04/25, Al Viro wrote:
> >
> > On Wed, Apr 25, 2012 at 02:37:46PM +0200, Oleg Nesterov wrote:
> > > As for sys_sigsuspend() and this race in particular:
> > >
> > > > Arrival of a signal that has userland handler
> > > > and that isn't blocked by the mask given to sigsuspend() should terminate
> > > > sigsuspend().
> > >
> > > Yes. But note that do_signal() restores the old sigmask. This means that
> > > the signal we get after the first do_signal() was not blocked before
> > > sigsuspend() was called. So, to some extent, we can pretend that the
> > > handler was executed before sigsuspend() and it was never restarted.
> >
> > Signal might have already arrived by the time we restore sigmask.
>
> Yes, and it sets TIF_SIGPENDING, but unless I missed something this doesn't
> matter.
>
> > So no,
> > it might have been blocked prior to sigsuspend().
>
> If it was not blocked,
        ^^^^^^^
I meant, If it WAS blocked.

> then the next do_signal()->get_signal_to_deliver()
> returns 0 and clears TIF_SIGPENDING. After that we finally re-enter
> sys_sigsuspend() and (assuming it unblocks this sig) notice this pending
> signal again and return -EINTR eventually.
>
> Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-25 13:03                                       ` Oleg Nesterov
  2012-04-25 13:32                                         ` Oleg Nesterov
@ 2012-04-25 13:32                                         ` Al Viro
  2012-04-25 14:52                                           ` Oleg Nesterov
  1 sibling, 1 reply; 91+ messages in thread
From: Al Viro @ 2012-04-25 13:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, linux-arch, linux-kernel, Russell King,
	Tejun Heo, Arnd Bergmann, Roland McGrath

On Wed, Apr 25, 2012 at 03:03:29PM +0200, Oleg Nesterov wrote:
> Yes, and it sets TIF_SIGPENDING, but unless I missed something this doesn't
> matter.
> 
> > So no,
> > it might have been blocked prior to sigsuspend().
> 
> If it was not blocked, then the next do_signal()->get_signal_to_deliver()
> returns 0 and clears TIF_SIGPENDING. After that we finally re-enter
> sys_sigsuspend() and (assuming it unblocks this sig) notice this pending
> signal again and return -EINTR eventually.

Point...  Still, since we are talking about an arbitrary wide window (the
damn thing is waiting for signals to arrive, after all) this doesn't
sound good; sure, we might have been hit by SIGSTOP just before entering
that sigsuspend() with SIGCONT arriving only when the signal in question
had, ending up with handler running before sigsuspend(), but...  IMO it's
a QoI problem at the very least.

As for SA_RESTART/!SA_RESTART mixes, if SA_RESTART comes first we should
just take that restart and pretend that the second signal has arrived at
the very beginning of handler, I think.  Assuming I understood you correctly,
that is.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-25 13:32                                         ` Al Viro
@ 2012-04-25 14:52                                           ` Oleg Nesterov
  2012-04-25 15:46                                             ` Oleg Nesterov
  0 siblings, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2012-04-25 14:52 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, linux-arch, linux-kernel, Russell King,
	Tejun Heo, Arnd Bergmann, Roland McGrath

On 04/25, Al Viro wrote:
>
> Point...  Still, since we are talking about an arbitrary wide window (the
> damn thing is waiting for signals to arrive, after all) this doesn't
> sound good;
> ...
> IMO it's
> a QoI problem at the very least.

and looks confusing, agreed.

> As for SA_RESTART/!SA_RESTART mixes, if SA_RESTART comes first we should
> just take that restart and pretend that the second signal has arrived at
> the very beginning of handler, I think.

Yes. My point was, this confuses the user-space developers too. And this
case is equally unclear to me wrt should we (at least try to) change this
or not.

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-25 14:52                                           ` Oleg Nesterov
@ 2012-04-25 15:46                                             ` Oleg Nesterov
  2012-04-25 16:10                                               ` Al Viro
  0 siblings, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2012-04-25 15:46 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, linux-arch, linux-kernel, Russell King,
	Tejun Heo, Arnd Bergmann, Roland McGrath

On 04/25, Oleg Nesterov wrote:
>
> On 04/25, Al Viro wrote:
> >
> > Point...  Still, since we are talking about an arbitrary wide window (the
> > damn thing is waiting for signals to arrive, after all) this doesn't
> > sound good;
> > ...
> > IMO it's
> > a QoI problem at the very least.
>
> and looks confusing, agreed.

OK, I didn't really try to think, and somehow I simply can't wake up today.
But perhaps we can do something the following? We add the new syscall

	sys_eintr(void)
	{
		return -EINTR;
	}

(perhaps not strictly needed, perhaps we can reuse sys_restart_syscal)

Now,

	--- x/arch/x86/kernel/signal.c
	+++ x/arch/x86/kernel/signal.c
	@@ -711,6 +711,13 @@ handle_signal(unsigned long sig, siginfo
				regs->ax = regs->orig_ax;
				regs->ip -= 2;
				break;
	+
	+		case -EINTR:
	+			break;
	+
	+		default:
	+			if (regs->orig_ax == NR_eintr)
	+				regs->ax = NR_eintr;
			}
		}
	 
	@@ -791,6 +798,7 @@ static void do_signal(struct pt_regs *re
			case -ERESTARTSYS:
			case -ERESTARTNOINTR:
				regs->ax = regs->orig_ax;
	+			regs->orig_ax = NR_eintr;
				regs->ip -= 2;
				break;
	 

this ignores ERESTART_RESTARTBLOCK for simplicity.

And I am not sure this can't confuse the tools like strace...

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-25 15:46                                             ` Oleg Nesterov
@ 2012-04-25 16:10                                               ` Al Viro
  2012-04-25 17:02                                                 ` Oleg Nesterov
  0 siblings, 1 reply; 91+ messages in thread
From: Al Viro @ 2012-04-25 16:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, linux-arch, linux-kernel, Russell King,
	Tejun Heo, Arnd Bergmann, Roland McGrath

On Wed, Apr 25, 2012 at 05:46:11PM +0200, Oleg Nesterov wrote:
> OK, I didn't really try to think, and somehow I simply can't wake up today.
> But perhaps we can do something the following? We add the new syscall
> 
> 	sys_eintr(void)
> 	{
> 		return -EINTR;
> 	}
> 
> (perhaps not strictly needed, perhaps we can reuse sys_restart_syscal)
> 
> Now,
> 
> 	--- x/arch/x86/kernel/signal.c
> 	+++ x/arch/x86/kernel/signal.c
> 	@@ -711,6 +711,13 @@ handle_signal(unsigned long sig, siginfo
> 				regs->ax = regs->orig_ax;
> 				regs->ip -= 2;
> 				break;
> 	+
> 	+		case -EINTR:
> 	+			break;
> 	+
> 	+		default:
> 	+			if (regs->orig_ax == NR_eintr)
> 	+				regs->ax = NR_eintr;
> 			}
> 		}
> 	 
> 	@@ -791,6 +798,7 @@ static void do_signal(struct pt_regs *re
> 			case -ERESTARTSYS:
> 			case -ERESTARTNOINTR:
> 				regs->ax = regs->orig_ax;
> 	+			regs->orig_ax = NR_eintr;
> 				regs->ip -= 2;
> 				break;
> 	 
> 
> this ignores ERESTART_RESTARTBLOCK for simplicity.

Ehh...  You do realize that it's that simple only on architectures that
have syscall number in a register?  arm/parisc/unicore32 have really
painful code dealing with restart_syscall(2); building trampoline on
stack and all such...  Take a look at e.g. insert_restart_trampoline()
in arch/parisc/kernel/signal.c (BTW, it doesn't check for put_user()
failures while building that thing).

FWIW, I wonder if e.g. arm would be better off with the following
trick: new flag added to _TIF_SYSCALL_WORK, with __sys_trace checking
it and if it's set, clearing it and simulating sys_restart_syscall().
ERESTART_RESTARTBLOCK would become practically identical to ERESTARTNOHAND,
except that if we decide to restart it we would set the flag.  All the
trampoline crap goes away that way and restart logics gets simpler on
any platform doing that kind of thing.

I'm not sure that SA_RESTART case is actually worth bothering with -
AFAICS, your mechanism won't cover SA_RESTART/SA_RESTART/!SA_RESTART
combinations anyway, and I'm not at all sure that we _want_ to change
user-visible behaviour.  Basically, you have
	syscall interrupted
		SA_RESTART signal arrives, handler1 entered
			!SA_RESTART signal arrives, handler2 entered
			handler2 returns
		hander1 returns
	syscall restarts
and to get behaviour independent of relative timing of those two signals
you'd have to replace the last one with "syscall fails with EINTR".
It's a user-visible change and I'm not at all sure that we won't break
userland code with it.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-25 16:10                                               ` Al Viro
@ 2012-04-25 17:02                                                 ` Oleg Nesterov
  2012-04-25 17:51                                                   ` Al Viro
  0 siblings, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2012-04-25 17:02 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, linux-arch, linux-kernel, Russell King,
	Tejun Heo, Arnd Bergmann, Roland McGrath

On 04/25, Al Viro wrote:
>
> On Wed, Apr 25, 2012 at 05:46:11PM +0200, Oleg Nesterov wrote:
> >
> > 	--- x/arch/x86/kernel/signal.c
> > 	+++ x/arch/x86/kernel/signal.c
> > 	@@ -711,6 +711,13 @@ handle_signal(unsigned long sig, siginfo
> > 				regs->ax = regs->orig_ax;
> > 				regs->ip -= 2;
> > 				break;
> > 	+
> > 	+		case -EINTR:
> > 	+			break;
> > 	+
> > 	+		default:
> > 	+			if (regs->orig_ax == NR_eintr)
> > 	+				regs->ax = NR_eintr;
> > 			}
> > 		}
> > 	
> > 	@@ -791,6 +798,7 @@ static void do_signal(struct pt_regs *re
> > 			case -ERESTARTSYS:
> > 			case -ERESTARTNOINTR:
> > 				regs->ax = regs->orig_ax;
> > 	+			regs->orig_ax = NR_eintr;
> > 				regs->ip -= 2;
> > 				break;
> > 	
> >
> > this ignores ERESTART_RESTARTBLOCK for simplicity.
>
> Ehh...  You do realize that it's that simple only on architectures that
> have syscall number in a register?

No, I do not ;) I simply know nothing about other architectures.

> Take a look at e.g. insert_restart_trampoline()
> in arch/parisc/kernel/signal.c

Will do. Not sure I will be able to undestand it though. And btw I still
didn't look at the signal patches in your tree, I've just returned from
vacation today.

> I'm not sure that SA_RESTART case is actually worth bothering with -

Neither me. But,

> AFAICS, your mechanism won't cover SA_RESTART/SA_RESTART/!SA_RESTART
> combinations anyway,

Confused. The hack above doesn't even try to cover this case (and it is
obviously wrong in ERESTARTSYS/NOINTR case), I only tried to illustrate
the idea.

> 	syscall interrupted
> 		SA_RESTART signal arrives, handler1 entered
> 			!SA_RESTART signal arrives, handler2 entered
> 			handler2 returns
> 		hander1 returns
> 	syscall restarts

And there is another case:

	syscall interrupted, it returns ERESTARTSYS.

	do_signal() sees no signal, restarts the syscall

	!SA_RESTART comes before we return to user-mode

This doesn't really differ from sigsuspend/ERESTARTNOHAND we discussed
before.

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-25 17:02                                                 ` Oleg Nesterov
@ 2012-04-25 17:51                                                   ` Al Viro
  2012-04-26  7:15                                                     ` Martin Schwidefsky
  2012-04-26 13:22                                                     ` Oleg Nesterov
  0 siblings, 2 replies; 91+ messages in thread
From: Al Viro @ 2012-04-25 17:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, linux-arch, linux-kernel, Russell King,
	Tejun Heo, Arnd Bergmann, Roland McGrath

On Wed, Apr 25, 2012 at 07:02:30PM +0200, Oleg Nesterov wrote:

> And there is another case:
> 
> 	syscall interrupted, it returns ERESTARTSYS.
> 
> 	do_signal() sees no signal, restarts the syscall
> 
> 	!SA_RESTART comes before we return to user-mode
> 
> This doesn't really differ from sigsuspend/ERESTARTNOHAND we discussed
> before.

... and dealt with the same way:
	NEED_RESTART flag is set the first time around
	no handler for the first signal, so we do nothing else
	handler is found for the second signal
	NEED_RESTART flag is present, so
		we clear NEED_RESTART
		we look at errno and see ERESTARTSYS
		we check sa_flags and see !SA_RESTART
		we set return value to -EINTR and proceed to set sigframe up
	on the exit to userland we see NEED_RESTART not set, so off we go
I think I hadn't been clear enough; here's pseudocode for what I meant

do_signal()
{
	if (we have any business doing restarts)
		// note: we won't get here on subsequent calls of do_signal()
		// due to the checks above; same logics that currently prevents
		// double restarts
		set NEED_RESTART flag
	sig = get_signal_to_deliver(...)
	if (sig) {
		if (NEED_RESTART set) {
			clear NEED_RESTART
			same thing we do at that spot now - restart or EINTR
			handle_signal(...)
			...
			return;
		}
	}
	/* no handler */
	if (test_and_clear_...(RESTORE_SIGMASK))
		set_current_blocked(&current->saved_sigmask);
}
and in asm glue, *after* checking for SIGPENDING/NOTIFY_RESUME, check
NEED_RESTART and if it's set do what we currently do for restarts on
handlerless signal.

BTW, if we combine that with ERESTART_RESTARTBLOCK trick on arm/parisc/unicore,
there won't be much work to do for restarts in that case; in x86ese it
would be simply
	if (regs->ax == -ERESTART_RESTARTBLOCK)
		set_thread_flag(TIF_EMULATE_RESTART_SYSCALL)
	regs->ax = regs->orig_ax;
	regs->ip -= 2;
all of that conditional on TS_NEED_RESTART...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-25 17:51                                                   ` Al Viro
@ 2012-04-26  7:15                                                     ` Martin Schwidefsky
  2012-04-26  7:25                                                       ` David Miller
  2012-04-26 13:52                                                       ` Oleg Nesterov
  2012-04-26 13:22                                                     ` Oleg Nesterov
  1 sibling, 2 replies; 91+ messages in thread
From: Martin Schwidefsky @ 2012-04-26  7:15 UTC (permalink / raw)
  To: Al Viro
  Cc: Oleg Nesterov, Linus Torvalds, linux-arch, linux-kernel,
	Russell King, Tejun Heo, Arnd Bergmann, Roland McGrath

On Wed, 25 Apr 2012 18:51:13 +0100
Al Viro <viro@ZenIV.linux.org.uk> wrote:

> do_signal()
> {
> 	if (we have any business doing restarts)
> 		// note: we won't get here on subsequent calls of do_signal()
> 		// due to the checks above; same logics that currently prevents
> 		// double restarts
> 		set NEED_RESTART flag
> 	sig = get_signal_to_deliver(...)
> 	if (sig) {
> 		if (NEED_RESTART set) {
> 			clear NEED_RESTART
> 			same thing we do at that spot now - restart or EINTR
> 			handle_signal(...)
> 			...
> 			return;
> 		}
> 	}
> 	/* no handler */
> 	if (test_and_clear_...(RESTORE_SIGMASK))
> 		set_current_blocked(&current->saved_sigmask);
> }
> and in asm glue, *after* checking for SIGPENDING/NOTIFY_RESUME, check
> NEED_RESTART and if it's set do what we currently do for restarts on
> handlerless signal.

You need to be careful with inferior calls there. gdb likes to play games
with the registers inside the get_signal_to_deliver call, it wants to be
able to jump out of an interrupted system call, do its inferior call in
the debugee and then return to the interrupted system call.
You would have to to read, modify & restore the NEED_RESTART flag in gdb
over an inferior call.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-26  7:15                                                     ` Martin Schwidefsky
@ 2012-04-26  7:25                                                       ` David Miller
  2012-04-26 13:52                                                       ` Oleg Nesterov
  1 sibling, 0 replies; 91+ messages in thread
From: David Miller @ 2012-04-26  7:25 UTC (permalink / raw)
  To: schwidefsky
  Cc: viro, oleg, torvalds, linux-arch, linux-kernel, linux, tj, arnd, roland

From: Martin Schwidefsky <schwidefsky@de.ibm.com>
Date: Thu, 26 Apr 2012 09:15:19 +0200

> You need to be careful with inferior calls there. gdb likes to play games
> with the registers inside the get_signal_to_deliver call, it wants to be
> able to jump out of an interrupted system call, do its inferior call in
> the debugee and then return to the interrupted system call.
> You would have to to read, modify & restore the NEED_RESTART flag in gdb
> over an inferior call.

Right and this is what orig_ax and friends are used for on x86.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-25 17:51                                                   ` Al Viro
  2012-04-26  7:15                                                     ` Martin Schwidefsky
@ 2012-04-26 13:22                                                     ` Oleg Nesterov
  1 sibling, 0 replies; 91+ messages in thread
From: Oleg Nesterov @ 2012-04-26 13:22 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, linux-arch, linux-kernel, Russell King,
	Tejun Heo, Arnd Bergmann, Roland McGrath

On 04/25, Al Viro wrote:
>
> do_signal()
> {
> 	if (we have any business doing restarts)
> 		// note: we won't get here on subsequent calls of do_signal()
> 		// due to the checks above; same logics that currently prevents
> 		// double restarts
> 		set NEED_RESTART flag
> 	sig = get_signal_to_deliver(...)
> 	if (sig) {
> 		if (NEED_RESTART set) {
> 			clear NEED_RESTART
> 			same thing we do at that spot now - restart or EINTR
> 			handle_signal(...)
> 			...
> 			return;
> 		}
> 	}
> 	/* no handler */
> 	if (test_and_clear_...(RESTORE_SIGMASK))
> 		set_current_blocked(&current->saved_sigmask);
> }

OK,

> and in asm glue,

Ah. So we are going to change the ret_from_sys_call-like code.

> check
> NEED_RESTART and if it's set do what we currently do for restarts on
> handlerless signal.

and probably we should clear NEED_RESTART?

OK, thanks, I am starting to understand.

However. Perhaps I missed something, but this doesn't look 100% correct.
Although even _if_ I am right I guess this is pure theoretical problem.

Suppose that we restart the syscall which returned ERESTARTNOHAND. I mean,
the task has already returned to the user-mode and CPU is going to execute
the syscall insn.

What if, say, reschedule interrupt comes before the task enters the kernel
again? The new signal can come before this task returns to the user-mode,
ret_from_intr: will notice it and it is still possible to run a signal
handler before re-entering the syscall.

No?

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-26  7:15                                                     ` Martin Schwidefsky
  2012-04-26  7:25                                                       ` David Miller
@ 2012-04-26 13:52                                                       ` Oleg Nesterov
  2012-04-26 14:31                                                         ` Martin Schwidefsky
  1 sibling, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2012-04-26 13:52 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Al Viro, Linus Torvalds, linux-arch, linux-kernel, Russell King,
	Tejun Heo, Arnd Bergmann, Roland McGrath

On 04/26, Martin Schwidefsky wrote:
>
> On Wed, 25 Apr 2012 18:51:13 +0100
> Al Viro <viro@ZenIV.linux.org.uk> wrote:
>
> > do_signal()
> > {
> > 	if (we have any business doing restarts)
> > 		// note: we won't get here on subsequent calls of do_signal()
> > 		// due to the checks above; same logics that currently prevents
> > 		// double restarts
> > 		set NEED_RESTART flag
> > 	sig = get_signal_to_deliver(...)
> > 	if (sig) {
> > 		if (NEED_RESTART set) {
> > 			clear NEED_RESTART
> > 			same thing we do at that spot now - restart or EINTR
> > 			handle_signal(...)
> > 			...
> > 			return;
> > 		}
> > 	}
> > 	/* no handler */
> > 	if (test_and_clear_...(RESTORE_SIGMASK))
> > 		set_current_blocked(&current->saved_sigmask);
> > }
> > and in asm glue, *after* checking for SIGPENDING/NOTIFY_RESUME, check
> > NEED_RESTART and if it's set do what we currently do for restarts on
> > handlerless signal.
>
> You need to be careful with inferior calls there. gdb likes to play games
> with the registers inside the get_signal_to_deliver call, it wants to be
> able to jump out of an interrupted system call, do its inferior call in
> the debugee and then return to the interrupted system call.

Ah.

> You would have to to read, modify & restore the NEED_RESTART flag in gdb
> over an inferior call.

I am not sure, but perhaps this is not really needed...

But at least this means that "if (we have any business doing restarts)"
above is meaningless before get_signal_to_deliver().


And I am confused, off-topic question... How it is possible to
"then return to the interrupted system call" if that system call
returned -ERESTART_RESTARTBLOCK but the inferior call in turn
does the system call which changes restart_block->fn/etc ?

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-26 13:52                                                       ` Oleg Nesterov
@ 2012-04-26 14:31                                                         ` Martin Schwidefsky
  0 siblings, 0 replies; 91+ messages in thread
From: Martin Schwidefsky @ 2012-04-26 14:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Linus Torvalds, linux-arch, linux-kernel, Russell King,
	Tejun Heo, Arnd Bergmann, Roland McGrath

On Thu, 26 Apr 2012 15:52:55 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

> On 04/26, Martin Schwidefsky wrote:
> > You need to be careful with inferior calls there. gdb likes to play games
> > with the registers inside the get_signal_to_deliver call, it wants to be
> > able to jump out of an interrupted system call, do its inferior call in
> > the debugee and then return to the interrupted system call.
> 
> Ah.
> 
> > You would have to to read, modify & restore the NEED_RESTART flag in gdb
> > over an inferior call.
> 
> I am not sure, but perhaps this is not really needed...
> 
> But at least this means that "if (we have any business doing restarts)"
> above is meaningless before get_signal_to_deliver().
> 
> 
> And I am confused, off-topic question... How it is possible to
> "then return to the interrupted system call" if that system call
> returned -ERESTART_RESTARTBLOCK but the inferior call in turn
> does the system call which changes restart_block->fn/etc ?

Returning from an inferior gdb call to a system call that needs to be
restarted with -ERESTART_RESTARTBLOCK is broken on all architectures
as far as I can tell. And it would be hard to fix it, basically you
have to save and restore the restart block.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-24  7:26                               ` Al Viro
  2012-04-25  3:06                                 ` Al Viro
@ 2012-04-26 18:37                                 ` Oleg Nesterov
  2012-04-26 23:19                                   ` Al Viro
  1 sibling, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2012-04-26 18:37 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-arch, linux-kernel

On 04/24, Al Viro wrote:
>
> Untested variants pushed into signal.git#master; will test tomorrow.  In
> the meanwhile, any code review (and testing of the entire thing on as many
> targets as possible) would be very welcome.

I started to read these patches today, will continue tomorrow. Somehow
I got stuck at f1fcb14721b4f1e65387d4563311f15f0bd33684, please see the
question below. And a couple of minor nits.




b4b620b87fd2f388cf4c13fea21f31bed7c9a1b0 new helper: sigsuspend()

Looks obviously correct but I do not understand this chunk in kernel.c,

	+ #ifndef __ARCH_HAS_SYS_RT_SIGSUSPEND
	+ /**
	+  *  sys_rt_sigsuspend - replace the signal mask for a value with the
	+
	 #ifdef __ARCH_WANT_SYS_RT_SIGSUSPEND

So this checks the (never used/defined?) __ARCH_HAS_SYS_RT_SIGSUSPEND
but comments out __ARCH_WANT_SYS_RT_SIGSUSPEND. Looks like a typo.





6b78370886e4f61187404b7737a831281bde35e8 xtensa: switch to generic rt_sigsuspend(2)
and
d978bf9dd41728dd60fe2269493fe8f21d28eef3 h8300: switch to saved_sigmask-based sigsuspend/rt_sigsuspend

(off-topic, but do_signal()->try_to_freeze() looks unneeded and wrong)

	+       /* If there's no signal to deliver, we just restore the saved mask.  */
	+       if (test_thread_flag(TIF_RESTORE_SIGMASK)) {
	+               clear_thread_flag(TIF_RESTORE_SIGMASK);
	+               sigprocmask(SIG_SETMASK, &current->saved_sigmask, NULL);
			^^^^^^^^^^^

set_current_blocked(&current->saved_sigmask) looks better.






f1fcb14721b4f1e65387d4563311f15f0bd33684 alpha: tidy signal delivery up

Everything looks fine, but I have the off-topic question. The changelog
says:

	* checking for TIF_SIGPENDING is enough; set_restart_sigmask() sets this
	one as well.

Agreed, but why set_restore_sigmask() sets TIF_SIGPENDING? It should be
never used without signal_pending() == T.

IOW, do you know a reason why this patch

	--- x/arch/x86/include/asm/thread_info.h
	+++ x/arch/x86/include/asm/thread_info.h
	@@ -264,7 +264,7 @@ static inline void set_restore_sigmask(v
	 {
		struct thread_info *ti = current_thread_info();
		ti->status |= TS_RESTORE_SIGMASK;
	-	set_bit(TIF_SIGPENDING, (unsigned long *)&ti->flags);
	+	WARN_ON(!test_bit(TIF_SIGPENDING, (unsigned long *)&ti->flags));
	 }
	 
	 static inline bool is_ia32_task(void)

is not correct?

OK, say, sys_sigsuspend() does

	current->state = TASK_INTERRUPTIBLE;
	schedule();
	set_restore_sigmask();
	return -ERESTARTNOHAND;

so set_bit(TIF_SIGPENDING) saves us from the "spurious wakeup". But is
it really possible?

We had the bugs in ptrace some time ago (and iirc this is why sys_pause
checks signal_pending), but is there any reason today why the
TASK_INTERRUPTIBLE task can return from schedule() without SIGPENDING?
(of course, ignoring the case when this task was added to some
 wait_queue_head_t).


I am just curious. Perhaps set_restore_sigmask() sets SIGPENDING just
to be safer, but otoh this can hide the problem.

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-26 18:37                                 ` Oleg Nesterov
@ 2012-04-26 23:19                                   ` Al Viro
  2012-04-27 17:24                                     ` Oleg Nesterov
  0 siblings, 1 reply; 91+ messages in thread
From: Al Viro @ 2012-04-26 23:19 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Linus Torvalds, linux-arch, linux-kernel

On Thu, Apr 26, 2012 at 08:37:42PM +0200, Oleg Nesterov wrote:
> b4b620b87fd2f388cf4c13fea21f31bed7c9a1b0 new helper: sigsuspend()
> 
> Looks obviously correct but I do not understand this chunk in kernel.c,
> 
> 	+ #ifndef __ARCH_HAS_SYS_RT_SIGSUSPEND
> 	+ /**
> 	+  *  sys_rt_sigsuspend - replace the signal mask for a value with the
> 	+
> 	 #ifdef __ARCH_WANT_SYS_RT_SIGSUSPEND
> 
> So this checks the (never used/defined?) __ARCH_HAS_SYS_RT_SIGSUSPEND
> but comments out __ARCH_WANT_SYS_RT_SIGSUSPEND. Looks like a typo.

	Buggered manual cherry-pick - earlier there was a patch inverting
__ARCH_WANT_SYS_RT_SIGSUSPEND (only mips did _not_ have it after the
whole series, and only because it has sys_rt_sigsuspend() of its own
with unusual prototype).  I ended up dropping it and mishandled conflict
resolution.  Fixed.

> 6b78370886e4f61187404b7737a831281bde35e8 xtensa: switch to generic rt_sigsuspend(2)
> and
> d978bf9dd41728dd60fe2269493fe8f21d28eef3 h8300: switch to saved_sigmask-based sigsuspend/rt_sigsuspend
> 
> (off-topic, but do_signal()->try_to_freeze() looks unneeded and wrong)

Yes, get_signal_to_deliver() will do it.  I'd killed some instances in the
last round of signal fixes (2010), but never got around to doing that for
all architectures.

> 	+       /* If there's no signal to deliver, we just restore the saved mask.  */
> 	+       if (test_thread_flag(TIF_RESTORE_SIGMASK)) {
> 	+               clear_thread_flag(TIF_RESTORE_SIGMASK);
> 	+               sigprocmask(SIG_SETMASK, &current->saved_sigmask, NULL);
> 			^^^^^^^^^^^
> 
> set_current_blocked(&current->saved_sigmask) looks better.

In principle, yes.  FWIW, I think that the entire thing should be a helper
to go along with set_restore_sigmask().  With
	if (test_and_clear_thread_flag(TIF_RESTORE_SIGMASK))
		set_current_blocked(&current->saved_sigmask);
as default implementation.  The only question is where should it go -
asm/thread_info.h is not a good place due to header dependencies.
Kinda-sorta solution - in thread_info.h
{set,clear,test,test_and_clear}_restore_sigmask()
and in linux/signal.h
#ifdef HAVE_SET_RESTORE_SIGMASK
static inline void restore_saved_sigmask(void)
{
	if (test_and_clear_restore_sigmask())
		set_current_blocked(&current->saved_mask);
}
static inline sigset_t *sigmask_to_save(void)
{
	struct sigset *res = &current->blocked;
	if (unlikely(test_restore_sigmask()))
		res = current->saved_sigmask;
	return res;
}
#endif

Speaking of other helpers, pulling ppc restore_sigmask() into signal.h
(as static inline) might be a good idea.  Every sigreturn instance is
open-coding it...  We need saner names, though; this set is too easy to
confuse with each other.

> f1fcb14721b4f1e65387d4563311f15f0bd33684 alpha: tidy signal delivery up
> 
> Everything looks fine, but I have the off-topic question. The changelog
> says:
> 
> 	* checking for TIF_SIGPENDING is enough; set_restart_sigmask() sets this
> 	one as well.
> 
> Agreed, but why set_restore_sigmask() sets TIF_SIGPENDING? It should be
> never used without signal_pending() == T.

Umm...  Probably, and as far as I can see all callers are only reached if
we have SIGPENDING, but that requires at least documenting what's going on.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches)
  2012-04-20 22:35                       ` Linus Torvalds
@ 2012-04-27  7:35                         ` Kasatkin, Dmitry
  2012-04-27 17:34                           ` Al Viro
  0 siblings, 1 reply; 91+ messages in thread
From: Kasatkin, Dmitry @ 2012-04-27  7:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Hugh Dickins, linux-fsdevel, James Morris,
	linux-security-module, linux-kernel, David Safford, Mimi Zohar,
	David Miller, Andrew Morton

On Sat, Apr 21, 2012 at 1:35 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, Apr 20, 2012 at 3:13 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Kinda-sorta.  I agree that such helpers make sense, but you are too
> > optimistic about the number of such places.  And clusterfuck around
> > mremap() is fairly deep, so propagating all way back to up_write()
> > wont' be fun.
>
> Fair enough.
>
> I'll do the helpers and see how much they get rid of, just because
> looking at all the callers, those helpers seem to be obviously the
> right thing anyway. So even if we don't do anything else, we can
> improve things regardless.
>
> For do_brk(), for example, it looks like do_brk() itself should
> actually be entirely static to mm/mmap.c, because every single caller
> from the outside actually wants the self-locking version.
>
> So plan right now: do "vm_xyzzy()" helper functions that do
> "do_xyzzy()" and take the lock (and do not take the "mm" argument,
> because it had better always be the current one - keep the calling
> convention as simple as possible).
>
>                   Linus

Moi Linus,

Taking the list of files to fput out of do_munmap() and doing it after
unlocking mmap sem
was one of possible solutions when we discussed it with Mimi.
But we were looking for the solution with less modifications of existing kernel.

In fact IMA is already doing some work before taking mmap sem.
---
	ret = ima_file_mmap(filep, prot);
	if (ret < 0)
		return ret;
	down_write(&current->mm->mmap_sem);
--

But have you seen the proposed patch for __fput()?
[PATCH v4 10/12] ima: defer calling __fput()

It defers only of course the last AND mmap_sem is locked AND open for write.

	if (current->mm && rwsem_is_locked(&current->mm->mmap_sem)) {
		if (ima_defer_fput(file) == 0)
			return;
	}

Just 5 out of ~100,000 mmap_sem held fput() calls were deferred.

t.
- Dmitry

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-26 23:19                                   ` Al Viro
@ 2012-04-27 17:24                                     ` Oleg Nesterov
  2012-04-27 17:54                                       ` Oleg Nesterov
  2012-04-27 18:45                                       ` Al Viro
  0 siblings, 2 replies; 91+ messages in thread
From: Oleg Nesterov @ 2012-04-27 17:24 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-arch, linux-kernel

On 04/27, Al Viro wrote:
>
> On Thu, Apr 26, 2012 at 08:37:42PM +0200, Oleg Nesterov wrote:
>
> > 	+       /* If there's no signal to deliver, we just restore the saved mask.  */
> > 	+       if (test_thread_flag(TIF_RESTORE_SIGMASK)) {
> > 	+               clear_thread_flag(TIF_RESTORE_SIGMASK);
> > 	+               sigprocmask(SIG_SETMASK, &current->saved_sigmask, NULL);
> > 			^^^^^^^^^^^
> >
> > set_current_blocked(&current->saved_sigmask) looks better.
>
> In principle, yes.  FWIW, I think that the entire thing should be a helper
> to go along with set_restore_sigmask().  With
> 	if (test_and_clear_thread_flag(TIF_RESTORE_SIGMASK))
> 		set_current_blocked(&current->saved_sigmask);
> as default implementation.  The only question is where should it go -
> asm/thread_info.h is not a good place due to header dependencies.
> Kinda-sorta solution - in thread_info.h
> {set,clear,test,test_and_clear}_restore_sigmask()
> and in linux/signal.h
> #ifdef HAVE_SET_RESTORE_SIGMASK
> static inline void restore_saved_sigmask(void)
> {
> 	if (test_and_clear_restore_sigmask())
> 		set_current_blocked(&current->saved_mask);
> }
> static inline sigset_t *sigmask_to_save(void)
> {
> 	struct sigset *res = &current->blocked;
> 	if (unlikely(test_restore_sigmask()))
> 		res = current->saved_sigmask;
> 	return res;
> }

Perhaps... but test_*_restore_sigmask() depends on TIF_ or TS_

> Speaking of other helpers, pulling ppc restore_sigmask() into signal.h
> (as static inline) might be a good idea.

Agreed.

> Every sigreturn instance is
> open-coding it...

Not only sigreturn. Just look at sigprocmask(SIG_SETMASK) callers, almost
all should be converted to use set_current_blocked(). For example, pselect.

> We need saner names, though; this set is too easy to
> confuse with each other.

Say, set_current_blocked_careful().

> > Agreed, but why set_restore_sigmask() sets TIF_SIGPENDING? It should be
> > never used without signal_pending() == T.
>
> Umm...  Probably, and as far as I can see all callers are only reached if
> we have SIGPENDING, but that requires at least documenting what's going on.

WARN_ON(!test_bit(TIF_SIGPENDING)) looks like the perfect documentation ;)




OK, I have read the whole series. You do understand that I can't comment
the changes in asm, but I managed to convince myself I can more or less
understand them and I see nothing wrong.

The only comment I have,
38671a3e831ed7327affb24d715f98bb99c80e56 m68k: add TIF_NOTIFY_RESUME and handle it
forgets to unexport do_signal().



The last thing. Matt, could you please look at
git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal.git ? It seems to me
you already sent some of these changes (use set_current_blocked/block_sigmask).
Perhaps there are alreay in -mm or linux-next?

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches)
  2012-04-27  7:35                         ` Kasatkin, Dmitry
@ 2012-04-27 17:34                           ` Al Viro
  2012-04-27 18:52                             ` Kasatkin, Dmitry
  2012-04-30 14:32                             ` Mimi Zohar
  0 siblings, 2 replies; 91+ messages in thread
From: Al Viro @ 2012-04-27 17:34 UTC (permalink / raw)
  To: Kasatkin, Dmitry
  Cc: Linus Torvalds, Hugh Dickins, linux-fsdevel, James Morris,
	linux-security-module, linux-kernel, David Safford, Mimi Zohar,
	David Miller, Andrew Morton

On Fri, Apr 27, 2012 at 10:35:25AM +0300, Kasatkin, Dmitry wrote:

> But have you seen the proposed patch for __fput()?
> [PATCH v4 10/12] ima: defer calling __fput()
> 
> It defers only of course the last AND mmap_sem is locked AND open for write.
> 
> 	if (current->mm && rwsem_is_locked(&current->mm->mmap_sem)) {
> 		if (ima_defer_fput(file) == 0)
> 			return;
> 	}
> 
> Just 5 out of ~100,000 mmap_sem held fput() calls were deferred.

Let me get it straight.
	a) You still ignore all the problems with that described in the
posting right in the beginning of this thread.
	b) You ignore the problems with semantics changes from user-visible
delays of fput() past the return from syscall (described in Linus' posting
upthread - they apply to this "solution" as well).
	c) You seem to consider the fact that this path will be exercised
very rarely, thus making any races on it damn hard to reproduce and debug
as a good thing.  

And as for the sentiment expressed in the beginning of your posting (that
smaller patch size is worth more than clean locking rules, maintainability
of resulting kernel, etc.)...  I'm sorry, but you guys need to decide
what IMA is.  If it's a first-class part of the kernel, you have your
priorities backwards...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-27 17:24                                     ` Oleg Nesterov
@ 2012-04-27 17:54                                       ` Oleg Nesterov
  2012-05-02 10:37                                         ` Matt Fleming
  2012-04-27 18:45                                       ` Al Viro
  1 sibling, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2012-04-27 17:54 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-arch, linux-kernel, Matt Fleming

On 04/27, Oleg Nesterov wrote:
>
> Not only sigreturn. Just look at sigprocmask(SIG_SETMASK) callers, almost
> all should be converted to use set_current_blocked(). For example, pselect.

which btw doesn't need sigsaved.

> The last thing. Matt, could you please look at
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal.git ? It seems to me
> you already sent some of these changes (use set_current_blocked/block_sigmask).
> Perhaps there are alreay in -mm or linux-next?

Forgot to add Matt, sorry for noise...

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-27 17:24                                     ` Oleg Nesterov
  2012-04-27 17:54                                       ` Oleg Nesterov
@ 2012-04-27 18:45                                       ` Al Viro
  2012-04-27 19:14                                         ` Geert Uytterhoeven
                                                           ` (3 more replies)
  1 sibling, 4 replies; 91+ messages in thread
From: Al Viro @ 2012-04-27 18:45 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Linus Torvalds, linux-arch, linux-kernel, Roland McGrath

On Fri, Apr 27, 2012 at 07:24:44PM +0200, Oleg Nesterov wrote:
> > static inline sigset_t *sigmask_to_save(void)
> > {
> > 	struct sigset *res = &current->blocked;
> > 	if (unlikely(test_restore_sigmask()))
> > 		res = current->saved_sigmask;
> > 	return res;
> > }
> 
> Perhaps... but test_*_restore_sigmask() depends on TIF_ or TS_

... and is in thread_info.h; you might want to pull it again ;-)
Right now the signal.git#master is at 349b4565ad9e9b891245590319567c0a042046d9

> > Umm...  Probably, and as far as I can see all callers are only reached if
> > we have SIGPENDING, but that requires at least documenting what's going on.
> 
> WARN_ON(!test_bit(TIF_SIGPENDING)) looks like the perfect documentation ;)

OK, I'm convinced.  Done.

BTW, I've a better name than set_current_blocked_carefully(); the thing
is, only 3 callers out of 63 are _not_ immediately preceded by excluding
SIGKILL/SIGSTOP out of the set.  So I've copied the existing variant
to __set_current_blocked() and folded that sigdelsetmask(...) into the
set_current_blocked().

> The only comment I have,
> 38671a3e831ed7327affb24d715f98bb99c80e56 m68k: add TIF_NOTIFY_RESUME and handle it
> forgets to unexport do_signal().

Meh...  The thing is, there are _two_ of them.  signal_mm.c and signal_no.c
badly need merging, with common stuff moved to signal.c.  I really hate
to see static function that isn't called anywhere in file it's defined in,
leaving one to trace what happens to include that file.  Running into that
in USB code is bloody annoying and I'd rather not add to that pile.  It's
certainly a valid C, but it's hell on casual reader...

	BTW, I'm somewhat tempted to do the following: *ALL* calls of
tracehook_signal_handler() are now immediately preceded by block_signals().
Moreover, tracehook_signal_handler(...., 0) is currently a no-op, so
it could be painlessly added after the remaining block_signals() instances.
How about *folding* block_signals() (along with clear_restore_sigmask())
into tracehook_signal_handler()?  I don't know if anyone has conflicting
plans for that sucker; Roland?

	Even more ambitious variant: take the "setting sigframe up has
failed, send ourselves SIGSEGV" in there as well (adding an extra argument
to tell which one it is).  Hell knows; I know at least one place where
such failure does _not_ lead to SIGSEGV, but there we do do_exit(SIGILL)
right in setup_..._frame() and do_exit() never returns.  There might be
other such suckers...
 
> The last thing. Matt, could you please look at
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal.git ? It seems to me
> you already sent some of these changes (use set_current_blocked/block_sigmask).
> Perhaps there are alreay in -mm or linux-next?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches)
  2012-04-27 17:34                           ` Al Viro
@ 2012-04-27 18:52                             ` Kasatkin, Dmitry
  2012-04-27 19:15                               ` Kasatkin, Dmitry
  2012-04-30 14:32                             ` Mimi Zohar
  1 sibling, 1 reply; 91+ messages in thread
From: Kasatkin, Dmitry @ 2012-04-27 18:52 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Hugh Dickins, linux-fsdevel, James Morris,
	linux-security-module, linux-kernel, David Safford, Mimi Zohar,
	David Miller, Andrew Morton

On Fri, Apr 27, 2012 at 8:34 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Apr 27, 2012 at 10:35:25AM +0300, Kasatkin, Dmitry wrote:
>
>> But have you seen the proposed patch for __fput()?
>> [PATCH v4 10/12] ima: defer calling __fput()
>>
>> It defers only of course the last AND mmap_sem is locked AND open for write.
>>
>>       if (current->mm && rwsem_is_locked(&current->mm->mmap_sem)) {
>>               if (ima_defer_fput(file) == 0)
>>                       return;
>>       }
>>
>> Just 5 out of ~100,000 mmap_sem held fput() calls were deferred.
>
> Let me get it straight.
>        a) You still ignore all the problems with that described in the
> posting right in the beginning of this thread.
>        b) You ignore the problems with semantics changes from user-visible
> delays of fput() past the return from syscall (described in Linus' posting
> upthread - they apply to this "solution" as well).
>        c) You seem to consider the fact that this path will be exercised
> very rarely, thus making any races on it damn hard to reproduce and debug
> as a good thing.
>
> And as for the sentiment expressed in the beginning of your posting (that
> smaller patch size is worth more than clean locking rules, maintainability
> of resulting kernel, etc.)...  I'm sorry, but you guys need to decide
> what IMA is.  If it's a first-class part of the kernel, you have your
> priorities backwards...

Hello,

I do not ignore anything.
I said that we were thinking about solution to get the list of file to
fput them after mmap unlock.
And I do understand the issues discussed.
I just wanted to know more opinions on proposed patch.

- Dmitry

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-27 18:45                                       ` Al Viro
@ 2012-04-27 19:14                                         ` Geert Uytterhoeven
  2012-04-27 19:34                                           ` Al Viro
  2012-04-27 19:42                                         ` Al Viro
                                                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 91+ messages in thread
From: Geert Uytterhoeven @ 2012-04-27 19:14 UTC (permalink / raw)
  To: Al Viro
  Cc: Oleg Nesterov, Linus Torvalds, linux-arch, linux-kernel,
	Roland McGrath, Greg Ungerer

On Fri, Apr 27, 2012 at 20:45, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> The only comment I have,
>> 38671a3e831ed7327affb24d715f98bb99c80e56 m68k: add TIF_NOTIFY_RESUME and handle it
>> forgets to unexport do_signal().
>
> Meh...  The thing is, there are _two_ of them.  signal_mm.c and signal_no.c
> badly need merging, with common stuff moved to signal.c.  I really hate

Greg recently posted a patch to merge them:
http://www.spinics.net/lists/linux-m68k/msg04995.html

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches)
  2012-04-27 18:52                             ` Kasatkin, Dmitry
@ 2012-04-27 19:15                               ` Kasatkin, Dmitry
  0 siblings, 0 replies; 91+ messages in thread
From: Kasatkin, Dmitry @ 2012-04-27 19:15 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Hugh Dickins, linux-fsdevel, James Morris,
	linux-security-module, linux-kernel, David Safford, Mimi Zohar,
	David Miller, Andrew Morton

On Fri, Apr 27, 2012 at 9:52 PM, Kasatkin, Dmitry
<dmitry.kasatkin@intel.com> wrote:
> On Fri, Apr 27, 2012 at 8:34 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Fri, Apr 27, 2012 at 10:35:25AM +0300, Kasatkin, Dmitry wrote:
>>
>>> But have you seen the proposed patch for __fput()?
>>> [PATCH v4 10/12] ima: defer calling __fput()
>>>
>>> It defers only of course the last AND mmap_sem is locked AND open for write.
>>>
>>>       if (current->mm && rwsem_is_locked(&current->mm->mmap_sem)) {
>>>               if (ima_defer_fput(file) == 0)
>>>                       return;
>>>       }
>>>
>>> Just 5 out of ~100,000 mmap_sem held fput() calls were deferred.
>>
>> Let me get it straight.
>>        a) You still ignore all the problems with that described in the
>> posting right in the beginning of this thread.
>>        b) You ignore the problems with semantics changes from user-visible
>> delays of fput() past the return from syscall (described in Linus' posting
>> upthread - they apply to this "solution" as well).
>>        c) You seem to consider the fact that this path will be exercised
>> very rarely, thus making any races on it damn hard to reproduce and debug
>> as a good thing.
>>
>> And as for the sentiment expressed in the beginning of your posting (that
>> smaller patch size is worth more than clean locking rules, maintainability
>> of resulting kernel, etc.)...  I'm sorry, but you guys need to decide
>> what IMA is.  If it's a first-class part of the kernel, you have your
>> priorities backwards...
>
> Hello,
>
> I do not ignore anything.
> I said that we were thinking about solution to get the list of file to
> fput them after mmap unlock.
> And I do understand the issues discussed.
> I just wanted to know more opinions on proposed patch.
>
> - Dmitry

For sure we are happy to see the discussion about the problem. Thanks
for your attention to it.
And our target to solve it in best possible way.
So let's agree the solution and we will be happy to implement the patch.

- Dmitry

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-27 19:14                                         ` Geert Uytterhoeven
@ 2012-04-27 19:34                                           ` Al Viro
  2012-04-29 22:51                                             ` Al Viro
  0 siblings, 1 reply; 91+ messages in thread
From: Al Viro @ 2012-04-27 19:34 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Oleg Nesterov, Linus Torvalds, linux-arch, linux-kernel,
	Roland McGrath, Greg Ungerer

On Fri, Apr 27, 2012 at 09:14:53PM +0200, Geert Uytterhoeven wrote:
> On Fri, Apr 27, 2012 at 20:45, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >> The only comment I have,
> >> 38671a3e831ed7327affb24d715f98bb99c80e56 m68k: add TIF_NOTIFY_RESUME and handle it
> >> forgets to unexport do_signal().
> >
> > Meh... ??The thing is, there are _two_ of them. ??signal_mm.c and signal_no.c
> > badly need merging, with common stuff moved to signal.c. ??I really hate
> 
> Greg recently posted a patch to merge them:
> http://www.spinics.net/lists/linux-m68k/msg04995.html

Nice...  BTW, could you comment on
    m68k: don't open-code block_sigmask()
    m68k: use set_current_blocked in sigreturn/rt_sigreturn
    m68k-nommu: do_signal() is only called when returning to user mode
    m68k: add TIF_NOTIFY_RESUME and handle it.
in signal.git tree?  The last one is really interesting...

Just pick the current tree - one that was there yesterday had a dumb typo
in thread_info.h part, so it wouldn't even compile (TIF_NOTIFE_RESUME
defined instead of TIF_NOTIFY_RESUME ;-/)

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-27 18:45                                       ` Al Viro
  2012-04-27 19:14                                         ` Geert Uytterhoeven
@ 2012-04-27 19:42                                         ` Al Viro
  2012-04-27 20:20                                         ` Roland McGrath
  2012-04-29 16:41                                         ` Oleg Nesterov
  3 siblings, 0 replies; 91+ messages in thread
From: Al Viro @ 2012-04-27 19:42 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Linus Torvalds, linux-arch, linux-kernel, Roland McGrath

On Fri, Apr 27, 2012 at 07:45:29PM +0100, Al Viro wrote:

> > The last thing. Matt, could you please look at
> > git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal.git ? It seems to me
> > you already sent some of these changes (use set_current_blocked/block_sigmask).
> > Perhaps there are alreay in -mm or linux-next?

BTW, I'll gladly pick Matt's patches in front of that queue (and drop my
duplicates), but I doubt that signal.git is suitable for -next right now.
After some review from linux-arch folks - maybe, but right now it's too
dangerous and untested.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-27 18:45                                       ` Al Viro
  2012-04-27 19:14                                         ` Geert Uytterhoeven
  2012-04-27 19:42                                         ` Al Viro
@ 2012-04-27 20:20                                         ` Roland McGrath
  2012-04-27 21:12                                           ` Al Viro
  2012-04-29 16:41                                         ` Oleg Nesterov
  3 siblings, 1 reply; 91+ messages in thread
From: Roland McGrath @ 2012-04-27 20:20 UTC (permalink / raw)
  To: Al Viro; +Cc: Oleg Nesterov, Linus Torvalds, linux-arch, linux-kernel

> 	BTW, I'm somewhat tempted to do the following: *ALL* calls of
> tracehook_signal_handler() are now immediately preceded by block_signals().
> Moreover, tracehook_signal_handler(...., 0) is currently a no-op, so
> it could be painlessly added after the remaining block_signals() instances.
> How about *folding* block_signals() (along with clear_restore_sigmask())
> into tracehook_signal_handler()?  I don't know if anyone has conflicting
> plans for that sucker; Roland?

I'm not actually doing much in the way of kernel work these days so I
certainly don't have any actual plans and it's pretty much all up to Oleg.
I'm also not really following this thread in detail, as I've dropped too
much context in the last year or two to be of a whole lot of help without
spending a lot of time recovering knowledge.  

But I will say that the intent of tracehook_signal_handler has always
been what its kerneldoc says: Call it once when handler setup is
complete (exactly once per signal delivery, so potentially multiple
times before actually returning to user mode).  Though it is indeed a
no-op today when stepping==0, we want its use to continue to conform
to that exact definition so that one day we could add e.g. a
PTRACE_EVENT_SIGNAL_HANDLED feature just by hacking tracehook.h and
not have to go back into every arch's signal code and recover
understanding of how the call is being used.  (It was more than enough
work to do that once when I broke out and documented the tracehook.h
interfaces the first time.)  You know, as if we thought modularity
were a useful notion.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-27 20:20                                         ` Roland McGrath
@ 2012-04-27 21:12                                           ` Al Viro
  2012-04-27 21:27                                             ` Roland McGrath
  0 siblings, 1 reply; 91+ messages in thread
From: Al Viro @ 2012-04-27 21:12 UTC (permalink / raw)
  To: Roland McGrath; +Cc: Oleg Nesterov, Linus Torvalds, linux-arch, linux-kernel

On Fri, Apr 27, 2012 at 01:20:02PM -0700, Roland McGrath wrote:

> But I will say that the intent of tracehook_signal_handler has always
> been what its kerneldoc says: Call it once when handler setup is
> complete (exactly once per signal delivery, so potentially multiple
> times before actually returning to user mode).  Though it is indeed a
> no-op today when stepping==0, we want its use to continue to conform
> to that exact definition so that one day we could add e.g. a
> PTRACE_EVENT_SIGNAL_HANDLED feature just by hacking tracehook.h and
> not have to go back into every arch's signal code and recover
> understanding of how the call is being used.  (It was more than enough
> work to do that once when I broke out and documented the tracehook.h
> interfaces the first time.)  You know, as if we thought modularity
> were a useful notion.

OK...  FWIW, it sounds like an argument for using it (or a function in
kernel/signal.c that would call it) on all architectures.

Note that there's an extra complication on alpha and itanic - we have more
than just struct pt_regs * there.  If we care about the rest of registers
(struct switch_stack on alpha), we probably need to do something about that.
Hell knows...  On alpha, in particular, we always get switch_stack argument
of do_notify_resume() et.al. at the constant offset below pt_regs one -
        mov     $sp, $16
        bsr     $1, do_switch_stack
        mov     $sp, $17
is how it's done.  So there we could switch to use of container_of().  On
ia64 we have struct sigscratch with pt_regs embedded into it, so there
container_of() is also reasonable.  I wonder how alpha and itanic deal
with do_coredump(), BTW - it gets pt_regs, but not the rest...

Another fun story: x86 and mips has do_notify_resume() with void *unused as
its second argument.  It used to be sigset_t *oldset and it remains unused
since 2006 or so.  Three years later arch/score went in - with the same
void *unused as the second argument of do_notify_resume().  Gotta love the
cargo-cult programming...

BTW, what's the story with 'cookie' argument of get_signal_to_deliver()?
Everything passes NULL in it and nothing actually looks at the value
passed (it's passed further without being looked at, until it reaches
ptrace_signal_deliver(), which ignores it completely on all architectures).
AFAICS, it's a rudiment of something that was used only on sparc and got
discarded as hopeless in commit 28e6103665301ce60634e8a77f0b657c6cc099de
Author: David S. Miller <davem@davemloft.net>
Date:   Sun May 11 02:07:19 2008 -0700
    sparc: Fix debugger syscall restart interactions.
Is there anything else to it?  If not, I'd rather get rid of that thing...
Frankly, I'm somewhat tempted to add something like
struct k_sigcontext {
	int sig;
	siginfo_t info;
	struct k_sigaction ka;
}; and turn get_signal_to_deliver into
bool get_signal_to_deliver(struct k_sigcontext *ctx, struct pt_regs *regs),
with ctx passed through handle_signal/setup_frame/tracehook_signal_deliver
instead of the same triple shoved into three arguments, in more or less
random order as it's done now.  Perhaps let tracehook_signal_deliver() keep
its type and semantics and introduce
void signal_delivered(struct k_sigcontext *, struct pt_regs *, int stepping)
that would be called by all handle_signal() after successful frame setup
and would call tracehook_... itself, after having done block_sigmask()...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-27 21:12                                           ` Al Viro
@ 2012-04-27 21:27                                             ` Roland McGrath
  2012-04-27 23:15                                               ` Al Viro
  0 siblings, 1 reply; 91+ messages in thread
From: Roland McGrath @ 2012-04-27 21:27 UTC (permalink / raw)
  To: Al Viro; +Cc: Oleg Nesterov, Linus Torvalds, linux-arch, linux-kernel

> OK...  FWIW, it sounds like an argument for using it (or a function in
> kernel/signal.c that would call it) on all architectures.

That was certainly always the intent.  But I only touched myself the arch's
that I was prepared to test reasonably.

> Note that there's an extra complication on alpha and itanic - we have more
> than just struct pt_regs * there.  If we care about the rest of registers
> (struct switch_stack on alpha), we probably need to do something about that.

The presumed plan so far was just that anybody would use user_regset for
any serious tracer/debugger stuff.  That is certainly much more costly on
ia64 than passing along a few more values.  If anybody ever cares, the
tracehook functions' signatures can always be adjusted in the future.

> I wonder how alpha and itanic deal with do_coredump(), BTW - it gets
> pt_regs, but not the rest...

The expectation was that every arch would eventually switch on
CORE_DUMP_USE_REGSET.  (Looks like so far 12 do and so ~16 don't.)
Certainly avoiding the overhead of user_regset for core dumping is not
worth any new code complexity or extra arch hooks, since that overhead
even on the worst-case arch (ia64) has got to be marginal in comparison
to all the memory-copying and i/o going on.  For imagined potential
tracing/fancier-debugging cases that might be used in high-throughput
ways the question would be different, but such uses still remain to be
implemented.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-27 21:27                                             ` Roland McGrath
@ 2012-04-27 23:15                                               ` Al Viro
  2012-04-27 23:32                                                 ` Al Viro
                                                                   ` (2 more replies)
  0 siblings, 3 replies; 91+ messages in thread
From: Al Viro @ 2012-04-27 23:15 UTC (permalink / raw)
  To: Roland McGrath; +Cc: Oleg Nesterov, Linus Torvalds, linux-arch, linux-kernel

On Fri, Apr 27, 2012 at 02:27:29PM -0700, Roland McGrath wrote:
> The expectation was that every arch would eventually switch on
> CORE_DUMP_USE_REGSET.  (Looks like so far 12 do and so ~16 don't.)
> Certainly avoiding the overhead of user_regset for core dumping is not
> worth any new code complexity or extra arch hooks, since that overhead
> even on the worst-case arch (ia64) has got to be marginal in comparison
> to all the memory-copying and i/o going on.  For imagined potential
> tracing/fancier-debugging cases that might be used in high-throughput
> ways the question would be different, but such uses still remain to be
> implemented.

BTW, speaking of tracehook - you do realize that there are some
architectures where check for user_mode() in do_signal() is not
useless?  I.e. there do_notify_resume() _can_ be called when
returning to kernel mode.  And that'll get you tracehook_notify_resume()
called when you probably wouldn't want it to be; key_replace_session_keyring()
call is not desirable in that situation and the stuff Oleg wants
to add in tracehook_notify_resume() won't be happy with that either...

I think all such architectures need that check lifted to do_notify_resume()
(and the rest needs it killed, of course).  Including x86, by the look
of it - we _probably_ can't get there with TIF_NOTIFY_RESUME and
!user_mode(regs), but I'm not entirely sure of that.  arm is in about the
same situation; alpha, ppc{32,64}, sparc{32,64} and m68k really can't get
there like that (they all check it in the asm glue).  mips probably might,
unless I'm misreading their ret_from_fork()...  Fun.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-27 23:15                                               ` Al Viro
@ 2012-04-27 23:32                                                 ` Al Viro
  2012-04-29  4:12                                                   ` Al Viro
  2012-04-27 23:50                                                 ` Al Viro
  2012-04-28  2:42                                                 ` [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such Al Viro
  2 siblings, 1 reply; 91+ messages in thread
From: Al Viro @ 2012-04-27 23:32 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Oleg Nesterov, Linus Torvalds, linux-arch, linux-kernel

On Sat, Apr 28, 2012 at 12:15:26AM +0100, Al Viro wrote:

> I think all such architectures need that check lifted to do_notify_resume()
> (and the rest needs it killed, of course).  Including x86, by the look
> of it - we _probably_ can't get there with TIF_NOTIFY_RESUME and
> !user_mode(regs), but I'm not entirely sure of that.  arm is in about the
> same situation; alpha, ppc{32,64}, sparc{32,64} and m68k really can't get
> there like that (they all check it in the asm glue).  mips probably might,
> unless I'm misreading their ret_from_fork()...  Fun.

	Speaking of user_mode() oddities - may I politely inquire what had
been smoked to inspire this (in arch/s390/kernel/signal.c):
        /* This is the legacy signal stack switching. */
        else if (!user_mode(regs) &&
                 !(ka->sa.sa_flags & SA_RESTORER) &&
                 ka->sa.sa_restorer) {
                sp = (unsigned long) ka->sa.sa_restorer;
        }
especially when all paths leading to that come through do_signal() that does
        if (!user_mode(regs))
                return;
on the same regs.  It had been like that since 2.3.99pre8 when s390 went
into the tree...  It looks vaguely similar to i386
                        /* This is the legacy signal stack switching. */
                        if ((regs->ss & 0xffff) != __USER_DS &&
                                !(ka->sa.sa_flags & SA_RESTORER) &&
                                        ka->sa.sa_restorer)
                                sp = (unsigned long) ka->sa.sa_restorer;
but there the code is at least not unreachable...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-27 23:15                                               ` Al Viro
  2012-04-27 23:32                                                 ` Al Viro
@ 2012-04-27 23:50                                                 ` Al Viro
  2012-04-28 18:51                                                   ` [PATCH] arch/tile: avoid calling do_signal() after fork from a kernel thread Chris Metcalf
  2012-04-28  2:42                                                 ` [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such Al Viro
  2 siblings, 1 reply; 91+ messages in thread
From: Al Viro @ 2012-04-27 23:50 UTC (permalink / raw)
  To: Chris Metcalf; +Cc: Oleg Nesterov, Linus Torvalds, linux-arch, linux-kernel

On Sat, Apr 28, 2012 at 12:15:26AM +0100, Al Viro wrote:

> I think all such architectures need that check lifted to do_notify_resume()
> (and the rest needs it killed, of course).  Including x86, by the look
> of it - we _probably_ can't get there with TIF_NOTIFY_RESUME and
> !user_mode(regs), but I'm not entirely sure of that.  arm is in about the
> same situation; alpha, ppc{32,64}, sparc{32,64} and m68k really can't get
> there like that (they all check it in the asm glue).  mips probably might,
> unless I'm misreading their ret_from_fork()...  Fun.

I'm completely unfamiliar with tile assembler, so I might be missing something
there, but AFAICS the following sequence can lead to do_signal() being
called for a kernel thread:
	kernel_thread()
		ret_from_fork() in child
			.Lresume_userspace
				do_work_pending()
					do_signal()
and unlike many other, tile does _not_ check for !user_mode() anywhere
relevant...  Broken?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-27 23:15                                               ` Al Viro
  2012-04-27 23:32                                                 ` Al Viro
  2012-04-27 23:50                                                 ` Al Viro
@ 2012-04-28  2:42                                                 ` Al Viro
  2012-04-28  3:32                                                   ` Al Viro
  2012-04-29 16:18                                                   ` Oleg Nesterov
  2 siblings, 2 replies; 91+ messages in thread
From: Al Viro @ 2012-04-28  2:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Oleg Nesterov, linux-arch, linux-kernel

On Sat, Apr 28, 2012 at 12:15:26AM +0100, Al Viro wrote:

> I think all such architectures need that check lifted to do_notify_resume()
> (and the rest needs it killed, of course).  Including x86, by the look
> of it - we _probably_ can't get there with TIF_NOTIFY_RESUME and
> !user_mode(regs), but I'm not entirely sure of that.  arm is in about the
> same situation; alpha, ppc{32,64}, sparc{32,64} and m68k really can't get
> there like that (they all check it in the asm glue).  mips probably might,
> unless I'm misreading their ret_from_fork()...  Fun.

It's actually worse than I thought - we can't just lift that check
to do_notify_resume() and be done with that.  Suppose do_signal() does
get called on e.g. i386 or arm with !user_mode(regs).  What'll happen next?

We have TIF_SIGPENDING set in thread flags - otherwise we wouldn't get
there at all.  OK, do_signal() doesn't do anything and returns.  So does
do_notify_resume().  And we are back into the loop in asm glue, rereading
the thread flags (still unchanged), checking if anything is to be done
(yes, it is - TIF_SIGPENDING is still set), calling do_notify_resume(),
ad infinitum.

Lifting the check into do_notify_resume() will not help at all, obviously.

AFAICS we can get hit by that.  At least i386, arm and mips have
ret_from_fork going straight to "return from syscall" path, no checks for
return to user mode done.  And process created by kernel_thread() will
go there.  It's a narrow race, but AFAICS it's not impossible to hit -
guess the PID of kernel thread to be launched, send it a signal and hit
the moment before it gets to executing the payload.

It's probably not exploitable unless you are root, since most of the
threads are spawned either by kthreadd or by khelper, both running as
root.  OTOH, there might be other places leading to the same fun - e.g.
kernel_execve() goes through the normal syscall return path almost on
everything and in case of failure it returns to kernel mode.  Again,
that one is unlikely to be exploitable (it only happens from root-owned
threads), but I'm not sure if anything else gets there; IIRC, there had
been an effort to get rid of issuing syscalls via int/syscall/trap/whatnot,
but I don't remember how far did it go, especially under arch...

Comments?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-28  2:42                                                 ` [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such Al Viro
@ 2012-04-28  3:32                                                   ` Al Viro
  2012-04-28  3:36                                                     ` Al Viro
  2012-04-29 16:33                                                     ` Oleg Nesterov
  2012-04-29 16:18                                                   ` Oleg Nesterov
  1 sibling, 2 replies; 91+ messages in thread
From: Al Viro @ 2012-04-28  3:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Oleg Nesterov, linux-arch, linux-kernel

On Sat, Apr 28, 2012 at 03:42:08AM +0100, Al Viro wrote:
> On Sat, Apr 28, 2012 at 12:15:26AM +0100, Al Viro wrote:
> 
> > I think all such architectures need that check lifted to do_notify_resume()
> > (and the rest needs it killed, of course).  Including x86, by the look
> > of it - we _probably_ can't get there with TIF_NOTIFY_RESUME and
> > !user_mode(regs), but I'm not entirely sure of that.  arm is in about the
> > same situation; alpha, ppc{32,64}, sparc{32,64} and m68k really can't get
> > there like that (they all check it in the asm glue).  mips probably might,
> > unless I'm misreading their ret_from_fork()...  Fun.
> 
> It's actually worse than I thought - we can't just lift that check
> to do_notify_resume() and be done with that.  Suppose do_signal() does
> get called on e.g. i386 or arm with !user_mode(regs).  What'll happen next?
> 
> We have TIF_SIGPENDING set in thread flags - otherwise we wouldn't get
> there at all.  OK, do_signal() doesn't do anything and returns.  So does
> do_notify_resume().  And we are back into the loop in asm glue, rereading
> the thread flags (still unchanged), checking if anything is to be done
> (yes, it is - TIF_SIGPENDING is still set), calling do_notify_resume(),
> ad infinitum.
> 
> Lifting the check into do_notify_resume() will not help at all, obviously.
> 
> AFAICS we can get hit by that.  At least i386, arm and mips have
> ret_from_fork going straight to "return from syscall" path, no checks for
> return to user mode done.  And process created by kernel_thread() will
> go there.  It's a narrow race, but AFAICS it's not impossible to hit -
> guess the PID of kernel thread to be launched, send it a signal and hit
> the moment before it gets to executing the payload.
> 
> It's probably not exploitable unless you are root, since most of the
> threads are spawned either by kthreadd or by khelper, both running as
> root.  OTOH, there might be other places leading to the same fun - e.g.
> kernel_execve() goes through the normal syscall return path almost on
> everything and in case of failure it returns to kernel mode.  Again,
> that one is unlikely to be exploitable (it only happens from root-owned
> threads), but I'm not sure if anything else gets there; IIRC, there had
> been an effort to get rid of issuing syscalls via int/syscall/trap/whatnot,
> but I don't remember how far did it go, especially under arch...

Actually, it looks like on i386 the loop will be broken by checks in
resume_userspace_sig, so the worst thing that might happen would be
a bogus call of tracehook_notify_resume() if it's possible to get there
with TIF_NOTIFY_RESUME for kernel thread.  No such luck on arm, though...
To be honest, I'd rather check for user_mode() before calling
do_notify_resume() and go away to no_work_pending if it's true.  For arm
and i386 that would probably look like this, and I'd really, *really*
like review and comments on that.  amd64 is, AFAICS, careful enough to
avoid hitting do_notify_resume() when returning into the kernel mode -
its implementations of ret_from_fork and kernel_execve take care to avoid
that.

diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S
index 82aaf0a..e147619 100644
--- a/arch/arm/kernel/entry-common.S
+++ b/arch/arm/kernel/entry-common.S
@@ -57,6 +57,9 @@ work_pending:
 	 * TIF_SIGPENDING or TIF_NOTIFY_RESUME must've been set if we got here
 	 */
 	mov	r0, sp				@ 'regs'
+	ldr	r2, [sp, #S_PSR]
+	tst	r2, #15
+	be	no_work_pending
 	mov	r2, why				@ 'syscall'
 	tst	r1, #_TIF_SIGPENDING		@ delivering a signal?
 	movne	why, #0				@ prevent further restarts
diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index cd41742..f7b7a1c 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -641,15 +641,6 @@ static void do_signal(struct pt_regs *regs, int syscall)
 	int signr;
 
 	/*
-	 * We want the common case to go fast, which
-	 * is why we may in certain cases get here from
-	 * kernel mode. Just return without doing anything
-	 * if so.
-	 */
-	if (!user_mode(regs))
-		return;
-
-	/*
 	 * If we were from a system call, check for system call restarting...
 	 */
 	if (syscall) {
diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index 7b784f4..e858462 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -321,7 +321,6 @@ ret_from_exception:
 	preempt_stop(CLBR_ANY)
 ret_from_intr:
 	GET_THREAD_INFO(%ebp)
-resume_userspace_sig:
 #ifdef CONFIG_VM86
 	movl PT_EFLAGS(%esp), %eax	# mix EFLAGS and CS
 	movb PT_CS(%esp), %al
@@ -628,9 +627,13 @@ work_notifysig:				# deal with pending signals and
 					# vm86-space
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_NONE)
+	movb PT_CS(%esp), %bl
+	andl $SEGMENT_RPL_MASK, %ebx
+	cmpl $USER_RPL, %ebx
+	jb resume_kernel
 	xorl %edx, %edx
 	call do_notify_resume
-	jmp resume_userspace_sig
+	jmp resume_userspace
 
 	ALIGN
 work_notifysig_v86:
@@ -643,9 +646,13 @@ work_notifysig_v86:
 #endif
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_NONE)
+	movb PT_CS(%esp), %bl
+	andl $SEGMENT_RPL_MASK, %ebx
+	cmpl $USER_RPL, %ebx
+	jb resume_kernel
 	xorl %edx, %edx
 	call do_notify_resume
-	jmp resume_userspace_sig
+	jmp resume_userspace
 END(work_pending)
 
 	# perform syscall exit tracing
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 595969f..c4aa7c5 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -738,16 +738,6 @@ static void do_signal(struct pt_regs *regs)
 	siginfo_t info;
 	int signr;
 
-	/*
-	 * We want the common case to go fast, which is why we may in certain
-	 * cases get here from kernel mode. Just return without doing anything
-	 * if so.
-	 * X86_32: vm86 regs switched out by assembly code before reaching
-	 * here, so testing against kernel CS suffices.
-	 */
-	if (!user_mode(regs))
-		return;
-
 	signr = get_signal_to_deliver(&info, &ka, regs, NULL);
 	if (signr > 0) {
 		/* Whee! Actually deliver the signal.  */

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-28  3:32                                                   ` Al Viro
@ 2012-04-28  3:36                                                     ` Al Viro
  2012-04-29 16:33                                                     ` Oleg Nesterov
  1 sibling, 0 replies; 91+ messages in thread
From: Al Viro @ 2012-04-28  3:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Oleg Nesterov, linux-arch, linux-kernel

On Sat, Apr 28, 2012 at 04:32:45AM +0100, Al Viro wrote:
> diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S
> index 82aaf0a..e147619 100644
> --- a/arch/arm/kernel/entry-common.S
> +++ b/arch/arm/kernel/entry-common.S
> @@ -57,6 +57,9 @@ work_pending:
>  	 * TIF_SIGPENDING or TIF_NOTIFY_RESUME must've been set if we got here
>  	 */
>  	mov	r0, sp				@ 'regs'
> +	ldr	r2, [sp, #S_PSR]
> +	tst	r2, #15
> +	be	no_work_pending

bne, actually.  Apologies for braino...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH] arch/tile: avoid calling do_signal() after fork from a kernel thread
  2012-04-27 23:50                                                 ` Al Viro
@ 2012-04-28 18:51                                                   ` Chris Metcalf
  2012-04-28 20:55                                                     ` Al Viro
  0 siblings, 1 reply; 91+ messages in thread
From: Chris Metcalf @ 2012-04-28 18:51 UTC (permalink / raw)
  To: Al Viro; +Cc: Oleg Nesterov, Linus Torvalds, linux-arch, linux-kernel

Calling interrupt_return will check the privilege of the context we're
returning to avoid the possibility of kernel threads doing any kind
of userspace actions (including signal handling) after a fork.

Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
---
Al, thanks for noticing this.  I've queued it up for 3.4.

Do you have a case that might provoke the signal behavior in the
unpatched code?  The patched code passes our internal regressions.

 arch/tile/kernel/intvec_32.S |    2 +-
 arch/tile/kernel/intvec_64.S |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/tile/kernel/intvec_32.S b/arch/tile/kernel/intvec_32.S
index 5d56a1e..d0f48ca 100644
--- a/arch/tile/kernel/intvec_32.S
+++ b/arch/tile/kernel/intvec_32.S
@@ -1274,7 +1274,7 @@ STD_ENTRY(ret_from_fork)
 	FEEDBACK_REENTER(ret_from_fork)
 	{
 	 movei  r30, 0               /* not an NMI */
-	 j      .Lresume_userspace   /* jump into middle of interrupt_return */
+	 j      interrupt_return
 	}
 	STD_ENDPROC(ret_from_fork)
 
diff --git a/arch/tile/kernel/intvec_64.S b/arch/tile/kernel/intvec_64.S
index 49d9d66..252924b 100644
--- a/arch/tile/kernel/intvec_64.S
+++ b/arch/tile/kernel/intvec_64.S
@@ -1120,7 +1120,7 @@ STD_ENTRY(ret_from_fork)
 	FEEDBACK_REENTER(ret_from_fork)
 	{
 	 movei  r30, 0               /* not an NMI */
-	 j      .Lresume_userspace   /* jump into middle of interrupt_return */
+	 j      interrupt_return
 	}
 	STD_ENDPROC(ret_from_fork)
 
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v2] arch/tile: fix up some issues in calling do_work_pending()
  2012-04-29  0:55                                                         ` Al Viro
@ 2012-04-28 18:51                                                           ` Chris Metcalf
  2012-04-29  3:49                                                           ` [PATCH] arch/tile: avoid calling do_signal() after fork from a kernel thread Chris Metcalf
  1 sibling, 0 replies; 91+ messages in thread
From: Chris Metcalf @ 2012-04-28 18:51 UTC (permalink / raw)
  To: Al Viro; +Cc: Oleg Nesterov, Linus Torvalds, linux-arch, linux-kernel

First, we were at risk of handling thread-info flags, in particular
do_signal(), when returning from kernel space.  This could happen
after a failed kernel_execve(), or when forking a kernel thread.
The fix is to test in do_work_pending() for user_mode() and return
immediately if so; we already had this test for one of the flags,
so I just hoisted it to the top of the function.

Second, if a ptraced process updated the callee-saved registers
in the ptregs struct and then processed another thread-info flag, we
would overwrite the modifications with the original callee-saved
registers.  To fix this, we add a register to note if we've already
saved the registers once, and skip doing it on additional passes
through the loop.  To avoid a performance hit from the couple of
extra instructions involved, I modified the GET_THREAD_INFO() macro
to be guaranteed to be one instruction, then bundled it with adjacent
instructions, yielding an overall net savings.

Reported-By: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
---
 arch/tile/include/asm/thread_info.h |    9 ++++++-
 arch/tile/kernel/intvec_32.S        |   41 +++++++++++++++++++++++-----------
 arch/tile/kernel/intvec_64.S        |   38 +++++++++++++++++++++++--------
 arch/tile/kernel/process.c          |    7 ++++-
 4 files changed, 68 insertions(+), 27 deletions(-)

diff --git a/arch/tile/include/asm/thread_info.h b/arch/tile/include/asm/thread_info.h
index bc4f562..7594764 100644
--- a/arch/tile/include/asm/thread_info.h
+++ b/arch/tile/include/asm/thread_info.h
@@ -100,9 +100,14 @@ extern void cpu_idle_on_new_stack(struct thread_info *old_ti,
 
 #else /* __ASSEMBLY__ */
 
-/* how to get the thread information struct from ASM */
+/*
+ * How to get the thread information struct from assembly.
+ * Note that we use different macros since different architectures
+ * have different semantics in their "mm" instruction and we would
+ * like to guarantee that the macro expands to exactly one instruction.
+ */
 #ifdef __tilegx__
-#define GET_THREAD_INFO(reg) move reg, sp; mm reg, zero, LOG2_THREAD_SIZE, 63
+#define EXTRACT_THREAD_INFO(reg) mm reg, zero, LOG2_THREAD_SIZE, 63
 #else
 #define GET_THREAD_INFO(reg) mm reg, sp, zero, LOG2_THREAD_SIZE, 31
 #endif
diff --git a/arch/tile/kernel/intvec_32.S b/arch/tile/kernel/intvec_32.S
index 5d56a1e..6943515 100644
--- a/arch/tile/kernel/intvec_32.S
+++ b/arch/tile/kernel/intvec_32.S
@@ -839,6 +839,18 @@ STD_ENTRY(interrupt_return)
 	FEEDBACK_REENTER(interrupt_return)
 
 	/*
+	 * Use r33 to hold whether we have already loaded the callee-saves
+	 * into ptregs.  We don't want to do it twice in this loop, since
+	 * then we'd clobber whatever changes are made by ptrace, etc.
+	 * Get base of stack in r32.
+	 */
+	{
+	 GET_THREAD_INFO(r32)
+	 movei  r33, 0
+	}
+
+.Lretry_work_pending:
+	/*
 	 * Disable interrupts so as to make sure we don't
 	 * miss an interrupt that sets any of the thread flags (like
 	 * need_resched or sigpending) between sampling and the iret.
@@ -848,9 +860,6 @@ STD_ENTRY(interrupt_return)
 	IRQ_DISABLE(r20, r21)
 	TRACE_IRQS_OFF  /* Note: clobbers registers r0-r29 */
 
-	/* Get base of stack in r32; note r30/31 are used as arguments here. */
-	GET_THREAD_INFO(r32)
-
 
 	/* Check to see if there is any work to do before returning to user. */
 	{
@@ -866,16 +875,18 @@ STD_ENTRY(interrupt_return)
 
 	/*
 	 * Make sure we have all the registers saved for signal
-	 * handling or single-step.  Call out to C code to figure out
-	 * exactly what we need to do for each flag bit, then if
-	 * necessary, reload the flags and recheck.
+	 * handling, notify-resume, or single-step.  Call out to C
+	 * code to figure out exactly what we need to do for each flag bit,
+	 * then if necessary, reload the flags and recheck.
 	 */
-	push_extra_callee_saves r0
 	{
 	 PTREGS_PTR(r0, PTREGS_OFFSET_BASE)
-	 jal    do_work_pending
+	 bnz    r33, 1f
 	}
-	bnz     r0, .Lresume_userspace
+	push_extra_callee_saves r0
+	movei   r33, 1
+1:	jal     do_work_pending
+	bnz     r0, .Lretry_work_pending
 
 	/*
 	 * In the NMI case we
@@ -1180,10 +1191,12 @@ handle_syscall:
 	add     r20, r20, tp
 	lw      r21, r20
 	addi    r21, r21, 1
-	sw      r20, r21
+	{
+	 sw     r20, r21
+	 GET_THREAD_INFO(r31)
+	}
 
 	/* Trace syscalls, if requested. */
-	GET_THREAD_INFO(r31)
 	addi	r31, r31, THREAD_INFO_FLAGS_OFFSET
 	lw	r30, r31
 	andi    r30, r30, _TIF_SYSCALL_TRACE
@@ -1362,7 +1375,10 @@ handle_ill:
 3:
 	/* set PC and continue */
 	lw      r26, r24
-	sw      r28, r26
+	{
+	 sw     r28, r26
+	 GET_THREAD_INFO(r0)
+	}
 
 	/*
 	 * Clear TIF_SINGLESTEP to prevent recursion if we execute an ill.
@@ -1370,7 +1386,6 @@ handle_ill:
 	 * need to clear it here and can't really impose on all other arches.
 	 * So what's another write between friends?
 	 */
-	GET_THREAD_INFO(r0)
 
 	addi    r1, r0, THREAD_INFO_FLAGS_OFFSET
 	{
diff --git a/arch/tile/kernel/intvec_64.S b/arch/tile/kernel/intvec_64.S
index 49d9d66..30ae76e 100644
--- a/arch/tile/kernel/intvec_64.S
+++ b/arch/tile/kernel/intvec_64.S
@@ -647,6 +647,20 @@ STD_ENTRY(interrupt_return)
 	FEEDBACK_REENTER(interrupt_return)
 
 	/*
+	 * Use r33 to hold whether we have already loaded the callee-saves
+	 * into ptregs.  We don't want to do it twice in this loop, since
+	 * then we'd clobber whatever changes are made by ptrace, etc.
+	 */
+	{
+	 movei  r33, 0
+	 move   r32, sp
+	}
+
+	/* Get base of stack in r32. */
+	EXTRACT_THREAD_INFO(r32)
+
+.Lretry_work_pending:
+	/*
 	 * Disable interrupts so as to make sure we don't
 	 * miss an interrupt that sets any of the thread flags (like
 	 * need_resched or sigpending) between sampling and the iret.
@@ -656,9 +670,6 @@ STD_ENTRY(interrupt_return)
 	IRQ_DISABLE(r20, r21)
 	TRACE_IRQS_OFF  /* Note: clobbers registers r0-r29 */
 
-	/* Get base of stack in r32; note r30/31 are used as arguments here. */
-	GET_THREAD_INFO(r32)
-
 
 	/* Check to see if there is any work to do before returning to user. */
 	{
@@ -674,16 +685,18 @@ STD_ENTRY(interrupt_return)
 
 	/*
 	 * Make sure we have all the registers saved for signal
-	 * handling or single-step.  Call out to C code to figure out
+	 * handling or notify-resume.  Call out to C code to figure out
 	 * exactly what we need to do for each flag bit, then if
 	 * necessary, reload the flags and recheck.
 	 */
-	push_extra_callee_saves r0
 	{
 	 PTREGS_PTR(r0, PTREGS_OFFSET_BASE)
-	 jal    do_work_pending
+	 bnez   r33, 1f
 	}
-	bnez    r0, .Lresume_userspace
+	push_extra_callee_saves r0
+	movei   r33, 1
+1:	jal     do_work_pending
+	bnez    r0, .Lretry_work_pending
 
 	/*
 	 * In the NMI case we
@@ -968,11 +981,16 @@ handle_syscall:
 	shl16insli r20, r20, hw0(irq_stat + IRQ_CPUSTAT_SYSCALL_COUNT_OFFSET)
 	add     r20, r20, tp
 	ld4s    r21, r20
-	addi    r21, r21, 1
-	st4     r20, r21
+	{
+	 addi   r21, r21, 1
+	 move   r31, sp
+	}
+	{
+	 st4    r20, r21
+	 EXTRACT_THREAD_INFO(r31)
+	}
 
 	/* Trace syscalls, if requested. */
-	GET_THREAD_INFO(r31)
 	addi	r31, r31, THREAD_INFO_FLAGS_OFFSET
 	ld	r30, r31
 	andi    r30, r30, _TIF_SYSCALL_TRACE
diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index 2d5ef61..54e6c64 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -567,6 +567,10 @@ struct task_struct *__sched _switch_to(struct task_struct *prev,
  */
 int do_work_pending(struct pt_regs *regs, u32 thread_info_flags)
 {
+	/* If we enter in kernel mode, do nothing and exit the caller loop. */
+	if (!user_mode(regs))
+		return 0;
+
 	if (thread_info_flags & _TIF_NEED_RESCHED) {
 		schedule();
 		return 1;
@@ -589,8 +593,7 @@ int do_work_pending(struct pt_regs *regs, u32 thread_info_flags)
 		return 1;
 	}
 	if (thread_info_flags & _TIF_SINGLESTEP) {
-		if ((regs->ex1 & SPR_EX_CONTEXT_1_1__PL_MASK) == 0)
-			single_step_once(regs);
+		single_step_once(regs);
 		return 0;
 	}
 	panic("work_pending: bad flags %#x\n", thread_info_flags);
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH] arch/tile: avoid calling do_signal() after fork from a kernel thread
  2012-04-28 18:51                                                   ` [PATCH] arch/tile: avoid calling do_signal() after fork from a kernel thread Chris Metcalf
@ 2012-04-28 20:55                                                     ` Al Viro
  2012-04-28 21:46                                                       ` Chris Metcalf
  0 siblings, 1 reply; 91+ messages in thread
From: Al Viro @ 2012-04-28 20:55 UTC (permalink / raw)
  To: Chris Metcalf; +Cc: Oleg Nesterov, Linus Torvalds, linux-arch, linux-kernel

On Sat, Apr 28, 2012 at 02:51:43PM -0400, Chris Metcalf wrote:
> Calling interrupt_return will check the privilege of the context we're
> returning to avoid the possibility of kernel threads doing any kind
> of userspace actions (including signal handling) after a fork.
> 
> Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
> ---
> Al, thanks for noticing this.  I've queued it up for 3.4.
> 
> Do you have a case that might provoke the signal behavior in the
> unpatched code?  The patched code passes our internal regressions.
> 
>  arch/tile/kernel/intvec_32.S |    2 +-
>  arch/tile/kernel/intvec_64.S |    2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/tile/kernel/intvec_32.S b/arch/tile/kernel/intvec_32.S
> index 5d56a1e..d0f48ca 100644
> --- a/arch/tile/kernel/intvec_32.S
> +++ b/arch/tile/kernel/intvec_32.S
> @@ -1274,7 +1274,7 @@ STD_ENTRY(ret_from_fork)
>  	FEEDBACK_REENTER(ret_from_fork)
>  	{
>  	 movei  r30, 0               /* not an NMI */
> -	 j      .Lresume_userspace   /* jump into middle of interrupt_return */
> +	 j      interrupt_return
>  	}
>  	STD_ENDPROC(ret_from_fork)

Umm...  I'm not sure that it's correct.  For one thing, ret_from_fork is
used both for kernel threads and for plain old fork(2).  In the latter
case you want .Lresume_userspace, not interrupt_return.  For another,
there's kernel_execve() and if it fails (binary doesn't exist/has wrong
format/etc.) you'll get to .Lresume_userspace with EX1_PL(regs->ex1)
unchanged, i.e. the kernel one...

Frankly, with the way you have that stuff done I'd rather do just this:

int do_work_pending(struct pt_regs *regs, u32 thread_info_flags)
{
	if (!user_mode(regs))
		return 0;
	....
}

and be done with that.  Unless I'm seriously misreading your code it'll do
the right thing with no changes to asm glue.  As for the reproducer, just
guess the PID of modprobe when you are e.g. trying to mount a filesystem
with fs driver modular and not loaded; fork(), have parent wait a bit
and call mount(), while the child keeps sending something more or less
innocent (SIGCHLD, for example) to the guessed PID.  And either have
/sbin/modprobe chmod -x before doing that (you'll need to remember to
chmod it back before reboot, of course) or just
mount --bind /dev/null /sbin/modprobe.  Either way, kernel_execve() will
fail.  And if you manage to hit the sucker just as it's being spawned,
you'll get the kernel_thread() codepath as well.

FWIW, I like what you've done with do_work_pending() - it's much cleaner
than usual loops and tests in assembler.  The only question is, what's
going on with
	push_extra_callee_saves r0
you are doing there - seems possibly over the top for situations when
SIGPENDING isn't set and, more seriously, what if you go through that
loop many times?  You slap them again and again into pt_regs, overwriting
anything ptrace() might've done to r34..r51, right?

Smells like something that should be done only once, not on each iteration
through the loop...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH] arch/tile: avoid calling do_signal() after fork from a kernel thread
  2012-04-28 20:55                                                     ` Al Viro
@ 2012-04-28 21:46                                                       ` Chris Metcalf
  2012-04-29  0:55                                                         ` Al Viro
  0 siblings, 1 reply; 91+ messages in thread
From: Chris Metcalf @ 2012-04-28 21:46 UTC (permalink / raw)
  To: Al Viro; +Cc: Oleg Nesterov, Linus Torvalds, linux-arch, linux-kernel

On 4/28/2012 4:55 PM, Al Viro wrote:
> On Sat, Apr 28, 2012 at 02:51:43PM -0400, Chris Metcalf wrote:
>> Calling interrupt_return will check the privilege of the context we're
>> returning to avoid the possibility of kernel threads doing any kind
>> of userspace actions (including signal handling) after a fork.
>>
>> Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
>> ---
>> Al, thanks for noticing this.  I've queued it up for 3.4.
>>
>> Do you have a case that might provoke the signal behavior in the
>> unpatched code?  The patched code passes our internal regressions.
>>
>>  arch/tile/kernel/intvec_32.S |    2 +-
>>  arch/tile/kernel/intvec_64.S |    2 +-
>>  2 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/tile/kernel/intvec_32.S b/arch/tile/kernel/intvec_32.S
>> index 5d56a1e..d0f48ca 100644
>> --- a/arch/tile/kernel/intvec_32.S
>> +++ b/arch/tile/kernel/intvec_32.S
>> @@ -1274,7 +1274,7 @@ STD_ENTRY(ret_from_fork)
>>  	FEEDBACK_REENTER(ret_from_fork)
>>  	{
>>  	 movei  r30, 0               /* not an NMI */
>> -	 j      .Lresume_userspace   /* jump into middle of interrupt_return */
>> +	 j      interrupt_return
>>  	}
>>  	STD_ENDPROC(ret_from_fork)
> Umm...  I'm not sure that it's correct.  For one thing, ret_from_fork is
> used both for kernel threads and for plain old fork(2).  In the latter
> case you want .Lresume_userspace, not interrupt_return.

It's OK, since we will jump to .Lresume_userspace from interrupt_return in
the latter case:

STD_ENTRY(interrupt_return)
        /* If we're resuming to kernel space, don't check thread flags. */
        {
         [...]
         PTREGS_PTR(r29, PTREGS_OFFSET_EX1)
        }
        ld      r29, r29
        andi    r29, r29, SPR_EX_CONTEXT_1_1__PL_MASK  /* mask off ICS */
        {
         beqzt  r29, .Lresume_userspace
         [...]
        }

The struct ptregs "ex1" field will reliably tell us whether we came from
kernel or userspace context.  Certainly for fork() this will indicate
userspace, since it's the return register info for the syscall.   And for
kernel_thread() we explicitly set up ex1 to indicate kernel privilege as well.

> For another,
> there's kernel_execve() and if it fails (binary doesn't exist/has wrong
> format/etc.) you'll get to .Lresume_userspace with EX1_PL(regs->ex1)
> unchanged, i.e. the kernel one...

This is a good point.  The current syscall return path goes directly to
.Lresume_userspace, which will screw up kernel syscalls.  I think the right
answer is still to jump to interrupt_return from those cases, though.  In
particular, this gets rid of the existing cruftiness where we have to
document that a local label (.Lresume_userspace) can be the target of jumps
from outside the containing function.

> As for the reproducer, just
> guess the PID of modprobe when you are e.g. trying to mount a filesystem
> with fs driver modular and not loaded; fork(), have parent wait a bit
> and call mount(), while the child keeps sending something more or less
> innocent (SIGCHLD, for example) to the guessed PID.  And either have
> /sbin/modprobe chmod -x before doing that (you'll need to remember to
> chmod it back before reboot, of course) or just
> mount --bind /dev/null /sbin/modprobe.  Either way, kernel_execve() will
> fail.  And if you manage to hit the sucker just as it's being spawned,
> you'll get the kernel_thread() codepath as well.
>
> FWIW, I like what you've done with do_work_pending() - it's much cleaner
> than usual loops and tests in assembler.  The only question is, what's
> going on with
> 	push_extra_callee_saves r0
> you are doing there - seems possibly over the top for situations when
> SIGPENDING isn't set and, more seriously, what if you go through that
> loop many times?  You slap them again and again into pt_regs, overwriting
> anything ptrace() might've done to r34..r51, right?

Yes, that's a good observation.  I should hoist the push of callee-saves to
before the loop.  I'll put out a new patch that incorporates both of those
changes.

Thanks!

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH] arch/tile: avoid calling do_signal() after fork from a kernel thread
  2012-04-28 21:46                                                       ` Chris Metcalf
@ 2012-04-29  0:55                                                         ` Al Viro
  2012-04-28 18:51                                                           ` [PATCH v2] arch/tile: fix up some issues in calling do_work_pending() Chris Metcalf
  2012-04-29  3:49                                                           ` [PATCH] arch/tile: avoid calling do_signal() after fork from a kernel thread Chris Metcalf
  0 siblings, 2 replies; 91+ messages in thread
From: Al Viro @ 2012-04-29  0:55 UTC (permalink / raw)
  To: Chris Metcalf; +Cc: Oleg Nesterov, Linus Torvalds, linux-arch, linux-kernel

On Sat, Apr 28, 2012 at 05:46:13PM -0400, Chris Metcalf wrote:

> It's OK, since we will jump to .Lresume_userspace from interrupt_return in
> the latter case:
> 
> STD_ENTRY(interrupt_return)
>         /* If we're resuming to kernel space, don't check thread flags. */
>         {
>          [...]
>          PTREGS_PTR(r29, PTREGS_OFFSET_EX1)
>         }
>         ld      r29, r29
>         andi    r29, r29, SPR_EX_CONTEXT_1_1__PL_MASK  /* mask off ICS */
>         {
>          beqzt  r29, .Lresume_userspace
>          [...]
>         }
> 
> The struct ptregs "ex1" field will reliably tell us whether we came from
> kernel or userspace context.  Certainly for fork() this will indicate
> userspace, since it's the return register info for the syscall.   And for
> kernel_thread() we explicitly set up ex1 to indicate kernel privilege as well.
> 
> > For another,
> > there's kernel_execve() and if it fails (binary doesn't exist/has wrong
> > format/etc.) you'll get to .Lresume_userspace with EX1_PL(regs->ex1)
> > unchanged, i.e. the kernel one...
> 
> This is a good point.  The current syscall return path goes directly to
> .Lresume_userspace, which will screw up kernel syscalls.  I think the right
> answer is still to jump to interrupt_return from those cases, though.  In
> particular, this gets rid of the existing cruftiness where we have to
> document that a local label (.Lresume_userspace) can be the target of jumps
> from outside the containing function.

Point, but...  Are you sure you want to add extra work to a very hot path?
Leaving the "we don't have any pending work to do" unburdened by that check
would reduce the overhead a lot.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH] arch/tile: avoid calling do_signal() after fork from a kernel thread
  2012-04-29  0:55                                                         ` Al Viro
  2012-04-28 18:51                                                           ` [PATCH v2] arch/tile: fix up some issues in calling do_work_pending() Chris Metcalf
@ 2012-04-29  3:49                                                           ` Chris Metcalf
  1 sibling, 0 replies; 91+ messages in thread
From: Chris Metcalf @ 2012-04-29  3:49 UTC (permalink / raw)
  To: Al Viro; +Cc: Oleg Nesterov, Linus Torvalds, linux-arch, linux-kernel

On 4/28/2012 8:55 PM, Al Viro wrote:
> On Sat, Apr 28, 2012 at 05:46:13PM -0400, Chris Metcalf wrote:
>>> For another,
>>> there's kernel_execve() and if it fails (binary doesn't exist/has wrong
>>> format/etc.) you'll get to .Lresume_userspace with EX1_PL(regs->ex1)
>>> unchanged, i.e. the kernel one...
>> This is a good point.  The current syscall return path goes directly to
>> .Lresume_userspace, which will screw up kernel syscalls.  I think the right
>> answer is still to jump to interrupt_return from those cases, though.  In
>> particular, this gets rid of the existing cruftiness where we have to
>> document that a local label (.Lresume_userspace) can be the target of jumps
>> from outside the containing function.
> Point, but...  Are you sure you want to add extra work to a very hot path?
> Leaving the "we don't have any pending work to do" unburdened by that check
> would reduce the overhead a lot.

I suppose that's plausible; it's only 5 cycles of work (including the one
cache-hot load), but might also be an extra icache line load, which could
be quite a bit slower.

I think the way to do this may be to both take your suggestion about
checking user_mode() in do_work_pending(), and also handle the looping over
the flag bits there as well.  Then we can load the caller-saves once, prior
to calling do_work_pending(), and when we return we're guaranteed to have
interrupts disabled and TIF_ALLWORK clear.

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-27 23:32                                                 ` Al Viro
@ 2012-04-29  4:12                                                   ` Al Viro
  2012-04-30  8:06                                                     ` Martin Schwidefsky
  0 siblings, 1 reply; 91+ messages in thread
From: Al Viro @ 2012-04-29  4:12 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Oleg Nesterov, Linus Torvalds, linux-arch, linux-kernel

On Sat, Apr 28, 2012 at 12:32:35AM +0100, Al Viro wrote:
> On Sat, Apr 28, 2012 at 12:15:26AM +0100, Al Viro wrote:
> 
> > I think all such architectures need that check lifted to do_notify_resume()
> > (and the rest needs it killed, of course).  Including x86, by the look
> > of it - we _probably_ can't get there with TIF_NOTIFY_RESUME and
> > !user_mode(regs), but I'm not entirely sure of that.  arm is in about the
> > same situation; alpha, ppc{32,64}, sparc{32,64} and m68k really can't get
> > there like that (they all check it in the asm glue).  mips probably might,
> > unless I'm misreading their ret_from_fork()...  Fun.
> 
> 	Speaking of user_mode() oddities - may I politely inquire what had
> been smoked to inspire this (in arch/s390/kernel/signal.c):
>         /* This is the legacy signal stack switching. */
>         else if (!user_mode(regs) &&
>                  !(ka->sa.sa_flags & SA_RESTORER) &&
>                  ka->sa.sa_restorer) {
>                 sp = (unsigned long) ka->sa.sa_restorer;
>         }
> especially when all paths leading to that come through do_signal() that does
>         if (!user_mode(regs))
>                 return;
> on the same regs.  It had been like that since 2.3.99pre8 when s390 went
> into the tree...  It looks vaguely similar to i386
>                         /* This is the legacy signal stack switching. */
>                         if ((regs->ss & 0xffff) != __USER_DS &&
>                                 !(ka->sa.sa_flags & SA_RESTORER) &&
>                                         ka->sa.sa_restorer)
>                                 sp = (unsigned long) ka->sa.sa_restorer;
> but there the code is at least not unreachable...

While we are at it, can we *ever* reach do_signal() (nevermind deep in its
guts) with !user_mode(regs)?
AFAICS, for 31bit possible paths are:
do_signal()
	<- sysc_sigpending
		<- sysc_work
			<- sysc_tif, having checked for user_mode(%r11)
	<- io_sigpending
		<- io_work_tif
			<- io_work_user
				<- io_work, having checked for user_mode(%r11)

and identical for 64bit.  *All* paths into do_signal() go through
        tm      __PT_PSW+1(%r11),0x01   # returning to user ?
and proceed towards do_signal() only if the bit is set.  Which is precisely
what user_mode(%r11) is...

What the hell is going on in that code?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-28  2:42                                                 ` [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such Al Viro
  2012-04-28  3:32                                                   ` Al Viro
@ 2012-04-29 16:18                                                   ` Oleg Nesterov
  2012-04-29 18:05                                                     ` Al Viro
  1 sibling, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2012-04-29 16:18 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-arch, linux-kernel

First of all, let me repeat I am useless when it comes to the low-level
or asm code. I can be easily wrong, and I am going to ask the questions.

On 04/28, Al Viro wrote:
>
> It's actually worse than I thought - we can't just lift that check
> to do_notify_resume() and be done with that.  Suppose do_signal() does
> get called on e.g. i386 or arm with !user_mode(regs).  What'll happen next?
>
> We have TIF_SIGPENDING set in thread flags - otherwise we wouldn't get
> there at all.  OK, do_signal() doesn't do anything and returns.  So does
> do_notify_resume().  And we are back into the loop in asm glue, rereading
> the thread flags (still unchanged), checking if anything is to be done
> (yes, it is - TIF_SIGPENDING is still set), calling do_notify_resume(),
> ad infinitum.
>
> Lifting the check into do_notify_resume() will not help at all, obviously.
>
> AFAICS we can get hit by that.

Please look at 29a2e2836ff9ea65a603c89df217f4198973a74f
x86-32: Fix endless loop when processing signals for kernel tasks

> At least i386, arm and mips have
> ret_from_fork going straight to "return from syscall" path, no checks for
> return to user mode done.  And process created by kernel_thread() will
> go there.

Looks like, the patch above fixes that.

But, we call do_notify_resume() first, it would be nice to avoid this
and remove the user_mode() check in do_signal() or lift into
do_notify_resume().

> It's a narrow race, but AFAICS it's not impossible to hit -
> guess the PID of kernel thread to be launched, send it a signal and hit
> the moment before it gets to executing the payload.

Yes. But note that the kernel threads run with all signals ignored.

This is still possible, but a kernel thread should do allow_signal()
and then call kernel_thread() (not kthread_create).



Question. So far I know that on x86 do_notify_resume() && !user_mode()
is only possible on 32bit system, and the possible callers are
ret_from_fork or kernel_execve (if it fails).

Plus, perhaps CONFIG_VM86 makes a difference?

Could you please clarify?

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-28  3:32                                                   ` Al Viro
  2012-04-28  3:36                                                     ` Al Viro
@ 2012-04-29 16:33                                                     ` Oleg Nesterov
  1 sibling, 0 replies; 91+ messages in thread
From: Oleg Nesterov @ 2012-04-29 16:33 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-arch, linux-kernel

Ah, I didn't notice this email...

On 04/28, Al Viro wrote:
>
> Actually, it looks like on i386 the loop will be broken by checks in
> resume_userspace_sig,

Yes,

> so the worst thing that might happen would be
> a bogus call of tracehook_notify_resume() if it's possible to get there
> with TIF_NOTIFY_RESUME for kernel thread.

Afaics, the kernel should never have TIF_NOTIFY_RESUME. Except (afaics!)
a user-space task with TIF_NOTIFY_RESUME does kernel_thread(), and this
flag is copied by setup_thread_stack(). But this should be forbidden and
we are going to kill daemonize(), probably it already has no callers.

> To be honest, I'd rather check for user_mode() before calling
> do_notify_resume() and go away to no_work_pending if it's true.

Agreed! I tried to suggest this when 29a2e283 was discussed, but my asm
skills are close to zero.

> --- a/arch/x86/kernel/entry_32.S
> +++ b/arch/x86/kernel/entry_32.S
> @@ -321,7 +321,6 @@ ret_from_exception:
>  	preempt_stop(CLBR_ANY)
>  ret_from_intr:
>  	GET_THREAD_INFO(%ebp)
> -resume_userspace_sig:
>  #ifdef CONFIG_VM86
>  	movl PT_EFLAGS(%esp), %eax	# mix EFLAGS and CS
>  	movb PT_CS(%esp), %al
> @@ -628,9 +627,13 @@ work_notifysig:				# deal with pending signals and
>  					# vm86-space
>  	TRACE_IRQS_ON
>  	ENABLE_INTERRUPTS(CLBR_NONE)
> +	movb PT_CS(%esp), %bl
> +	andl $SEGMENT_RPL_MASK, %ebx
> +	cmpl $USER_RPL, %ebx
> +	jb resume_kernel
>  	xorl %edx, %edx
>  	call do_notify_resume
> -	jmp resume_userspace_sig
> +	jmp resume_userspace
>  
>  	ALIGN
>  work_notifysig_v86:
> @@ -643,9 +646,13 @@ work_notifysig_v86:
>  #endif
>  	TRACE_IRQS_ON
>  	ENABLE_INTERRUPTS(CLBR_NONE)
> +	movb PT_CS(%esp), %bl
> +	andl $SEGMENT_RPL_MASK, %ebx
> +	cmpl $USER_RPL, %ebx
> +	jb resume_kernel
>  	xorl %edx, %edx
>  	call do_notify_resume
> -	jmp resume_userspace_sig
> +	jmp resume_userspace
>  END(work_pending)
>  
>  	# perform syscall exit tracing
> diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
> index 595969f..c4aa7c5 100644
> --- a/arch/x86/kernel/signal.c
> +++ b/arch/x86/kernel/signal.c
> @@ -738,16 +738,6 @@ static void do_signal(struct pt_regs *regs)
>  	siginfo_t info;
>  	int signr;
>  
> -	/*
> -	 * We want the common case to go fast, which is why we may in certain
> -	 * cases get here from kernel mode. Just return without doing anything
> -	 * if so.
> -	 * X86_32: vm86 regs switched out by assembly code before reaching
> -	 * here, so testing against kernel CS suffices.
> -	 */
> -	if (!user_mode(regs))
> -		return;
> -

Can't review, but like very much ;)

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-27 18:45                                       ` Al Viro
                                                           ` (2 preceding siblings ...)
  2012-04-27 20:20                                         ` Roland McGrath
@ 2012-04-29 16:41                                         ` Oleg Nesterov
  2012-04-29 18:09                                           ` Al Viro
  3 siblings, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2012-04-29 16:41 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-arch, linux-kernel, Roland McGrath

On 04/27, Al Viro wrote:
>
> 	BTW, I'm somewhat tempted to do the following: *ALL* calls of
> tracehook_signal_handler() are now immediately preceded by block_signals().
> Moreover, tracehook_signal_handler(...., 0) is currently a no-op, so
> it could be painlessly added after the remaining block_signals() instances.
> How about *folding* block_signals() (along with clear_restore_sigmask())
> into tracehook_signal_handler()?

Oh, please no. Imho, these two have nothing to do with each other.

Besides, at least on x86 tracehook_signal_handler's logic is not exactly
right and should be fixed.

And we are going to kill tracehook.h. While personally I do not think
this is the good idea, but the matter of fact is that tracehooks are
already destroyed.

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-29 16:18                                                   ` Oleg Nesterov
@ 2012-04-29 18:05                                                     ` Al Viro
  2012-05-01  4:31                                                       ` Al Viro
  2012-05-02 18:30                                                       ` Oleg Nesterov
  0 siblings, 2 replies; 91+ messages in thread
From: Al Viro @ 2012-04-29 18:05 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Linus Torvalds, linux-arch, linux-kernel, Thomas Gleixner

On Sun, Apr 29, 2012 at 06:18:18PM +0200, Oleg Nesterov wrote:

> Please look at 29a2e2836ff9ea65a603c89df217f4198973a74f
> x86-32: Fix endless loop when processing signals for kernel tasks
> 
> > At least i386, arm and mips have
> > ret_from_fork going straight to "return from syscall" path, no checks for
> > return to user mode done.  And process created by kernel_thread() will
> > go there.
> 
> Looks like, the patch above fixes that.

Yes, found that shortly after posting.  No such luck for arm, though...

> But, we call do_notify_resume() first, it would be nice to avoid this
> and remove the user_mode() check in do_signal() or lift into
> do_notify_resume().

See the proposed patch.  Does exactly that - lifts all the way to caller
of do_notify_resume(), buggers off if test fits.

> Question. So far I know that on x86 do_notify_resume() && !user_mode()
> is only possible on 32bit system, and the possible callers are
> ret_from_fork or kernel_execve (if it fails).
> 
> Plus, perhaps CONFIG_VM86 makes a difference?
> 
> Could you please clarify?

VM86 doesn't make a difference; we form normal pt_regs for it in case
we have a pending signal, but after that has been done, we don't need
to care.  Look:
	* NOTIFY_RESUME handling doesn't need to be done unless we are
returning to userland.  IOW, the first step is to lift that if (!user_mode(...
into do_notify_resume().  Agreed?
	* Now, if do_notify_resume() does nothing in case !user_mode(regs),
let's lift that check to (32bit) caller.  What we have right now is
	do_notify_resume(%esp, NULL, %ecx)
	goto resume_userspace_sig;
resume_userspace_sig:
	if (!user_mode_vm(%esp))
		goto resume_kernel;
resume_userspace:
So after lifting the check we get
	if (user_mode(%esp))
		do_notify_resume(%esp, NULL, %ecx)
	goto resume_userspace_sig;
resume_userspace_sig:
	if (!user_mode_vm(%esp))
		goto resume_kernel;
resume_userspace:
but user_mode(regs) being true means that user_mode_vm(regs) is also true,
so this code is equivalent to
	if (!user_mode(%esp))
		goto resume_kernel;
	do_notify_resume(%esp, NULL, %ecx)
	goto resume_userspace;
(with stuff around resume_userspace_sig left without changes).

diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index 7b784f4..e858462 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -321,7 +321,6 @@ ret_from_exception:
 	preempt_stop(CLBR_ANY)
 ret_from_intr:
 	GET_THREAD_INFO(%ebp)
-resume_userspace_sig:
 #ifdef CONFIG_VM86
 	movl PT_EFLAGS(%esp), %eax	# mix EFLAGS and CS
 	movb PT_CS(%esp), %al
@@ -628,9 +627,13 @@ work_notifysig:				# deal with pending signals and
 					# vm86-space
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_NONE)
+	movb PT_CS(%esp), %bl
+	andl $SEGMENT_RPL_MASK, %ebx
+	cmpl $USER_RPL, %ebx
+	jb resume_kernel
 	xorl %edx, %edx
 	call do_notify_resume
-	jmp resume_userspace_sig
+	jmp resume_userspace
 
 	ALIGN
 work_notifysig_v86:
@@ -643,9 +646,13 @@ work_notifysig_v86:
 #endif
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_NONE)
+	movb PT_CS(%esp), %bl
+	andl $SEGMENT_RPL_MASK, %ebx
+	cmpl $USER_RPL, %ebx
+	jb resume_kernel
 	xorl %edx, %edx
 	call do_notify_resume
-	jmp resume_userspace_sig
+	jmp resume_userspace
 END(work_pending)
 
 	# perform syscall exit tracing
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 595969f..c4aa7c5 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -738,16 +738,6 @@ static void do_signal(struct pt_regs *regs)
 	siginfo_t info;
 	int signr;
 
-	/*
-	 * We want the common case to go fast, which is why we may in certain
-	 * cases get here from kernel mode. Just return without doing anything
-	 * if so.
-	 * X86_32: vm86 regs switched out by assembly code before reaching
-	 * here, so testing against kernel CS suffices.
-	 */
-	if (!user_mode(regs))
-		return;
-
 	signr = get_signal_to_deliver(&info, &ka, regs, NULL);
 	if (signr > 0) {
 		/* Whee! Actually deliver the signal.  */

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-29 16:41                                         ` Oleg Nesterov
@ 2012-04-29 18:09                                           ` Al Viro
  2012-04-29 18:25                                             ` Oleg Nesterov
  0 siblings, 1 reply; 91+ messages in thread
From: Al Viro @ 2012-04-29 18:09 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Linus Torvalds, linux-arch, linux-kernel, Roland McGrath

On Sun, Apr 29, 2012 at 06:41:55PM +0200, Oleg Nesterov wrote:
> On 04/27, Al Viro wrote:
> >
> > 	BTW, I'm somewhat tempted to do the following: *ALL* calls of
> > tracehook_signal_handler() are now immediately preceded by block_signals().
> > Moreover, tracehook_signal_handler(...., 0) is currently a no-op, so
> > it could be painlessly added after the remaining block_signals() instances.
> > How about *folding* block_signals() (along with clear_restore_sigmask())
> > into tracehook_signal_handler()?
> 
> Oh, please no. Imho, these two have nothing to do with each other.
> 
> Besides, at least on x86 tracehook_signal_handler's logic is not exactly
> right and should be fixed.

Details, please...

> And we are going to kill tracehook.h. While personally I do not think
> this is the good idea, but the matter of fact is that tracehooks are
> already destroyed.

See signal.git#master.  I ended up with
	signal_delivered(<tracehook_signal_handler arguments>)
and that sucker does both things.  The funny thing is, block_sigmask()
(apologies for typo above) is not used anywhere else...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-29 18:09                                           ` Al Viro
@ 2012-04-29 18:25                                             ` Oleg Nesterov
  0 siblings, 0 replies; 91+ messages in thread
From: Oleg Nesterov @ 2012-04-29 18:25 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-arch, linux-kernel, Roland McGrath

On 04/29, Al Viro wrote:
>
> On Sun, Apr 29, 2012 at 06:41:55PM +0200, Oleg Nesterov wrote:
> >
> > Besides, at least on x86 tracehook_signal_handler's logic is not exactly
> > right and should be fixed.
>
> Details, please...

See http://marc.info/?t=127550678000005 and
https://bugzilla.kernel.org/show_bug.cgi?id=16061

	- the tracee reports a signal SIG

	- the tracer does ptrace(SINGLESTEP, SIG), this approves the signal
	  and also sets TF

	- the tracee dequeues the signal, changes its IP to sig_handler().

	  then it notices TIF_SINGLESTEP and notifies the tracer without
	  return to user-space _and_ without clearing TF or TIF_SINGLESTEP

	- the tracer does ptrace(SINGLESTEP) again, but now enable_single_step()
	  looses TIF_FORCED_TF.

	- the tracer does ptrace(CONT), but user_disable_single_step() doesn't
	  clear TF since TIF_FORCED_TF is not set

	- the tracee returns to user-space with X86_EFLAGS_TF in eflags

Somehow we forgot about this bug...

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-27 19:34                                           ` Al Viro
@ 2012-04-29 22:51                                             ` Al Viro
  2012-04-30  6:39                                               ` Greg Ungerer
  0 siblings, 1 reply; 91+ messages in thread
From: Al Viro @ 2012-04-29 22:51 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Oleg Nesterov, Linus Torvalds, linux-arch, linux-kernel,
	Roland McGrath, Greg Ungerer

On Fri, Apr 27, 2012 at 08:34:13PM +0100, Al Viro wrote:
> On Fri, Apr 27, 2012 at 09:14:53PM +0200, Geert Uytterhoeven wrote:
> > On Fri, Apr 27, 2012 at 20:45, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >> The only comment I have,
> > >> 38671a3e831ed7327affb24d715f98bb99c80e56 m68k: add TIF_NOTIFY_RESUME and handle it
> > >> forgets to unexport do_signal().
> > >
> > > Meh... ??The thing is, there are _two_ of them. ??signal_mm.c and signal_no.c
> > > badly need merging, with common stuff moved to signal.c. ??I really hate
> > 
> > Greg recently posted a patch to merge them:
> > http://www.spinics.net/lists/linux-m68k/msg04995.html
> 
> Nice...  BTW, could you comment on
>     m68k: don't open-code block_sigmask()
>     m68k: use set_current_blocked in sigreturn/rt_sigreturn
>     m68k-nommu: do_signal() is only called when returning to user mode
>     m68k: add TIF_NOTIFY_RESUME and handle it.
> in signal.git tree?  The last one is really interesting...
> 
> Just pick the current tree - one that was there yesterday had a dumb typo
> in thread_info.h part, so it wouldn't even compile (TIF_NOTIFE_RESUME
> defined instead of TIF_NOTIFY_RESUME ;-/)

BTW, Greg's patch put into the queue in the very beginning, everything rebased
on top of it.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-29 22:51                                             ` Al Viro
@ 2012-04-30  6:39                                               ` Greg Ungerer
  0 siblings, 0 replies; 91+ messages in thread
From: Greg Ungerer @ 2012-04-30  6:39 UTC (permalink / raw)
  To: Al Viro
  Cc: Geert Uytterhoeven, Oleg Nesterov, Linus Torvalds, linux-arch,
	linux-kernel, Roland McGrath, Greg Ungerer

On 30/04/12 08:51, Al Viro wrote:
> On Fri, Apr 27, 2012 at 08:34:13PM +0100, Al Viro wrote:
>> On Fri, Apr 27, 2012 at 09:14:53PM +0200, Geert Uytterhoeven wrote:
>>> On Fri, Apr 27, 2012 at 20:45, Al Viro<viro@zeniv.linux.org.uk>  wrote:
>>>>> The only comment I have,
>>>>> 38671a3e831ed7327affb24d715f98bb99c80e56 m68k: add TIF_NOTIFY_RESUME and handle it
>>>>> forgets to unexport do_signal().
>>>>
>>>> Meh... ??The thing is, there are _two_ of them. ??signal_mm.c and signal_no.c
>>>> badly need merging, with common stuff moved to signal.c. ??I really hate
>>>
>>> Greg recently posted a patch to merge them:
>>> http://www.spinics.net/lists/linux-m68k/msg04995.html
>>
>> Nice...  BTW, could you comment on
>>      m68k: don't open-code block_sigmask()
>>      m68k: use set_current_blocked in sigreturn/rt_sigreturn
>>      m68k-nommu: do_signal() is only called when returning to user mode
>>      m68k: add TIF_NOTIFY_RESUME and handle it.
>> in signal.git tree?  The last one is really interesting...
>>
>> Just pick the current tree - one that was there yesterday had a dumb typo
>> in thread_info.h part, so it wouldn't even compile (TIF_NOTIFE_RESUME
>> defined instead of TIF_NOTIFY_RESUME ;-/)
>
> BTW, Greg's patch put into the queue in the very beginning, everything rebased
> on top of it.

Those changes currently sitting in your signal tree look ok to me.

Acked-by: Greg Ungerer <gerg@uclinux.org>

Regards
Greg


------------------------------------------------------------------------
Greg Ungerer  --  Principal Engineer        EMAIL:     gerg@snapgear.com
SnapGear Group, McAfee                      PHONE:       +61 7 3435 2888
8 Gardner Close                             FAX:         +61 7 3217 5323
Milton, QLD, 4064, Australia                WEB: http://www.SnapGear.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-29  4:12                                                   ` Al Viro
@ 2012-04-30  8:06                                                     ` Martin Schwidefsky
  0 siblings, 0 replies; 91+ messages in thread
From: Martin Schwidefsky @ 2012-04-30  8:06 UTC (permalink / raw)
  To: Al Viro; +Cc: Oleg Nesterov, Linus Torvalds, linux-arch, linux-kernel

On Sun, 29 Apr 2012 05:12:05 +0100
Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Sat, Apr 28, 2012 at 12:32:35AM +0100, Al Viro wrote:
> > On Sat, Apr 28, 2012 at 12:15:26AM +0100, Al Viro wrote:
> > 
> > > I think all such architectures need that check lifted to do_notify_resume()
> > > (and the rest needs it killed, of course).  Including x86, by the look
> > > of it - we _probably_ can't get there with TIF_NOTIFY_RESUME and
> > > !user_mode(regs), but I'm not entirely sure of that.  arm is in about the
> > > same situation; alpha, ppc{32,64}, sparc{32,64} and m68k really can't get
> > > there like that (they all check it in the asm glue).  mips probably might,
> > > unless I'm misreading their ret_from_fork()...  Fun.
> > 
> > 	Speaking of user_mode() oddities - may I politely inquire what had
> > been smoked to inspire this (in arch/s390/kernel/signal.c):
> >         /* This is the legacy signal stack switching. */
> >         else if (!user_mode(regs) &&
> >                  !(ka->sa.sa_flags & SA_RESTORER) &&
> >                  ka->sa.sa_restorer) {
> >                 sp = (unsigned long) ka->sa.sa_restorer;
> >         }
> > especially when all paths leading to that come through do_signal() that does
> >         if (!user_mode(regs))
> >                 return;
> > on the same regs.  It had been like that since 2.3.99pre8 when s390 went
> > into the tree...  It looks vaguely similar to i386
> >                         /* This is the legacy signal stack switching. */
> >                         if ((regs->ss & 0xffff) != __USER_DS &&
> >                                 !(ka->sa.sa_flags & SA_RESTORER) &&
> >                                         ka->sa.sa_restorer)
> >                                 sp = (unsigned long) ka->sa.sa_restorer;
> > but there the code is at least not unreachable...
> 
> While we are at it, can we *ever* reach do_signal() (nevermind deep in its
> guts) with !user_mode(regs)?
> AFAICS, for 31bit possible paths are:
> do_signal()
> 	<- sysc_sigpending
> 		<- sysc_work
> 			<- sysc_tif, having checked for user_mode(%r11)
> 	<- io_sigpending
> 		<- io_work_tif
> 			<- io_work_user
> 				<- io_work, having checked for user_mode(%r11)
> 
> and identical for 64bit.  *All* paths into do_signal() go through
>         tm      __PT_PSW+1(%r11),0x01   # returning to user ?
> and proceed towards do_signal() only if the bit is set.  Which is precisely
> what user_mode(%r11) is...
> 
> What the hell is going on in that code?

Cut-copy-paste from x86. You are correct that do_signal is only ever called
with user_mode(regs) == 1. This allows us to remove the !user_mode() check
in do_signal and the SA_RESTORER thingy in get_sigframe. I once contemplated
to remove the SA_RESTORER option altogether, because a user space restorer
does not make much sense for s390. The user space restorer needs to 
1) restore all registers, and 2) branch back to the interrupted code.
For 2) we need a register which makes 1) impossible to do.
I still have that patch lying around on my hard drive.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches)
  2012-04-27 17:34                           ` Al Viro
  2012-04-27 18:52                             ` Kasatkin, Dmitry
@ 2012-04-30 14:32                             ` Mimi Zohar
  2012-05-03  4:23                               ` James Morris
  1 sibling, 1 reply; 91+ messages in thread
From: Mimi Zohar @ 2012-04-30 14:32 UTC (permalink / raw)
  To: Al Viro
  Cc: Jonathan Corbet <corbet@lwn.net>,
	Kasatkin, Dmitry, Linus Torvalds, Hugh Dickins, linux-fsdevel,
	James Morris, linux-security-module, linux-kernel, David Safford,
	David Miller, Andrew Morton

On Fri, 2012-04-27 at 18:34 +0100, Al Viro wrote:
> On Fri, Apr 27, 2012 at 10:35:25AM +0300, Kasatkin, Dmitry wrote:
> 
> > But have you seen the proposed patch for __fput()?
> > [PATCH v4 10/12] ima: defer calling __fput()
> > 
> > It defers only of course the last AND mmap_sem is locked AND open for write.
> > 
> > 	if (current->mm && rwsem_is_locked(&current->mm->mmap_sem)) {
> > 		if (ima_defer_fput(file) == 0)
> > 			return;
> > 	}
> > 
> > Just 5 out of ~100,000 mmap_sem held fput() calls were deferred.
> 
> Let me get it straight.
> 	a) You still ignore all the problems with that described in the
> posting right in the beginning of this thread.
> 	b) You ignore the problems with semantics changes from user-visible
> delays of fput() past the return from syscall (described in Linus' posting
> upthread - they apply to this "solution" as well).
> 	c) You seem to consider the fact that this path will be exercised
> very rarely, thus making any races on it damn hard to reproduce and debug
> as a good thing.  
> 
> And as for the sentiment expressed in the beginning of your posting (that
> smaller patch size is worth more than clean locking rules, maintainability
> of resulting kernel, etc.)...  I'm sorry, but you guys need to decide
> what IMA is.  If it's a first-class part of the kernel, you have your
> priorities backwards...

Al, with all this time spent on the different components of the
integrity subsystem, making it a first class citizen is our main
concern.  We definitely appreciate all of your work, previous and
current work on fput, to help make this happen and will be happy to help
in any way possbile.

Jon, thank you for summarizing the discussion - article
http://lwn.net/Articles/494158/.  As Jake Edge's previous LWN article
http://lwn.net/Articles/488906/ said, we've been working on upstreaming
the different integrity components for quite a while.  Although
IMA-appraisal isn't the last component, it is a major one and were
hoping that it would be upstreamed in the near future.  Is 3.5 still a
possibility?

(It was just pointed out to me that the discussion has moved to
linux-arch.  I'm pretty sure that other people on the LSM mailing list
are following this discussion.  In the future, please CC the LSM
mailing.)

thanks,

Mimi


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-29 18:05                                                     ` Al Viro
@ 2012-05-01  4:31                                                       ` Al Viro
  2012-05-01  5:06                                                         ` Mike Frysinger
  2012-05-02 18:30                                                       ` Oleg Nesterov
  1 sibling, 1 reply; 91+ messages in thread
From: Al Viro @ 2012-05-01  4:31 UTC (permalink / raw)
  To: Chen Liqin; +Cc: Linus Torvalds, linux-arch, linux-kernel, Oleg Nesterov

On Sun, Apr 29, 2012 at 07:05:35PM +0100, Al Viro wrote:
> > Looks like, the patch above fixes that.
> 
> Yes, found that shortly after posting.  No such luck for arm, though...

And for a bunch of other platforms too.  Situation right now:

alpha m68k powerpc sparc: do_notify_resume() reached only when returning to
user mode, no check

arm frv x86 mn10300: in current signal.git reached only when returning to
user mode, check removed

xtensa s390: reached only when returning to user mode, check removed

microblaze: in current signal.git reached only when returning to user mode,
check removed; also fixed bogus restart on sigreturn (a-la what had been
fixed on arm a couple of years ago) along with handling of multiple signal
arrivals.

blackfin: no loop (== multiple signals handling is fucked); no check either
        ret_from_fork doesn't handle signals, etc., userland or not.
        kernel_execve doesn't handle signals, etc., success or no success
        conclusion: check is probably not needed, multiple pending signals are
screwed

score: something very fishy there; fixing bogus restart on sigreturn is
simple, but what exactly clears regs->is_syscall on interrupts et.al.?
I don't see anything similar in there.  Looks like interrupts could be
confused for syscalls wrt restart logics.  And if happens when signal is
pending *and* %r4 contains e.g. -514, we'll get that -514 silently replaced
with -4.  Or cp0_epc gets decremented by 8, resulting in a couple of insns
getting repeated...  And regs->in_syscall is fairly deep in the stack,
so it doesn't look like it was something zeroed by hardware on interrupt...
What am I missing here?

It gets even funnier - in syscall_trace_enter, after we'd
done do_syscall_trace() we have this:
        brl     r8
(i.e. the actual call of sys_whatever_it_was()) followed by
        li      r8, -MAX_ERRNO - 1
        sw      r8, [r0, PT_R7]         # set error flag

        neg     r4, r4                  # error
        sw      r4, [r0, PT_R0]         # set flag for syscall
                                        # restarting
1:      sw      r4, [r0, PT_R2]         # result
        j       syscall_exit
which looks like a result of severe bitrot.  For one thing, regs->regs[0]
is *not* used anywhere in syscall restart logics in arch/score/kernel/signal.c;
for another, the whole thing looks like severely mangled remnants of
	if ((unsigned long)r4 >= (unsigned long)-MAX_ERRNO) {
		regs->regs[7] = 1;
		r4 = -r4;
	}
	regs->regs[4] = r4;
we do on normal (non-traced) syscall path.  Unconverted bits and pieces of
mips?  There return value does go into regs->regs[2] (and regs->regs[0] is
involved in syscall restart logics, while we are at it).  Overall, this
area looks very rotten.  BTW, what's the purpose of syscall_exit: there
and why is it different from syscall_return?  They seem to be identical
except for stray nop in the beginning of the former.  And unless something
very subtle is going on there, that nop *is* a stray one - namely, the
delay slot of immediately preceding "bl schedule_tail"...

Could the maintainers of arch/score tell what's going on?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-05-01  4:31                                                       ` Al Viro
@ 2012-05-01  5:06                                                         ` Mike Frysinger
  2012-05-01  5:52                                                           ` Al Viro
  0 siblings, 1 reply; 91+ messages in thread
From: Mike Frysinger @ 2012-05-01  5:06 UTC (permalink / raw)
  To: Al Viro
  Cc: Chen Liqin, Linus Torvalds, linux-arch, linux-kernel, Oleg Nesterov

[-- Attachment #1: Type: Text/Plain, Size: 1311 bytes --]

On Tuesday 01 May 2012 00:31:29 Al Viro wrote:
> blackfin: no loop (== multiple signals handling is fucked); no check either
>         ret_from_fork doesn't handle signals, etc., userland or not.
>         kernel_execve doesn't handle signals, etc., success or no success
>         conclusion: check is probably not needed, multiple pending signals
> are screwed

to be honest, i haven't been following this thread as Blackfin wasn't mentioned 
in the initial summary.  now it seems we have ;).  i tried going back through 
this TIF_NOTIFY_RESUME thread but haven't quite got a bead on what needs to be 
done.

seems like you're only referring to ret_from_fork here and not the normal 
syscall return path ?  in the Blackfin case, we don't have a fork(), so we only 
have to handle the supervisor mode case (spawning kthreads), so i don't think 
we're quite as fucked as you might think :).

what is it you're suggesting we add ?  in the past, i found documentation on 
the arch TIF_*/notify requirements to be pretty much non-existent.  so some 
parts of the Blackfin paths are what i found from my eyes bleeding x86 asm 
paths, and from single testing some random tests (like strace or gdb).  things 
seem to run & be debugable, and no one has complained thus far, so we ship it!
-mike

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-05-01  5:06                                                         ` Mike Frysinger
@ 2012-05-01  5:52                                                           ` Al Viro
  2012-05-02 17:24                                                             ` Al Viro
  0 siblings, 1 reply; 91+ messages in thread
From: Al Viro @ 2012-05-01  5:52 UTC (permalink / raw)
  To: Mike Frysinger; +Cc: Linus Torvalds, linux-arch, linux-kernel, Oleg Nesterov

On Tue, May 01, 2012 at 01:06:36AM -0400, Mike Frysinger wrote:
> > blackfin: no loop (== multiple signals handling is fucked); no check either
> >         ret_from_fork doesn't handle signals, etc., userland or not.
> >         kernel_execve doesn't handle signals, etc., success or no success
> >         conclusion: check is probably not needed, multiple pending signals
> > are screwed
> 
> to be honest, i haven't been following this thread as Blackfin wasn't mentioned 
> in the initial summary.  now it seems we have ;).  i tried going back through 
> this TIF_NOTIFY_RESUME thread but haven't quite got a bead on what needs to be 
> done.
> 
> seems like you're only referring to ret_from_fork here and not the normal 
> syscall return path ?  in the Blackfin case, we don't have a fork(), so we only 
> have to handle the supervisor mode case (spawning kthreads), so i don't think 
> we're quite as fucked as you might think :).

*anything* that goes through do_fork() (fork(2), clone(2), vfork(2)) will
hit ret_from_fork.  And kernel_execve() in case of success returns into
userland (on failure it stays in kernel).  So handling signals et.al.
on those would be nice.

In any case, the fucked up part is about handling of multiple pending
signals.

> what is it you're suggesting we add ?  in the past, i found documentation on 
> the arch TIF_*/notify requirements to be pretty much non-existent.  so some 
> parts of the Blackfin paths are what i found from my eyes bleeding x86 asm 

Tell me about that...  I've spent the last couple of weeks sniffing the
asm glue all over arch/* ;-/

> paths, and from single testing some random tests (like strace or gdb).  things 
> seem to run & be debugable, and no one has complained thus far, so we ship it!
> -mike

The main issue with blackfin, AFAICS, is that if you get several signals
pending, you need to build sigframes for all of them.  Especially since
you might have generated one yourself (SIGSEGV on failure to create a userland
stack frame for signal handler).  Leaving that SIGSEGV to be picked at
more or less random moment when you enter the kernel next time (e.g. the
next timer interrupt) is Not Nice(tm)...

So the normal logics looks like this:
loop:
	disable interrupts
	read thread flags
	if nothing had been set, you are done.  Restore registers and return
wherever your pt_regs would have you return.
	if NEED_RESCHED is set
		enable interrupts unless your context switch does that for you
		schedule();
		goto loop;
	at that point you don't need interrupts anymore; enable them
	if SIGPENDING or NOTIFY_RESUME is set
		do_notify_resume();
		goto loop;
	// maybe arch-specific flags handled the same way - before or
	// after SIGPENDING/NOTIFY_RESUME, also jumping back to the
	// beginning of the loop

The trouble is, you *really* don't want to get into that loop when
returning to kernel mode.  You can't process signals in that case,
so SIGPENDING will stay there and you'll loop forever.  One way to
deal with that one is to avoid going there if you are returning to
kernel mode; I think it'll be the easiest variant in case of blackfin.
Another way would be to check that just before we'd call do_notify_resume() -
again, buggering off if we are returning to the kernel.

Another thing you'll need is regs->orig_p0 = -1; in handle_restart() -
with multiple signals being handled you want restarts happening only
with the first one.  In your case the "I don't need to bother
with restarts" check is simple - something like arm has it much more
painfully...

IOW, here's what I'd suggest doing:

* add that regs->orig_p0 = -1; to handle_restart()
* add jump .Lresume_userspace_1 after the call of do_notify_resume()
* instead of RESTORE_CONTEXT/rti in ret_from_fork and kernel_execve()
have them jump to .Lresume_userspace.

and that'll be it.  BTW, 
 do_restart:   
                regs->p0 = regs->orig_p0;
                regs->r0 = regs->orig_r0;
                regs->pc -= 2;
in handle_restart() looks excessive, so you might want to trim it a bit
while you are there ;-)

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-27 17:54                                       ` Oleg Nesterov
@ 2012-05-02 10:37                                         ` Matt Fleming
  2012-05-02 14:14                                           ` Al Viro
  0 siblings, 1 reply; 91+ messages in thread
From: Matt Fleming @ 2012-05-02 10:37 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Linus Torvalds, linux-arch, linux-kernel, Andrew Morton

On Fri, 2012-04-27 at 19:54 +0200, Oleg Nesterov wrote:
> On 04/27, Oleg Nesterov wrote:
> >
> > Not only sigreturn. Just look at sigprocmask(SIG_SETMASK) callers, almost
> > all should be converted to use set_current_blocked(). For example, pselect.
> 
> which btw doesn't need sigsaved.
> 
> > The last thing. Matt, could you please look at
> > git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal.git ? It seems to me
> > you already sent some of these changes (use set_current_blocked/block_sigmask).
> > Perhaps there are alreay in -mm or linux-next?
> 
> Forgot to add Matt, sorry for noise...

Any of the changes that aren't in Linus' tree will be in -mm and
linux-next. The ones currently in linux-next are,

07d969a parisc: use set_current_blocked() and block_sigmask()
5becb45 frv: use set_current_blocked() and block_sigmask()
3608417 blackfin: use set_current_blocked() and block_sigmask()
cd21f1a unicore32: use block_sigmask()
2bb36fd h8300: use set_current_blocked() and block_sigmask()
a624f4f score: use set_current_blocked() and block_sigmask()
2326e2e score: don't mask signals if we fail to setup signal stack
b2f6181 microblaze: use set_current_blocked() and block_sigmask()
f557a56 microblaze: fix signal masking
9ff7b4a microblaze: no need to reset handler if SA_ONESHOT
6cb49f5 microblaze: don't reimplement force_sigsegv()
bd9767f ia64: use set_current_blocked() and block_sigmask()
cdbc96c cris: use set_current_blocked() and block_sigmask()
295f127 mn10300: use set_current_blocked() and block_sigmask()
524987e m68k: use set_current_blocked() and block_sigmask()
f0c96a3 m32r: use set_current_blocked() and block_sigmask()
f0f0acd avr32: use block_sigmask()
ef64059 avr32: don't mask signals in the error path

... and this one from you, Oleg, that was sent as part of my series,

3e6c120 avr32: use set_current_blocked() in handle_signal/sys_rt_sigreturn

-- 
Matt Fleming, Intel Open Source Technology Center


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-05-02 10:37                                         ` Matt Fleming
@ 2012-05-02 14:14                                           ` Al Viro
  0 siblings, 0 replies; 91+ messages in thread
From: Al Viro @ 2012-05-02 14:14 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Oleg Nesterov, Linus Torvalds, linux-arch, linux-kernel, Andrew Morton

On Wed, May 02, 2012 at 11:37:50AM +0100, Matt Fleming wrote:
> > > The last thing. Matt, could you please look at
> > > git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal.git ? It seems to me
> > > you already sent some of these changes (use set_current_blocked/block_sigmask).
> > > Perhaps there are alreay in -mm or linux-next?
> > 
> > Forgot to add Matt, sorry for noise...
> 
> Any of the changes that aren't in Linus' tree will be in -mm and
> linux-next. The ones currently in linux-next are,
> 
> 07d969a parisc: use set_current_blocked() and block_sigmask()
> 5becb45 frv: use set_current_blocked() and block_sigmask()
> 3608417 blackfin: use set_current_blocked() and block_sigmask()
> cd21f1a unicore32: use block_sigmask()
> 2bb36fd h8300: use set_current_blocked() and block_sigmask()
> a624f4f score: use set_current_blocked() and block_sigmask()
> 2326e2e score: don't mask signals if we fail to setup signal stack
> b2f6181 microblaze: use set_current_blocked() and block_sigmask()
> f557a56 microblaze: fix signal masking
> 9ff7b4a microblaze: no need to reset handler if SA_ONESHOT
> 6cb49f5 microblaze: don't reimplement force_sigsegv()
> bd9767f ia64: use set_current_blocked() and block_sigmask()
> cdbc96c cris: use set_current_blocked() and block_sigmask()
> 295f127 mn10300: use set_current_blocked() and block_sigmask()
> 524987e m68k: use set_current_blocked() and block_sigmask()
> f0c96a3 m32r: use set_current_blocked() and block_sigmask()
> f0f0acd avr32: use block_sigmask()
> ef64059 avr32: don't mask signals in the error path
> 
> ... and this one from you, Oleg, that was sent as part of my series,
> 
> 3e6c120 avr32: use set_current_blocked() in handle_signal/sys_rt_sigreturn

OK, those commits cherry-picked into the beginning of the queue, the
rest rebased on top of them, dropping my duplicates.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-05-01  5:52                                                           ` Al Viro
@ 2012-05-02 17:24                                                             ` Al Viro
  0 siblings, 0 replies; 91+ messages in thread
From: Al Viro @ 2012-05-02 17:24 UTC (permalink / raw)
  To: Mike Frysinger; +Cc: Linus Torvalds, linux-arch, linux-kernel, Oleg Nesterov

On Tue, May 01, 2012 at 06:52:26AM +0100, Al Viro wrote:
> IOW, here's what I'd suggest doing:
> 
> * add that regs->orig_p0 = -1; to handle_restart()
> * add jump .Lresume_userspace_1 after the call of do_notify_resume()
> * instead of RESTORE_CONTEXT/rti in ret_from_fork and kernel_execve()
> have them jump to .Lresume_userspace.

arrrgh...  The last one is unfortunately not so simple - you have that
in the guts of _system_call, which is called via pseudo_long_call,
so it would ought to be a call into the middle of _system_call, followed
by that RESTORE_CONTEXT/rti...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such
  2012-04-29 18:05                                                     ` Al Viro
  2012-05-01  4:31                                                       ` Al Viro
@ 2012-05-02 18:30                                                       ` Oleg Nesterov
  1 sibling, 0 replies; 91+ messages in thread
From: Oleg Nesterov @ 2012-05-02 18:30 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-arch, linux-kernel, Thomas Gleixner

On 04/29, Al Viro wrote:
>
> 	* Now, if do_notify_resume() does nothing in case !user_mode(regs),
> let's lift that check to (32bit) caller.  What we have right now is
> 	do_notify_resume(%esp, NULL, %ecx)
> 	goto resume_userspace_sig;
> resume_userspace_sig:
> 	if (!user_mode_vm(%esp))
> 		goto resume_kernel;
> resume_userspace:
> So after lifting the check we get
> 	if (user_mode(%esp))
> 		do_notify_resume(%esp, NULL, %ecx)
> 	goto resume_userspace_sig;
> resume_userspace_sig:
> 	if (!user_mode_vm(%esp))
> 		goto resume_kernel;
> resume_userspace:
> but user_mode(regs) being true means that user_mode_vm(regs) is also true,
> so this code is equivalent to
> 	if (!user_mode(%esp))
> 		goto resume_kernel;
> 	do_notify_resume(%esp, NULL, %ecx)
> 	goto resume_userspace;
> (with stuff around resume_userspace_sig left without changes).

Yes, thanks, this looks correct.

I've read the new patches in your tree. Again, I do not have any
useful comment, but a couple of questions.

And just in case... I will be completely offline till May 9.


----------------------------------------
046a099ad7b3791a7f9dfbe56ac1263bda8b1974 arm: if there's no handler we need to restore sigmask, syscall or no syscall

with or without this patch, set_current_blocked(->saved_sigmask) doesn't
look exactly right after force_sigsegv(), this can block SIGSEGV.

And force_sigsegv(sig => 0) looks strange, but this is off-topic.

And the question, I am just curious...

OTOH. I am not sure I understand the "int syscall" argument correctly,
I'll assume it means the same as "regs->orig_ax > 0" on x86. In this
case it is not clear to me how "!syscall && TIF_RESTORE_SIGMASK" is
possible.

x86 does this outside of the "if (syscall_get_nr(current, regs)" block
too. Probably this makes sense because debugger can change orig_ax in
between?

(The same for the next db7fddb9574c175aabdbcaa74b736bb3d1665a8e change
 in unicore32)

----------------------------------------
415a12e79ebfa703a5ec91c85cb29f6ecc844aa1 most of set_current_blocked() callers want SIGKILL/SIGSTOP removed from set

Cosmetic nit. With this patch we have

	void set_current_blocked(sigset_t *newset)
	{
		struct task_struct *tsk = current;
		sigdelsetmask(newset, sigmask(SIGKILL) | sigmask(SIGSTOP));
		spin_lock_irq(&tsk->sighand->siglock);
		__set_task_blocked(tsk, newset);
		spin_unlock_irq(&tsk->sighand->siglock);
	}

but it could simply do

	void set_current_blocked(sigset_t *newset)
	{
		sigdelsetmask(newset, sigmask(SIGKILL) | sigmask(SIGSTOP));
		__set_current_blocked(newset);
	}

-----------------------------------------
fa04e22b239aa035f3ae77151e26b03400303245 FRV: Shrink TIF_WORK_MASK [ver #2]

Off-topic/stupid question. Even if I know nothing about arch/frv, this looks
like a nice change to me because

	#define _TIF_WORK_MASK         0x0000FFFE
	#define _TIF_ALLWORK_MASK      0x0000FFFF

looks very confusing imho. I mean, it is not clear which bits do we actually
want to check.

Can't we (cough, you ;) also cleanup _TIF_WORK_MASK/_TIF_ALLWORK_MASK on x86?


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches)
  2012-04-30 14:32                             ` Mimi Zohar
@ 2012-05-03  4:23                               ` James Morris
  0 siblings, 0 replies; 91+ messages in thread
From: James Morris @ 2012-05-03  4:23 UTC (permalink / raw)
  To: Mimi Zohar
  Cc: Al Viro, Jonathan Corbet <corbet@lwn.net>,
	Kasatkin, Dmitry, Linus Torvalds, Hugh Dickins, linux-fsdevel,
	linux-security-module, linux-kernel, David Safford, David Miller,
	Andrew Morton

On Mon, 30 Apr 2012, Mimi Zohar wrote:

> Jon, thank you for summarizing the discussion - article
> http://lwn.net/Articles/494158/.  As Jake Edge's previous LWN article
> http://lwn.net/Articles/488906/ said, we've been working on upstreaming
> the different integrity components for quite a while.  Although
> IMA-appraisal isn't the last component, it is a major one and were
> hoping that it would be upstreamed in the near future.  Is 3.5 still a
> possibility?

Not sure why you're asking Jon, but the technical issues raised by Al need 
to be resolved before the code can be considered for upstreaming.


- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 91+ messages in thread

end of thread, other threads:[~2012-05-03  4:23 UTC | newest]

Thread overview: 91+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-18 13:04 [PULL REQUEST] : ima-appraisal patches Mimi Zohar
2012-04-18 15:02 ` James Morris
2012-04-18 18:07   ` Mimi Zohar
2012-04-18 18:39     ` Al Viro
2012-04-18 20:56       ` Mimi Zohar
2012-04-19 19:57       ` Mimi Zohar
2012-04-20  0:43         ` [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches) Al Viro
2012-04-20  2:31           ` Linus Torvalds
2012-04-20  2:54             ` Al Viro
2012-04-20  2:58               ` Linus Torvalds
2012-04-20  8:09                 ` Al Viro
2012-04-20 15:56                   ` Linus Torvalds
2012-04-20 16:08                     ` Al Viro
2012-04-20 16:42                       ` Al Viro
2012-04-20 17:21                         ` Linus Torvalds
2012-04-20 18:07                           ` Al Viro
2012-04-23 18:01                             ` [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such Al Viro
2012-04-23 18:37                               ` Oleg Nesterov
2012-04-24  7:26                               ` Al Viro
2012-04-25  3:06                                 ` Al Viro
2012-04-25 12:37                                   ` Oleg Nesterov
2012-04-25 12:50                                     ` Al Viro
2012-04-25 13:03                                       ` Oleg Nesterov
2012-04-25 13:32                                         ` Oleg Nesterov
2012-04-25 13:32                                         ` Al Viro
2012-04-25 14:52                                           ` Oleg Nesterov
2012-04-25 15:46                                             ` Oleg Nesterov
2012-04-25 16:10                                               ` Al Viro
2012-04-25 17:02                                                 ` Oleg Nesterov
2012-04-25 17:51                                                   ` Al Viro
2012-04-26  7:15                                                     ` Martin Schwidefsky
2012-04-26  7:25                                                       ` David Miller
2012-04-26 13:52                                                       ` Oleg Nesterov
2012-04-26 14:31                                                         ` Martin Schwidefsky
2012-04-26 13:22                                                     ` Oleg Nesterov
2012-04-26 18:37                                 ` Oleg Nesterov
2012-04-26 23:19                                   ` Al Viro
2012-04-27 17:24                                     ` Oleg Nesterov
2012-04-27 17:54                                       ` Oleg Nesterov
2012-05-02 10:37                                         ` Matt Fleming
2012-05-02 14:14                                           ` Al Viro
2012-04-27 18:45                                       ` Al Viro
2012-04-27 19:14                                         ` Geert Uytterhoeven
2012-04-27 19:34                                           ` Al Viro
2012-04-29 22:51                                             ` Al Viro
2012-04-30  6:39                                               ` Greg Ungerer
2012-04-27 19:42                                         ` Al Viro
2012-04-27 20:20                                         ` Roland McGrath
2012-04-27 21:12                                           ` Al Viro
2012-04-27 21:27                                             ` Roland McGrath
2012-04-27 23:15                                               ` Al Viro
2012-04-27 23:32                                                 ` Al Viro
2012-04-29  4:12                                                   ` Al Viro
2012-04-30  8:06                                                     ` Martin Schwidefsky
2012-04-27 23:50                                                 ` Al Viro
2012-04-28 18:51                                                   ` [PATCH] arch/tile: avoid calling do_signal() after fork from a kernel thread Chris Metcalf
2012-04-28 20:55                                                     ` Al Viro
2012-04-28 21:46                                                       ` Chris Metcalf
2012-04-29  0:55                                                         ` Al Viro
2012-04-28 18:51                                                           ` [PATCH v2] arch/tile: fix up some issues in calling do_work_pending() Chris Metcalf
2012-04-29  3:49                                                           ` [PATCH] arch/tile: avoid calling do_signal() after fork from a kernel thread Chris Metcalf
2012-04-28  2:42                                                 ` [RFC] TIF_NOTIFY_RESUME, arch/*/*/*signal*.c and all such Al Viro
2012-04-28  3:32                                                   ` Al Viro
2012-04-28  3:36                                                     ` Al Viro
2012-04-29 16:33                                                     ` Oleg Nesterov
2012-04-29 16:18                                                   ` Oleg Nesterov
2012-04-29 18:05                                                     ` Al Viro
2012-05-01  4:31                                                       ` Al Viro
2012-05-01  5:06                                                         ` Mike Frysinger
2012-05-01  5:52                                                           ` Al Viro
2012-05-02 17:24                                                             ` Al Viro
2012-05-02 18:30                                                       ` Oleg Nesterov
2012-04-29 16:41                                         ` Oleg Nesterov
2012-04-29 18:09                                           ` Al Viro
2012-04-29 18:25                                             ` Oleg Nesterov
2012-04-20  3:15               ` [RFC] situation with fput() locking (was Re: [PULL REQUEST] : ima-appraisal patches) Al Viro
2012-04-20 18:54           ` Hugh Dickins
2012-04-20 19:04             ` Al Viro
2012-04-20 19:18               ` Linus Torvalds
2012-04-20 19:32                 ` Hugh Dickins
2012-04-20 19:58                 ` Al Viro
2012-04-20 21:12                   ` Linus Torvalds
2012-04-20 22:13                     ` Al Viro
2012-04-20 22:35                       ` Linus Torvalds
2012-04-27  7:35                         ` Kasatkin, Dmitry
2012-04-27 17:34                           ` Al Viro
2012-04-27 18:52                             ` Kasatkin, Dmitry
2012-04-27 19:15                               ` Kasatkin, Dmitry
2012-04-30 14:32                             ` Mimi Zohar
2012-05-03  4:23                               ` James Morris
2012-04-20 19:37               ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).