linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [3.3-rc7] sys_poll use after free (hibernate)
@ 2012-03-13  0:58 Dave Jones
  2012-03-18 19:02 ` Linus Torvalds
  2012-03-18 19:47 ` richard -rw- weinberger
  0 siblings, 2 replies; 18+ messages in thread
From: Dave Jones @ 2012-03-13  0:58 UTC (permalink / raw)
  To: Linux Kernel; +Cc: Linus Torvalds

While trying to reproduce the i915 memory corruption problem with hibernate,
I've hit the following use-after-free quite a few times, by running my
syscall fuzzer and hibernating/resuming in a loop.

oops below on 3.3rc7, but same problem also affects 3.2.9, and probably
many versions before that.

	Dave

general protection fault: 0000 [#1] PREEMPT SMP 
CPU 3 
Modules linked in: binfmt_misc cmtp kernelcapi hidp can_bcm sctp dccp_ipv6 dccp_ipv4 dccp af_802154 phonet bluetooth rfkill can pppo
e pppox ppp_generic slhc irda crc_ccitt rds af_key rose ax25 atm appletalk ipx p8022 psnap llc p8023 nfs fscache auth_rpcgss nfs_acl lockd ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 
xt_state nf_conntrack ip6table_filter ip6_tables btrfs dm_mirror dm_region_hash dm_log zlib_deflate libcrc32c snd_hda_codec_idt snd_hda_intel snd_hda_codec snd_hwdep snd_seq sunrpc ppd
ev snd_seq_device snd_pcm tg3 iTCO_wdt parport_pc microcode snd_timer snd raid0 i5000_edac iTCO_vendor_support dcdbas parport soundcore usb_debug shpchp pcspkr serio_raw floppy edac_co
re snd_page_alloc i2c_i801 i5k_amb firewire_ohci firewire_core crc_itu_t nouveau ttm drm_kms_helper drm i2c_core mxm_wmi video wmi [last unloaded: scsi_wait_scan]

Pid: 18277, comm: trinity Not tainted 3.3.0-rc7+ #11 Dell Inc.                 Precision WorkStation 490    /0DT031
RIP: 0010:[<ffffffff81233e3e>]  [<ffffffff81233e3e>] proc_sys_poll+0x4e/0x90
RSP: 0018:ffff880207167b08  EFLAGS: 00010246
RAX: 0000000000000145 RBX: ffff88020cab6940 RCX: 0000000000000000
RDX: ffffffff81233df0 RSI: 6b6b6b6b6b6b6b6b RDI: ffff88020cab6940
RBP: ffff880207167b28 R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8802085fd510
R13: 0000000000000000 R14: ffff8802098c63e4 R15: 0000000000000000
FS:  00007fa1a4f70700(0000) GS:ffff880236c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000021d8928 CR3: 000000012dff9000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process trinity (pid: 18277, threadinfo ffff880207166000, task ffff880233300000)
Stack:
 000000000000002a ffff8802098c5dfc 000000000000002a ffff88020cab6940
 ffff880207167f38 ffffffff811d4360 00000000021d7a10 ffffc900000000e1
 00000000021d7920 ffff880233300000 0000000000000000 ffff880207167b88
Call Trace:
 [<ffffffff811d4360>] do_sys_poll+0x280/0x500
 [<ffffffff811d2e20>] ? poll_freewait+0xe0/0xe0
 [<ffffffff81179c9a>] ? do_wp_page+0x1da/0x720
 [<ffffffff810a09a1>] ? get_parent_ip+0x11/0x50
 [<ffffffff816a1bad>] ? sub_preempt_count+0x9d/0xd0
 [<ffffffff8169d8e5>] ? _raw_spin_unlock+0x35/0x60
 [<ffffffff81179dc8>] ? do_wp_page+0x308/0x720
 [<ffffffff81333e74>] ? debug_object_activate+0x84/0x1a0
 [<ffffffff8117b92d>] ? handle_pte_fault+0x27d/0xa10
 [<ffffffff811b1d71>] ? mem_cgroup_count_vm_event+0x31/0x1a0
 [<ffffffff816a1bad>] ? sub_preempt_count+0x9d/0xd0
 [<ffffffff811d46bb>] sys_poll+0x6b/0x100
 [<ffffffff816a5ae9>] system_call_fastpath+0x16/0x1b
Code: 00 48 89 fb 48 89 f1 48 8b 40 30 4c 8b 60 e8 b8 45 01 00 00 49 83 7c 24 28 00 74 2e 49 8b 74 24 30 48 85 f6 74 24 48 85 c9 75 32 <8b> 16 b8 45 01 00 00 48 63 d2 49 39 d5 74 10 8b 06 48 98 48 89 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [3.3-rc7] sys_poll use after free (hibernate)
  2012-03-13  0:58 [3.3-rc7] sys_poll use after free (hibernate) Dave Jones
@ 2012-03-18 19:02 ` Linus Torvalds
  2012-03-18 19:27   ` Al Viro
  2012-03-18 19:47 ` richard -rw- weinberger
  1 sibling, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2012-03-18 19:02 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, Lucas De Marchi, Andrew Morton, Al Viro

On Mon, Mar 12, 2012 at 5:58 PM, Dave Jones <davej@redhat.com> wrote:
>
> While trying to reproduce the i915 memory corruption problem with hibernate,
> I've hit the following use-after-free quite a few times, by running my
> syscall fuzzer and hibernating/resuming in a loop.
>
> oops below on 3.3rc7, but same problem also affects 3.2.9, and probably
> many versions before that.
>
> RIP: proc_sys_poll+0x4e/0x90
> Code: 00 48 89 fb 48 89 f1 48 8b 40 30 4c 8b 60 e8 b8 45 01 00 00 49 83 7c 24 28 00 74 2e 49 8b 74 24 30 48 85 f6 74 24 48 85 c9 75 32 <8b> 16 b8 45 01 00 00 48 63 d2 49 39 d5 74 10 8b 06 48 98 48 89

The code is:

   0:	48 89 fb             	mov    %rdi,%rbx
   3:	48 89 f1             	mov    %rsi,%rcx
   6:	48 8b 40 30          	mov    0x30(%rax),%rax
   a:	4c 8b 60 e8          	mov    -0x18(%rax),%r12
   e:	b8 45 01 00 00       	mov    $0x145,%eax
  13:	49 83 7c 24 28 00    	cmpq   $0x0,0x28(%r12)
  19:	74 2e                	je     0x49
  1b:	49 8b 74 24 30       	mov    0x30(%r12),%rsi
  20:	48 85 f6             	test   %rsi,%rsi
  23:	74 24                	je     0x49
  25:	48 85 c9             	test   %rcx,%rcx
  28:	75 32                	jne    0x5c
  2a:*	8b 16                	mov    (%rsi),%edx     <-- trapping instruction
  2c:	b8 45 01 00 00       	mov    $0x145,%eax
  31:	48 63 d2             	movslq %edx,%rdx
  34:	49 39 d5             	cmp    %rdx,%r13
  37:	74 10                	je     0x49
  39:	8b 06                	mov    (%rsi),%eax

and that load is from

    poll_wait(filp, &table->poll->wait, wait);

where the testing of %rsi and %rcx are the "if (p && wait_address)"
check in poll_wait(), and %rsi is "table->poll" if I read it all
correctly.

And the 6b6b6b6b6b6b6b6b pattern is obviously POISON_FREE, so
apparently 'table' has already been freed.

I suspect the whole sysctl 'poll' code is seriously broken, since it
seems to depend on those ctl_table pointers being stable over the
whole open/close sequence, but if somebody unregisters the sysctl,
it's all gone. The ctl_table doesn't have any refcounting etc, and I
suspect that your hibernate sequence ends up unregistering some sysctl
(perhaps as part of a module unload?)

The sysctl poll code was added fairly recently, I'm adding the guilty
parties and some other relevant people to the cc.

Oh, how I hate sysctl's. Misdesigned from the start.

                   Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [3.3-rc7] sys_poll use after free (hibernate)
  2012-03-18 19:02 ` Linus Torvalds
@ 2012-03-18 19:27   ` Al Viro
  2012-03-19  8:17     ` Alexey Dobriyan
                       ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Al Viro @ 2012-03-18 19:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Linux Kernel, Lucas De Marchi, Andrew Morton

On Sun, Mar 18, 2012 at 12:02:04PM -0700, Linus Torvalds wrote:
> and that load is from
> 
>     poll_wait(filp, &table->poll->wait, wait);
> 
> where the testing of %rsi and %rcx are the "if (p && wait_address)"
> check in poll_wait(), and %rsi is "table->poll" if I read it all
> correctly.
> 
> And the 6b6b6b6b6b6b6b6b pattern is obviously POISON_FREE, so
> apparently 'table' has already been freed.
> 
> I suspect the whole sysctl 'poll' code is seriously broken, since it
> seems to depend on those ctl_table pointers being stable over the
> whole open/close sequence, but if somebody unregisters the sysctl,
> it's all gone. The ctl_table doesn't have any refcounting etc, and I
> suspect that your hibernate sequence ends up unregistering some sysctl
> (perhaps as part of a module unload?)

Ewww...  The way it was supposed to work (prio to ->poll() madness) was
that actual IO gets wrapped into grab_header()/sysctl_head_finish()
pair.  proc_sys_poll() doesn't do it, so yes, that post-mortem is
very likely to be correct.

Looking at that sucker a bit more: what the hell is proc_sys_setattr()
doing with vmtruncate(), of all things???  Unless something has changed
very much and very badly, it does *not* use page cache at all...

Incidentally, I wonder if we want the whole thing in fs/proc; the argument
against splitoff to a separate fs used to be "that would break userland
setups - can't ask people to update /etc/fstab or init scripts to mount
that thing on /proc/sys".  Fair enough, but... what's to stop us from slapping
->d_automount() on /proc/sys like that:
	struct vfsmount *mnt = vfs_kern_mount(&sysctlfs_type, 0, "sysctl", 0);
	if (!IS_ERR(mnt))
		mntget(mnt);
	return mnt;
and we are all set.  IOW, now that ->d_automount() stuff is there, we can
do that easily without any userland breakage.  Comments?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [3.3-rc7] sys_poll use after free (hibernate)
  2012-03-13  0:58 [3.3-rc7] sys_poll use after free (hibernate) Dave Jones
  2012-03-18 19:02 ` Linus Torvalds
@ 2012-03-18 19:47 ` richard -rw- weinberger
  2012-03-18 21:24   ` Dave Jones
  1 sibling, 1 reply; 18+ messages in thread
From: richard -rw- weinberger @ 2012-03-18 19:47 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, Linus Torvalds

On Tue, Mar 13, 2012 at 1:58 AM, Dave Jones <davej@redhat.com> wrote:
> While trying to reproduce the i915 memory corruption problem with hibernate,
> I've hit the following use-after-free quite a few times, by running my
> syscall fuzzer and hibernating/resuming in a loop.
>

If you cannot reproduce the problem you can send me debug/test-patches.
I have two machines where I can reproduce the issue within an hour.

-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [3.3-rc7] sys_poll use after free (hibernate)
  2012-03-18 19:47 ` richard -rw- weinberger
@ 2012-03-18 21:24   ` Dave Jones
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Jones @ 2012-03-18 21:24 UTC (permalink / raw)
  To: richard -rw- weinberger; +Cc: Linux Kernel, Linus Torvalds

On Sun, Mar 18, 2012 at 08:47:19PM +0100, richard -rw- weinberger wrote:
 > On Tue, Mar 13, 2012 at 1:58 AM, Dave Jones <davej@redhat.com> wrote:
 > > While trying to reproduce the i915 memory corruption problem with hibernate,
 > 
 > If you cannot reproduce the problem you can send me debug/test-patches.
 > I have two machines where I can reproduce the issue within an hour.

At the time I was hoping to bisect it. But it looks like it only happens on
ironlake era graphics and newer, which I don't have.

With that info, I'm not sure it's really bisectable. The bug has probably
been there since day 1 when ironlake support was added.
See the thread 'Subject: Re: inode->i_wb_list corruption.' for some further
thoughts, where it's theorised that the GTT contains stale entries after
when we thaw.  I wouldn't be surprised if no-one had even tried hibernate
(or at least noticed the memory corruption immediately) before that was merged.

My thinking is that some kind of GTT teardown in the ->hibernate routine is
probably what's needed.

	Dave


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [3.3-rc7] sys_poll use after free (hibernate)
  2012-03-18 19:27   ` Al Viro
@ 2012-03-19  8:17     ` Alexey Dobriyan
  2012-03-20  6:08     ` Lucas De Marchi
  2012-03-22 22:24     ` [3.3-rc7] sys_poll use after free (hibernate) Eric W. Biederman
  2 siblings, 0 replies; 18+ messages in thread
From: Alexey Dobriyan @ 2012-03-19  8:17 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Jones, Linux Kernel, Lucas De Marchi, Andrew Morton

On Sun, Mar 18, 2012 at 07:27:55PM +0000, Al Viro wrote:
> On Sun, Mar 18, 2012 at 12:02:04PM -0700, Linus Torvalds wrote:
> Incidentally, I wonder if we want the whole thing in fs/proc; the argument
> against splitoff to a separate fs used to be "that would break userland
> setups - can't ask people to update /etc/fstab or init scripts to mount
> that thing on /proc/sys".  Fair enough, but... what's to stop us from slapping
> ->d_automount() on /proc/sys like that:
> 	struct vfsmount *mnt = vfs_kern_mount(&sysctlfs_type, 0, "sysctl", 0);
> 	if (!IS_ERR(mnt))
> 		mntget(mnt);
> 	return mnt;
> and we are all set.  IOW, now that ->d_automount() stuff is there, we can
> do that easily without any userland breakage.  Comments?

IIRC, fstab argument was the only one.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [3.3-rc7] sys_poll use after free (hibernate)
  2012-03-18 19:27   ` Al Viro
  2012-03-19  8:17     ` Alexey Dobriyan
@ 2012-03-20  6:08     ` Lucas De Marchi
  2012-03-20 18:29       ` [PATCH] sysctl: protect poll() in entries that may go away Lucas De Marchi
  2012-03-22 21:31       ` [3.3-rc7] sys_poll use after free (hibernate) Eric W. Biederman
  2012-03-22 22:24     ` [3.3-rc7] sys_poll use after free (hibernate) Eric W. Biederman
  2 siblings, 2 replies; 18+ messages in thread
From: Lucas De Marchi @ 2012-03-20  6:08 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, Dave Jones, Linux Kernel, Andrew Morton

On Sun, Mar 18, 2012 at 4:27 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sun, Mar 18, 2012 at 12:02:04PM -0700, Linus Torvalds wrote:
>> and that load is from
>>
>>     poll_wait(filp, &table->poll->wait, wait);
>>
>> where the testing of %rsi and %rcx are the "if (p && wait_address)"
>> check in poll_wait(), and %rsi is "table->poll" if I read it all
>> correctly.
>>
>> And the 6b6b6b6b6b6b6b6b pattern is obviously POISON_FREE, so
>> apparently 'table' has already been freed.
>>
>> I suspect the whole sysctl 'poll' code is seriously broken, since it
>> seems to depend on those ctl_table pointers being stable over the
>> whole open/close sequence, but if somebody unregisters the sysctl,
>> it's all gone. The ctl_table doesn't have any refcounting etc, and I
>> suspect that your hibernate sequence ends up unregistering some sysctl
>> (perhaps as part of a module unload?)

How could that happen if the only files that support poll  right now
on sysctl are kernel/hostname and kernel/domainname?

>
> Ewww...  The way it was supposed to work (prio to ->poll() madness) was
> that actual IO gets wrapped into grab_header()/sysctl_head_finish()
> pair.  proc_sys_poll() doesn't do it, so yes, that post-mortem is
> very likely to be correct.

Yes, it  seems like I forgot to call grab_header() there, sorry for
that. I'll prepare a patch and send you later today. I just wonder
what is happening to reach that code... :-/


Lucas De Marchi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] sysctl: protect poll() in entries that may go away
  2012-03-20  6:08     ` Lucas De Marchi
@ 2012-03-20 18:29       ` Lucas De Marchi
  2012-03-22 21:31       ` [3.3-rc7] sys_poll use after free (hibernate) Eric W. Biederman
  1 sibling, 0 replies; 18+ messages in thread
From: Lucas De Marchi @ 2012-03-20 18:29 UTC (permalink / raw)
  To: Al Viro, Dave Jones
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel, Lucas De Marchi

Protect code accessing ctl_table by grabbing the header with
grab_header() and after releasing with sysctl_head_finish(). This is
needed if poll() is called in entries created by modules: currently only
hostname and domainname support poll(), but this bug may be triggered
when/if modules use it and if user called poll() in a file that doesn't
support it.

Dave Jones reported the following when using a syscall fuzzer while
hibernating/resuming:

RIP: 0010:[<ffffffff81233e3e>]  [<ffffffff81233e3e>] proc_sys_poll+0x4e/0x90
RAX: 0000000000000145 RBX: ffff88020cab6940 RCX: 0000000000000000
RDX: ffffffff81233df0 RSI: 6b6b6b6b6b6b6b6b RDI: ffff88020cab6940
[ ... ]
Code: 00 48 89 fb 48 89 f1 48 8b 40 30 4c 8b 60 e8 b8 45 01 00 00 49 83
7c 24 28 00 74 2e 49 8b 74 24 30 48 85 f6 74 24 48 85 c9 75 32 <8b> 16
b8 45 01 00 00 48 63 d2 49 39 d5 74 10 8b 06 48 98 48 89

If an entry goes away while we are polling() it, ctl_table may not exist
anymore.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
---
 fs/proc/proc_sysctl.c |   17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index a6b6217..53c3bce 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -188,20 +188,32 @@ static ssize_t proc_sys_write(struct file *filp, const char __user *buf,
 
 static int proc_sys_open(struct inode *inode, struct file *filp)
 {
+	struct ctl_table_header *head = grab_header(inode);
 	struct ctl_table *table = PROC_I(inode)->sysctl_entry;
 
+	/* sysctl was unregistered */
+	if (IS_ERR(head))
+		return PTR_ERR(head);
+
 	if (table->poll)
 		filp->private_data = proc_sys_poll_event(table->poll);
 
+	sysctl_head_finish(head);
+
 	return 0;
 }
 
 static unsigned int proc_sys_poll(struct file *filp, poll_table *wait)
 {
 	struct inode *inode = filp->f_path.dentry->d_inode;
+	struct ctl_table_header *head = grab_header(inode);
 	struct ctl_table *table = PROC_I(inode)->sysctl_entry;
-	unsigned long event = (unsigned long)filp->private_data;
 	unsigned int ret = DEFAULT_POLLMASK;
+	unsigned long event;
+
+	/* sysctl was unregistered */
+	if (IS_ERR(head))
+		return POLLERR | POLLHUP;
 
 	if (!table->proc_handler)
 		goto out;
@@ -209,6 +221,7 @@ static unsigned int proc_sys_poll(struct file *filp, poll_table *wait)
 	if (!table->poll)
 		goto out;
 
+	event = (unsigned long)filp->private_data;
 	poll_wait(filp, &table->poll->wait, wait);
 
 	if (event != atomic_read(&table->poll->event)) {
@@ -217,6 +230,8 @@ static unsigned int proc_sys_poll(struct file *filp, poll_table *wait)
 	}
 
 out:
+	sysctl_head_finish(head);
+
 	return ret;
 }
 
-- 
1.7.9.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [3.3-rc7] sys_poll use after free (hibernate)
  2012-03-20  6:08     ` Lucas De Marchi
  2012-03-20 18:29       ` [PATCH] sysctl: protect poll() in entries that may go away Lucas De Marchi
@ 2012-03-22 21:31       ` Eric W. Biederman
  2012-03-22 22:12         ` Lucas De Marchi
  1 sibling, 1 reply; 18+ messages in thread
From: Eric W. Biederman @ 2012-03-22 21:31 UTC (permalink / raw)
  To: Lucas De Marchi
  Cc: Al Viro, Linus Torvalds, Dave Jones, Linux Kernel, Andrew Morton

Lucas De Marchi <lucas.demarchi@profusion.mobi> writes:

> On Sun, Mar 18, 2012 at 4:27 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Sun, Mar 18, 2012 at 12:02:04PM -0700, Linus Torvalds wrote:
>>> and that load is from
>>>
>>>     poll_wait(filp, &table->poll->wait, wait);
>>>
>>> where the testing of %rsi and %rcx are the "if (p && wait_address)"
>>> check in poll_wait(), and %rsi is "table->poll" if I read it all
>>> correctly.
>>>
>>> And the 6b6b6b6b6b6b6b6b pattern is obviously POISON_FREE, so
>>> apparently 'table' has already been freed.
>>>
>>> I suspect the whole sysctl 'poll' code is seriously broken, since it
>>> seems to depend on those ctl_table pointers being stable over the
>>> whole open/close sequence, but if somebody unregisters the sysctl,
>>> it's all gone. The ctl_table doesn't have any refcounting etc, and I
>>> suspect that your hibernate sequence ends up unregistering some sysctl
>>> (perhaps as part of a module unload?)
>
> How could that happen if the only files that support poll  right now
> on sysctl are kernel/hostname and kernel/domainname?
>
>>
>> Ewww...  The way it was supposed to work (prio to ->poll() madness) was
>> that actual IO gets wrapped into grab_header()/sysctl_head_finish()
>> pair.  proc_sys_poll() doesn't do it, so yes, that post-mortem is
>> very likely to be correct.
>
> Yes, it  seems like I forgot to call grab_header() there, sorry for
> that. I'll prepare a patch and send you later today. I just wonder
> what is happening to reach that code... :-/

It looks like it was a combination of the fuzzer doing silly things
and a removed ctl_table entry being poisoned and having .poll set
to 6b6b6b6b6b6b6b6b so the guard against calling poll when it is
nonsense did not trigger.  So your patch should be sufficient
for now.

Long term we still need a version of poll that is safe to use
with modules.

Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [3.3-rc7] sys_poll use after free (hibernate)
  2012-03-22 21:31       ` [3.3-rc7] sys_poll use after free (hibernate) Eric W. Biederman
@ 2012-03-22 22:12         ` Lucas De Marchi
  2012-03-22 23:02           ` Eric W. Biederman
  2012-03-24  0:25           ` [REVIEW][PATCH] Making poll generally useful for sysctls Eric W. Biederman
  0 siblings, 2 replies; 18+ messages in thread
From: Lucas De Marchi @ 2012-03-22 22:12 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, Linus Torvalds, Dave Jones, Linux Kernel, Andrew Morton

On Thu, Mar 22, 2012 at 6:31 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Lucas De Marchi <lucas.demarchi@profusion.mobi> writes:
>
>> On Sun, Mar 18, 2012 at 4:27 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>> On Sun, Mar 18, 2012 at 12:02:04PM -0700, Linus Torvalds wrote:
>>>> and that load is from
>>>>
>>>>     poll_wait(filp, &table->poll->wait, wait);
>>>>
>>>> where the testing of %rsi and %rcx are the "if (p && wait_address)"
>>>> check in poll_wait(), and %rsi is "table->poll" if I read it all
>>>> correctly.
>>>>
>>>> And the 6b6b6b6b6b6b6b6b pattern is obviously POISON_FREE, so
>>>> apparently 'table' has already been freed.
>>>>
>>>> I suspect the whole sysctl 'poll' code is seriously broken, since it
>>>> seems to depend on those ctl_table pointers being stable over the
>>>> whole open/close sequence, but if somebody unregisters the sysctl,
>>>> it's all gone. The ctl_table doesn't have any refcounting etc, and I
>>>> suspect that your hibernate sequence ends up unregistering some sysctl
>>>> (perhaps as part of a module unload?)
>>
>> How could that happen if the only files that support poll  right now
>> on sysctl are kernel/hostname and kernel/domainname?
>>
>>>
>>> Ewww...  The way it was supposed to work (prio to ->poll() madness) was
>>> that actual IO gets wrapped into grab_header()/sysctl_head_finish()
>>> pair.  proc_sys_poll() doesn't do it, so yes, that post-mortem is
>>> very likely to be correct.
>>
>> Yes, it  seems like I forgot to call grab_header() there, sorry for
>> that. I'll prepare a patch and send you later today. I just wonder
>> what is happening to reach that code... :-/
>
> It looks like it was a combination of the fuzzer doing silly things
> and a removed ctl_table entry being poisoned and having .poll set
> to 6b6b6b6b6b6b6b6b so the guard against calling poll when it is
> nonsense did not trigger.  So your patch should be sufficient
> for now.

What I understood afterwards was:

1. fuzzer calling poll() on files that did support poll
2. modules that created that sysctl entries were removed
3. 'table' was entirely removed (not ->poll).

>
> Long term we still need a version of poll that is safe to use
> with modules.

I think the way it's now (with my patch taken by Andrew) is safe for
having poll() with modules.


Lucas De Marchi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [3.3-rc7] sys_poll use after free (hibernate)
  2012-03-18 19:27   ` Al Viro
  2012-03-19  8:17     ` Alexey Dobriyan
  2012-03-20  6:08     ` Lucas De Marchi
@ 2012-03-22 22:24     ` Eric W. Biederman
  2 siblings, 0 replies; 18+ messages in thread
From: Eric W. Biederman @ 2012-03-22 22:24 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Jones, Linux Kernel, Lucas De Marchi, Andrew Morton

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Sun, Mar 18, 2012 at 12:02:04PM -0700, Linus Torvalds wrote:
>> and that load is from
>> 
>>     poll_wait(filp, &table->poll->wait, wait);
>> 
>> where the testing of %rsi and %rcx are the "if (p && wait_address)"
>> check in poll_wait(), and %rsi is "table->poll" if I read it all
>> correctly.
>> 
>> And the 6b6b6b6b6b6b6b6b pattern is obviously POISON_FREE, so
>> apparently 'table' has already been freed.
>> 
>> I suspect the whole sysctl 'poll' code is seriously broken, since it
>> seems to depend on those ctl_table pointers being stable over the
>> whole open/close sequence, but if somebody unregisters the sysctl,
>> it's all gone. The ctl_table doesn't have any refcounting etc, and I
>> suspect that your hibernate sequence ends up unregistering some sysctl
>> (perhaps as part of a module unload?)
>
> Ewww...  The way it was supposed to work (prio to ->poll() madness) was
> that actual IO gets wrapped into grab_header()/sysctl_head_finish()
> pair.  proc_sys_poll() doesn't do it, so yes, that post-mortem is
> very likely to be correct.

> Looking at that sucker a bit more: what the hell is proc_sys_setattr()
> doing with vmtruncate(), of all things???  Unless something has changed
> very much and very badly, it does *not* use page cache at all...

sysctl continues not to use the page cache.  The vmtruncate was a generic
vfs level push down that has not been removed as unnecessary in
proc_sysctl.c yet.

The question of how to cleanly implement suppoort for byte level
read/writes of sysctl entries remains an open problem.  But even that
looks like a job for seq_file or cousin of seq_file rather than the page
cache.

> Incidentally, I wonder if we want the whole thing in fs/proc; the argument
> against splitoff to a separate fs used to be "that would break userland
> setups - can't ask people to update /etc/fstab or init scripts to mount
> that thing on /proc/sys".  Fair enough, but... what's to stop us from slapping
> ->d_automount() on /proc/sys like that:
> 	struct vfsmount *mnt = vfs_kern_mount(&sysctlfs_type, 0, "sysctl", 0);
> 	if (!IS_ERR(mnt))
> 		mntget(mnt);
> 	return mnt;
> and we are all set.  IOW, now that ->d_automount() stuff is there, we can
> do that easily without any userland breakage.  Comments?

Is that something we could do with /proc/<pid>/net as well?

I am looking at what it will take to move /proc/sys into /proc/<pid>/sys
so that we can remove the namespace inspired dcache weirdness.  With my
recent cleanups that should be a pretty simple change.

I remember attempting that once before for /proc/<pid>/net and the
review got stalled in getting the expiry logic right.

Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [3.3-rc7] sys_poll use after free (hibernate)
  2012-03-22 22:12         ` Lucas De Marchi
@ 2012-03-22 23:02           ` Eric W. Biederman
  2012-03-24  0:25           ` [REVIEW][PATCH] Making poll generally useful for sysctls Eric W. Biederman
  1 sibling, 0 replies; 18+ messages in thread
From: Eric W. Biederman @ 2012-03-22 23:02 UTC (permalink / raw)
  To: Lucas De Marchi
  Cc: Al Viro, Linus Torvalds, Dave Jones, Linux Kernel, Andrew Morton

Lucas De Marchi <lucas.demarchi@profusion.mobi> writes:

> On Thu, Mar 22, 2012 at 6:31 PM, Eric W. Biederman

>> It looks like it was a combination of the fuzzer doing silly things
>> and a removed ctl_table entry being poisoned and having .poll set
>> to 6b6b6b6b6b6b6b6b so the guard against calling poll when it is
>> nonsense did not trigger.  So your patch should be sufficient
>> for now.
>
> What I understood afterwards was:
>
> 1. fuzzer calling poll() on files that did support poll
> 2. modules that created that sysctl entries were removed
> 3. 'table' was entirely removed (not ->poll).

I just grepped the kernel for ctl_table_poll and DEFINE_CTL_TABLE_POLL.
There are only the two original users of hostname and domainname.

The problem very much had to be that ctl_table was freed and poisoned
but we still pointed to it, and we were not using the grab_header idiom
to ensure we did not use an expired ctl_table entry.

Which means that it was any ctl_table being add/removed.  Probably in
this case the per cpu scheduler sysctl table entries that get
added/removed whenever we logically add/remove a cpu.

I expect what happened is that the fuzzer opened the sysctl file some
time before it was removed and then sometime after the entry was removed
(but before the memory was reused) called select/poll on that file
descriptor.  Since the ctl_table was poisoned ->poll was
6b6b6b6b6b6b6b6b and so we passed the checks for a non NULL ->poll and
we proceed to do nonsense things that caused the kernel oops in
proc_sys_poll.

>> Long term we still need a version of poll that is safe to use
>> with modules.
>
> I think the way it's now (with my patch taken by Andrew) is safe for
> having poll() with modules.

No it is not.

The problem is that proc_sys_poll is non-blocking.  It is called
primarily to place the system on a wait queue.  But notice that
if you place the caller on a wait_queue in proc_sys_poll and return
then we may call unregister_sysctl_table while and remove the sysctl
while someone still is on the wait queue.  Sleeping on a wait_queue
that has been freed is so bizarre I don't want to think about the
failure modes.

sysfs solves this problem by tracking openers and has it's wait_queue in
the per opener structure.  That same logic needs to be mirrored in
sysctl for poll to be safe on any sysctl table entry that can be
removed.

I believe a correct fix would remove the .poll field in struct ctl_table,
remove struct ctl_table_poll entirely and modify the signature of
proc_sys_poll_notify to be:
void proc_sys_poll_notify(struct ctl_table_header *head, struct ctl_table *table);

Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [REVIEW][PATCH] Making poll generally useful for sysctls
  2012-03-22 22:12         ` Lucas De Marchi
  2012-03-22 23:02           ` Eric W. Biederman
@ 2012-03-24  0:25           ` Eric W. Biederman
  2012-03-24  6:20             ` Lucas De Marchi
  1 sibling, 1 reply; 18+ messages in thread
From: Eric W. Biederman @ 2012-03-24  0:25 UTC (permalink / raw)
  To: Lucas De Marchi; +Cc: Linux Kernel, linux-fsdevel


Lucas can you take a look at the patch below. I don't have a test
harness to test this but this should make poll generally useful
for all sysctls as well as fix the module removal case.

In particular if you could test to see if the code still works as
expected for the hostname and domainname cases that would be great.

I don't anticipate any problems but real world testing is always good.

 fs/proc/proc_sysctl.c   |   34 +++++++++++++++++-----------------
 include/linux/sysctl.h  |   22 +++-------------------
 kernel/utsname_sysctl.c |   14 ++++----------
 3 files changed, 24 insertions(+), 46 deletions(-)
---
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 47b474b..98fd5c9 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -16,13 +16,15 @@ static const struct inode_operations proc_sys_inode_operations;
 static const struct file_operations proc_sys_dir_file_operations;
 static const struct inode_operations proc_sys_dir_operations;
 
-void proc_sys_poll_notify(struct ctl_table_poll *poll)
+static inline void *proc_sys_poll_event(struct ctl_table *table)
 {
-	if (!poll)
-		return;
+	return (void *)(unsigned long)atomic_read(&table->event);
+}
 
-	atomic_inc(&poll->event);
-	wake_up_interruptible(&poll->wait);
+void proc_sys_poll_notify(struct ctl_table_header *head, struct ctl_table *table)
+{
+	atomic_inc(&table->event);
+	wake_up_interruptible(&head->wait);
 }
 
 static struct ctl_table root_table[] = {
@@ -169,6 +171,7 @@ static void init_header(struct ctl_table_header *head,
 		for (entry = table; entry->procname; entry++, node++) {
 			rb_init_node(&node->node);
 			node->header = head;
+			atomic_set(&entry->event, 1);
 		}
 	}
 }
@@ -240,6 +243,8 @@ static void start_unregistering(struct ctl_table_header *p)
 		/* anything non-NULL; we'll never dereference it */
 		p->unregistering = ERR_PTR(-EINVAL);
 	}
+	/* Don't let poll sleep forever on deleted entries */
+	wake_up_interruptible(&p->wait);
 	/*
 	 * do not remove from the list until nobody holds it; walking the
 	 * list in do_sysctl() relies on that.
@@ -505,6 +510,9 @@ static ssize_t proc_sys_call_handler(struct file *filp, void __user *buf,
 	error = table->proc_handler(table, write, buf, &res, ppos);
 	if (!error)
 		error = res;
+
+	if (write)
+		proc_sys_poll_notify(head, table);
 out:
 	sysctl_head_finish(head);
 
@@ -532,8 +540,7 @@ static int proc_sys_open(struct inode *inode, struct file *filp)
 	if (IS_ERR(head))
 		return PTR_ERR(head);
 
-	if (table->poll)
-		filp->private_data = proc_sys_poll_event(table->poll);
+	filp->private_data = proc_sys_poll_event(table);
 
 	sysctl_head_finish(head);
 
@@ -552,21 +559,14 @@ static unsigned int proc_sys_poll(struct file *filp, poll_table *wait)
 	if (IS_ERR(head))
 		return POLLERR | POLLHUP;
 
-	if (!table->proc_handler)
-		goto out;
-
-	if (!table->poll)
-		goto out;
-
 	event = (unsigned long)filp->private_data;
-	poll_wait(filp, &table->poll->wait, wait);
+	poll_wait(filp, &head->wait, wait);
 
-	if (event != atomic_read(&table->poll->event)) {
-		filp->private_data = proc_sys_poll_event(table->poll);
+	if (event != atomic_read(&table->event)) {
+		filp->private_data = proc_sys_poll_event(table);
 		ret = POLLIN | POLLRDNORM | POLLERR | POLLPRI;
 	}
 
-out:
 	sysctl_head_finish(head);
 
 	return ret;
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index c34b4c8..8647ee0 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -992,34 +992,17 @@ extern int proc_do_large_bitmap(struct ctl_table *, int,
  * cover common cases.
  */
 
-/* Support for userspace poll() to watch for changes */
-struct ctl_table_poll {
-	atomic_t event;
-	wait_queue_head_t wait;
-};
-
-static inline void *proc_sys_poll_event(struct ctl_table_poll *poll)
-{
-	return (void *)(unsigned long)atomic_read(&poll->event);
-}
-
-#define __CTL_TABLE_POLL_INITIALIZER(name) {				\
-	.event = ATOMIC_INIT(0),					\
-	.wait = __WAIT_QUEUE_HEAD_INITIALIZER(name.wait) }
-
-#define DEFINE_CTL_TABLE_POLL(name)					\
-	struct ctl_table_poll name = __CTL_TABLE_POLL_INITIALIZER(name)
 
 /* A sysctl table is an array of struct ctl_table: */
 struct ctl_table 
 {
 	const char *procname;		/* Text ID for /proc/sys, or zero */
 	void *data;
+	atomic_t event;
 	int maxlen;
 	umode_t mode;
 	struct ctl_table *child;	/* Deprecated */
 	proc_handler *proc_handler;	/* Callback for text formatting */
-	struct ctl_table_poll *poll;
 	void *extra1;
 	void *extra2;
 };
@@ -1042,6 +1025,7 @@ struct ctl_table_header
 		};
 		struct rcu_head rcu;
 	};
+	wait_queue_head_t wait;
 	struct completion *unregistering;
 	struct ctl_table *ctl_table_arg;
 	struct ctl_table_root *root;
@@ -1076,7 +1060,7 @@ struct ctl_path {
 
 #ifdef CONFIG_SYSCTL
 
-void proc_sys_poll_notify(struct ctl_table_poll *poll);
+void proc_sys_poll_notify(struct ctl_table_header *head, struct ctl_table *table);
 
 extern void setup_sysctl_set(struct ctl_table_set *p,
 	struct ctl_table_root *root,
diff --git a/kernel/utsname_sysctl.c b/kernel/utsname_sysctl.c
index 63da38c..3e91a50 100644
--- a/kernel/utsname_sysctl.c
+++ b/kernel/utsname_sysctl.c
@@ -53,18 +53,12 @@ static int proc_do_uts_string(ctl_table *table, int write,
 	r = proc_dostring(&uts_table,write,buffer,lenp, ppos);
 	put_uts(table, write, uts_table.data);
 
-	if (write)
-		proc_sys_poll_notify(table->poll);
-
 	return r;
 }
 #else
 #define proc_do_uts_string NULL
 #endif
 
-static DEFINE_CTL_TABLE_POLL(hostname_poll);
-static DEFINE_CTL_TABLE_POLL(domainname_poll);
-
 static struct ctl_table uts_kern_table[] = {
 	{
 		.procname	= "ostype",
@@ -93,7 +87,6 @@ static struct ctl_table uts_kern_table[] = {
 		.maxlen		= sizeof(init_uts_ns.name.nodename),
 		.mode		= 0644,
 		.proc_handler	= proc_do_uts_string,
-		.poll		= &hostname_poll,
 	},
 	{
 		.procname	= "domainname",
@@ -101,7 +94,6 @@ static struct ctl_table uts_kern_table[] = {
 		.maxlen		= sizeof(init_uts_ns.name.domainname),
 		.mode		= 0644,
 		.proc_handler	= proc_do_uts_string,
-		.poll		= &domainname_poll,
 	},
 	{}
 };
@@ -115,6 +107,8 @@ static struct ctl_table uts_root_table[] = {
 	{}
 };
 
+static struct ctl_table_header *uts_header;
+
 #ifdef CONFIG_PROC_SYSCTL
 /*
  * Notify userspace about a change in a certain entry of uts_kern_table,
@@ -124,13 +118,13 @@ void uts_proc_notify(enum uts_proc proc)
 {
 	struct ctl_table *table = &uts_kern_table[proc];
 
-	proc_sys_poll_notify(table->poll);
+	proc_sys_poll_notify(uts_header, table);
 }
 #endif
 
 static int __init utsname_sysctl_init(void)
 {
-	register_sysctl_table(uts_root_table);
+	uts_header = register_sysctl_table(uts_root_table);
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [REVIEW][PATCH] Making poll generally useful for sysctls
  2012-03-24  0:25           ` [REVIEW][PATCH] Making poll generally useful for sysctls Eric W. Biederman
@ 2012-03-24  6:20             ` Lucas De Marchi
  2012-03-24  7:58               ` Eric W. Biederman
  0 siblings, 1 reply; 18+ messages in thread
From: Lucas De Marchi @ 2012-03-24  6:20 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Kernel, linux-fsdevel

On Fri, Mar 23, 2012 at 9:25 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Lucas can you take a look at the patch below. I don't have a test
> harness to test this but this should make poll generally useful
> for all sysctls as well as fix the module removal case.
>
> In particular if you could test to see if the code still works as
> expected for the hostname and domainname cases that would be great.
>
> I don't anticipate any problems but real world testing is always good.
>
>  fs/proc/proc_sysctl.c   |   34 +++++++++++++++++-----------------
>  include/linux/sysctl.h  |   22 +++-------------------
>  kernel/utsname_sysctl.c |   14 ++++----------
>  3 files changed, 24 insertions(+), 46 deletions(-)
> ---
> diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> index 47b474b..98fd5c9 100644
> --- a/fs/proc/proc_sysctl.c
> +++ b/fs/proc/proc_sysctl.c
> @@ -16,13 +16,15 @@ static const struct inode_operations proc_sys_inode_operations;
>  static const struct file_operations proc_sys_dir_file_operations;
>  static const struct inode_operations proc_sys_dir_operations;
>
> -void proc_sys_poll_notify(struct ctl_table_poll *poll)
> +static inline void *proc_sys_poll_event(struct ctl_table *table)
>  {
> -       if (!poll)
> -               return;
> +       return (void *)(unsigned long)atomic_read(&table->event);
> +}
>
> -       atomic_inc(&poll->event);
> -       wake_up_interruptible(&poll->wait);
> +void proc_sys_poll_notify(struct ctl_table_header *head, struct ctl_table *table)
> +{
> +       atomic_inc(&table->event);
> +       wake_up_interruptible(&head->wait);
>  }
>
>  static struct ctl_table root_table[] = {
> @@ -169,6 +171,7 @@ static void init_header(struct ctl_table_header *head,
>                for (entry = table; entry->procname; entry++, node++) {
>                        rb_init_node(&node->node);
>                        node->header = head;
> +                       atomic_set(&entry->event, 1);
>                }
>        }
>  }
> @@ -240,6 +243,8 @@ static void start_unregistering(struct ctl_table_header *p)
>                /* anything non-NULL; we'll never dereference it */
>                p->unregistering = ERR_PTR(-EINVAL);
>        }
> +       /* Don't let poll sleep forever on deleted entries */
> +       wake_up_interruptible(&p->wait);
>        /*
>         * do not remove from the list until nobody holds it; walking the
>         * list in do_sysctl() relies on that.
> @@ -505,6 +510,9 @@ static ssize_t proc_sys_call_handler(struct file *filp, void __user *buf,
>        error = table->proc_handler(table, write, buf, &res, ppos);
>        if (!error)
>                error = res;
> +
> +       if (write)
> +               proc_sys_poll_notify(head, table);
>  out:
>        sysctl_head_finish(head);
>
> @@ -532,8 +540,7 @@ static int proc_sys_open(struct inode *inode, struct file *filp)
>        if (IS_ERR(head))
>                return PTR_ERR(head);
>
> -       if (table->poll)
> -               filp->private_data = proc_sys_poll_event(table->poll);
> +       filp->private_data = proc_sys_poll_event(table);
>
>        sysctl_head_finish(head);
>
> @@ -552,21 +559,14 @@ static unsigned int proc_sys_poll(struct file *filp, poll_table *wait)
>        if (IS_ERR(head))
>                return POLLERR | POLLHUP;
>
> -       if (!table->proc_handler)
> -               goto out;
> -
> -       if (!table->poll)
> -               goto out;
> -
>        event = (unsigned long)filp->private_data;
> -       poll_wait(filp, &table->poll->wait, wait);
> +       poll_wait(filp, &head->wait, wait);
>
> -       if (event != atomic_read(&table->poll->event)) {
> -               filp->private_data = proc_sys_poll_event(table->poll);
> +       if (event != atomic_read(&table->event)) {
> +               filp->private_data = proc_sys_poll_event(table);
>                ret = POLLIN | POLLRDNORM | POLLERR | POLLPRI;
>        }
>
> -out:
>        sysctl_head_finish(head);
>
>        return ret;
> diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
> index c34b4c8..8647ee0 100644
> --- a/include/linux/sysctl.h
> +++ b/include/linux/sysctl.h
> @@ -992,34 +992,17 @@ extern int proc_do_large_bitmap(struct ctl_table *, int,
>  * cover common cases.
>  */
>
> -/* Support for userspace poll() to watch for changes */
> -struct ctl_table_poll {
> -       atomic_t event;
> -       wait_queue_head_t wait;
> -};
> -
> -static inline void *proc_sys_poll_event(struct ctl_table_poll *poll)
> -{
> -       return (void *)(unsigned long)atomic_read(&poll->event);
> -}
> -
> -#define __CTL_TABLE_POLL_INITIALIZER(name) {                           \
> -       .event = ATOMIC_INIT(0),                                        \
> -       .wait = __WAIT_QUEUE_HEAD_INITIALIZER(name.wait) }
> -
> -#define DEFINE_CTL_TABLE_POLL(name)                                    \
> -       struct ctl_table_poll name = __CTL_TABLE_POLL_INITIALIZER(name)
>
>  /* A sysctl table is an array of struct ctl_table: */
>  struct ctl_table
>  {
>        const char *procname;           /* Text ID for /proc/sys, or zero */
>        void *data;
> +       atomic_t event;
>        int maxlen;
>        umode_t mode;
>        struct ctl_table *child;        /* Deprecated */
>        proc_handler *proc_handler;     /* Callback for text formatting */
> -       struct ctl_table_poll *poll;
>        void *extra1;
>        void *extra2;
>  };
> @@ -1042,6 +1025,7 @@ struct ctl_table_header
>                };
>                struct rcu_head rcu;
>        };
> +       wait_queue_head_t wait;

If you have a waitqueue per table instead of per entry you will get
spurious notifications when other entries change. The easiest way to
test this is to poll /proc/sys/kernel/hostname and change your
domainname.

I couldn't apply this patch to any tree (linus/master + my previous
patch, your master, 3.3 + my previous patch), so I couldn't test. On
top of your tree:

[lucas@vader kernel]$ git am /tmp/a.patch
Applying: Making poll generally useful for sysctls
error: patch failed: fs/proc/proc_sysctl.c:16
error: fs/proc/proc_sysctl.c: patch does not apply
error: patch failed: include/linux/sysctl.h:992
error: include/linux/sysctl.h: patch does not apply
Patch failed at 0001 Making poll generally useful for sysctls


Lucas De Marchi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [REVIEW][PATCH] Making poll generally useful for sysctls
  2012-03-24  6:20             ` Lucas De Marchi
@ 2012-03-24  7:58               ` Eric W. Biederman
  2012-03-26 17:44                 ` Lucas De Marchi
  0 siblings, 1 reply; 18+ messages in thread
From: Eric W. Biederman @ 2012-03-24  7:58 UTC (permalink / raw)
  To: Lucas De Marchi; +Cc: Linux Kernel, linux-fsdevel

Lucas De Marchi <lucas.demarchi@profusion.mobi> writes:

>>  /* A sysctl table is an array of struct ctl_table: */
>>  struct ctl_table
>>  {
>>        const char *procname;           /* Text ID for /proc/sys, or zero */
>>        void *data;
>> +       atomic_t event;
>>        int maxlen;
>>        umode_t mode;
>>        struct ctl_table *child;        /* Deprecated */
>>        proc_handler *proc_handler;     /* Callback for text formatting */
>> -       struct ctl_table_poll *poll;
>>        void *extra1;
>>        void *extra2;
>>  };
>> @@ -1042,6 +1025,7 @@ struct ctl_table_header
>>                };
>>                struct rcu_head rcu;
>>        };
>> +       wait_queue_head_t wait;
>
> If you have a waitqueue per table instead of per entry you will get
> spurious notifications when other entries change. The easiest way to
> test this is to poll /proc/sys/kernel/hostname and change your
> domainname.

You will get spurious wakeups but not spurious notifications to
userspace since event is still per entry.

For my money that seemed a nice compromise of code simplicity, and
generality.  We could of course do something much closer to what
sysfs does and allocate and refcount something similar to your
ctl_table_poll when we have a ctl_table opened.  But that just looked
like a pain.

Of course we already have spurious notifications for hostname and
domainname when multiple uts namespaces are involved, but that
is a different problem.

> I couldn't apply this patch to any tree (linus/master + my previous
> patch, your master, 3.3 + my previous patch), so I couldn't test. On
> top of your tree:

How odd.  It should have applied cleanly to my tree and it applies
with just a two line offset top of Linus's latest with my tree merged
in.  Those two lines of offset coming from the two extra includes
that came in through the merge.

patch -p1 --dry-run <  ~/tmp/sysctl-poll-test.patch 
patching file fs/proc/proc_sysctl.c
Hunk #1 succeeded at 18 (offset 2 lines).
Hunk #2 succeeded at 173 (offset 2 lines).
Hunk #3 succeeded at 245 (offset 2 lines).
Hunk #4 succeeded at 512 (offset 2 lines).
Hunk #5 succeeded at 542 (offset 2 lines).
Hunk #6 succeeded at 561 (offset 2 lines).
patching file include/linux/sysctl.h
patching file kernel/utsname_sysctl.c

> [lucas@vader kernel]$ git am /tmp/a.patch
> Applying: Making poll generally useful for sysctls
> error: patch failed: fs/proc/proc_sysctl.c:16
> error: fs/proc/proc_sysctl.c: patch does not apply
> error: patch failed: include/linux/sysctl.h:992
> error: include/linux/sysctl.h: patch does not apply
> Patch failed at 0001 Making poll generally useful for sysctls

Here is rebased version of the patch just in case that helps.

---

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 21d836f..739615c 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -18,13 +18,15 @@ static const struct inode_operations proc_sys_inode_operations;
 static const struct file_operations proc_sys_dir_file_operations;
 static const struct inode_operations proc_sys_dir_operations;
 
-void proc_sys_poll_notify(struct ctl_table_poll *poll)
+static inline void *proc_sys_poll_event(struct ctl_table *table)
 {
-	if (!poll)
-		return;
+	return (void *)(unsigned long)atomic_read(&table->event);
+}
 
-	atomic_inc(&poll->event);
-	wake_up_interruptible(&poll->wait);
+void proc_sys_poll_notify(struct ctl_table_header *head, struct ctl_table *table)
+{
+	atomic_inc(&table->event);
+	wake_up_interruptible(&head->wait);
 }
 
 static struct ctl_table root_table[] = {
@@ -171,6 +173,7 @@ static void init_header(struct ctl_table_header *head,
 		for (entry = table; entry->procname; entry++, node++) {
 			rb_init_node(&node->node);
 			node->header = head;
+			atomic_set(&entry->event, 1);
 		}
 	}
 }
@@ -242,6 +245,8 @@ static void start_unregistering(struct ctl_table_header *p)
 		/* anything non-NULL; we'll never dereference it */
 		p->unregistering = ERR_PTR(-EINVAL);
 	}
+	/* Don't let poll sleep forever on deleted entries */
+	wake_up_interruptible(&p->wait);
 	/*
 	 * do not remove from the list until nobody holds it; walking the
 	 * list in do_sysctl() relies on that.
@@ -507,6 +512,9 @@ static ssize_t proc_sys_call_handler(struct file *filp, void __user *buf,
 	error = table->proc_handler(table, write, buf, &res, ppos);
 	if (!error)
 		error = res;
+
+	if (write)
+		proc_sys_poll_notify(head, table);
 out:
 	sysctl_head_finish(head);
 
@@ -534,8 +542,7 @@ static int proc_sys_open(struct inode *inode, struct file *filp)
 	if (IS_ERR(head))
 		return PTR_ERR(head);
 
-	if (table->poll)
-		filp->private_data = proc_sys_poll_event(table->poll);
+	filp->private_data = proc_sys_poll_event(table);
 
 	sysctl_head_finish(head);
 
@@ -554,21 +561,14 @@ static unsigned int proc_sys_poll(struct file *filp, poll_table *wait)
 	if (IS_ERR(head))
 		return POLLERR | POLLHUP;
 
-	if (!table->proc_handler)
-		goto out;
-
-	if (!table->poll)
-		goto out;
-
 	event = (unsigned long)filp->private_data;
-	poll_wait(filp, &table->poll->wait, wait);
+	poll_wait(filp, &head->wait, wait);
 
-	if (event != atomic_read(&table->poll->event)) {
-		filp->private_data = proc_sys_poll_event(table->poll);
+	if (event != atomic_read(&table->event)) {
+		filp->private_data = proc_sys_poll_event(table);
 		ret = POLLIN | POLLRDNORM | POLLERR | POLLPRI;
 	}
 
-out:
 	sysctl_head_finish(head);
 
 	return ret;
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index c34b4c8..8647ee0 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -992,34 +992,17 @@ extern int proc_do_large_bitmap(struct ctl_table *, int,
  * cover common cases.
  */
 
-/* Support for userspace poll() to watch for changes */
-struct ctl_table_poll {
-	atomic_t event;
-	wait_queue_head_t wait;
-};
-
-static inline void *proc_sys_poll_event(struct ctl_table_poll *poll)
-{
-	return (void *)(unsigned long)atomic_read(&poll->event);
-}
-
-#define __CTL_TABLE_POLL_INITIALIZER(name) {				\
-	.event = ATOMIC_INIT(0),					\
-	.wait = __WAIT_QUEUE_HEAD_INITIALIZER(name.wait) }
-
-#define DEFINE_CTL_TABLE_POLL(name)					\
-	struct ctl_table_poll name = __CTL_TABLE_POLL_INITIALIZER(name)
 
 /* A sysctl table is an array of struct ctl_table: */
 struct ctl_table 
 {
 	const char *procname;		/* Text ID for /proc/sys, or zero */
 	void *data;
+	atomic_t event;
 	int maxlen;
 	umode_t mode;
 	struct ctl_table *child;	/* Deprecated */
 	proc_handler *proc_handler;	/* Callback for text formatting */
-	struct ctl_table_poll *poll;
 	void *extra1;
 	void *extra2;
 };
@@ -1042,6 +1025,7 @@ struct ctl_table_header
 		};
 		struct rcu_head rcu;
 	};
+	wait_queue_head_t wait;
 	struct completion *unregistering;
 	struct ctl_table *ctl_table_arg;
 	struct ctl_table_root *root;
@@ -1076,7 +1060,7 @@ struct ctl_path {
 
 #ifdef CONFIG_SYSCTL
 
-void proc_sys_poll_notify(struct ctl_table_poll *poll);
+void proc_sys_poll_notify(struct ctl_table_header *head, struct ctl_table *table);
 
 extern void setup_sysctl_set(struct ctl_table_set *p,
 	struct ctl_table_root *root,
diff --git a/kernel/utsname_sysctl.c b/kernel/utsname_sysctl.c
index 63da38c..3e91a50 100644
--- a/kernel/utsname_sysctl.c
+++ b/kernel/utsname_sysctl.c
@@ -53,18 +53,12 @@ static int proc_do_uts_string(ctl_table *table, int write,
 	r = proc_dostring(&uts_table,write,buffer,lenp, ppos);
 	put_uts(table, write, uts_table.data);
 
-	if (write)
-		proc_sys_poll_notify(table->poll);
-
 	return r;
 }
 #else
 #define proc_do_uts_string NULL
 #endif
 
-static DEFINE_CTL_TABLE_POLL(hostname_poll);
-static DEFINE_CTL_TABLE_POLL(domainname_poll);
-
 static struct ctl_table uts_kern_table[] = {
 	{
 		.procname	= "ostype",
@@ -93,7 +87,6 @@ static struct ctl_table uts_kern_table[] = {
 		.maxlen		= sizeof(init_uts_ns.name.nodename),
 		.mode		= 0644,
 		.proc_handler	= proc_do_uts_string,
-		.poll		= &hostname_poll,
 	},
 	{
 		.procname	= "domainname",
@@ -101,7 +94,6 @@ static struct ctl_table uts_kern_table[] = {
 		.maxlen		= sizeof(init_uts_ns.name.domainname),
 		.mode		= 0644,
 		.proc_handler	= proc_do_uts_string,
-		.poll		= &domainname_poll,
 	},
 	{}
 };
@@ -115,6 +107,8 @@ static struct ctl_table uts_root_table[] = {
 	{}
 };
 
+static struct ctl_table_header *uts_header;
+
 #ifdef CONFIG_PROC_SYSCTL
 /*
  * Notify userspace about a change in a certain entry of uts_kern_table,
@@ -124,13 +118,13 @@ void uts_proc_notify(enum uts_proc proc)
 {
 	struct ctl_table *table = &uts_kern_table[proc];
 
-	proc_sys_poll_notify(table->poll);
+	proc_sys_poll_notify(uts_header, table);
 }
 #endif
 
 static int __init utsname_sysctl_init(void)
 {
-	register_sysctl_table(uts_root_table);
+	uts_header = register_sysctl_table(uts_root_table);
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [REVIEW][PATCH] Making poll generally useful for sysctls
  2012-03-24  7:58               ` Eric W. Biederman
@ 2012-03-26 17:44                 ` Lucas De Marchi
  2012-03-27  4:02                   ` Lucas De Marchi
  0 siblings, 1 reply; 18+ messages in thread
From: Lucas De Marchi @ 2012-03-26 17:44 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Kernel, linux-fsdevel

Hi Eric,

On Sat, Mar 24, 2012 at 4:58 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Lucas De Marchi <lucas.demarchi@profusion.mobi> writes:
>
>>>  /* A sysctl table is an array of struct ctl_table: */
>>>  struct ctl_table
>>>  {
>>>        const char *procname;           /* Text ID for /proc/sys, or zero */
>>>        void *data;
>>> +       atomic_t event;
>>>        int maxlen;
>>>        umode_t mode;
>>>        struct ctl_table *child;        /* Deprecated */
>>>        proc_handler *proc_handler;     /* Callback for text formatting */
>>> -       struct ctl_table_poll *poll;
>>>        void *extra1;
>>>        void *extra2;
>>>  };
>>> @@ -1042,6 +1025,7 @@ struct ctl_table_header
>>>                };
>>>                struct rcu_head rcu;
>>>        };
>>> +       wait_queue_head_t wait;
>>
>> If you have a waitqueue per table instead of per entry you will get
>> spurious notifications when other entries change. The easiest way to
>> test this is to poll /proc/sys/kernel/hostname and change your
>> domainname.
>
> You will get spurious wakeups but not spurious notifications to
> userspace since event is still per entry.

Yeah, indeed.

> For my money that seemed a nice compromise of code simplicity, and
> generality.  We could of course do something much closer to what
> sysfs does and allocate and refcount something similar to your
> ctl_table_poll when we have a ctl_table opened.  But that just looked
> like a pain.

I don't think we want spurious wakeups in favor of a slightly simpler code.


>
> Of course we already have spurious notifications for hostname and
> domainname when multiple uts namespaces are involved, but that
> is a different problem.
>
>> I couldn't apply this patch to any tree (linus/master + my previous
>> patch, your master, 3.3 + my previous patch), so I couldn't test. On
>> top of your tree:
>
> How odd.  It should have applied cleanly to my tree and it applies
> with just a two line offset top of Linus's latest with my tree merged
> in.  Those two lines of offset coming from the two extra includes
> that came in through the merge.
>
> patch -p1 --dry-run <  ~/tmp/sysctl-poll-test.patch
> patching file fs/proc/proc_sysctl.c
> Hunk #1 succeeded at 18 (offset 2 lines).
> Hunk #2 succeeded at 173 (offset 2 lines).
> Hunk #3 succeeded at 245 (offset 2 lines).
> Hunk #4 succeeded at 512 (offset 2 lines).
> Hunk #5 succeeded at 542 (offset 2 lines).
> Hunk #6 succeeded at 561 (offset 2 lines).
> patching file include/linux/sysctl.h
> patching file kernel/utsname_sysctl.c
>
>> [lucas@vader kernel]$ git am /tmp/a.patch
>> Applying: Making poll generally useful for sysctls
>> error: patch failed: fs/proc/proc_sysctl.c:16
>> error: fs/proc/proc_sysctl.c: patch does not apply
>> error: patch failed: include/linux/sysctl.h:992
>> error: include/linux/sysctl.h: patch does not apply
>> Patch failed at 0001 Making poll generally useful for sysctls
>
> Here is rebased version of the patch just in case that helps.

Now I can apply, but I can't boot: we hit a NULL dereference in
__wake_up_common(), called by proc_sys_poll_notify(). It seems that
you forgot to initialize the waitqueue with
__WAIT_QUEUE_HEAD_INITIALIZER().


Lucas De Marchi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [REVIEW][PATCH] Making poll generally useful for sysctls
  2012-03-26 17:44                 ` Lucas De Marchi
@ 2012-03-27  4:02                   ` Lucas De Marchi
  2012-03-28  2:00                     ` Eric W. Biederman
  0 siblings, 1 reply; 18+ messages in thread
From: Lucas De Marchi @ 2012-03-27  4:02 UTC (permalink / raw)
  To: Lucas De Marchi; +Cc: Eric W. Biederman, Linux Kernel, linux-fsdevel

On Mon, 26 Mar 2012 14:44:50 -0300
Lucas De Marchi <lucas.demarchi@profusion.mobi> wrote:

> Hi Eric,
> 
> On Sat, Mar 24, 2012 at 4:58 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
> > Lucas De Marchi <lucas.demarchi@profusion.mobi> writes:
> >
> >>>  /* A sysctl table is an array of struct ctl_table: */
> >>>  struct ctl_table
> >>>  {
> >>>        const char *procname;           /* Text ID for /proc/sys, or zero */
> >>>        void *data;
> >>> +       atomic_t event;
> >>>        int maxlen;
> >>>        umode_t mode;
> >>>        struct ctl_table *child;        /* Deprecated */
> >>>        proc_handler *proc_handler;     /* Callback for text formatting */
> >>> -       struct ctl_table_poll *poll;
> >>>        void *extra1;
> >>>        void *extra2;
> >>>  };
> >>> @@ -1042,6 +1025,7 @@ struct ctl_table_header
> >>>                };
> >>>                struct rcu_head rcu;
> >>>        };
> >>> +       wait_queue_head_t wait;
> >>
> >> If you have a waitqueue per table instead of per entry you will get
> >> spurious notifications when other entries change. The easiest way to
> >> test this is to poll /proc/sys/kernel/hostname and change your
> >> domainname.
> >
> > You will get spurious wakeups but not spurious notifications to
> > userspace since event is still per entry.
> 
> Yeah, indeed.
> 
> > For my money that seemed a nice compromise of code simplicity, and
> > generality.  We could of course do something much closer to what
> > sysfs does and allocate and refcount something similar to your
> > ctl_table_poll when we have a ctl_table opened.  But that just looked
> > like a pain.
> 
> I don't think we want spurious wakeups in favor of a slightly simpler code.
> 
> 
> >
> > Of course we already have spurious notifications for hostname and
> > domainname when multiple uts namespaces are involved, but that
> > is a different problem.
> >
> >> I couldn't apply this patch to any tree (linus/master + my previous
> >> patch, your master, 3.3 + my previous patch), so I couldn't test. On
> >> top of your tree:
> >
> > How odd.  It should have applied cleanly to my tree and it applies
> > with just a two line offset top of Linus's latest with my tree merged
> > in.  Those two lines of offset coming from the two extra includes
> > that came in through the merge.
> >
> > patch -p1 --dry-run <  ~/tmp/sysctl-poll-test.patch
> > patching file fs/proc/proc_sysctl.c
> > Hunk #1 succeeded at 18 (offset 2 lines).
> > Hunk #2 succeeded at 173 (offset 2 lines).
> > Hunk #3 succeeded at 245 (offset 2 lines).
> > Hunk #4 succeeded at 512 (offset 2 lines).
> > Hunk #5 succeeded at 542 (offset 2 lines).
> > Hunk #6 succeeded at 561 (offset 2 lines).
> > patching file include/linux/sysctl.h
> > patching file kernel/utsname_sysctl.c
> >
> >> [lucas@vader kernel]$ git am /tmp/a.patch
> >> Applying: Making poll generally useful for sysctls
> >> error: patch failed: fs/proc/proc_sysctl.c:16
> >> error: fs/proc/proc_sysctl.c: patch does not apply
> >> error: patch failed: include/linux/sysctl.h:992
> >> error: include/linux/sysctl.h: patch does not apply
> >> Patch failed at 0001 Making poll generally useful for sysctls
> >
> > Here is rebased version of the patch just in case that helps.
> 
> Now I can apply, but I can't boot: we hit a NULL dereference in
> __wake_up_common(), called by proc_sys_poll_notify(). It seems that
> you forgot to initialize the waitqueue with
> __WAIT_QUEUE_HEAD_INITIALIZER().

Trying again I came up with the following simple oneliner on top
of your patch. With it I can boot successfully and poll any file
under /proc/sys (I didn't try many, but there's no reason it would not
work).

The nice part of this patch is that suddenly all sysctl entries can be
monitored through poll() instead of having to add adhoc code. However
that spurious wake ups are not very nice. Eric, what if we keep the
waitqueue inside the entry and initialize it there, just like we did
for ->event? This would mean iterating through them on unregister
though.

Lucas De Marchi

--------------------------

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 739615c..85ae957 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -168,6 +168,7 @@ static void init_header(struct ctl_table_header *head,
 	head->set = set;
 	head->parent = NULL;
 	head->node = node;
+	init_waitqueue_head(&head->wait);
 	if (node) {
 		struct ctl_table *entry;
 		for (entry = table; entry->procname; entry++, node++) {

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [REVIEW][PATCH] Making poll generally useful for sysctls
  2012-03-27  4:02                   ` Lucas De Marchi
@ 2012-03-28  2:00                     ` Eric W. Biederman
  0 siblings, 0 replies; 18+ messages in thread
From: Eric W. Biederman @ 2012-03-28  2:00 UTC (permalink / raw)
  To: Lucas De Marchi; +Cc: Linux Kernel, linux-fsdevel

Lucas De Marchi <lucas.demarchi@profusion.mobi> writes:

> On Mon, 26 Mar 2012 14:44:50 -0300
> Lucas De Marchi <lucas.demarchi@profusion.mobi> wrote:
>
>> Hi Eric,
>> 
>> On Sat, Mar 24, 2012 at 4:58 AM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>
>> > Here is rebased version of the patch just in case that helps.
>> 
>> Now I can apply, but I can't boot: we hit a NULL dereference in
>> __wake_up_common(), called by proc_sys_poll_notify(). It seems that
>> you forgot to initialize the waitqueue with
>> __WAIT_QUEUE_HEAD_INITIALIZER().
>
> Trying again I came up with the following simple oneliner on top
> of your patch. With it I can boot successfully and poll any file
> under /proc/sys (I didn't try many, but there's no reason it would not
> work).

Thanks. I feel silly for that pretty obvious oversight.

There is another bug I am seeing in the sysctl poll code.  It needs to
be .read that updates filp->private_data to event, and not .poll.
Otherwise we have what should be a level triggered interface acting like
an edge triggered interface.

Any chance I could get you to cook up a patch for that bug?

> The nice part of this patch is that suddenly all sysctl entries can be
> monitored through poll() instead of having to add adhoc code. However
> that spurious wake ups are not very nice. Eric, what if we keep the
> waitqueue inside the entry and initialize it there, just like we did
> for ->event? This would mean iterating through them on unregister
> though.

Iterating through the all of the table entries on unregister is
not a problem, some code paths for namespace support are doing that
already.  Putting the wait queue in struct ctl_table is something
we can't do.  struct ctl_table can be freed before the final fput
on a file descriptor and fs/select.c will try to remove freed
wait queue heads, which would get us back to where we came in.

What we can do is use struct ctl_node instead. Either bloating struct
ctl_node or adding putting a pointer to struct ctl_table_poll.  The
only tricky part is that I don't believe I have any size information
on how many ctl_node entries I have.  So that information would have
to be gathered and kept as well.

After having looked at how large wait_queue_head_t I am reluctant
to pay the price for keeping a wait queue for nodes that we are not
polling.  So I am thinking allocate in .poll and free in unregister,
but I don't think I am ambitious enough to code that up.

Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2012-03-28  1:56 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-13  0:58 [3.3-rc7] sys_poll use after free (hibernate) Dave Jones
2012-03-18 19:02 ` Linus Torvalds
2012-03-18 19:27   ` Al Viro
2012-03-19  8:17     ` Alexey Dobriyan
2012-03-20  6:08     ` Lucas De Marchi
2012-03-20 18:29       ` [PATCH] sysctl: protect poll() in entries that may go away Lucas De Marchi
2012-03-22 21:31       ` [3.3-rc7] sys_poll use after free (hibernate) Eric W. Biederman
2012-03-22 22:12         ` Lucas De Marchi
2012-03-22 23:02           ` Eric W. Biederman
2012-03-24  0:25           ` [REVIEW][PATCH] Making poll generally useful for sysctls Eric W. Biederman
2012-03-24  6:20             ` Lucas De Marchi
2012-03-24  7:58               ` Eric W. Biederman
2012-03-26 17:44                 ` Lucas De Marchi
2012-03-27  4:02                   ` Lucas De Marchi
2012-03-28  2:00                     ` Eric W. Biederman
2012-03-22 22:24     ` [3.3-rc7] sys_poll use after free (hibernate) Eric W. Biederman
2012-03-18 19:47 ` richard -rw- weinberger
2012-03-18 21:24   ` Dave Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).