All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] race condition in procfs
@ 2005-11-29  7:17 Grzegorz Nosek
  2005-11-29  8:09 ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: Grzegorz Nosek @ 2005-11-29  7:17 UTC (permalink / raw)
  To: linux-kernel

Hello,

I found a race condition in procfs on SMP systems. The result is an
oops in processes like pidof. Apparently ->proc_read() gets passed a
potentially NULL pointer. The following micro-patch seems to fix it.

Best regards,
 Grzegorz Nosek

--- linux-2.6/fs/proc/base.c.orig 2005-11-25 00:07:43.000000000 +0100
+++ linux-2.6/fs/proc/base.c      2005-11-28 11:44:11.000000000 +0100
@@ -718,6 +718,9 @@
       ssize_t length;
       struct task_struct *task = proc_task(inode);

+       if (!task)
+               return -ENOENT;
+
       if (count > PROC_BLOCK_SIZE)
               count = PROC_BLOCK_SIZE;
       if (!(page = __get_free_page(GFP_KERNEL)))

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] race condition in procfs
  2005-11-29  7:17 [PATCH] race condition in procfs Grzegorz Nosek
@ 2005-11-29  8:09 ` Andrew Morton
  2005-11-29  8:38   ` Grzegorz Nosek
  2005-11-29 15:22   ` [PATCH] shrinks dentry struct Eric Dumazet
  0 siblings, 2 replies; 22+ messages in thread
From: Andrew Morton @ 2005-11-29  8:09 UTC (permalink / raw)
  To: Grzegorz Nosek; +Cc: linux-kernel

Grzegorz Nosek <grzegorz.nosek@gmail.com> wrote:
>
> Hello,
> 
> I found a race condition in procfs on SMP systems. The result is an
> oops in processes like pidof. Apparently ->proc_read() gets passed a
> potentially NULL pointer.

Do you know what the race is?

How does one reproduce it?

> The following micro-patch seems to fix it.

It might be right, or it might be a workaround..


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] race condition in procfs
  2005-11-29  8:09 ` Andrew Morton
@ 2005-11-29  8:38   ` Grzegorz Nosek
  2005-11-29 13:25     ` Grzegorz Nosek
  2005-11-29 15:22   ` [PATCH] shrinks dentry struct Eric Dumazet
  1 sibling, 1 reply; 22+ messages in thread
From: Grzegorz Nosek @ 2005-11-29  8:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1601 bytes --]

2005/11/29, Andrew Morton <akpm@osdl.org>:
> > I found a race condition in procfs on SMP systems. The result is an
> > oops in processes like pidof. Apparently ->proc_read() gets passed a
> > potentially NULL pointer.
>
> Do you know what the race is?

Apparently it's a race between deleting a process and accessing its
/proc/pid entries. It came out in pidof while it was accessing
/proc/pid/stat (fs/proc/array.c:do_task_stat crashed on first
instruction - it was an inline function accessing task->state,
get_task_state IIRC). oops (with vserver history data - I'm using a
patch mentioned below) is attached.

>
> How does one reproduce it?

I managed to reproduce it (although not reliably) during high CPU load
and I/O (parallel kernel compiles) on SMP systems with the vserver
patch (http://linux-vserver.org, the exact patch is
http://vserver.13thfloor.at/Experimental/patch-2.6.14.2-vs2.1.0-rc8.diff),
but the vserver maintainer pointed out that it probably is a mainline
issue. We're not using 2.6 systems too much except for the vserver
test beds so I cannot tell if it happens on vanilla kernels.

>
> > The following micro-patch seems to fix it.
>
> It might be right, or it might be a workaround..
>

I'm not a kernel guru so it's just my proposal. Can it break anything?
An alternative _might_ be somewhat coarser task_struct locking
(do_task_stat grabs a spinlock but then it's already too late).
However, if no "right" solution appears, I'll keep using my two-liner
because it seems to help, at least in my setup.

Best regards,
 Grzegorz Nosek

[-- Attachment #2: oops.s35 --]
[-- Type: application/octet-stream, Size: 8426 bytes --]

Nov 27 00:15:26 s35 [43281574.240000] Unable to handle kernel NULL pointer dereference
Nov 27 00:15:26 s35 at virtual address 00000000 
Nov 27 00:15:26 s35 [43281574.240000]  printing eip: 
Nov 27 00:15:26 s35 [43281574.240000] a01b50eb 
Nov 27 00:15:26 s35 [43281574.240000] *pde = 00000000 
Nov 27 00:15:26 s35 [43281574.240000] Oops: 0000 [#1] 
Nov 27 00:15:26 s35 [43281574.240000] SMP 
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000] Modules linked in:
Nov 27 00:15:26 s35 ipt_owner
Nov 27 00:15:26 s35 ipt_state
Nov 27 00:15:26 s35 iptable_filter
Nov 27 00:15:26 s35 netconsole
Nov 27 00:15:26 s35 uhci_hcd
Nov 27 00:15:26 s35 ohci_hcd
Nov 27 00:15:26 s35 ehci_hcd
Nov 27 00:15:26 s35 usbcore
Nov 27 00:15:26 s35 ip_conntrack_ftp
Nov 27 00:15:26 s35 ip_conntrack
Nov 27 00:15:26 s35 forcedeth
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000] CPU:    1 
Nov 27 00:15:26 s35 [43281574.240000] EIP:    0060:[<a01b50eb>]    Not tainted VLI 
Nov 27 00:15:26 s35 [43281574.240000] EFLAGS: 00010257   (2.6.14.2amd64smp.17)  
Nov 27 00:15:26 s35 [43281574.240000] EIP is at do_task_stat+0x8b/0x890 
Nov 27 00:15:26 s35 [43281574.240000] eax: 00000000   ebx: 00000000   ecx: a0601700   edx: c804ad48 
Nov 27 00:15:26 s35 [43281574.240000] esi: b3fbe000   edi: f666aa70   ebp: d7e65f20   esp: d7e65da0 
Nov 27 00:15:26 s35 [43281574.240000] ds: 007b   es: 007b   ss: 0068 
Nov 27 00:15:26 s35 [43281574.240000] Process pidof (pid: 4723, threadinfo=d7e64000 task=e24e7550)
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000] Stack: 
Nov 27 00:15:26 s35 a01b1e2e 
Nov 27 00:15:26 s35 f666aa70 
Nov 27 00:15:26 s35 d7e65f28 
Nov 27 00:15:26 s35 a8cab11c 
Nov 27 00:15:26 s35 d7e65e24 
Nov 27 00:15:26 s35 d7e65de8 
Nov 27 00:15:26 s35 a0184934 
Nov 27 00:15:26 s35 d7e65e24 
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000]        
Nov 27 00:15:26 s35 a8cab544 
Nov 27 00:15:26 s35 d7e65de8 
Nov 27 00:15:26 s35 a019090d 
Nov 27 00:15:26 s35 a8cab544 
Nov 27 00:15:26 s35 a0720a00 
Nov 27 00:15:26 s35 d7e65df8 
Nov 27 00:15:26 s35 a2227140 
Nov 27 00:15:26 s35 00000000 
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000]        
Nov 27 00:15:26 s35 00000000 
Nov 27 00:15:26 s35 d7e65e2c 
Nov 27 00:15:26 s35 d7e65e48 
Nov 27 00:15:26 s35 a0185664 
Nov 27 00:15:26 s35 a8cab544 
Nov 27 00:15:26 s35 d7e65e2c 
Nov 27 00:15:26 s35 d7e65e24 
Nov 27 00:15:26 s35 c94ff00b 
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000] Call Trace: 
Nov 27 00:15:26 s35 [43281574.240000]  [<a0103e9f>] 
Nov 27 00:15:26 s35 show_stack+0x7f/0xa0
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000]  [<a010403d>] 
Nov 27 00:15:26 s35 show_registers+0x15d/0x1d0
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000]  [<a0104252>] 
Nov 27 00:15:26 s35 die+0x112/0x1c0
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000]  [<a055c2b9>] 
Nov 27 00:15:26 s35 do_page_fault+0x3d9/0x650
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000]  [<a0103b53>] 
Nov 27 00:15:26 s35 error_code+0x4f/0x54
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000]  [<a01b5940>] 
Nov 27 00:15:26 s35 proc_tgid_stat+0x20/0x30
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000]  [<a01b0f75>] 
Nov 27 00:15:26 s35 proc_info_read+0x55/0xa0
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000]  [<a0174d68>] 
Nov 27 00:15:26 s35 vfs_read+0x198/0x1a0
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000]  [<a017506b>] 
Nov 27 00:15:26 s35 sys_read+0x4b/0x80
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000]  [<a010302d>] 
Nov 27 00:15:26 s35 syscall_call+0x7/0xb
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000] Code: 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 c7 
Nov 27 00:15:26 s35 85 
Nov 27 00:15:26 s35 6c 
Nov 27 00:15:26 s35 ff 
Nov 27 00:15:26 s35 ff 
Nov 27 00:15:26 s35 ff 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 8b 
Nov 27 00:15:26 s35 07 
Nov 27 00:15:26 s35 8b 
Nov 27 00:15:26 s35 9f 
Nov 27 00:15:26 s35 84 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 25 
Nov 27 00:15:26 s35 8f 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 83 
Nov 27 00:15:26 s35 e3 
Nov 27 00:15:26 s35 30 
Nov 27 00:15:26 s35 09 
Nov 27 00:15:26 s35 d8 
Nov 27 00:15:26 s35 eb 
Nov 27 00:15:26 s35 05 
Nov 27 00:15:26 s35 83 
Nov 27 00:15:26 s35 c1 
Nov 27 00:15:26 s35 04 
Nov 27 00:15:26 s35 d1 
Nov 27 00:15:26 s35 e8 
Nov 27 00:15:26 s35 75 
Nov 27 00:15:26 s35 f9 
Nov 27 00:15:26 s35 8b 
Nov 27 00:15:26 s35 01 
Nov 27 00:15:26 s35 unparseable log message: "<0f> "
Nov 27 00:15:26 s35 b6 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 c7 
Nov 27 00:15:26 s35 45 
Nov 27 00:15:26 s35 c8 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 c7 
Nov 27 00:15:26 s35 45 
Nov 27 00:15:26 s35 cc 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 c7 
Nov 27 00:15:26 s35 45 
Nov 27 00:15:26 s35 d0 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000] History:	SEQ:  3ddca14	NR_CPUS: 8 
Nov 27 00:15:26 s35 [43281574.240000] (#c9ae,*0):a04d546e set_vx_info f6e48000[#830,190.71] @f4fcf4e8 
Nov 27 00:15:26 s35 [43281574.240000] (#c964,*1):a013ac82 release_vx_info f6e48000[#830,190.74] @c5cdb030 
Nov 27 00:15:26 s35 [43281574.240000] (#ca13,*0):a04d40b2 clr_vx_info f6e48000[#830,188.71] @db738068 
Nov 27 00:15:26 s35 [43281574.240000] (#ca14,*1):a0104140 oops  
Nov 27 00:15:26 s35 [43281574.240000] (#ca12,*0):a04d40b2 clr_vx_info f6e48000[#830,189.71] @db739b68 
Nov 27 00:15:26 s35 [43281574.240000] (#ca0f,*1):a011c57c clr_vx_info f6e1e000[#831,151.39] @f6f1bad0 
Nov 27 00:15:26 s35 [43281574.240000] (#ca11,*0):a04d40b2 clr_vx_info f6e48000[#830,190.71] @ad3a16e8 
Nov 27 00:15:26 s35 [43281574.240000] (#ca0e,*1):a011c45c set_vx_info f6e1e000[#831,150.39] @f6f1a210 
Nov 27 00:15:26 s35 [43281574.240000] (#ca10,*0):a04d40b2 clr_vx_info f6e48000[#830,191.71] @ed1a7b68 
Nov 27 00:15:26 s35 [43281574.240000] (#ca0d,*1):a04d40b2 clr_vx_info f6e48000[#830,192.71] @b6b98ae8 
Nov 27 00:15:26 s35 [43281574.240000] (#ca03,*0):a04d4556 set_vx_info f6e48000[#830,190.71] @ba389268 
Nov 27 00:15:26 s35 [43281574.240000] (#ca0c,*1):a011ddac claim_vx_info f6e1e000[#831,150.38] @e24e7550 
Nov 27 00:15:26 s35 [43281574.240000] (#ca02,*0):a04d40b2 clr_vx_info f6e48000[#830,191.71] @b6b98d68 
Nov 27 00:15:26 s35 [43281574.240000] (#ca0b,*1):a011c45c set_vx_info f6e1e000[#831,149.38] @f6f1bad0 
Nov 27 00:15:26 s35 [43281574.240000] (#ca01,*0):a04d546e set_vx_info f6e48000[#830,190.71] @b6b98d68 
Nov 27 00:15:26 s35 [43281574.240000] (#ca0a,*1):a011d38c init_vx_info f6e1e000[#831,148.38] @e24e79f8 
Nov 27 00:15:26 s35 [43281574.240000] (#ca00,*0):a04d40b2 clr_vx_info f6e48000[#830,191.71] @f4fcf4e8 
Nov 27 00:15:26 s35 [43281574.240000] (#ca09,*1):a011bf10 clr_vx_info f6e1e000[#831,149.38] @e24e79f8 
Nov 27 00:15:26 s35 [43281574.240000] (#c9ff,*0):a04d40b2 clr_vx_info f6e48000[#830,192.71] @f56bbde8 
Nov 27 00:15:26 s35 [43281574.240000] (#ca08,*1):a04d546e set_vx_info f6e48000[#830,191.71] @b6b98ae8 
Nov 27 00:15:26 s35 [43281574.240000] (#c9fe,*0):a04d546e set_vx_info f6e48000[#830,191.71] @f56bbde8 
Nov 27 00:15:26 s35 [43281574.240000] (#ca07,*1):a04d40b2 clr_vx_info f6e48000[#830,192.71] @cbec5068 
Nov 27 00:15:26 s35 [43281574.240000] (#c9fd,*0):a04d546e set_vx_info f6e48000[#830,190.71] @f4fcf4e8 
Nov 27 00:15:26 s35 [43281574.240000] (#ca06,*1):a04d40b2 clr_vx_info f6e48000[#830,193.71] @cbec5ba8 
Nov 27 00:15:26 s35 [43281574.240000] (#c9fc,*0):a04d4556 set_vx_info f6e48000[#830,189.71] @ad3a16e8 
Nov 27 00:15:26 s35 [43281574.240000] (#ca05,*1):a04d546e set_vx_info f6e48000[#830,192.71] @cbec5ba8 
Nov 27 00:15:26 s35 [43281574.240000] (#c9fb,*0):a04d40b2 clr_vx_info f6e48000[#830,190.71] @ad3a16e8 
Nov 27 00:15:26 s35 [43281574.240000] (#ca04,*1):a04d546e set_vx_info f6e48000[#830,191.71] @cbec5068 
Nov 27 00:15:26 s35 [43281574.240000] (#c9f7,*0):a011c57c clr_vx_info f6e1e000[#831,148.37] @f6f1a790 
Nov 27 00:15:26 s35 [43281574.240000] (#c9fa,*1):a011ddac claim_vx_info f6e1e000[#831,149.37] @e4856550 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] race condition in procfs
  2005-11-29  8:38   ` Grzegorz Nosek
@ 2005-11-29 13:25     ` Grzegorz Nosek
  2005-11-29 14:04       ` Grzegorz Nosek
  0 siblings, 1 reply; 22+ messages in thread
From: Grzegorz Nosek @ 2005-11-29 13:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, vserver

> >
> > Do you know what the race is?
>
> Apparently it's a race between deleting a process and accessing its
> /proc/pid entries. It came out in pidof while it was accessing
> /proc/pid/stat (fs/proc/array.c:do_task_stat crashed on first
> instruction - it was an inline function accessing task->state,
> get_task_state IIRC). oops (with vserver history data - I'm using a
> patch mentioned below) is attached.
>
> >
> > How does one reproduce it?
>
> I managed to reproduce it (although not reliably) during high CPU load
> and I/O (parallel kernel compiles) on SMP systems with the vserver
> patch (http://linux-vserver.org, the exact patch is
> http://vserver.13thfloor.at/Experimental/patch-2.6.14.2-vs2.1.0-rc8.diff),
> but the vserver maintainer pointed out that it probably is a mainline
> issue. We're not using 2.6 systems too much except for the vserver
> test beds so I cannot tell if it happens on vanilla kernels.
>
> >
> > > The following micro-patch seems to fix it.
> >
> > It might be right, or it might be a workaround..
> >
>
> I'm not a kernel guru so it's just my proposal. Can it break anything?
> An alternative _might_ be somewhat coarser task_struct locking
> (do_task_stat grabs a spinlock but then it's already too late).
> However, if no "right" solution appears, I'll keep using my two-liner
> because it seems to help, at least in my setup.
>

Oh well, I got another oops in the very same place with the patch
applied. So now I surrounded the check with
read_[un]lock(&tasklist_lock) and added a check to do_task_stat (both
now have a printk). If it builds, boots and doesn't crash, I'll post
the patch.

Best regards,
 Grzegorz Nosek

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] race condition in procfs
  2005-11-29 13:25     ` Grzegorz Nosek
@ 2005-11-29 14:04       ` Grzegorz Nosek
  2005-11-29 14:28         ` Steven Rostedt
  0 siblings, 1 reply; 22+ messages in thread
From: Grzegorz Nosek @ 2005-11-29 14:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, vserver

2005/11/29, Grzegorz Nosek <grzegorz.nosek@gmail.com>:
>
> Oh well, I got another oops in the very same place with the patch
> applied. So now I surrounded the check with
> read_[un]lock(&tasklist_lock) and added a check to do_task_stat (both
> now have a printk). If it builds, boots and doesn't crash, I'll post
> the patch.

Froze solid a minute after booting :( netconsole didn't log anything
either. Back to the drawing board.

Best regards,
 Grzegorz Nosek

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] race condition in procfs
  2005-11-29 14:04       ` Grzegorz Nosek
@ 2005-11-29 14:28         ` Steven Rostedt
  2005-11-29 14:39           ` Grzegorz Nosek
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2005-11-29 14:28 UTC (permalink / raw)
  To: Grzegorz Nosek; +Cc: vserver, linux-kernel, Andrew Morton

On Tue, 2005-11-29 at 15:04 +0100, Grzegorz Nosek wrote:
> 2005/11/29, Grzegorz Nosek <grzegorz.nosek@gmail.com>:
> >
> > Oh well, I got another oops in the very same place with the patch
> > applied. So now I surrounded the check with
> > read_[un]lock(&tasklist_lock) and added a check to do_task_stat (both
> > now have a printk). If it builds, boots and doesn't crash, I'll post
> > the patch.
> 
> Froze solid a minute after booting :( netconsole didn't log anything
> either. Back to the drawing board.

Have you seen this crash the vanilla kernel?  What exactly are you doing
to see the crash? If you have a script or something, could you post it.
I could spend some time helping you debug it too on one of my SMP boxes.

-- Steve



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] race condition in procfs
  2005-11-29 14:28         ` Steven Rostedt
@ 2005-11-29 14:39           ` Grzegorz Nosek
  2005-11-29 14:49             ` Steven Rostedt
  0 siblings, 1 reply; 22+ messages in thread
From: Grzegorz Nosek @ 2005-11-29 14:39 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: vserver, linux-kernel, Andrew Morton

2005/11/29, Steven Rostedt <rostedt@goodmis.org>:
> Have you seen this crash the vanilla kernel?  What exactly are you doing
> to see the crash? If you have a script or something, could you post it.
> I could spend some time helping you debug it too on one of my SMP boxes.
>

I'm not really using vanilla 2.6 kernels and my setup would be quite
hard to run on a vanilla kernel.

The reproduceability of this bug varies. Sometimes it'll go for a few
days without happening, sometimes it's a matter of a few minutes. I'm
beginning to feel it's a vserver issue after all, somehow related to
pid virtualisation (it maps some vxi->vx_initpid to 1).

Thus I cannot provide a simple script to trigger the bug (I wish I
could) but often doing a -j8 kernel compile in a vserver is enough.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] race condition in procfs
  2005-11-29 14:39           ` Grzegorz Nosek
@ 2005-11-29 14:49             ` Steven Rostedt
  2005-11-30 14:41               ` Grzegorz Nosek
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2005-11-29 14:49 UTC (permalink / raw)
  To: Grzegorz Nosek; +Cc: vserver, linux-kernel, Andrew Morton


On Tue, 29 Nov 2005, Grzegorz Nosek wrote:

>
> I'm not really using vanilla 2.6 kernels and my setup would be quite
> hard to run on a vanilla kernel.
>
> The reproduceability of this bug varies. Sometimes it'll go for a few
> days without happening, sometimes it's a matter of a few minutes. I'm
> beginning to feel it's a vserver issue after all, somehow related to
> pid virtualisation (it maps some vxi->vx_initpid to 1).
>
> Thus I cannot provide a simple script to trigger the bug (I wish I
> could) but often doing a -j8 kernel compile in a vserver is enough.
>

What you are showing, would probably show up by others if this were a
vanilla kernel issue.  I don't have an 8 way machine, just 2 way, but the
vanilla kernel is being used on many 8 ways out there, so I think you are
right that this _is_ a vserver issue.

Unless, of course, that the vserver is producing an obscure race in the
vanilla kernel that normal operations would seldom have.  Just like the
PREEMPT_RT patch has discovered many race conditions that were in the
vanilla kernel but were not often a problem.

-- Steve


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] shrinks dentry struct
  2005-11-29  8:09 ` Andrew Morton
  2005-11-29  8:38   ` Grzegorz Nosek
@ 2005-11-29 15:22   ` Eric Dumazet
  2005-11-30  2:06     ` Paul Jackson
                       ` (2 more replies)
  1 sibling, 3 replies; 22+ messages in thread
From: Eric Dumazet @ 2005-11-29 15:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1106 bytes --]

Hi Andrew

Could you add this patch to mm ?

Thank you

[PATCH] shrinks dentry struct

Some long time ago, dentry struct was carefully tuned so that on 32 bits UP, 
sizeof(struct dentry) was exactly 128, ie a power of 2, and a multiple of 
memory cache lines.

Then RCU was added and dentry struct enlarged by two pointers, with nice 
results for SMP, but not so good on UP, because breaking the above tuning (128 
+ 8 = 136 bytes)

This patch reverts this unwanted side effect, by using an union (d_u), where 
d_rcu and d_child are placed so that these two fields can share their memory 
needs.

At the time d_free() is called (and d_rcu is really used), d_child is known to 
be empty and not touched by the dentry freeing.

Lockless lookups only access d_name, d_parent, d_lock, d_op, d_flags (so the 
previous content of d_child is not needed if said dentry was unhashed but 
still accessed by a CPU because of RCU constraints)

As dentry cache easily contains millions of entries, a size reduction is worth 
the extra complexity of the ugly C union.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>


[-- Attachment #2: shrink_dentry_struct.patch --]
[-- Type: text/plain, Size: 14773 bytes --]

--- linux-2.6.15-rc3/include/linux/dcache.h	2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/include/linux/dcache.h	2005-11-29 12:04:54.000000000 +0100
@@ -95,14 +95,19 @@
 	struct qstr d_name;
 
 	struct list_head d_lru;		/* LRU list */
-	struct list_head d_child;	/* child of parent list */
+	/*
+	 * d_child and d_rcu can share memory
+	 */
+	union {
+		struct list_head d_child;	/* child of parent list */
+	 	struct rcu_head d_rcu;
+	} d_u;
 	struct list_head d_subdirs;	/* our children */
 	struct list_head d_alias;	/* inode alias list */
 	unsigned long d_time;		/* used by d_revalidate */
 	struct dentry_operations *d_op;
 	struct super_block *d_sb;	/* The root of the dentry tree */
 	void *d_fsdata;			/* fs-specific data */
- 	struct rcu_head d_rcu;
 	struct dcookie_struct *d_cookie; /* cookie, if any */
 	int d_mounted;
 	unsigned char d_iname[DNAME_INLINE_LEN_MIN];	/* small names */
--- linux-2.6.15-rc3/fs/dcache.c	2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/fs/dcache.c	2005-11-29 15:07:10.000000000 +0100
@@ -71,7 +71,7 @@
 
 static void d_callback(struct rcu_head *head)
 {
-	struct dentry * dentry = container_of(head, struct dentry, d_rcu);
+	struct dentry * dentry = container_of(head, struct dentry, d_u.d_rcu);
 
 	if (dname_external(dentry))
 		kfree(dentry->d_name.name);
@@ -86,7 +86,7 @@
 {
 	if (dentry->d_op && dentry->d_op->d_release)
 		dentry->d_op->d_release(dentry);
- 	call_rcu(&dentry->d_rcu, d_callback);
+ 	call_rcu(&dentry->d_u.d_rcu, d_callback);
 }
 
 /*
@@ -193,7 +193,7 @@
   			list_del(&dentry->d_lru);
   			dentry_stat.nr_unused--;
   		}
-  		list_del(&dentry->d_child);
+  		list_del(&dentry->d_u.d_child);
 		dentry_stat.nr_dentry--;	/* For d_free, below */
 		/*drops the locks, at that point nobody can reach this dentry */
 		dentry_iput(dentry);
@@ -367,7 +367,7 @@
 	struct dentry * parent;
 
 	__d_drop(dentry);
-	list_del(&dentry->d_child);
+	list_del(&dentry->d_u.d_child);
 	dentry_stat.nr_dentry--;	/* For d_free, below */
 	dentry_iput(dentry);
 	parent = dentry->d_parent;
@@ -518,7 +518,7 @@
 resume:
 	while (next != &this_parent->d_subdirs) {
 		struct list_head *tmp = next;
-		struct dentry *dentry = list_entry(tmp, struct dentry, d_child);
+		struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
 		next = tmp->next;
 		/* Have we found a mount point ? */
 		if (d_mountpoint(dentry))
@@ -532,7 +532,7 @@
 	 * All done at this level ... ascend and resume the search.
 	 */
 	if (this_parent != parent) {
-		next = this_parent->d_child.next; 
+		next = this_parent->d_u.d_child.next; 
 		this_parent = this_parent->d_parent;
 		goto resume;
 	}
@@ -569,7 +569,7 @@
 resume:
 	while (next != &this_parent->d_subdirs) {
 		struct list_head *tmp = next;
-		struct dentry *dentry = list_entry(tmp, struct dentry, d_child);
+		struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
 		next = tmp->next;
 
 		if (!list_empty(&dentry->d_lru)) {
@@ -610,7 +610,7 @@
 	 * All done at this level ... ascend and resume the search.
 	 */
 	if (this_parent != parent) {
-		next = this_parent->d_child.next; 
+		next = this_parent->d_u.d_child.next; 
 		this_parent = this_parent->d_parent;
 #ifdef DCACHE_DEBUG
 printk(KERN_DEBUG "select_parent: ascending to %s/%s, found=%d\n",
@@ -753,12 +753,12 @@
 		dentry->d_parent = dget(parent);
 		dentry->d_sb = parent->d_sb;
 	} else {
-		INIT_LIST_HEAD(&dentry->d_child);
+		INIT_LIST_HEAD(&dentry->d_u.d_child);
 	}
 
 	spin_lock(&dcache_lock);
 	if (parent)
-		list_add(&dentry->d_child, &parent->d_subdirs);
+		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
 	dentry_stat.nr_dentry++;
 	spin_unlock(&dcache_lock);
 
@@ -1310,8 +1310,8 @@
 	/* Unhash the target: dput() will then get rid of it */
 	__d_drop(target);
 
-	list_del(&dentry->d_child);
-	list_del(&target->d_child);
+	list_del(&dentry->d_u.d_child);
+	list_del(&target->d_u.d_child);
 
 	/* Switch the names.. */
 	switch_names(dentry, target);
@@ -1322,15 +1322,15 @@
 	if (IS_ROOT(dentry)) {
 		dentry->d_parent = target->d_parent;
 		target->d_parent = target;
-		INIT_LIST_HEAD(&target->d_child);
+		INIT_LIST_HEAD(&target->d_u.d_child);
 	} else {
 		do_switch(dentry->d_parent, target->d_parent);
 
 		/* And add them back to the (new) parent lists */
-		list_add(&target->d_child, &target->d_parent->d_subdirs);
+		list_add(&target->d_u.d_child, &target->d_parent->d_subdirs);
 	}
 
-	list_add(&dentry->d_child, &dentry->d_parent->d_subdirs);
+	list_add(&dentry->d_u.d_child, &dentry->d_parent->d_subdirs);
 	spin_unlock(&target->d_lock);
 	spin_unlock(&dentry->d_lock);
 	write_sequnlock(&rename_lock);
@@ -1568,7 +1568,7 @@
 resume:
 	while (next != &this_parent->d_subdirs) {
 		struct list_head *tmp = next;
-		struct dentry *dentry = list_entry(tmp, struct dentry, d_child);
+		struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
 		next = tmp->next;
 		if (d_unhashed(dentry)||!dentry->d_inode)
 			continue;
@@ -1579,7 +1579,7 @@
 		atomic_dec(&dentry->d_count);
 	}
 	if (this_parent != root) {
-		next = this_parent->d_child.next; 
+		next = this_parent->d_u.d_child.next; 
 		atomic_dec(&this_parent->d_count);
 		this_parent = this_parent->d_parent;
 		goto resume;
--- linux-2.6.15-rc3/fs/autofs4/autofs_i.h	2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/fs/autofs4/autofs_i.h	2005-11-29 12:04:54.000000000 +0100
@@ -209,7 +209,7 @@
 	struct dentry *child;
 	int ret = 0;
 
-	list_for_each_entry(child, &dentry->d_subdirs, d_child)
+	list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child)
 		if (simple_positive(child))
 			goto out;
 	ret = 1;
--- linux-2.6.15-rc3/fs/autofs4/expire.c	2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/fs/autofs4/expire.c	2005-11-29 12:04:54.000000000 +0100
@@ -105,7 +105,7 @@
 	next = this_parent->d_subdirs.next;
 resume:
 	while (next != &this_parent->d_subdirs) {
-		struct dentry *dentry = list_entry(next, struct dentry, d_child);
+		struct dentry *dentry = list_entry(next, struct dentry, d_u.d_child);
 
 		/* Negative dentry - give up */
 		if (!simple_positive(dentry)) {
@@ -138,7 +138,7 @@
 	}
 
 	if (this_parent != top) {
-		next = this_parent->d_child.next;
+		next = this_parent->d_u.d_child.next;
 		this_parent = this_parent->d_parent;
 		goto resume;
 	}
@@ -163,7 +163,7 @@
 	next = this_parent->d_subdirs.next;
 resume:
 	while (next != &this_parent->d_subdirs) {
-		struct dentry *dentry = list_entry(next, struct dentry, d_child);
+		struct dentry *dentry = list_entry(next, struct dentry, d_u.d_child);
 
 		/* Negative dentry - give up */
 		if (!simple_positive(dentry)) {
@@ -199,7 +199,7 @@
 	}
 
 	if (this_parent != parent) {
-		next = this_parent->d_child.next;
+		next = this_parent->d_u.d_child.next;
 		this_parent = this_parent->d_parent;
 		goto resume;
 	}
@@ -238,7 +238,7 @@
 	/* On exit from the loop expire is set to a dgot dentry
 	 * to expire or it's NULL */
 	while ( next != &root->d_subdirs ) {
-		struct dentry *dentry = list_entry(next, struct dentry, d_child);
+		struct dentry *dentry = list_entry(next, struct dentry, d_u.d_child);
 
 		/* Negative dentry - give up */
 		if ( !simple_positive(dentry) ) {
@@ -302,7 +302,7 @@
 			expired, (int)expired->d_name.len, expired->d_name.name);
 		spin_lock(&dcache_lock);
 		list_del(&expired->d_parent->d_subdirs);
-		list_add(&expired->d_parent->d_subdirs, &expired->d_child);
+		list_add(&expired->d_parent->d_subdirs, &expired->d_u.d_child);
 		spin_unlock(&dcache_lock);
 		return expired;
 	}
--- linux-2.6.15-rc3/fs/autofs4/inode.c	2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/fs/autofs4/inode.c	2005-11-29 12:04:54.000000000 +0100
@@ -91,7 +91,7 @@
 	next = this_parent->d_subdirs.next;
 resume:
 	while (next != &this_parent->d_subdirs) {
-		struct dentry *dentry = list_entry(next, struct dentry, d_child);
+		struct dentry *dentry = list_entry(next, struct dentry, d_u.d_child);
 
 		/* Negative dentry - don`t care */
 		if (!simple_positive(dentry)) {
@@ -117,7 +117,7 @@
 	if (this_parent != sbi->root) {
 		struct dentry *dentry = this_parent;
 
-		next = this_parent->d_child.next;
+		next = this_parent->d_u.d_child.next;
 		this_parent = this_parent->d_parent;
 		spin_unlock(&dcache_lock);
 		DPRINTK("parent dentry %p %.*s",
--- linux-2.6.15-rc3/fs/coda/cache.c	2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/fs/coda/cache.c	2005-11-29 12:05:21.000000000 +0100
@@ -93,7 +93,7 @@
 	spin_lock(&dcache_lock);
 	list_for_each(child, &parent->d_subdirs)
 	{
-		de = list_entry(child, struct dentry, d_child);
+		de = list_entry(child, struct dentry, d_u.d_child);
 		/* don't know what to do with negative dentries */
 		if ( ! de->d_inode ) 
 			continue;
--- linux-2.6.15-rc3/fs/libfs.c	2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/fs/libfs.c	2005-11-29 14:58:51.000000000 +0100
@@ -93,16 +93,16 @@
 			loff_t n = file->f_pos - 2;
 
 			spin_lock(&dcache_lock);
-			list_del(&cursor->d_child);
+			list_del(&cursor->d_u.d_child);
 			p = file->f_dentry->d_subdirs.next;
 			while (n && p != &file->f_dentry->d_subdirs) {
 				struct dentry *next;
-				next = list_entry(p, struct dentry, d_child);
+				next = list_entry(p, struct dentry, d_u.d_child);
 				if (!d_unhashed(next) && next->d_inode)
 					n--;
 				p = p->next;
 			}
-			list_add_tail(&cursor->d_child, p);
+			list_add_tail(&cursor->d_u.d_child, p);
 			spin_unlock(&dcache_lock);
 		}
 	}
@@ -126,7 +126,7 @@
 {
 	struct dentry *dentry = filp->f_dentry;
 	struct dentry *cursor = filp->private_data;
-	struct list_head *p, *q = &cursor->d_child;
+	struct list_head *p, *q = &cursor->d_u.d_child;
 	ino_t ino;
 	int i = filp->f_pos;
 
@@ -153,7 +153,7 @@
 			}
 			for (p=q->next; p != &dentry->d_subdirs; p=p->next) {
 				struct dentry *next;
-				next = list_entry(p, struct dentry, d_child);
+				next = list_entry(p, struct dentry, d_u.d_child);
 				if (d_unhashed(next) || !next->d_inode)
 					continue;
 
@@ -261,7 +261,7 @@
 	int ret = 0;
 
 	spin_lock(&dcache_lock);
-	list_for_each_entry(child, &dentry->d_subdirs, d_child)
+	list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child)
 		if (simple_positive(child))
 			goto out;
 	ret = 1;
--- linux-2.6.15-rc3/fs/ncpfs/dir.c	2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/fs/ncpfs/dir.c	2005-11-29 14:59:31.000000000 +0100
@@ -365,7 +365,7 @@
 	spin_lock(&dcache_lock);
 	next = parent->d_subdirs.next;
 	while (next != &parent->d_subdirs) {
-		dent = list_entry(next, struct dentry, d_child);
+		dent = list_entry(next, struct dentry, d_u.d_child);
 		if ((unsigned long)dent->d_fsdata == fpos) {
 			if (dent->d_inode)
 				dget_locked(dent);
--- linux-2.6.15-rc3/fs/ncpfs/ncplib_kernel.h	2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/fs/ncpfs/ncplib_kernel.h	2005-11-29 14:59:31.000000000 +0100
@@ -196,7 +196,7 @@
 	spin_lock(&dcache_lock);
 	next = parent->d_subdirs.next;
 	while (next != &parent->d_subdirs) {
-		dentry = list_entry(next, struct dentry, d_child);
+		dentry = list_entry(next, struct dentry, d_u.d_child);
 
 		if (dentry->d_fsdata == NULL)
 			ncp_age_dentry(server, dentry);
@@ -218,7 +218,7 @@
 	spin_lock(&dcache_lock);
 	next = parent->d_subdirs.next;
 	while (next != &parent->d_subdirs) {
-		dentry = list_entry(next, struct dentry, d_child);
+		dentry = list_entry(next, struct dentry, d_u.d_child);
 		dentry->d_fsdata = NULL;
 		ncp_age_dentry(server, dentry);
 		next = next->next;
--- linux-2.6.15-rc3/fs/smbfs/cache.c	2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/fs/smbfs/cache.c	2005-11-29 14:59:47.000000000 +0100
@@ -66,7 +66,7 @@
 	spin_lock(&dcache_lock);
 	next = parent->d_subdirs.next;
 	while (next != &parent->d_subdirs) {
-		dentry = list_entry(next, struct dentry, d_child);
+		dentry = list_entry(next, struct dentry, d_u.d_child);
 		dentry->d_fsdata = NULL;
 		smb_age_dentry(server, dentry);
 		next = next->next;
@@ -100,7 +100,7 @@
 	spin_lock(&dcache_lock);
 	next = parent->d_subdirs.next;
 	while (next != &parent->d_subdirs) {
-		dent = list_entry(next, struct dentry, d_child);
+		dent = list_entry(next, struct dentry, d_u.d_child);
 		if ((unsigned long)dent->d_fsdata == fpos) {
 			if (dent->d_inode)
 				dget_locked(dent);
--- linux-2.6.15-rc3/kernel/cpuset.c	2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/kernel/cpuset.c	2005-11-29 15:01:01.000000000 +0100
@@ -304,7 +304,7 @@
 	spin_lock(&dcache_lock);
 	node = dentry->d_subdirs.next;
 	while (node != &dentry->d_subdirs) {
-		struct dentry *d = list_entry(node, struct dentry, d_child);
+		struct dentry *d = list_entry(node, struct dentry, d_u.d_child);
 		list_del_init(node);
 		if (d->d_inode) {
 			d = dget_locked(d);
@@ -316,7 +316,7 @@
 		}
 		node = dentry->d_subdirs.next;
 	}
-	list_del_init(&dentry->d_child);
+	list_del_init(&dentry->d_u.d_child);
 	spin_unlock(&dcache_lock);
 	remove_dir(dentry);
 }
--- linux-2.6.15-rc3/drivers/usb/core/inode.c	2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/drivers/usb/core/inode.c	2005-11-29 12:04:54.000000000 +0100
@@ -186,7 +186,7 @@
 
 	down(&bus->d_inode->i_sem);
 
-	list_for_each_entry(dev, &bus->d_subdirs, d_child)
+	list_for_each_entry(dev, &bus->d_subdirs, d_u.d_child)
 		if (dev->d_inode)
 			update_dev(dev);
 
@@ -203,7 +203,7 @@
 
 	down(&root->d_inode->i_sem);
 
-	list_for_each_entry(bus, &root->d_subdirs, d_child) {
+	list_for_each_entry(bus, &root->d_subdirs, d_u.d_child) {
 		if (bus->d_inode) {
 			switch (S_IFMT & bus->d_inode->i_mode) {
 			case S_IFDIR:
@@ -319,7 +319,7 @@
 	spin_lock(&dcache_lock);
 
 	list_for_each(list, &dentry->d_subdirs) {
-		struct dentry *de = list_entry(list, struct dentry, d_child);
+		struct dentry *de = list_entry(list, struct dentry, d_u.d_child);
 		if (usbfs_positive(de)) {
 			spin_unlock(&dcache_lock);
 			return 0;
--- linux-2.6.15-rc3/net/sunrpc/rpc_pipe.c	2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/net/sunrpc/rpc_pipe.c	2005-11-29 15:01:01.000000000 +0100
@@ -492,7 +492,7 @@
 repeat:
 	spin_lock(&dcache_lock);
 	list_for_each_safe(pos, next, &parent->d_subdirs) {
-		dentry = list_entry(pos, struct dentry, d_child);
+		dentry = list_entry(pos, struct dentry, d_u.d_child);
 		spin_lock(&dentry->d_lock);
 		if (!d_unhashed(dentry)) {
 			dget_locked(dentry);
--- linux-2.6.15-rc3/security/selinux/selinuxfs.c	2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/security/selinux/selinuxfs.c	2005-11-29 15:01:09.000000000 +0100
@@ -889,7 +889,7 @@
 	spin_lock(&dcache_lock);
 	node = de->d_subdirs.next;
 	while (node != &de->d_subdirs) {
-		struct dentry *d = list_entry(node, struct dentry, d_child);
+		struct dentry *d = list_entry(node, struct dentry, d_u.d_child);
 		list_del_init(node);
 
 		if (d->d_inode) {

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] shrinks dentry struct
  2005-11-29 15:22   ` [PATCH] shrinks dentry struct Eric Dumazet
@ 2005-11-30  2:06     ` Paul Jackson
  2005-11-30  2:14       ` Andrew Morton
  2005-12-03  1:15     ` [PATCH] remove unused blkp field in percpu_data Eric Dumazet
  2005-12-13 18:03     ` [PATCH] shrinks dentry struct Paul E. McKenney
  2 siblings, 1 reply; 22+ messages in thread
From: Paul Jackson @ 2005-11-30  2:06 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: akpm, linux-kernel

Eric,

Would the following accomplish the same thing as your patch, to shrink
UP dentry structs back to 128 bytes, with a smaller and less intrusive
patch?

---

 include/linux/dcache.h |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

--- 2.6.15-rc2-mm1.orig/include/linux/dcache.h	2005-11-29 17:45:51.977352268 -0800
+++ 2.6.15-rc2-mm1/include/linux/dcache.h	2005-11-29 18:04:59.151307979 -0800
@@ -95,19 +95,24 @@ struct dentry {
 	struct qstr d_name;
 
 	struct list_head d_lru;		/* LRU list */
-	struct list_head d_child;	/* child of parent list */
+	union {				/* Fit 32 bit UP dentry in 128 bytes */
+		struct list_head du_child;	/* child of parent list */
+ 		struct rcu_head du_rcu;
+	} d_du;
 	struct list_head d_subdirs;	/* our children */
 	struct list_head d_alias;	/* inode alias list */
 	unsigned long d_time;		/* used by d_revalidate */
 	struct dentry_operations *d_op;
 	struct super_block *d_sb;	/* The root of the dentry tree */
 	void *d_fsdata;			/* fs-specific data */
- 	struct rcu_head d_rcu;
 	struct dcookie_struct *d_cookie; /* cookie, if any */
 	int d_mounted;
 	unsigned char d_iname[DNAME_INLINE_LEN_MIN];	/* small names */
 };
 
+#define d_child d_du.du_child
+#define d_rcu d_du.du_rcu
+
 struct dentry_operations {
 	int (*d_revalidate)(struct dentry *, struct nameidata *);
 	int (*d_hash) (struct dentry *, struct qstr *);


-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] shrinks dentry struct
  2005-11-30  2:06     ` Paul Jackson
@ 2005-11-30  2:14       ` Andrew Morton
  2005-11-30  2:43         ` Paul Jackson
  2005-11-30  6:56         ` Hugh Dickins
  0 siblings, 2 replies; 22+ messages in thread
From: Andrew Morton @ 2005-11-30  2:14 UTC (permalink / raw)
  To: Paul Jackson; +Cc: dada1, linux-kernel

Paul Jackson <pj@sgi.com> wrote:
>
>  Would the following accomplish the same thing as your patch, to shrink
>  UP dentry structs back to 128 bytes, with a smaller and less intrusive
>  patch?
> 
> ...
>  -	struct list_head d_child;	/* child of parent list */
>  +	union {				/* Fit 32 bit UP dentry in 128 bytes */
>  +		struct list_head du_child;	/* child of parent list */
>  + 		struct rcu_head du_rcu;
>  +	} d_du;
> ...
>   
>  +#define d_child d_du.du_child
>  +#define d_rcu d_du.du_rcu

Yes, but it's better to just do the big edit, rather than letting these
little namespace crufties accumulate over time.

Even better would be to ditch gcc-2.95.x and use an anonymous union, but
Hugh won't let me ;)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] shrinks dentry struct
  2005-11-30  2:14       ` Andrew Morton
@ 2005-11-30  2:43         ` Paul Jackson
  2005-11-30  6:56         ` Hugh Dickins
  1 sibling, 0 replies; 22+ messages in thread
From: Paul Jackson @ 2005-11-30  2:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: dada1, linux-kernel

Andrew wrote:
> Yes, but it's better to just do the big edit, rather than letting these
> little namespace crufties accumulate over time.

Ok.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] shrinks dentry struct
  2005-11-30  2:14       ` Andrew Morton
  2005-11-30  2:43         ` Paul Jackson
@ 2005-11-30  6:56         ` Hugh Dickins
  1 sibling, 0 replies; 22+ messages in thread
From: Hugh Dickins @ 2005-11-30  6:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Paul Jackson, dada1, linux-kernel

On Tue, 29 Nov 2005, Andrew Morton wrote:
> Yes, but it's better to just do the big edit, rather than letting these
> little namespace crufties accumulate over time.
> 
> Even better would be to ditch gcc-2.95.x and use an anonymous union, but
> Hugh won't let me ;)

Oh, I let you, but not for me (I don't want the eternal blame).

Hugh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] race condition in procfs
  2005-11-29 14:49             ` Steven Rostedt
@ 2005-11-30 14:41               ` Grzegorz Nosek
  2005-11-30 15:14                 ` Steven Rostedt
  0 siblings, 1 reply; 22+ messages in thread
From: Grzegorz Nosek @ 2005-11-30 14:41 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: vserver, linux-kernel, Andrew Morton

2005/11/29, Steven Rostedt <rostedt@goodmis.org>:
>
> What you are showing, would probably show up by others if this were a
> vanilla kernel issue.  I don't have an 8 way machine, just 2 way, but the
> vanilla kernel is being used on many 8 ways out there, so I think you are
> right that this _is_ a vserver issue.

Yeah, I guess so. I also noticed that running an older build (w/o ACPI
so it sees only 2 CPUs due to lack of HT - it's a dual Xeon HT machine
so there are 4 logical CPUs) seems a bit more stable, but it happens
there too.

>
> Unless, of course, that the vserver is producing an obscure race in the
> vanilla kernel that normal operations would seldom have.  Just like the
> PREEMPT_RT patch has discovered many race conditions that were in the
> vanilla kernel but were not often a problem.
>

I'm not using preemption. What made me just stare in wonder was when I
added a check in do_task_stat at the very beginning to the effect of:

if (!task) {
 printk(...);
 return -ENOENT;
}

/* dereference task as usual */

I *still* got the oops (and no message got logged). So either it is
used before the entry point (there is an occurrence of
sizeof(task->comm) but that should be statically determined by the
compiler, right?) or it is set to NULL in some magical way between the
check and usage (yep, it's still a race but the window should be
smaller I think).

The only place I can find a proc_inode.task field set to NULL is in
proc_pid_make_inode(). However, it is set to the value of task
parameter just a few instructions later. Am I right? Or can
proc_pid_make_inode get passed a NULL pointer?

I'm lost. Any assistance will be invaluable.

Best regards,
 Grzegorz Nosek

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] race condition in procfs
  2005-11-30 14:41               ` Grzegorz Nosek
@ 2005-11-30 15:14                 ` Steven Rostedt
  2005-11-30 15:29                   ` Grzegorz Nosek
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2005-11-30 15:14 UTC (permalink / raw)
  To: Grzegorz Nosek; +Cc: vserver, linux-kernel, Andrew Morton

On Wed, 2005-11-30 at 15:41 +0100, Grzegorz Nosek wrote:
> 
> I'm lost. Any assistance will be invaluable.

OK, Remove your patches, run the system where you can capture the log,
and provide a full output of the oops.  Make sure you have
CONFIG_KALLSYMS set.

-- Steve



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] race condition in procfs
  2005-11-30 15:14                 ` Steven Rostedt
@ 2005-11-30 15:29                   ` Grzegorz Nosek
  2005-11-30 16:25                     ` Steven Rostedt
  0 siblings, 1 reply; 22+ messages in thread
From: Grzegorz Nosek @ 2005-11-30 15:29 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: vserver, linux-kernel, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 283 bytes --]

2005/11/30, Steven Rostedt <rostedt@goodmis.org>:
>
> OK, Remove your patches, run the system where you can capture the log,
> and provide a full output of the oops.  Make sure you have
> CONFIG_KALLSYMS set.
>

OK, attached an oops from netconsole.

TIA,
 Grzegorz Nosek

[-- Attachment #2: oops.s35 --]
[-- Type: application/octet-stream, Size: 8426 bytes --]

Nov 27 00:15:26 s35 [43281574.240000] Unable to handle kernel NULL pointer dereference
Nov 27 00:15:26 s35 at virtual address 00000000 
Nov 27 00:15:26 s35 [43281574.240000]  printing eip: 
Nov 27 00:15:26 s35 [43281574.240000] a01b50eb 
Nov 27 00:15:26 s35 [43281574.240000] *pde = 00000000 
Nov 27 00:15:26 s35 [43281574.240000] Oops: 0000 [#1] 
Nov 27 00:15:26 s35 [43281574.240000] SMP 
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000] Modules linked in:
Nov 27 00:15:26 s35 ipt_owner
Nov 27 00:15:26 s35 ipt_state
Nov 27 00:15:26 s35 iptable_filter
Nov 27 00:15:26 s35 netconsole
Nov 27 00:15:26 s35 uhci_hcd
Nov 27 00:15:26 s35 ohci_hcd
Nov 27 00:15:26 s35 ehci_hcd
Nov 27 00:15:26 s35 usbcore
Nov 27 00:15:26 s35 ip_conntrack_ftp
Nov 27 00:15:26 s35 ip_conntrack
Nov 27 00:15:26 s35 forcedeth
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000] CPU:    1 
Nov 27 00:15:26 s35 [43281574.240000] EIP:    0060:[<a01b50eb>]    Not tainted VLI 
Nov 27 00:15:26 s35 [43281574.240000] EFLAGS: 00010257   (2.6.14.2amd64smp.17)  
Nov 27 00:15:26 s35 [43281574.240000] EIP is at do_task_stat+0x8b/0x890 
Nov 27 00:15:26 s35 [43281574.240000] eax: 00000000   ebx: 00000000   ecx: a0601700   edx: c804ad48 
Nov 27 00:15:26 s35 [43281574.240000] esi: b3fbe000   edi: f666aa70   ebp: d7e65f20   esp: d7e65da0 
Nov 27 00:15:26 s35 [43281574.240000] ds: 007b   es: 007b   ss: 0068 
Nov 27 00:15:26 s35 [43281574.240000] Process pidof (pid: 4723, threadinfo=d7e64000 task=e24e7550)
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000] Stack: 
Nov 27 00:15:26 s35 a01b1e2e 
Nov 27 00:15:26 s35 f666aa70 
Nov 27 00:15:26 s35 d7e65f28 
Nov 27 00:15:26 s35 a8cab11c 
Nov 27 00:15:26 s35 d7e65e24 
Nov 27 00:15:26 s35 d7e65de8 
Nov 27 00:15:26 s35 a0184934 
Nov 27 00:15:26 s35 d7e65e24 
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000]        
Nov 27 00:15:26 s35 a8cab544 
Nov 27 00:15:26 s35 d7e65de8 
Nov 27 00:15:26 s35 a019090d 
Nov 27 00:15:26 s35 a8cab544 
Nov 27 00:15:26 s35 a0720a00 
Nov 27 00:15:26 s35 d7e65df8 
Nov 27 00:15:26 s35 a2227140 
Nov 27 00:15:26 s35 00000000 
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000]        
Nov 27 00:15:26 s35 00000000 
Nov 27 00:15:26 s35 d7e65e2c 
Nov 27 00:15:26 s35 d7e65e48 
Nov 27 00:15:26 s35 a0185664 
Nov 27 00:15:26 s35 a8cab544 
Nov 27 00:15:26 s35 d7e65e2c 
Nov 27 00:15:26 s35 d7e65e24 
Nov 27 00:15:26 s35 c94ff00b 
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000] Call Trace: 
Nov 27 00:15:26 s35 [43281574.240000]  [<a0103e9f>] 
Nov 27 00:15:26 s35 show_stack+0x7f/0xa0
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000]  [<a010403d>] 
Nov 27 00:15:26 s35 show_registers+0x15d/0x1d0
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000]  [<a0104252>] 
Nov 27 00:15:26 s35 die+0x112/0x1c0
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000]  [<a055c2b9>] 
Nov 27 00:15:26 s35 do_page_fault+0x3d9/0x650
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000]  [<a0103b53>] 
Nov 27 00:15:26 s35 error_code+0x4f/0x54
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000]  [<a01b5940>] 
Nov 27 00:15:26 s35 proc_tgid_stat+0x20/0x30
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000]  [<a01b0f75>] 
Nov 27 00:15:26 s35 proc_info_read+0x55/0xa0
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000]  [<a0174d68>] 
Nov 27 00:15:26 s35 vfs_read+0x198/0x1a0
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000]  [<a017506b>] 
Nov 27 00:15:26 s35 sys_read+0x4b/0x80
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000]  [<a010302d>] 
Nov 27 00:15:26 s35 syscall_call+0x7/0xb
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000] Code: 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 c7 
Nov 27 00:15:26 s35 85 
Nov 27 00:15:26 s35 6c 
Nov 27 00:15:26 s35 ff 
Nov 27 00:15:26 s35 ff 
Nov 27 00:15:26 s35 ff 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 8b 
Nov 27 00:15:26 s35 07 
Nov 27 00:15:26 s35 8b 
Nov 27 00:15:26 s35 9f 
Nov 27 00:15:26 s35 84 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 25 
Nov 27 00:15:26 s35 8f 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 83 
Nov 27 00:15:26 s35 e3 
Nov 27 00:15:26 s35 30 
Nov 27 00:15:26 s35 09 
Nov 27 00:15:26 s35 d8 
Nov 27 00:15:26 s35 eb 
Nov 27 00:15:26 s35 05 
Nov 27 00:15:26 s35 83 
Nov 27 00:15:26 s35 c1 
Nov 27 00:15:26 s35 04 
Nov 27 00:15:26 s35 d1 
Nov 27 00:15:26 s35 e8 
Nov 27 00:15:26 s35 75 
Nov 27 00:15:26 s35 f9 
Nov 27 00:15:26 s35 8b 
Nov 27 00:15:26 s35 01 
Nov 27 00:15:26 s35 unparseable log message: "<0f> "
Nov 27 00:15:26 s35 b6 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 c7 
Nov 27 00:15:26 s35 45 
Nov 27 00:15:26 s35 c8 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 c7 
Nov 27 00:15:26 s35 45 
Nov 27 00:15:26 s35 cc 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35 c7 
Nov 27 00:15:26 s35 45 
Nov 27 00:15:26 s35 d0 
Nov 27 00:15:26 s35 00 
Nov 27 00:15:26 s35  
Nov 27 00:15:26 s35 [43281574.240000] History:	SEQ:  3ddca14	NR_CPUS: 8 
Nov 27 00:15:26 s35 [43281574.240000] (#c9ae,*0):a04d546e set_vx_info f6e48000[#830,190.71] @f4fcf4e8 
Nov 27 00:15:26 s35 [43281574.240000] (#c964,*1):a013ac82 release_vx_info f6e48000[#830,190.74] @c5cdb030 
Nov 27 00:15:26 s35 [43281574.240000] (#ca13,*0):a04d40b2 clr_vx_info f6e48000[#830,188.71] @db738068 
Nov 27 00:15:26 s35 [43281574.240000] (#ca14,*1):a0104140 oops  
Nov 27 00:15:26 s35 [43281574.240000] (#ca12,*0):a04d40b2 clr_vx_info f6e48000[#830,189.71] @db739b68 
Nov 27 00:15:26 s35 [43281574.240000] (#ca0f,*1):a011c57c clr_vx_info f6e1e000[#831,151.39] @f6f1bad0 
Nov 27 00:15:26 s35 [43281574.240000] (#ca11,*0):a04d40b2 clr_vx_info f6e48000[#830,190.71] @ad3a16e8 
Nov 27 00:15:26 s35 [43281574.240000] (#ca0e,*1):a011c45c set_vx_info f6e1e000[#831,150.39] @f6f1a210 
Nov 27 00:15:26 s35 [43281574.240000] (#ca10,*0):a04d40b2 clr_vx_info f6e48000[#830,191.71] @ed1a7b68 
Nov 27 00:15:26 s35 [43281574.240000] (#ca0d,*1):a04d40b2 clr_vx_info f6e48000[#830,192.71] @b6b98ae8 
Nov 27 00:15:26 s35 [43281574.240000] (#ca03,*0):a04d4556 set_vx_info f6e48000[#830,190.71] @ba389268 
Nov 27 00:15:26 s35 [43281574.240000] (#ca0c,*1):a011ddac claim_vx_info f6e1e000[#831,150.38] @e24e7550 
Nov 27 00:15:26 s35 [43281574.240000] (#ca02,*0):a04d40b2 clr_vx_info f6e48000[#830,191.71] @b6b98d68 
Nov 27 00:15:26 s35 [43281574.240000] (#ca0b,*1):a011c45c set_vx_info f6e1e000[#831,149.38] @f6f1bad0 
Nov 27 00:15:26 s35 [43281574.240000] (#ca01,*0):a04d546e set_vx_info f6e48000[#830,190.71] @b6b98d68 
Nov 27 00:15:26 s35 [43281574.240000] (#ca0a,*1):a011d38c init_vx_info f6e1e000[#831,148.38] @e24e79f8 
Nov 27 00:15:26 s35 [43281574.240000] (#ca00,*0):a04d40b2 clr_vx_info f6e48000[#830,191.71] @f4fcf4e8 
Nov 27 00:15:26 s35 [43281574.240000] (#ca09,*1):a011bf10 clr_vx_info f6e1e000[#831,149.38] @e24e79f8 
Nov 27 00:15:26 s35 [43281574.240000] (#c9ff,*0):a04d40b2 clr_vx_info f6e48000[#830,192.71] @f56bbde8 
Nov 27 00:15:26 s35 [43281574.240000] (#ca08,*1):a04d546e set_vx_info f6e48000[#830,191.71] @b6b98ae8 
Nov 27 00:15:26 s35 [43281574.240000] (#c9fe,*0):a04d546e set_vx_info f6e48000[#830,191.71] @f56bbde8 
Nov 27 00:15:26 s35 [43281574.240000] (#ca07,*1):a04d40b2 clr_vx_info f6e48000[#830,192.71] @cbec5068 
Nov 27 00:15:26 s35 [43281574.240000] (#c9fd,*0):a04d546e set_vx_info f6e48000[#830,190.71] @f4fcf4e8 
Nov 27 00:15:26 s35 [43281574.240000] (#ca06,*1):a04d40b2 clr_vx_info f6e48000[#830,193.71] @cbec5ba8 
Nov 27 00:15:26 s35 [43281574.240000] (#c9fc,*0):a04d4556 set_vx_info f6e48000[#830,189.71] @ad3a16e8 
Nov 27 00:15:26 s35 [43281574.240000] (#ca05,*1):a04d546e set_vx_info f6e48000[#830,192.71] @cbec5ba8 
Nov 27 00:15:26 s35 [43281574.240000] (#c9fb,*0):a04d40b2 clr_vx_info f6e48000[#830,190.71] @ad3a16e8 
Nov 27 00:15:26 s35 [43281574.240000] (#ca04,*1):a04d546e set_vx_info f6e48000[#830,191.71] @cbec5068 
Nov 27 00:15:26 s35 [43281574.240000] (#c9f7,*0):a011c57c clr_vx_info f6e1e000[#831,148.37] @f6f1a790 
Nov 27 00:15:26 s35 [43281574.240000] (#c9fa,*1):a011ddac claim_vx_info f6e1e000[#831,149.37] @e4856550 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] race condition in procfs
  2005-11-30 15:29                   ` Grzegorz Nosek
@ 2005-11-30 16:25                     ` Steven Rostedt
  2005-11-30 17:23                       ` Grzegorz Nosek
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2005-11-30 16:25 UTC (permalink / raw)
  To: Grzegorz Nosek; +Cc: vserver, linux-kernel, Andrew Morton

(Andrew, this will be the last email that I include you on.  I'm taking
you off unless you want to stay on this thread, and say so.  I figure
that you get enough spam without having to read through this. I'll
obviously add you back if this results in a patch.)

On Wed, 2005-11-30 at 16:29 +0100, Grzegorz Nosek wrote:
> 2005/11/30, Steven Rostedt <rostedt@goodmis.org>:
> >
> > OK, Remove your patches, run the system where you can capture the log,
> > and provide a full output of the oops.  Make sure you have
> > CONFIG_KALLSYMS set.
> >
> 
> OK, attached an oops from netconsole.
> 

The oops happened at address a01b50eb.  Could you go into the compiled
directory run gdb on vmlinux and type li *0xa01b50eb and show what you
get.

For example:

~/work/ernie/linux-2.6.15-rc2-git5$ gdb vmlinux
GNU gdb 6.3-debian
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-linux"...Using host libthread_db library "/lib/tls/libthread_db.so.1".

(gdb) li *0xc01019e0
0xc01019e0 is in sys_clone (arch/i386/kernel/process.c:768).
763     {
764             return do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
765     }
766
767     asmlinkage int sys_clone(struct pt_regs regs)
768     {
769             unsigned long clone_flags;
770             unsigned long newsp;
771             int __user *parent_tidptr, *child_tidptr;
772
(gdb)

Obviously, use "li *0xa01b50eb" instead of 0xc01019e0.

Now if you get an error in starting gdb like:

$ gdb vmlinux
GNU gdb 6.3.90_20051119-debian
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i486-linux-gnu"...(no debugging symbols found)
Using host libthread_db library "/lib/tls/libthread_db.so.1".

(gdb) li *0xa01b50eb
No symbol table is loaded.  Use the "file" command.
(gdb)

Notice the (no debugging symbols found).  Then start up make menuconfig,
goto "Kernel Hacking", set "Kernel Debugging" and "Compile the kernel
with debug info".  And try again.  It may also be helpful to have
"Compile the kernel with frame pointers" also set.  If you do this, you
will probably need to use something other than the 0xa01b50eb.  Look at
the output of the oops and see the following:

Nov 27 00:15:26 s35 [43281574.240000] CPU:    1
Nov 27 00:15:26 s35 [43281574.240000] EIP:    0060:[<a01b50eb>]    Not tainted VLI
Nov 27 00:15:26 s35 [43281574.240000] EFLAGS: 00010257   (2.6.14.2amd64smp.17)

That EIP is what I want.

-- Steve



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] race condition in procfs
  2005-11-30 16:25                     ` Steven Rostedt
@ 2005-11-30 17:23                       ` Grzegorz Nosek
  2005-12-01 20:38                         ` Grzegorz Nosek
  0 siblings, 1 reply; 22+ messages in thread
From: Grzegorz Nosek @ 2005-11-30 17:23 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: vserver, linux-kernel

2005/11/30, Steven Rostedt <rostedt@goodmis.org>:
> (Andrew, this will be the last email that I include you on.  I'm taking
> you off unless you want to stay on this thread, and say so.  I figure
> that you get enough spam without having to read through this. I'll
> obviously add you back if this results in a patch.)

(removed Andrew from the CC as well)

>
> On Wed, 2005-11-30 at 16:29 +0100, Grzegorz Nosek wrote:
> > 2005/11/30, Steven Rostedt <rostedt@goodmis.org>:
> > >
> > > OK, Remove your patches, run the system where you can capture the log,
> > > and provide a full output of the oops.  Make sure you have
> > > CONFIG_KALLSYMS set.
> > >
> >
> > OK, attached an oops from netconsole.
> >
>
> The oops happened at address a01b50eb.  Could you go into the compiled
> directory run gdb on vmlinux and type li *0xa01b50eb and show what you
> get.
>

OK, will send it as soon as I get my hands on it (I'm building a new
kernel at the moment with full debug info). In the meantime, if you
have a copy of fs/proc/array.o handy, have a look at do_task_stat
dissassembly and search for movzbl (%eax), %eax. Regardless of my
kernel config, architecture or whatever, the oops is in that
instruction (clearly a NULL pointer dereference). From some previous
debug build I found out (via objdump -dl) that it's apparently at the
entry point of the get_task_stat inline function.

Best regards,
 Grzegorz Nosek

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] race condition in procfs
  2005-11-30 17:23                       ` Grzegorz Nosek
@ 2005-12-01 20:38                         ` Grzegorz Nosek
  0 siblings, 0 replies; 22+ messages in thread
From: Grzegorz Nosek @ 2005-12-01 20:38 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: vserver, linux-kernel

> > 2005/11/30, Steven Rostedt <rostedt@goodmis.org>:
> >
> > The oops happened at address a01b50eb.  Could you go into the compiled
> > directory run gdb on vmlinux and type li *0xa01b50eb and show what you
> > get.

It turned out to be a bug in the vserver patches. Sent to maintainer.
As it's not a mainline issue, I'm not bothering you any more. Thanks
for debugging tips :)

Best regards,
 Grzegorz Nosek

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] remove unused blkp field in percpu_data
  2005-11-29 15:22   ` [PATCH] shrinks dentry struct Eric Dumazet
  2005-11-30  2:06     ` Paul Jackson
@ 2005-12-03  1:15     ` Eric Dumazet
  2005-12-13 18:03     ` [PATCH] shrinks dentry struct Paul E. McKenney
  2 siblings, 0 replies; 22+ messages in thread
From: Eric Dumazet @ 2005-12-03  1:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 477 bytes --]

[PATCH] remove unused blkp field in percpu_data

I found that blkp field was not used in kernel tree.

As most of the times NR_CPUS is a power of two and kmalloc() memory blocks 
too, this extra field basically doubles the memory space allocated in 
__alloc_percpu() to store the 'struct percpu_data'

(for example, if NR_CPUS=8 on i386, kmalloc(4*8+4) returns a 64 bytes block 
instead of a 32 bytes block after this patch)


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

[-- Attachment #2: percpu_data.patch --]
[-- Type: text/plain, Size: 231 bytes --]

--- linux-2.6/include/linux/percpu.h	2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6-ed/linux/percpu.h	2005-12-03 01:57:23.000000000 +0100
@@ -19,7 +19,6 @@
 
 struct percpu_data {
 	void *ptrs[NR_CPUS];
-	void *blkp;
 };
 
 /* 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] shrinks dentry struct
  2005-11-29 15:22   ` [PATCH] shrinks dentry struct Eric Dumazet
  2005-11-30  2:06     ` Paul Jackson
  2005-12-03  1:15     ` [PATCH] remove unused blkp field in percpu_data Eric Dumazet
@ 2005-12-13 18:03     ` Paul E. McKenney
  2005-12-13 18:24       ` Eric Dumazet
  2 siblings, 1 reply; 22+ messages in thread
From: Paul E. McKenney @ 2005-12-13 18:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andrew Morton, linux-kernel

On Tue, Nov 29, 2005 at 04:22:00PM +0100, Eric Dumazet wrote:
> Hi Andrew
> 
> Could you add this patch to mm ?
> 
> Thank you
> 
> [PATCH] shrinks dentry struct
> 
> Some long time ago, dentry struct was carefully tuned so that on 32 bits 
> UP, sizeof(struct dentry) was exactly 128, ie a power of 2, and a multiple 
> of memory cache lines.
> 
> Then RCU was added and dentry struct enlarged by two pointers, with nice 
> results for SMP, but not so good on UP, because breaking the above tuning 
> (128 + 8 = 136 bytes)
> 
> This patch reverts this unwanted side effect, by using an union (d_u), 
> where d_rcu and d_child are placed so that these two fields can share their 
> memory needs.
> 
> At the time d_free() is called (and d_rcu is really used), d_child is known 
> to be empty and not touched by the dentry freeing.
> 
> Lockless lookups only access d_name, d_parent, d_lock, d_op, d_flags (so 
> the previous content of d_child is not needed if said dentry was unhashed 
> but still accessed by a CPU because of RCU constraints)
> 
> As dentry cache easily contains millions of entries, a size reduction is 
> worth the extra complexity of the ugly C union.

Looks sound to me!  Some opportunities for simplification below.

(Please accept my apologies for the delay -- some diversions turned out
to be more consuming than I had expected.)

							Thanx, Paul

> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> 

> --- linux-2.6.15-rc3/include/linux/dcache.h	2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/include/linux/dcache.h	2005-11-29 12:04:54.000000000 +0100
> @@ -95,14 +95,19 @@
>  	struct qstr d_name;
>  
>  	struct list_head d_lru;		/* LRU list */
> -	struct list_head d_child;	/* child of parent list */
> +	/*
> +	 * d_child and d_rcu can share memory
> +	 */
> +	union {
> +		struct list_head d_child;	/* child of parent list */
> +	 	struct rcu_head d_rcu;
> +	} d_u;
>  	struct list_head d_subdirs;	/* our children */
>  	struct list_head d_alias;	/* inode alias list */
>  	unsigned long d_time;		/* used by d_revalidate */
>  	struct dentry_operations *d_op;
>  	struct super_block *d_sb;	/* The root of the dentry tree */
>  	void *d_fsdata;			/* fs-specific data */
> - 	struct rcu_head d_rcu;
>  	struct dcookie_struct *d_cookie; /* cookie, if any */
>  	int d_mounted;
>  	unsigned char d_iname[DNAME_INLINE_LEN_MIN];	/* small names */
> --- linux-2.6.15-rc3/fs/dcache.c	2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/fs/dcache.c	2005-11-29 15:07:10.000000000 +0100
> @@ -71,7 +71,7 @@
>  
>  static void d_callback(struct rcu_head *head)
>  {
> -	struct dentry * dentry = container_of(head, struct dentry, d_rcu);
> +	struct dentry * dentry = container_of(head, struct dentry, d_u.d_rcu);
>  
>  	if (dname_external(dentry))
>  		kfree(dentry->d_name.name);
> @@ -86,7 +86,7 @@
>  {
>  	if (dentry->d_op && dentry->d_op->d_release)
>  		dentry->d_op->d_release(dentry);
> - 	call_rcu(&dentry->d_rcu, d_callback);
> + 	call_rcu(&dentry->d_u.d_rcu, d_callback);
>  }
>  
>  /*
> @@ -193,7 +193,7 @@
>    			list_del(&dentry->d_lru);
>    			dentry_stat.nr_unused--;
>    		}
> -  		list_del(&dentry->d_child);
> +  		list_del(&dentry->d_u.d_child);
>  		dentry_stat.nr_dentry--;	/* For d_free, below */
>  		/*drops the locks, at that point nobody can reach this dentry */
>  		dentry_iput(dentry);
> @@ -367,7 +367,7 @@
>  	struct dentry * parent;
>  
>  	__d_drop(dentry);
> -	list_del(&dentry->d_child);
> +	list_del(&dentry->d_u.d_child);
>  	dentry_stat.nr_dentry--;	/* For d_free, below */
>  	dentry_iput(dentry);
>  	parent = dentry->d_parent;
> @@ -518,7 +518,7 @@
>  resume:
>  	while (next != &this_parent->d_subdirs) {
>  		struct list_head *tmp = next;
> -		struct dentry *dentry = list_entry(tmp, struct dentry, d_child);
> +		struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
>  		next = tmp->next;
>  		/* Have we found a mount point ? */
>  		if (d_mountpoint(dentry))
> @@ -532,7 +532,7 @@
>  	 * All done at this level ... ascend and resume the search.
>  	 */
>  	if (this_parent != parent) {
> -		next = this_parent->d_child.next; 
> +		next = this_parent->d_u.d_child.next; 
>  		this_parent = this_parent->d_parent;
>  		goto resume;
>  	}
> @@ -569,7 +569,7 @@
>  resume:
>  	while (next != &this_parent->d_subdirs) {
>  		struct list_head *tmp = next;
> -		struct dentry *dentry = list_entry(tmp, struct dentry, d_child);
> +		struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
>  		next = tmp->next;
>  
>  		if (!list_empty(&dentry->d_lru)) {
> @@ -610,7 +610,7 @@
>  	 * All done at this level ... ascend and resume the search.
>  	 */
>  	if (this_parent != parent) {
> -		next = this_parent->d_child.next; 
> +		next = this_parent->d_u.d_child.next; 
>  		this_parent = this_parent->d_parent;
>  #ifdef DCACHE_DEBUG
>  printk(KERN_DEBUG "select_parent: ascending to %s/%s, found=%d\n",
> @@ -753,12 +753,12 @@
>  		dentry->d_parent = dget(parent);
>  		dentry->d_sb = parent->d_sb;
>  	} else {
> -		INIT_LIST_HEAD(&dentry->d_child);
> +		INIT_LIST_HEAD(&dentry->d_u.d_child);
>  	}
>  
>  	spin_lock(&dcache_lock);
>  	if (parent)
> -		list_add(&dentry->d_child, &parent->d_subdirs);
> +		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
>  	dentry_stat.nr_dentry++;
>  	spin_unlock(&dcache_lock);
>  
> @@ -1310,8 +1310,8 @@
>  	/* Unhash the target: dput() will then get rid of it */
>  	__d_drop(target);
>  
> -	list_del(&dentry->d_child);
> -	list_del(&target->d_child);
> +	list_del(&dentry->d_u.d_child);
> +	list_del(&target->d_u.d_child);
>  
>  	/* Switch the names.. */
>  	switch_names(dentry, target);
> @@ -1322,15 +1322,15 @@
>  	if (IS_ROOT(dentry)) {
>  		dentry->d_parent = target->d_parent;
>  		target->d_parent = target;
> -		INIT_LIST_HEAD(&target->d_child);
> +		INIT_LIST_HEAD(&target->d_u.d_child);
>  	} else {
>  		do_switch(dentry->d_parent, target->d_parent);
>  
>  		/* And add them back to the (new) parent lists */
> -		list_add(&target->d_child, &target->d_parent->d_subdirs);
> +		list_add(&target->d_u.d_child, &target->d_parent->d_subdirs);
>  	}
>  
> -	list_add(&dentry->d_child, &dentry->d_parent->d_subdirs);
> +	list_add(&dentry->d_u.d_child, &dentry->d_parent->d_subdirs);
>  	spin_unlock(&target->d_lock);
>  	spin_unlock(&dentry->d_lock);
>  	write_sequnlock(&rename_lock);
> @@ -1568,7 +1568,7 @@
>  resume:
>  	while (next != &this_parent->d_subdirs) {
>  		struct list_head *tmp = next;
> -		struct dentry *dentry = list_entry(tmp, struct dentry, d_child);
> +		struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
>  		next = tmp->next;
>  		if (d_unhashed(dentry)||!dentry->d_inode)
>  			continue;
> @@ -1579,7 +1579,7 @@
>  		atomic_dec(&dentry->d_count);
>  	}
>  	if (this_parent != root) {
> -		next = this_parent->d_child.next; 
> +		next = this_parent->d_u.d_child.next; 
>  		atomic_dec(&this_parent->d_count);
>  		this_parent = this_parent->d_parent;
>  		goto resume;
> --- linux-2.6.15-rc3/fs/autofs4/autofs_i.h	2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/fs/autofs4/autofs_i.h	2005-11-29 12:04:54.000000000 +0100
> @@ -209,7 +209,7 @@
>  	struct dentry *child;
>  	int ret = 0;
>  
> -	list_for_each_entry(child, &dentry->d_subdirs, d_child)
> +	list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child)
>  		if (simple_positive(child))
>  			goto out;
>  	ret = 1;
> --- linux-2.6.15-rc3/fs/autofs4/expire.c	2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/fs/autofs4/expire.c	2005-11-29 12:04:54.000000000 +0100
> @@ -105,7 +105,7 @@
>  	next = this_parent->d_subdirs.next;
>  resume:
>  	while (next != &this_parent->d_subdirs) {
> -		struct dentry *dentry = list_entry(next, struct dentry, d_child);
> +		struct dentry *dentry = list_entry(next, struct dentry, d_u.d_child);
>  
>  		/* Negative dentry - give up */
>  		if (!simple_positive(dentry)) {
> @@ -138,7 +138,7 @@
>  	}
>  
>  	if (this_parent != top) {
> -		next = this_parent->d_child.next;
> +		next = this_parent->d_u.d_child.next;
>  		this_parent = this_parent->d_parent;
>  		goto resume;
>  	}
> @@ -163,7 +163,7 @@
>  	next = this_parent->d_subdirs.next;
>  resume:
>  	while (next != &this_parent->d_subdirs) {
> -		struct dentry *dentry = list_entry(next, struct dentry, d_child);
> +		struct dentry *dentry = list_entry(next, struct dentry, d_u.d_child);
>  
>  		/* Negative dentry - give up */
>  		if (!simple_positive(dentry)) {
> @@ -199,7 +199,7 @@
>  	}
>  
>  	if (this_parent != parent) {
> -		next = this_parent->d_child.next;
> +		next = this_parent->d_u.d_child.next;
>  		this_parent = this_parent->d_parent;
>  		goto resume;
>  	}
> @@ -238,7 +238,7 @@
>  	/* On exit from the loop expire is set to a dgot dentry
>  	 * to expire or it's NULL */
>  	while ( next != &root->d_subdirs ) {
> -		struct dentry *dentry = list_entry(next, struct dentry, d_child);
> +		struct dentry *dentry = list_entry(next, struct dentry, d_u.d_child);
>  
>  		/* Negative dentry - give up */
>  		if ( !simple_positive(dentry) ) {
> @@ -302,7 +302,7 @@
>  			expired, (int)expired->d_name.len, expired->d_name.name);
>  		spin_lock(&dcache_lock);
>  		list_del(&expired->d_parent->d_subdirs);
> -		list_add(&expired->d_parent->d_subdirs, &expired->d_child);
> +		list_add(&expired->d_parent->d_subdirs, &expired->d_u.d_child);
>  		spin_unlock(&dcache_lock);
>  		return expired;
>  	}
> --- linux-2.6.15-rc3/fs/autofs4/inode.c	2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/fs/autofs4/inode.c	2005-11-29 12:04:54.000000000 +0100
> @@ -91,7 +91,7 @@
>  	next = this_parent->d_subdirs.next;
>  resume:
>  	while (next != &this_parent->d_subdirs) {
> -		struct dentry *dentry = list_entry(next, struct dentry, d_child);
> +		struct dentry *dentry = list_entry(next, struct dentry, d_u.d_child);
>  
>  		/* Negative dentry - don`t care */
>  		if (!simple_positive(dentry)) {
> @@ -117,7 +117,7 @@
>  	if (this_parent != sbi->root) {
>  		struct dentry *dentry = this_parent;
>  
> -		next = this_parent->d_child.next;
> +		next = this_parent->d_u.d_child.next;
>  		this_parent = this_parent->d_parent;
>  		spin_unlock(&dcache_lock);
>  		DPRINTK("parent dentry %p %.*s",
> --- linux-2.6.15-rc3/fs/coda/cache.c	2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/fs/coda/cache.c	2005-11-29 12:05:21.000000000 +0100
> @@ -93,7 +93,7 @@
>  	spin_lock(&dcache_lock);
>  	list_for_each(child, &parent->d_subdirs)
>  	{
> -		de = list_entry(child, struct dentry, d_child);
> +		de = list_entry(child, struct dentry, d_u.d_child);

The above list_entry() could be combined with the earlier list_for_each()
using list_for_each_entry().

>  		/* don't know what to do with negative dentries */
>  		if ( ! de->d_inode ) 
>  			continue;
> --- linux-2.6.15-rc3/fs/libfs.c	2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/fs/libfs.c	2005-11-29 14:58:51.000000000 +0100
> @@ -93,16 +93,16 @@
>  			loff_t n = file->f_pos - 2;
>  
>  			spin_lock(&dcache_lock);
> -			list_del(&cursor->d_child);
> +			list_del(&cursor->d_u.d_child);
>  			p = file->f_dentry->d_subdirs.next;
>  			while (n && p != &file->f_dentry->d_subdirs) {
>  				struct dentry *next;
> -				next = list_entry(p, struct dentry, d_child);
> +				next = list_entry(p, struct dentry, d_u.d_child);

Should be possible to combine the list_entry() and the while() into
list_for_each_entry().

>  				if (!d_unhashed(next) && next->d_inode)
>  					n--;
>  				p = p->next;
>  			}
> -			list_add_tail(&cursor->d_child, p);
> +			list_add_tail(&cursor->d_u.d_child, p);
>  			spin_unlock(&dcache_lock);
>  		}
>  	}
> @@ -126,7 +126,7 @@
>  {
>  	struct dentry *dentry = filp->f_dentry;
>  	struct dentry *cursor = filp->private_data;
> -	struct list_head *p, *q = &cursor->d_child;
> +	struct list_head *p, *q = &cursor->d_u.d_child;
>  	ino_t ino;
>  	int i = filp->f_pos;
>  
> @@ -153,7 +153,7 @@
>  			}
>  			for (p=q->next; p != &dentry->d_subdirs; p=p->next) {
>  				struct dentry *next;
> -				next = list_entry(p, struct dentry, d_child);
> +				next = list_entry(p, struct dentry, d_u.d_child);

Ditto...

>  				if (d_unhashed(next) || !next->d_inode)
>  					continue;
>  
> @@ -261,7 +261,7 @@
>  	int ret = 0;
>  
>  	spin_lock(&dcache_lock);
> -	list_for_each_entry(child, &dentry->d_subdirs, d_child)
> +	list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child)
>  		if (simple_positive(child))
>  			goto out;
>  	ret = 1;
> --- linux-2.6.15-rc3/fs/ncpfs/dir.c	2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/fs/ncpfs/dir.c	2005-11-29 14:59:31.000000000 +0100
> @@ -365,7 +365,7 @@
>  	spin_lock(&dcache_lock);
>  	next = parent->d_subdirs.next;
>  	while (next != &parent->d_subdirs) {
> -		dent = list_entry(next, struct dentry, d_child);
> +		dent = list_entry(next, struct dentry, d_u.d_child);

Ditto...

>  		if ((unsigned long)dent->d_fsdata == fpos) {
>  			if (dent->d_inode)
>  				dget_locked(dent);
> --- linux-2.6.15-rc3/fs/ncpfs/ncplib_kernel.h	2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/fs/ncpfs/ncplib_kernel.h	2005-11-29 14:59:31.000000000 +0100
> @@ -196,7 +196,7 @@
>  	spin_lock(&dcache_lock);
>  	next = parent->d_subdirs.next;
>  	while (next != &parent->d_subdirs) {
> -		dentry = list_entry(next, struct dentry, d_child);
> +		dentry = list_entry(next, struct dentry, d_u.d_child);

Ditto...

>  
>  		if (dentry->d_fsdata == NULL)
>  			ncp_age_dentry(server, dentry);
> @@ -218,7 +218,7 @@
>  	spin_lock(&dcache_lock);
>  	next = parent->d_subdirs.next;
>  	while (next != &parent->d_subdirs) {
> -		dentry = list_entry(next, struct dentry, d_child);
> +		dentry = list_entry(next, struct dentry, d_u.d_child);

Ditto...

>  		dentry->d_fsdata = NULL;
>  		ncp_age_dentry(server, dentry);
>  		next = next->next;
> --- linux-2.6.15-rc3/fs/smbfs/cache.c	2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/fs/smbfs/cache.c	2005-11-29 14:59:47.000000000 +0100
> @@ -66,7 +66,7 @@
>  	spin_lock(&dcache_lock);
>  	next = parent->d_subdirs.next;
>  	while (next != &parent->d_subdirs) {
> -		dentry = list_entry(next, struct dentry, d_child);
> +		dentry = list_entry(next, struct dentry, d_u.d_child);

Ditto...

>  		dentry->d_fsdata = NULL;
>  		smb_age_dentry(server, dentry);
>  		next = next->next;
> @@ -100,7 +100,7 @@
>  	spin_lock(&dcache_lock);
>  	next = parent->d_subdirs.next;
>  	while (next != &parent->d_subdirs) {
> -		dent = list_entry(next, struct dentry, d_child);
> +		dent = list_entry(next, struct dentry, d_u.d_child);

Ditto...

>  		if ((unsigned long)dent->d_fsdata == fpos) {
>  			if (dent->d_inode)
>  				dget_locked(dent);
> --- linux-2.6.15-rc3/kernel/cpuset.c	2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/kernel/cpuset.c	2005-11-29 15:01:01.000000000 +0100
> @@ -304,7 +304,7 @@
>  	spin_lock(&dcache_lock);
>  	node = dentry->d_subdirs.next;
>  	while (node != &dentry->d_subdirs) {
> -		struct dentry *d = list_entry(node, struct dentry, d_child);
> +		struct dentry *d = list_entry(node, struct dentry, d_u.d_child);

Ditto...

>  		list_del_init(node);
>  		if (d->d_inode) {
>  			d = dget_locked(d);
> @@ -316,7 +316,7 @@
>  		}
>  		node = dentry->d_subdirs.next;
>  	}
> -	list_del_init(&dentry->d_child);
> +	list_del_init(&dentry->d_u.d_child);
>  	spin_unlock(&dcache_lock);
>  	remove_dir(dentry);
>  }
> --- linux-2.6.15-rc3/drivers/usb/core/inode.c	2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/drivers/usb/core/inode.c	2005-11-29 12:04:54.000000000 +0100
> @@ -186,7 +186,7 @@
>  
>  	down(&bus->d_inode->i_sem);
>  
> -	list_for_each_entry(dev, &bus->d_subdirs, d_child)
> +	list_for_each_entry(dev, &bus->d_subdirs, d_u.d_child)
>  		if (dev->d_inode)
>  			update_dev(dev);
>  
> @@ -203,7 +203,7 @@
>  
>  	down(&root->d_inode->i_sem);
>  
> -	list_for_each_entry(bus, &root->d_subdirs, d_child) {
> +	list_for_each_entry(bus, &root->d_subdirs, d_u.d_child) {
>  		if (bus->d_inode) {
>  			switch (S_IFMT & bus->d_inode->i_mode) {
>  			case S_IFDIR:
> @@ -319,7 +319,7 @@
>  	spin_lock(&dcache_lock);
>  
>  	list_for_each(list, &dentry->d_subdirs) {
> -		struct dentry *de = list_entry(list, struct dentry, d_child);
> +		struct dentry *de = list_entry(list, struct dentry, d_u.d_child);

The list_entry() and list_for_each() could be combined.

>  		if (usbfs_positive(de)) {
>  			spin_unlock(&dcache_lock);
>  			return 0;
> --- linux-2.6.15-rc3/net/sunrpc/rpc_pipe.c	2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/net/sunrpc/rpc_pipe.c	2005-11-29 15:01:01.000000000 +0100
> @@ -492,7 +492,7 @@
>  repeat:
>  	spin_lock(&dcache_lock);
>  	list_for_each_safe(pos, next, &parent->d_subdirs) {
> -		dentry = list_entry(pos, struct dentry, d_child);
> +		dentry = list_entry(pos, struct dentry, d_u.d_child);
>  		spin_lock(&dentry->d_lock);
>  		if (!d_unhashed(dentry)) {
>  			dget_locked(dentry);
> --- linux-2.6.15-rc3/security/selinux/selinuxfs.c	2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/security/selinux/selinuxfs.c	2005-11-29 15:01:09.000000000 +0100
> @@ -889,7 +889,7 @@
>  	spin_lock(&dcache_lock);
>  	node = de->d_subdirs.next;
>  	while (node != &de->d_subdirs) {
> -		struct dentry *d = list_entry(node, struct dentry, d_child);
> +		struct dentry *d = list_entry(node, struct dentry, d_u.d_child);

The list_entry() and while() could be combined.

>  		list_del_init(node);
>  
>  		if (d->d_inode) {


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] shrinks dentry struct
  2005-12-13 18:03     ` [PATCH] shrinks dentry struct Paul E. McKenney
@ 2005-12-13 18:24       ` Eric Dumazet
  0 siblings, 0 replies; 22+ messages in thread
From: Eric Dumazet @ 2005-12-13 18:24 UTC (permalink / raw)
  To: paulmck; +Cc: Andrew Morton, linux-kernel

Paul E. McKenney a écrit :
> On Tue, Nov 29, 2005 at 04:22:00PM +0100, Eric Dumazet wrote:
> 
>>Hi Andrew
>>
>>Could you add this patch to mm ?
>>
>>Thank you
>>
>>[PATCH] shrinks dentry struct
>>
>>Some long time ago, dentry struct was carefully tuned so that on 32 bits 
>>UP, sizeof(struct dentry) was exactly 128, ie a power of 2, and a multiple 
>>of memory cache lines.
>>
>>Then RCU was added and dentry struct enlarged by two pointers, with nice 
>>results for SMP, but not so good on UP, because breaking the above tuning 
>>(128 + 8 = 136 bytes)
>>
>>This patch reverts this unwanted side effect, by using an union (d_u), 
>>where d_rcu and d_child are placed so that these two fields can share their 
>>memory needs.
>>
>>At the time d_free() is called (and d_rcu is really used), d_child is known 
>>to be empty and not touched by the dentry freeing.
>>
>>Lockless lookups only access d_name, d_parent, d_lock, d_op, d_flags (so 
>>the previous content of d_child is not needed if said dentry was unhashed 
>>but still accessed by a CPU because of RCU constraints)
>>
>>As dentry cache easily contains millions of entries, a size reduction is 
>>worth the extra complexity of the ugly C union.
> 
> 
> Looks sound to me!  Some opportunities for simplification below.
> 
> (Please accept my apologies for the delay -- some diversions turned out
> to be more consuming than I had expected.)
> 
> 							Thanx, Paul
> 

Hi Paul

My patch only address the layout of dentry structure, basically a 'global 
substitute' on various places.

Adding some 'optimizations' or simplifications was not the goal, so please 
submit a patch if you feel the need for it :)

Thank you

Eric

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2005-12-13 18:25 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-11-29  7:17 [PATCH] race condition in procfs Grzegorz Nosek
2005-11-29  8:09 ` Andrew Morton
2005-11-29  8:38   ` Grzegorz Nosek
2005-11-29 13:25     ` Grzegorz Nosek
2005-11-29 14:04       ` Grzegorz Nosek
2005-11-29 14:28         ` Steven Rostedt
2005-11-29 14:39           ` Grzegorz Nosek
2005-11-29 14:49             ` Steven Rostedt
2005-11-30 14:41               ` Grzegorz Nosek
2005-11-30 15:14                 ` Steven Rostedt
2005-11-30 15:29                   ` Grzegorz Nosek
2005-11-30 16:25                     ` Steven Rostedt
2005-11-30 17:23                       ` Grzegorz Nosek
2005-12-01 20:38                         ` Grzegorz Nosek
2005-11-29 15:22   ` [PATCH] shrinks dentry struct Eric Dumazet
2005-11-30  2:06     ` Paul Jackson
2005-11-30  2:14       ` Andrew Morton
2005-11-30  2:43         ` Paul Jackson
2005-11-30  6:56         ` Hugh Dickins
2005-12-03  1:15     ` [PATCH] remove unused blkp field in percpu_data Eric Dumazet
2005-12-13 18:03     ` [PATCH] shrinks dentry struct Paul E. McKenney
2005-12-13 18:24       ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.