All of lore.kernel.org
 help / color / mirror / Atom feed
* Mainline kernel OLTP performance update
@ 2009-01-13 21:10 Ma, Chinang
  2009-01-13 22:44 ` Wilcox, Matthew R
  0 siblings, 1 reply; 105+ messages in thread
From: Ma, Chinang @ 2009-01-13 21:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Tripathi, Sharad C, arjan, Wilcox, Matthew R, Kleen, Andi,
	Siddha, Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang,
	Peter Xihong, Nueckel, Hubert, Chris Mason

This is latest 2.6.29-rc1 kernel OLTP performance result. Compare to 2.6.24.2 the regression is around 3.5%.

Linux OLTP Performance summary
Kernel#            Speedup(x)   Intr/s  CtxSw/s us%  sys%   idle%  iowait%
2.6.24.2                1.000   21969   43425   76   24     0      0
2.6.27.2                0.973   30402   43523   74   25     0      1
2.6.29-rc1              0.965   30331   41970   74   26     0      0

Server configurations:
Intel Xeon Quad-core 2.0GHz  2 cpus/8 cores/8 threads
64GB memory, 3 qle2462 FC HBA, 450 spindles (30 logical units)

======oprofile CPU_CLK_UNHALTED for top 30 functions
Cycles% 2.6.24.2                   Cycles% 2.6.27.2
1.0500 qla24xx_start_scsi          1.2125 qla24xx_start_scsi
0.8089 schedule                    0.6962 kmem_cache_alloc
0.5864 kmem_cache_alloc            0.6209 qla24xx_intr_handler
0.4989 __blockdev_direct_IO        0.4895 copy_user_generic_string
0.4152 copy_user_generic_string    0.4591 __blockdev_direct_IO
0.3953 qla24xx_intr_handler        0.4409 __end_that_request_first
0.3596 scsi_request_fn             0.3729 __switch_to
0.3188 __switch_to                 0.3716 try_to_wake_up
0.2889 lock_timer_base             0.3531 lock_timer_base
0.2519 task_rq_lock                0.3393 scsi_request_fn
0.2474 aio_complete                0.3038 aio_complete
0.2460 scsi_alloc_sgtable          0.2989 memset_c
0.2445 generic_make_request        0.2633 qla2x00_process_completed_re
0.2263 qla2x00_process_completed_re0.2583 pick_next_highest_task_rt
0.2118 blk_queue_end_tag           0.2578 generic_make_request
0.2085 dio_bio_complete            0.2510 __list_add
0.2021 e1000_xmit_frame            0.2459 task_rq_lock
0.2006 __end_that_request_first    0.2322 kmem_cache_free
0.1954 generic_file_aio_read       0.2206 blk_queue_end_tag
0.1949 kfree                       0.2205 __mod_timer
0.1915 tcp_sendmsg                 0.2179 update_curr_rt
0.1901 try_to_wake_up              0.2164 sd_prep_fn
0.1895 kref_get                    0.2130 kref_get
0.1864 __mod_timer                 0.2075 dio_bio_complete
0.1863 thread_return               0.2066 push_rt_task
0.1854 math_state_restore          0.1974 qla24xx_msix_default
0.1775 __list_add                  0.1935 generic_file_aio_read
0.1721 memset_c                    0.1870 scsi_device_unbusy
0.1706 find_vma                    0.1861 tcp_sendmsg
0.1688 read_tsc                    0.1843 e1000_xmit_frame

======oprofile CPU_CLK_UNHALTED for top 30 functions
Cycles% 2.6.24.2                   Cycles% 2.6.29-rc1
1.0500 qla24xx_start_scsi          1.0691 qla24xx_intr_handler
0.8089 schedule                    0.7701 copy_user_generic_string
0.5864 kmem_cache_alloc            0.7339 qla24xx_wrt_req_reg
0.4989 __blockdev_direct_IO        0.6458 kmem_cache_alloc
0.4152 copy_user_generic_string    0.5794 qla24xx_start_scsi
0.3953 qla24xx_intr_handler        0.5505 unmap_vmas
0.3596 scsi_request_fn             0.4869 __blockdev_direct_IO
0.3188 __switch_to                 0.4493 try_to_wake_up
0.2889 lock_timer_base             0.4291 scsi_request_fn
0.2519 task_rq_lock                0.4118 clear_page_c
0.2474 aio_complete                0.4002 __switch_to
0.2460 scsi_alloc_sgtable          0.3381 ring_buffer_consume
0.2445 generic_make_request        0.3366 rb_get_reader_page
0.2263 qla2x00_process_completed_re0.3222 aio_complete
0.2118 blk_queue_end_tag           0.3135 memset_c
0.2085 dio_bio_complete            0.2875 __list_add
0.2021 e1000_xmit_frame            0.2673 task_rq_lock
0.2006 __end_that_request_first    0.2658 __end_that_request_first
0.1954 generic_file_aio_read       0.2615 qla2x00_process_completed_re
0.1949 kfree                       0.2615 lock_timer_base
0.1915 tcp_sendmsg                 0.2456 disk_map_sector_rcu
0.1901 try_to_wake_up              0.2427 tcp_sendmsg
0.1895 kref_get                    0.2413 e1000_xmit_frame
0.1864 __mod_timer                 0.2398 kmem_cache_free
0.1863 thread_return               0.2384 pick_next_highest_task_rt
0.1854 math_state_restore          0.2225 blk_queue_end_tag
0.1775 __list_add                  0.2211 sd_prep_fn
0.1721 memset_c                    0.2167 qla24xx_queuecommand
0.1706 find_vma                    0.2109 scsi_device_unbusy
0.1688 read_tsc                    0.2095 kref_get


^ permalink raw reply	[flat|nested] 105+ messages in thread

* RE: Mainline kernel OLTP performance update
  2009-01-13 21:10 Mainline kernel OLTP performance update Ma, Chinang
@ 2009-01-13 22:44 ` Wilcox, Matthew R
  2009-01-15  0:35   ` Andrew Morton
  0 siblings, 1 reply; 105+ messages in thread
From: Wilcox, Matthew R @ 2009-01-13 22:44 UTC (permalink / raw)
  To: Ma, Chinang, linux-kernel
  Cc: Tripathi, Sharad C, arjan, Kleen, Andi, Siddha, Suresh B,
	Chilukuri, Harita, Styner, Douglas W, Wang, Peter Xihong,
	Nueckel, Hubert, Chris Mason, Steven Rostedt

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 7470 bytes --]


One encouraging thing is that we don't see a significant drop-off between 2.6.28 and 2.6.29-rc1, which I think is the first time we've not seen a big problem with -rc1.

To compare the top 30 functions between 2.6.28 and 2.6.29-rc1:

1.4257 qla24xx_start_scsi		1.0691 qla24xx_intr_handler
0.8784 kmem_cache_alloc			0.7701 copy_user_generic_string
0.6876 qla24xx_intr_handler		0.7339 qla24xx_wrt_req_reg
0.5834 copy_user_generic_string	0.6458 kmem_cache_alloc
0.4945 scsi_request_fn			0.5794 qla24xx_start_scsi
0.4846 __blockdev_direct_IO		0.5505 unmap_vmas
0.4187 try_to_wake_up			0.4869 __blockdev_direct_IO
0.3518 aio_complete			0.4493 try_to_wake_up
0.3513 __end_that_request_first	0.4291 scsi_request_fn
0.3483 __switch_to			0.4118 clear_page_c
0.3271 memset_c				0.4002 __switch_to
0.2976 qla2x00_process_completed_re	0.3381 ring_buffer_consume
0.2905 __list_add				0.3366 rb_get_reader_page
0.2901 generic_make_request		0.3222 aio_complete
0.2755 lock_timer_base			0.3135 memset_c
0.2741 blk_queue_end_tag		0.2875 __list_add
0.2593 kmem_cache_free			0.2673 task_rq_lock
0.2445 disk_map_sector_rcu		0.2658 __end_that_request_first
0.2370 pick_next_highest_task_rt	0.2615 qla2x00_process_completed_re
0.2323 scsi_device_unbusy		0.2615 lock_timer_base
0.2321 task_rq_lock			0.2456 disk_map_sector_rcu
0.2316 scsi_dispatch_cmd		0.2427 tcp_sendmsg
0.2239 kref_get				0.2413 e1000_xmit_frame
0.2237 dio_bio_complete			0.2398 kmem_cache_free
0.2194 push_rt_task			0.2384 pick_next_highest_task_rt
0.2145 __aio_get_req			0.2225 blk_queue_end_tag
0.2143 kfree				0.2211 sd_prep_fn
0.2138 __mod_timer			0.2167 qla24xx_queuecommand
0.2131 e1000_irq_enable			0.2109 scsi_device_unbusy
0.2091 scsi_softirq_done		0.2095 kref_get

It looks like a number of functions in the qla2x00 driver were split up, so it's probably best to ignore all the changes in qla* functions.

unmap_vmas is a new hot function.  It's been around since before git history started, and hasn't changed substantially between 2.6.28 and 2.6.29-rc1, so I suspect we're calling it more often.  I don't know why we'd be doing that.

clear_page_c is also new to the hot list.  I haven't tried to understand why this might be so.

The ring_buffer_consume() and rb_get_reader_page() functions are part of the oprofile code.  This seems to indicate a bug -- they should not be the #12 and #13 hottest functions in the kernel when monitoring a database run!

That seems to be about it for regressions.

> -----Original Message-----
> From: Ma, Chinang
> Sent: Tuesday, January 13, 2009 1:11 PM
> To: linux-kernel@vger.kernel.org
> Cc: Tripathi, Sharad C; arjan@linux.intel.com; Wilcox, Matthew R; Kleen,
> Andi; Siddha, Suresh B; Chilukuri, Harita; Styner, Douglas W; Wang, Peter
> Xihong; Nueckel, Hubert; Chris Mason
> Subject: Mainline kernel OLTP performance update
> 
> This is latest 2.6.29-rc1 kernel OLTP performance result. Compare to
> 2.6.24.2 the regression is around 3.5%.
> 
> Linux OLTP Performance summary
> Kernel#            Speedup(x)   Intr/s  CtxSw/s us%  sys%   idle%  iowait%
> 2.6.24.2                1.000   21969   43425   76   24     0      0
> 2.6.27.2                0.973   30402   43523   74   25     0      1
> 2.6.29-rc1              0.965   30331   41970   74   26     0      0
> 
> Server configurations:
> Intel Xeon Quad-core 2.0GHz  2 cpus/8 cores/8 threads
> 64GB memory, 3 qle2462 FC HBA, 450 spindles (30 logical units)
> 
> ======oprofile CPU_CLK_UNHALTED for top 30 functions
> Cycles% 2.6.24.2                   Cycles% 2.6.27.2
> 1.0500 qla24xx_start_scsi          1.2125 qla24xx_start_scsi
> 0.8089 schedule                    0.6962 kmem_cache_alloc
> 0.5864 kmem_cache_alloc            0.6209 qla24xx_intr_handler
> 0.4989 __blockdev_direct_IO        0.4895 copy_user_generic_string
> 0.4152 copy_user_generic_string    0.4591 __blockdev_direct_IO
> 0.3953 qla24xx_intr_handler        0.4409 __end_that_request_first
> 0.3596 scsi_request_fn             0.3729 __switch_to
> 0.3188 __switch_to                 0.3716 try_to_wake_up
> 0.2889 lock_timer_base             0.3531 lock_timer_base
> 0.2519 task_rq_lock                0.3393 scsi_request_fn
> 0.2474 aio_complete                0.3038 aio_complete
> 0.2460 scsi_alloc_sgtable          0.2989 memset_c
> 0.2445 generic_make_request        0.2633 qla2x00_process_completed_re
> 0.2263 qla2x00_process_completed_re0.2583 pick_next_highest_task_rt
> 0.2118 blk_queue_end_tag           0.2578 generic_make_request
> 0.2085 dio_bio_complete            0.2510 __list_add
> 0.2021 e1000_xmit_frame            0.2459 task_rq_lock
> 0.2006 __end_that_request_first    0.2322 kmem_cache_free
> 0.1954 generic_file_aio_read       0.2206 blk_queue_end_tag
> 0.1949 kfree                       0.2205 __mod_timer
> 0.1915 tcp_sendmsg                 0.2179 update_curr_rt
> 0.1901 try_to_wake_up              0.2164 sd_prep_fn
> 0.1895 kref_get                    0.2130 kref_get
> 0.1864 __mod_timer                 0.2075 dio_bio_complete
> 0.1863 thread_return               0.2066 push_rt_task
> 0.1854 math_state_restore          0.1974 qla24xx_msix_default
> 0.1775 __list_add                  0.1935 generic_file_aio_read
> 0.1721 memset_c                    0.1870 scsi_device_unbusy
> 0.1706 find_vma                    0.1861 tcp_sendmsg
> 0.1688 read_tsc                    0.1843 e1000_xmit_frame
> 
> ======oprofile CPU_CLK_UNHALTED for top 30 functions
> Cycles% 2.6.24.2                   Cycles% 2.6.29-rc1
> 1.0500 qla24xx_start_scsi          1.0691 qla24xx_intr_handler
> 0.8089 schedule                    0.7701 copy_user_generic_string
> 0.5864 kmem_cache_alloc            0.7339 qla24xx_wrt_req_reg
> 0.4989 __blockdev_direct_IO        0.6458 kmem_cache_alloc
> 0.4152 copy_user_generic_string    0.5794 qla24xx_start_scsi
> 0.3953 qla24xx_intr_handler        0.5505 unmap_vmas
> 0.3596 scsi_request_fn             0.4869 __blockdev_direct_IO
> 0.3188 __switch_to                 0.4493 try_to_wake_up
> 0.2889 lock_timer_base             0.4291 scsi_request_fn
> 0.2519 task_rq_lock                0.4118 clear_page_c
> 0.2474 aio_complete                0.4002 __switch_to
> 0.2460 scsi_alloc_sgtable          0.3381 ring_buffer_consume
> 0.2445 generic_make_request        0.3366 rb_get_reader_page
> 0.2263 qla2x00_process_completed_re0.3222 aio_complete
> 0.2118 blk_queue_end_tag           0.3135 memset_c
> 0.2085 dio_bio_complete            0.2875 __list_add
> 0.2021 e1000_xmit_frame            0.2673 task_rq_lock
> 0.2006 __end_that_request_first    0.2658 __end_that_request_first
> 0.1954 generic_file_aio_read       0.2615 qla2x00_process_completed_re
> 0.1949 kfree                       0.2615 lock_timer_base
> 0.1915 tcp_sendmsg                 0.2456 disk_map_sector_rcu
> 0.1901 try_to_wake_up              0.2427 tcp_sendmsg
> 0.1895 kref_get                    0.2413 e1000_xmit_frame
> 0.1864 __mod_timer                 0.2398 kmem_cache_free
> 0.1863 thread_return               0.2384 pick_next_highest_task_rt
> 0.1854 math_state_restore          0.2225 blk_queue_end_tag
> 0.1775 __list_add                  0.2211 sd_prep_fn
> 0.1721 memset_c                    0.2167 qla24xx_queuecommand
> 0.1706 find_vma                    0.2109 scsi_device_unbusy
> 0.1688 read_tsc                    0.2095 kref_get

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-13 22:44 ` Wilcox, Matthew R
@ 2009-01-15  0:35   ` Andrew Morton
  2009-01-15  1:21     ` Matthew Wilcox
  0 siblings, 1 reply; 105+ messages in thread
From: Andrew Morton @ 2009-01-15  0:35 UTC (permalink / raw)
  To: Wilcox, Matthew R
  Cc: chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, Andrew Vasquez, Anirban Chakraborty

On Tue, 13 Jan 2009 15:44:17 -0700
"Wilcox, Matthew R" <matthew.r.wilcox@intel.com> wrote:
>

(top-posting repaired.  That @intel.com address is a bad influence ;))

(cc linux-scsi)

> > -----Original Message-----
> > From: Ma, Chinang
> > Sent: Tuesday, January 13, 2009 1:11 PM
> > To: linux-kernel@vger.kernel.org
> > Cc: Tripathi, Sharad C; arjan@linux.intel.com; Wilcox, Matthew R; Kleen,
> > Andi; Siddha, Suresh B; Chilukuri, Harita; Styner, Douglas W; Wang, Peter
> > Xihong; Nueckel, Hubert; Chris Mason
> > Subject: Mainline kernel OLTP performance update
> > 
> > This is latest 2.6.29-rc1 kernel OLTP performance result. Compare to
> > 2.6.24.2 the regression is around 3.5%.
> > 
> > Linux OLTP Performance summary
> > Kernel#            Speedup(x)   Intr/s  CtxSw/s us%  sys%   idle%  iowait%
> > 2.6.24.2                1.000   21969   43425   76   24     0      0
> > 2.6.27.2                0.973   30402   43523   74   25     0      1
> > 2.6.29-rc1              0.965   30331   41970   74   26     0      0
> > 
> > Server configurations:
> > Intel Xeon Quad-core 2.0GHz  2 cpus/8 cores/8 threads
> > 64GB memory, 3 qle2462 FC HBA, 450 spindles (30 logical units)
>
> 
> One encouraging thing is that we don't see a significant drop-off between 2.6.28 and 2.6.29-rc1, which I think is the first time we've not seen a big problem with -rc1.
> 
> To compare the top 30 functions between 2.6.28 and 2.6.29-rc1:
> 
> 1.4257 qla24xx_start_scsi		1.0691 qla24xx_intr_handler
> 0.8784 kmem_cache_alloc			0.7701 copy_user_generic_string
> 0.6876 qla24xx_intr_handler		0.7339 qla24xx_wrt_req_reg
> 0.5834 copy_user_generic_string	0.6458 kmem_cache_alloc
> 0.4945 scsi_request_fn			0.5794 qla24xx_start_scsi
> 0.4846 __blockdev_direct_IO		0.5505 unmap_vmas
> 0.4187 try_to_wake_up			0.4869 __blockdev_direct_IO
> 0.3518 aio_complete			0.4493 try_to_wake_up
> 0.3513 __end_that_request_first	0.4291 scsi_request_fn
> 0.3483 __switch_to			0.4118 clear_page_c
> 0.3271 memset_c				0.4002 __switch_to
> 0.2976 qla2x00_process_completed_re	0.3381 ring_buffer_consume
> 0.2905 __list_add				0.3366 rb_get_reader_page
> 0.2901 generic_make_request		0.3222 aio_complete
> 0.2755 lock_timer_base			0.3135 memset_c
> 0.2741 blk_queue_end_tag		0.2875 __list_add
> 0.2593 kmem_cache_free			0.2673 task_rq_lock
> 0.2445 disk_map_sector_rcu		0.2658 __end_that_request_first
> 0.2370 pick_next_highest_task_rt	0.2615 qla2x00_process_completed_re
> 0.2323 scsi_device_unbusy		0.2615 lock_timer_base
> 0.2321 task_rq_lock			0.2456 disk_map_sector_rcu
> 0.2316 scsi_dispatch_cmd		0.2427 tcp_sendmsg
> 0.2239 kref_get				0.2413 e1000_xmit_frame
> 0.2237 dio_bio_complete			0.2398 kmem_cache_free
> 0.2194 push_rt_task			0.2384 pick_next_highest_task_rt
> 0.2145 __aio_get_req			0.2225 blk_queue_end_tag
> 0.2143 kfree				0.2211 sd_prep_fn
> 0.2138 __mod_timer			0.2167 qla24xx_queuecommand
> 0.2131 e1000_irq_enable			0.2109 scsi_device_unbusy
> 0.2091 scsi_softirq_done		0.2095 kref_get
> 
> It looks like a number of functions in the qla2x00 driver were split up, so it's probably best to ignore all the changes in qla* functions.
> 
> unmap_vmas is a new hot function.  It's been around since before git history started, and hasn't changed substantially between 2.6.28 and 2.6.29-rc1, so I suspect we're calling it more often.  I don't know why we'd be doing that.
> 
> clear_page_c is also new to the hot list.  I haven't tried to understand why this might be so.
> 
> The ring_buffer_consume() and rb_get_reader_page() functions are part of the oprofile code.  This seems to indicate a bug -- they should not be the #12 and #13 hottest functions in the kernel when monitoring a database run!
> 
> That seems to be about it for regressions.
> 

But the interrupt rate went through the roof.

A 3.5% slowdown in this workload is considered pretty serious, isn't it?

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-15  0:35   ` Andrew Morton
@ 2009-01-15  1:21     ` Matthew Wilcox
  2009-01-15  2:04       ` Andrew Morton
  2009-01-15 16:48       ` Ma, Chinang
  0 siblings, 2 replies; 105+ messages in thread
From: Matthew Wilcox @ 2009-01-15  1:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi,
	arjan, andi.kleen, suresh.b.siddha, harita.chilukuri,
	douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason,
	srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty

On Wed, Jan 14, 2009 at 04:35:57PM -0800, Andrew Morton wrote:
> On Tue, 13 Jan 2009 15:44:17 -0700
> "Wilcox, Matthew R" <matthew.r.wilcox@intel.com> wrote:
> >
> 
> (top-posting repaired.  That @intel.com address is a bad influence ;))

Alas, that email address goes to an Outlook client.  Not much to be done
about that.

> (cc linux-scsi)
> 
> > > This is latest 2.6.29-rc1 kernel OLTP performance result. Compare to
> > > 2.6.24.2 the regression is around 3.5%.
> > > 
> > > Linux OLTP Performance summary
> > > Kernel#            Speedup(x)   Intr/s  CtxSw/s us%  sys%   idle%  iowait%
> > > 2.6.24.2                1.000   21969   43425   76   24     0      0
> > > 2.6.27.2                0.973   30402   43523   74   25     0      1
> > > 2.6.29-rc1              0.965   30331   41970   74   26     0      0

> But the interrupt rate went through the roof.

Yes.  I forget why that was; I'll have to dig through my archives for
that.

> A 3.5% slowdown in this workload is considered pretty serious, isn't it?

Yes.  Anything above 0.3% is statistically significant.  1% is a big
deal.  The fact that we've lost 3.5% in the last year doesn't make
people happy.  There's a few things we've identified that have a big
effect:

 - Per-partition statistics.  Putting in a sysctl to stop doing them gets
   some of that back, but not as much as taking them out (even when
   the sysctl'd variable is in a __read_mostly section).  We tried a
   patch from Jens to speed up the search for a new partition, but it
   had no effect.

 - The RT scheduler changes.  They're better for some RT tasks, but not
   the database benchmark workload.  Chinang has posted about
   this before, but the thread didn't really go anywhere.
   http://marc.info/?t=122903815000001&r=1&w=2

SLUB would have had a huge negative effect if we were using it -- on the
order of 7% iirc.  SLQB is at least performance-neutral with SLAB.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-15  1:21     ` Matthew Wilcox
@ 2009-01-15  2:04       ` Andrew Morton
  2009-01-15  2:27         ` Steven Rostedt
                           ` (3 more replies)
  2009-01-15 16:48       ` Ma, Chinang
  1 sibling, 4 replies; 105+ messages in thread
From: Andrew Morton @ 2009-01-15  2:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi,
	arjan, andi.kleen, suresh.b.siddha, harita.chilukuri,
	douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason,
	srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty

On Wed, 14 Jan 2009 18:21:47 -0700 Matthew Wilcox <matthew@wil.cx> wrote:

> On Wed, Jan 14, 2009 at 04:35:57PM -0800, Andrew Morton wrote:
> > On Tue, 13 Jan 2009 15:44:17 -0700
> > "Wilcox, Matthew R" <matthew.r.wilcox@intel.com> wrote:
> > >
> > 
> > (top-posting repaired.  That @intel.com address is a bad influence ;))
> 
> Alas, that email address goes to an Outlook client.  Not much to be done
> about that.

aspirin?

> > (cc linux-scsi)
> > 
> > > > This is latest 2.6.29-rc1 kernel OLTP performance result. Compare to
> > > > 2.6.24.2 the regression is around 3.5%.
> > > > 
> > > > Linux OLTP Performance summary
> > > > Kernel#            Speedup(x)   Intr/s  CtxSw/s us%  sys%   idle%  iowait%
> > > > 2.6.24.2                1.000   21969   43425   76   24     0      0
> > > > 2.6.27.2                0.973   30402   43523   74   25     0      1
> > > > 2.6.29-rc1              0.965   30331   41970   74   26     0      0
> 
> > But the interrupt rate went through the roof.
> 
> Yes.  I forget why that was; I'll have to dig through my archives for
> that.

Oh.  I'd have thought that this alone could account for 3.5%.

> > A 3.5% slowdown in this workload is considered pretty serious, isn't it?
> 
> Yes.  Anything above 0.3% is statistically significant.  1% is a big
> deal.  The fact that we've lost 3.5% in the last year doesn't make
> people happy.  There's a few things we've identified that have a big
> effect:
> 
>  - Per-partition statistics.  Putting in a sysctl to stop doing them gets
>    some of that back, but not as much as taking them out (even when
>    the sysctl'd variable is in a __read_mostly section).  We tried a
>    patch from Jens to speed up the search for a new partition, but it
>    had no effect.

I find this surprising.

>  - The RT scheduler changes.  They're better for some RT tasks, but not
>    the database benchmark workload.  Chinang has posted about
>    this before, but the thread didn't really go anywhere.
>    http://marc.info/?t=122903815000001&r=1&w=2

Well.  It's more a case that it wasn't taken anywhere.  I appear to
have recently been informed that there have never been any
CPU-scheduler-caused regressions.  Please persist!

> SLUB would have had a huge negative effect if we were using it -- on the
> order of 7% iirc.  SLQB is at least performance-neutral with SLAB.

We really need to unblock that problem somehow.  I assume that
enterprise distros are shipping slab?


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-15  2:04       ` Andrew Morton
@ 2009-01-15  2:27         ` Steven Rostedt
  2009-01-15  7:11             ` Ma, Chinang
  2009-01-15  2:39         ` Andi Kleen
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 105+ messages in thread
From: Steven Rostedt @ 2009-01-15  2:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, Wilcox, Matthew R, chinang.ma, linux-kernel,
	sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, linux-scsi, Andrew Vasquez,
	Anirban Chakraborty, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Gregory Haskins

(added Ingo, Thomas, Peter and Gregory)

On Wed, 2009-01-14 at 18:04 -0800, Andrew Morton wrote:
> On Wed, 14 Jan 2009 18:21:47 -0700 Matthew Wilcox <matthew@wil.cx> wrote:
> 
> > On Wed, Jan 14, 2009 at 04:35:57PM -0800, Andrew Morton wrote:
> > > On Tue, 13 Jan 2009 15:44:17 -0700
> > > "Wilcox, Matthew R" <matthew.r.wilcox@intel.com> wrote:
> > > >
> > > 
> > > (top-posting repaired.  That @intel.com address is a bad influence ;))
> > 
> > Alas, that email address goes to an Outlook client.  Not much to be done
> > about that.
> 
> aspirin?
> 
> > > (cc linux-scsi)
> > > 
> > > > > This is latest 2.6.29-rc1 kernel OLTP performance result. Compare to
> > > > > 2.6.24.2 the regression is around 3.5%.
> > > > > 
> > > > > Linux OLTP Performance summary
> > > > > Kernel#            Speedup(x)   Intr/s  CtxSw/s us%  sys%   idle%  iowait%
> > > > > 2.6.24.2                1.000   21969   43425   76   24     0      0
> > > > > 2.6.27.2                0.973   30402   43523   74   25     0      1
> > > > > 2.6.29-rc1              0.965   30331   41970   74   26     0      0
> > 
> > > But the interrupt rate went through the roof.
> > 
> > Yes.  I forget why that was; I'll have to dig through my archives for
> > that.
> 
> Oh.  I'd have thought that this alone could account for 3.5%.
> 
> > > A 3.5% slowdown in this workload is considered pretty serious, isn't it?
> > 
> > Yes.  Anything above 0.3% is statistically significant.  1% is a big
> > deal.  The fact that we've lost 3.5% in the last year doesn't make
> > people happy.  There's a few things we've identified that have a big
> > effect:
> > 
> >  - Per-partition statistics.  Putting in a sysctl to stop doing them gets
> >    some of that back, but not as much as taking them out (even when
> >    the sysctl'd variable is in a __read_mostly section).  We tried a
> >    patch from Jens to speed up the search for a new partition, but it
> >    had no effect.
> 
> I find this surprising.
> 
> >  - The RT scheduler changes.  They're better for some RT tasks, but not
> >    the database benchmark workload.  Chinang has posted about
> >    this before, but the thread didn't really go anywhere.
> >    http://marc.info/?t=122903815000001&r=1&w=2

I read the whole thread before I found what you were talking about here:

http://marc.info/?l=linux-kernel&m=122937424114658&w=2

With this comment:

"When setting foreground and log writer to rt-prio, the log latency reduced to 4.8ms. \
Performance is about 1.5% higher than the CFS result.  
On a side note, we had been using rt-prio on all DBMS processes and log writer ( in \
higher priority) for the best OLTP performance. That has worked pretty well until \
2.6.25 when the new rt scheduler introduced the pull/push task for lower scheduling \
latency for rt-task. That has negative impact on this workload, probably due to the \
more elaborated load calculation/balancing for hundred of foreground rt-prio \
processes. Also, there is that question of no production environment would run DBMS \
with rt-prio. That is why I am going back to explore CFS and see whether I can drop \
rt-prio for good."

A couple of questions:

1) how does the latest rt scheduler compare?  There has been a lot of improvements.
2) how many rt tasks?
3) what were the prios, producer compared to consumers, not actual numbers
4) have you tried pinning tasks?

RT is more about determinism than performance. The old scheduler
migrated rt tasks the same as other tasks. This helps with performance
because it will keep several rt tasks on the same CPU and cache hot even
when a rt task can migrate. This helps performance, but kills
determinism (I was seeing 10 ms wake up times from the next-highest-prio
task on a cpu, even when another CPU was available).

If you pin a task to a cpu, then it skips over the push and pull logic
and will help with performance too.

-- Steve



> 
> Well.  It's more a case that it wasn't taken anywhere.  I appear to
> have recently been informed that there have never been any
> CPU-scheduler-caused regressions.  Please persist!
> 
> > SLUB would have had a huge negative effect if we were using it -- on the
> > order of 7% iirc.  SLQB is at least performance-neutral with SLAB.
> 
> We really need to unblock that problem somehow.  I assume that
> enterprise distros are shipping slab?
> 


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-15  2:04       ` Andrew Morton
  2009-01-15  2:27         ` Steven Rostedt
@ 2009-01-15  2:39         ` Andi Kleen
  2009-01-15  2:47           ` Matthew Wilcox
  2009-01-15  7:24         ` Nick Piggin
  2009-01-15 14:12         ` James Bottomley
  3 siblings, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2009-01-15  2:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, Wilcox, Matthew R, chinang.ma, linux-kernel,
	sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri,
	douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason,
	srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty

Andrew Morton <akpm@linux-foundation.org> writes:


>>    some of that back, but not as much as taking them out (even when
>>    the sysctl'd variable is in a __read_mostly section).  We tried a
>>    patch from Jens to speed up the search for a new partition, but it
>>    had no effect.
>
> I find this surprising.

The test system has thousands of disks/LUNs which it writes to
all the time, in addition to a workload which is a real cache pig. 
So any increase in the per LUN overhead directly leads to a lot
more cache misses in the kernel because it increases the working set
there sigificantly.

>
>>  - The RT scheduler changes.  They're better for some RT tasks, but not
>>    the database benchmark workload.  Chinang has posted about
>>    this before, but the thread didn't really go anywhere.
>>    http://marc.info/?t=122903815000001&r=1&w=2
>
> Well.  It's more a case that it wasn't taken anywhere.  I appear to
> have recently been informed that there have never been any
> CPU-scheduler-caused regressions.  Please persist!

Just to clarify: the non RT scheduler has never performed well on this
workload (although it seems to get slightly worse too), mostly 
because of log writer starvation.

RT at some point performed significantly better, but then as 
the RT behaviour was improved to be more fair on MP there were signficant
regressions when running under RT.
I wouldn't really advocate to make RT less fair again, it would
be better to just fix the non RT scheduler to perform reasonably. 
Unfortunately the thread above which was supposed to do that
didn't go anywhere.

>> SLUB would have had a huge negative effect if we were using it -- on the
>> order of 7% iirc.  SLQB is at least performance-neutral with SLAB.
>
> We really need to unblock that problem somehow.  I assume that
> enterprise distros are shipping slab?

The released ones all do.

-Andi
-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-15  2:39         ` Andi Kleen
@ 2009-01-15  2:47           ` Matthew Wilcox
  2009-01-15  3:36             ` Andi Kleen
  2009-01-20 13:27             ` Jens Axboe
  0 siblings, 2 replies; 105+ messages in thread
From: Matthew Wilcox @ 2009-01-15  2:47 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Wilcox, Matthew R, chinang.ma, linux-kernel,
	sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri,
	douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason,
	srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty

On Thu, Jan 15, 2009 at 03:39:05AM +0100, Andi Kleen wrote:
> Andrew Morton <akpm@linux-foundation.org> writes:
> >>    some of that back, but not as much as taking them out (even when
> >>    the sysctl'd variable is in a __read_mostly section).  We tried a
> >>    patch from Jens to speed up the search for a new partition, but it
> >>    had no effect.
> >
> > I find this surprising.
> 
> The test system has thousands of disks/LUNs which it writes to
> all the time, in addition to a workload which is a real cache pig. 
> So any increase in the per LUN overhead directly leads to a lot
> more cache misses in the kernel because it increases the working set
> there sigificantly.

This particular system has 450 spindles, but they're amalgamated into
30 logical volumes by the hardware or firmware.  Linux sees 30 LUNs.
Each one, though, has fifteen partitions on it, so that brings us back
up to 450 partitions.

This system, btw, is a scale model of the full system that would be used
to get published results.  If I remember correctly, a 1% performance
regression on this system is likely to translate to a 2% regression on
the full-scale system.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-15  2:47           ` Matthew Wilcox
@ 2009-01-15  3:36             ` Andi Kleen
  2009-01-20 13:27             ` Jens Axboe
  1 sibling, 0 replies; 105+ messages in thread
From: Andi Kleen @ 2009-01-15  3:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andi Kleen, Andrew Morton, Wilcox, Matthew R, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	Andrew Vasquez, Anirban Chakraborty

> This particular system has 450 spindles, but they're amalgamated into
> 30 logical volumes by the hardware or firmware.  Linux sees 30 LUNs.
> Each one, though, has fifteen partitions on it, so that brings us back
> up to 450 partitions.

Thanks for the correction.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* RE: Mainline kernel OLTP performance update
  2009-01-15  2:27         ` Steven Rostedt
@ 2009-01-15  7:11             ` Ma, Chinang
  0 siblings, 0 replies; 105+ messages in thread
From: Ma, Chinang @ 2009-01-15  7:11 UTC (permalink / raw)
  To: Steven Rostedt, Andrew Morton
  Cc: Matthew Wilcox, Wilcox, Matthew R, linux-kernel, Tripathi,
	Sharad C, arjan, Kleen, Andi, Siddha, Suresh B, Chilukuri,
	Harita, Styner, Douglas W, Wang, Peter Xihong, Nueckel, Hubert,
	chris.mason, linux-scsi, Andrew Vasquez, Anirban Chakraborty,
	Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Gregory Haskins

Trying to answer to some of the question below:

-Chinang

>-----Original Message-----
>From: Steven Rostedt [mailto:srostedt@redhat.com]
>Sent: Wednesday, January 14, 2009 6:27 PM
>To: Andrew Morton
>Cc: Matthew Wilcox; Wilcox, Matthew R; Ma, Chinang; linux-
>kernel@vger.kernel.org; Tripathi, Sharad C; arjan@linux.intel.com; Kleen,
>Andi; Siddha, Suresh B; Chilukuri, Harita; Styner, Douglas W; Wang, Peter
>Xihong; Nueckel, Hubert; chris.mason@oracle.com; linux-scsi@vger.kernel.org;
>Andrew Vasquez; Anirban Chakraborty; Ingo Molnar; Thomas Gleixner; Peter
>Zijlstra; Gregory Haskins
>Subject: Re: Mainline kernel OLTP performance update
>
>(added Ingo, Thomas, Peter and Gregory)
>
>On Wed, 2009-01-14 at 18:04 -0800, Andrew Morton wrote:
>> On Wed, 14 Jan 2009 18:21:47 -0700 Matthew Wilcox <matthew@wil.cx> wrote:
>>
>> > On Wed, Jan 14, 2009 at 04:35:57PM -0800, Andrew Morton wrote:
>> > > On Tue, 13 Jan 2009 15:44:17 -0700
>> > > "Wilcox, Matthew R" <matthew.r.wilcox@intel.com> wrote:
>> > > >
>> > >
>> > > (top-posting repaired.  That @intel.com address is a bad influence ;))
>> >
>> > Alas, that email address goes to an Outlook client.  Not much to be
>done
>> > about that.
>>
>> aspirin?
>>
>> > > (cc linux-scsi)
>> > >
>> > > > > This is latest 2.6.29-rc1 kernel OLTP performance result. Compare
>to
>> > > > > 2.6.24.2 the regression is around 3.5%.
>> > > > >
>> > > > > Linux OLTP Performance summary
>> > > > > Kernel#            Speedup(x)   Intr/s  CtxSw/s us%  sys%   idle%
>iowait%
>> > > > > 2.6.24.2                1.000   21969   43425   76   24     0
>0
>> > > > > 2.6.27.2                0.973   30402   43523   74   25     0
>1
>> > > > > 2.6.29-rc1              0.965   30331   41970   74   26     0
>0
>> >
>> > > But the interrupt rate went through the roof.
>> >
>> > Yes.  I forget why that was; I'll have to dig through my archives for
>> > that.
>>
>> Oh.  I'd have thought that this alone could account for 3.5%.
>>
>> > > A 3.5% slowdown in this workload is considered pretty serious, isn't
>it?
>> >
>> > Yes.  Anything above 0.3% is statistically significant.  1% is a big
>> > deal.  The fact that we've lost 3.5% in the last year doesn't make
>> > people happy.  There's a few things we've identified that have a big
>> > effect:
>> >
>> >  - Per-partition statistics.  Putting in a sysctl to stop doing them
>gets
>> >    some of that back, but not as much as taking them out (even when
>> >    the sysctl'd variable is in a __read_mostly section).  We tried a
>> >    patch from Jens to speed up the search for a new partition, but it
>> >    had no effect.
>>
>> I find this surprising.
>>
>> >  - The RT scheduler changes.  They're better for some RT tasks, but not
>> >    the database benchmark workload.  Chinang has posted about
>> >    this before, but the thread didn't really go anywhere.
>> >    http://marc.info/?t=122903815000001&r=1&w=2
>
>I read the whole thread before I found what you were talking about here:
>
>http://marc.info/?l=linux-kernel&m=122937424114658&w=2
>
>With this comment:
>
>"When setting foreground and log writer to rt-prio, the log latency reduced
>to 4.8ms. \
>Performance is about 1.5% higher than the CFS result.
>On a side note, we had been using rt-prio on all DBMS processes and log
>writer ( in \
>higher priority) for the best OLTP performance. That has worked pretty well
>until \
>2.6.25 when the new rt scheduler introduced the pull/push task for lower
>scheduling \
>latency for rt-task. That has negative impact on this workload, probably
>due to the \
>more elaborated load calculation/balancing for hundred of foreground rt-
>prio \
>processes. Also, there is that question of no production environment would
>run DBMS \
>with rt-prio. That is why I am going back to explore CFS and see whether I
>can drop \
>rt-prio for good."
>

>A couple of questions:
>
>1) how does the latest rt scheduler compare?  There has been a lot of
>improvements.

It is difficult for me to isolate the recent rt scheduler improvement as so many other changes were introduced to the kernel at the same time. A more accurate comparison should just revert the rt-scheduler back to the previous version and test the delta. I am not sure how to get that done. 

>2) how many rt tasks?   
	Around 250 rt tasks.

>3) what were the prios, producer compared to consumers, not actual numbers
	I suppose the single log writer is the main producer (rt-prio 49, higheset rt-prio in this workload) which wake up all foreground process when the log write is done. The 240 foreground processes are the consumer (rt-prio 48). At any given time some number of the 240 foreground was waiting for log writer to finish flushing out the log data.

>4) have you tried pinning tasks?
>
We did try pin foreground rt-process to cpu. That recovered about 1% performance but introduce idle time in some cpu. Without load balancing, my solution is to pin more processes to the idle cpu. I don't think this is a practical solution for the idle time problem as the process distribution need to be adjusted again when upgrade to a different server. 

>RT is more about determinism than performance. The old scheduler
>migrated rt tasks the same as other tasks. This helps with performance
>because it will keep several rt tasks on the same CPU and cache hot even
>when a rt task can migrate. This helps performance, but kills
>determinism (I was seeing 10 ms wake up times from the next-highest-prio
>task on a cpu, even when another CPU was available).
>
>If you pin a task to a cpu, then it skips over the push and pull logic
>and will help with performance too.
>
>-- Steve
>
>
>
>>
>> Well.  It's more a case that it wasn't taken anywhere.  I appear to
>> have recently been informed that there have never been any
>> CPU-scheduler-caused regressions.  Please persist!
>>
>> > SLUB would have had a huge negative effect if we were using it -- on
>the
>> > order of 7% iirc.  SLQB is at least performance-neutral with SLAB.
>>
>> We really need to unblock that problem somehow.  I assume that
>> enterprise distros are shipping slab?
>>


^ permalink raw reply	[flat|nested] 105+ messages in thread

* RE: Mainline kernel OLTP performance update
@ 2009-01-15  7:11             ` Ma, Chinang
  0 siblings, 0 replies; 105+ messages in thread
From: Ma, Chinang @ 2009-01-15  7:11 UTC (permalink / raw)
  To: Steven Rostedt, Andrew Morton
  Cc: Matthew Wilcox, Wilcox, Matthew R, linux-kernel, Tripathi,
	Sharad C, arjan, Kleen, Andi, Siddha, Suresh B, Chilukuri,
	Harita, Styner, Douglas W, Wang, Peter Xihong, Nueckel, Hubert,
	chris.mason, linux-scsi, Andrew Vasquez, Anirban Chakraborty,
	Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Gregory Haskins

Trying to answer to some of the question below:

-Chinang

>-----Original Message-----
>From: Steven Rostedt [mailto:srostedt@redhat.com]
>Sent: Wednesday, January 14, 2009 6:27 PM
>To: Andrew Morton
>Cc: Matthew Wilcox; Wilcox, Matthew R; Ma, Chinang; linux-
>kernel@vger.kernel.org; Tripathi, Sharad C; arjan@linux.intel.com; Kleen,
>Andi; Siddha, Suresh B; Chilukuri, Harita; Styner, Douglas W; Wang, Peter
>Xihong; Nueckel, Hubert; chris.mason@oracle.com; linux-scsi@vger.kernel.org;
>Andrew Vasquez; Anirban Chakraborty; Ingo Molnar; Thomas Gleixner; Peter
>Zijlstra; Gregory Haskins
>Subject: Re: Mainline kernel OLTP performance update
>
>(added Ingo, Thomas, Peter and Gregory)
>
>On Wed, 2009-01-14 at 18:04 -0800, Andrew Morton wrote:
>> On Wed, 14 Jan 2009 18:21:47 -0700 Matthew Wilcox <matthew@wil.cx> wrote:
>>
>> > On Wed, Jan 14, 2009 at 04:35:57PM -0800, Andrew Morton wrote:
>> > > On Tue, 13 Jan 2009 15:44:17 -0700
>> > > "Wilcox, Matthew R" <matthew.r.wilcox@intel.com> wrote:
>> > > >
>> > >
>> > > (top-posting repaired.  That @intel.com address is a bad influence ;))
>> >
>> > Alas, that email address goes to an Outlook client.  Not much to be
>done
>> > about that.
>>
>> aspirin?
>>
>> > > (cc linux-scsi)
>> > >
>> > > > > This is latest 2.6.29-rc1 kernel OLTP performance result. Compare
>to
>> > > > > 2.6.24.2 the regression is around 3.5%.
>> > > > >
>> > > > > Linux OLTP Performance summary
>> > > > > Kernel#            Speedup(x)   Intr/s  CtxSw/s us%  sys%   idle%
>iowait%
>> > > > > 2.6.24.2                1.000   21969   43425   76   24     0
>0
>> > > > > 2.6.27.2                0.973   30402   43523   74   25     0
>1
>> > > > > 2.6.29-rc1              0.965   30331   41970   74   26     0
>0
>> >
>> > > But the interrupt rate went through the roof.
>> >
>> > Yes.  I forget why that was; I'll have to dig through my archives for
>> > that.
>>
>> Oh.  I'd have thought that this alone could account for 3.5%.
>>
>> > > A 3.5% slowdown in this workload is considered pretty serious, isn't
>it?
>> >
>> > Yes.  Anything above 0.3% is statistically significant.  1% is a big
>> > deal.  The fact that we've lost 3.5% in the last year doesn't make
>> > people happy.  There's a few things we've identified that have a big
>> > effect:
>> >
>> >  - Per-partition statistics.  Putting in a sysctl to stop doing them
>gets
>> >    some of that back, but not as much as taking them out (even when
>> >    the sysctl'd variable is in a __read_mostly section).  We tried a
>> >    patch from Jens to speed up the search for a new partition, but it
>> >    had no effect.
>>
>> I find this surprising.
>>
>> >  - The RT scheduler changes.  They're better for some RT tasks, but not
>> >    the database benchmark workload.  Chinang has posted about
>> >    this before, but the thread didn't really go anywhere.
>> >    http://marc.info/?t=122903815000001&r=1&w=2
>
>I read the whole thread before I found what you were talking about here:
>
>http://marc.info/?l=linux-kernel&m=122937424114658&w=2
>
>With this comment:
>
>"When setting foreground and log writer to rt-prio, the log latency reduced
>to 4.8ms. \
>Performance is about 1.5% higher than the CFS result.
>On a side note, we had been using rt-prio on all DBMS processes and log
>writer ( in \
>higher priority) for the best OLTP performance. That has worked pretty well
>until \
>2.6.25 when the new rt scheduler introduced the pull/push task for lower
>scheduling \
>latency for rt-task. That has negative impact on this workload, probably
>due to the \
>more elaborated load calculation/balancing for hundred of foreground rt-
>prio \
>processes. Also, there is that question of no production environment would
>run DBMS \
>with rt-prio. That is why I am going back to explore CFS and see whether I
>can drop \
>rt-prio for good."
>

>A couple of questions:
>
>1) how does the latest rt scheduler compare?  There has been a lot of
>improvements.

It is difficult for me to isolate the recent rt scheduler improvement as so many other changes were introduced to the kernel at the same time. A more accurate comparison should just revert the rt-scheduler back to the previous version and test the delta. I am not sure how to get that done. 

>2) how many rt tasks?   
	Around 250 rt tasks.

>3) what were the prios, producer compared to consumers, not actual numbers
	I suppose the single log writer is the main producer (rt-prio 49, higheset rt-prio in this workload) which wake up all foreground process when the log write is done. The 240 foreground processes are the consumer (rt-prio 48). At any given time some number of the 240 foreground was waiting for log writer to finish flushing out the log data.

>4) have you tried pinning tasks?
>
We did try pin foreground rt-process to cpu. That recovered about 1% performance but introduce idle time in some cpu. Without load balancing, my solution is to pin more processes to the idle cpu. I don't think this is a practical solution for the idle time problem as the process distribution need to be adjusted again when upgrade to a different server. 

>RT is more about determinism than performance. The old scheduler
>migrated rt tasks the same as other tasks. This helps with performance
>because it will keep several rt tasks on the same CPU and cache hot even
>when a rt task can migrate. This helps performance, but kills
>determinism (I was seeing 10 ms wake up times from the next-highest-prio
>task on a cpu, even when another CPU was available).
>
>If you pin a task to a cpu, then it skips over the push and pull logic
>and will help with performance too.
>
>-- Steve
>
>
>
>>
>> Well.  It's more a case that it wasn't taken anywhere.  I appear to
>> have recently been informed that there have never been any
>> CPU-scheduler-caused regressions.  Please persist!
>>
>> > SLUB would have had a huge negative effect if we were using it -- on
>the
>> > order of 7% iirc.  SLQB is at least performance-neutral with SLAB.
>>
>> We really need to unblock that problem somehow.  I assume that
>> enterprise distros are shipping slab?
>>


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-15  2:04       ` Andrew Morton
  2009-01-15  2:27         ` Steven Rostedt
  2009-01-15  2:39         ` Andi Kleen
@ 2009-01-15  7:24         ` Nick Piggin
  2009-01-15  9:46           ` Pekka Enberg
  2009-01-16  0:27           ` Andrew Morton
  2009-01-15 14:12         ` James Bottomley
  3 siblings, 2 replies; 105+ messages in thread
From: Nick Piggin @ 2009-01-15  7:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, Wilcox, Matthew R, chinang.ma, linux-kernel,
	sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	Andrew Vasquez, Anirban Chakraborty

On Thursday 15 January 2009 13:04:31 Andrew Morton wrote:
> On Wed, 14 Jan 2009 18:21:47 -0700 Matthew Wilcox <matthew@wil.cx> wrote:

> > SLUB would have had a huge negative effect if we were using it -- on the
> > order of 7% iirc.  SLQB is at least performance-neutral with SLAB.
>
> We really need to unblock that problem somehow.  I assume that
> enterprise distros are shipping slab?

SLES11 will ship with SLAB, FWIW. As I said in the SLQB thread, this was
not due to my input. But I think it was probably the right choice to make
in that situation.

The biggest problem with SLAB for SGI I think is alien caches bloating the
kmem cache footprint to many GB each on their huge systems, but SLAB has a
parameter to turn off alien caches anyway so I think that is a reasonable
workaround.

Given the OLTP regression, and also I'd hate to have to deal with even
more reports of people's order-N allocations failing... basically with the
regression potential there, I don't think there was a compelling case
found to use SLUB (ie. where does it actually help?).

I'm going to propose to try to unblock the problem by asking to merge SLQB
with a plan to end up picking just one general allocator (and SLOB).

Given that SLAB and SLUB are fairly mature, I wonder what you'd think of
taking SLQB into -mm and making it the default there for a while, to see
if anybody reports a problem?


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-15  7:24         ` Nick Piggin
@ 2009-01-15  9:46           ` Pekka Enberg
  2009-01-15 13:52             ` Matthew Wilcox
  2009-01-16  0:27           ` Andrew Morton
  1 sibling, 1 reply; 105+ messages in thread
From: Pekka Enberg @ 2009-01-15  9:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Matthew Wilcox, Wilcox, Matthew R, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, Andrew Vasquez, Anirban Chakraborty

On Thu, Jan 15, 2009 at 9:24 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> SLES11 will ship with SLAB, FWIW. As I said in the SLQB thread, this was
> not due to my input. But I think it was probably the right choice to make
> in that situation.
>
> The biggest problem with SLAB for SGI I think is alien caches bloating the
> kmem cache footprint to many GB each on their huge systems, but SLAB has a
> parameter to turn off alien caches anyway so I think that is a reasonable
> workaround.
>
> Given the OLTP regression, and also I'd hate to have to deal with even
> more reports of people's order-N allocations failing... basically with the
> regression potential there, I don't think there was a compelling case
> found to use SLUB (ie. where does it actually help?).
>
> I'm going to propose to try to unblock the problem by asking to merge SLQB
> with a plan to end up picking just one general allocator (and SLOB).

It would also be nice if someone could do the performance analysis on
the SLUB bug. I ran sysbench in oltp mode here and the results look
like this:

  [ number of transactions per second from 10 runs. ]

                   min      max      avg      sd
  2.6.29-rc1-slab  833.77   852.32   845.10   4.72
  2.6.29-rc1-slub  823.61   851.94   836.74   8.57

I used the following sysbench parameters:

  sysbench --test=oltp \
         --oltp-table-size=1000000 \
         --mysql-socket=/var/run/mysqld/mysqld.sock \
         prepare

  sysbench --num-threads=16 \
         --max-requests=100000 \
         --test=oltp --oltp-table-size=1000000 \
         --mysql-socket=/var/run/mysqld/mysqld.sock \
         --oltp-read-only run

And no, the numbers are not flipped, SLUB beats SLAB here. :(

		Pekka

$ mysql --version
mysql  Ver 14.12 Distrib 5.0.51a, for debian-linux-gnu (x86_64) using
readline 5.2

$ cat /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 15
model name	: Intel(R) Core(TM)2 CPU         T7200  @ 2.00GHz
stepping	: 6
cpu MHz		: 1000.000
cache size	: 4096 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor
ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow
bogomips	: 3989.99
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 15
model name	: Intel(R) Core(TM)2 CPU         T7200  @ 2.00GHz
stepping	: 6
cpu MHz		: 1000.000
cache size	: 4096 KB
physical id	: 0
siblings	: 2
core id		: 1
cpu cores	: 2
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor
ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow
bogomips	: 3990.04
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

$ lspci
00:00.0 Host bridge: Intel Corporation Mobile 945GM/PM/GMS, 943/940GML
and 945GT Express Memory Controller Hub (rev 03)
00:02.0 VGA compatible controller: Intel Corporation Mobile 945GM/GMS,
943/940GML Express Integrated Graphics Controller (rev 03)
00:02.1 Display controller: Intel Corporation Mobile 945GM/GMS/GME,
943/940GML Express Integrated Graphics Controller (rev 03)
00:07.0 Performance counters: Intel Corporation Unknown device 27a3 (rev 03)
00:1b.0 Audio device: Intel Corporation 82801G (ICH7 Family) High
Definition Audio Controller (rev 02)
00:1c.0 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express
Port 1 (rev 02)
00:1c.1 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express
Port 2 (rev 02)
00:1d.0 USB Controller: Intel Corporation 82801G (ICH7 Family) USB
UHCI Controller #1 (rev 02)
00:1d.1 USB Controller: Intel Corporation 82801G (ICH7 Family) USB
UHCI Controller #2 (rev 02)
00:1d.2 USB Controller: Intel Corporation 82801G (ICH7 Family) USB
UHCI Controller #3 (rev 02)
00:1d.3 USB Controller: Intel Corporation 82801G (ICH7 Family) USB
UHCI Controller #4 (rev 02)
00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2
EHCI Controller (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev e2)
00:1f.0 ISA bridge: Intel Corporation 82801GBM (ICH7-M) LPC Interface
Bridge (rev 02)
00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE
Controller (rev 02)
00:1f.2 IDE interface: Intel Corporation 82801GBM/GHM (ICH7 Family)
SATA IDE Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 82801G (ICH7 Family) SMBus Controller (rev 02)
01:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8053
PCI-E Gigabit Ethernet Controller (rev 22)
02:00.0 Network controller: Atheros Communications Inc. AR5418
802.11abgn Wireless PCI Express Adapter (rev 01)
03:03.0 FireWire (IEEE 1394): Agere Systems FW323 (rev 61)

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-15  9:46           ` Pekka Enberg
@ 2009-01-15 13:52             ` Matthew Wilcox
  2009-01-15 14:42               ` Pekka Enberg
  2009-01-16 10:16               ` Pekka Enberg
  0 siblings, 2 replies; 105+ messages in thread
From: Matthew Wilcox @ 2009-01-15 13:52 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Andrew Morton, Wilcox, Matthew R, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, Andrew Vasquez, Anirban Chakraborty

On Thu, Jan 15, 2009 at 11:46:09AM +0200, Pekka Enberg wrote:
> It would also be nice if someone could do the performance analysis on
> the SLUB bug. I ran sysbench in oltp mode here and the results look
> like this:
> 
>   [ number of transactions per second from 10 runs. ]
> 
>                    min      max      avg      sd
>   2.6.29-rc1-slab  833.77   852.32   845.10   4.72
>   2.6.29-rc1-slub  823.61   851.94   836.74   8.57
> 
> And no, the numbers are not flipped, SLUB beats SLAB here. :(

Um.  More transactions per second is good.  Your numbers show SLAB
beating SLUB (even on your dual-CPU system).  And SLAB shows a lower
standard deviation, which is also good.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-15  2:04       ` Andrew Morton
                           ` (2 preceding siblings ...)
  2009-01-15  7:24         ` Nick Piggin
@ 2009-01-15 14:12         ` James Bottomley
  2009-01-15 17:44           ` Andrew Morton
  3 siblings, 1 reply; 105+ messages in thread
From: James Bottomley @ 2009-01-15 14:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, Wilcox, Matthew R, chinang.ma, linux-kernel,
	sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	Andrew Vasquez, Anirban Chakraborty

On Wed, 2009-01-14 at 18:04 -0800, Andrew Morton wrote:
> On Wed, 14 Jan 2009 18:21:47 -0700 Matthew Wilcox <matthew@wil.cx> wrote:
> > On Wed, Jan 14, 2009 at 04:35:57PM -0800, Andrew Morton wrote:
> > > > > Linux OLTP Performance summary
> > > > > Kernel#            Speedup(x)   Intr/s  CtxSw/s us%  sys%   idle%  iowait%
> > > > > 2.6.24.2                1.000   21969   43425   76   24     0      0
> > > > > 2.6.27.2                0.973   30402   43523   74   25     0      1
> > > > > 2.6.29-rc1              0.965   30331   41970   74   26     0      0
> > 
> > > But the interrupt rate went through the roof.
> > 
> > Yes.  I forget why that was; I'll have to dig through my archives for
> > that.
> 
> Oh.  I'd have thought that this alone could account for 3.5%.

Me too.  Anecdotally, I haven't noticed this in my lab machines, but
what I have noticed is on someone else's laptop (a hyperthreaded atom)
that I was trying to demo powertop on was that IPI reschedule interrupts
seem to be out of control ... they were ticking over at a really high
rate and preventing the CPU from spending much time in the low C and P
states.  To me this implicates some scheduler problem since that's the
primary producer of IPI reschedules ... I think it wouldn't be a
significant extrapolation to predict that the scheduler might be the
cause of the above problem as well.

James



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-15 13:52             ` Matthew Wilcox
@ 2009-01-15 14:42               ` Pekka Enberg
  2009-01-16 10:16               ` Pekka Enberg
  1 sibling, 0 replies; 105+ messages in thread
From: Pekka Enberg @ 2009-01-15 14:42 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nick Piggin, Andrew Morton, Wilcox, Matthew R, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, Andrew Vasquez, Anirban Chakraborty

Matthew Wilcox wrote:
> On Thu, Jan 15, 2009 at 11:46:09AM +0200, Pekka Enberg wrote:
>> It would also be nice if someone could do the performance analysis on
>> the SLUB bug. I ran sysbench in oltp mode here and the results look
>> like this:
>>
>>   [ number of transactions per second from 10 runs. ]
>>
>>                    min      max      avg      sd
>>   2.6.29-rc1-slab  833.77   852.32   845.10   4.72
>>   2.6.29-rc1-slub  823.61   851.94   836.74   8.57
>>
>> And no, the numbers are not flipped, SLUB beats SLAB here. :(
> 
> Um.  More transactions per second is good.  Your numbers show SLAB
> beating SLUB (even on your dual-CPU system).  And SLAB shows a lower
> standard deviation, which is also good.

*blush*

Will do oprofile tomorrow. Thanks Matthew.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* RE: Mainline kernel OLTP performance update
  2009-01-15  1:21     ` Matthew Wilcox
  2009-01-15  2:04       ` Andrew Morton
@ 2009-01-15 16:48       ` Ma, Chinang
  1 sibling, 0 replies; 105+ messages in thread
From: Ma, Chinang @ 2009-01-15 16:48 UTC (permalink / raw)
  To: Matthew Wilcox, Andrew Morton
  Cc: Wilcox, Matthew R, linux-kernel, Tripathi, Sharad C, arjan,
	Kleen, Andi, Siddha, Suresh B, Chilukuri, Harita, Styner,
	Douglas W, Wang, Peter Xihong, Nueckel, Hubert, chris.mason,
	srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty



>-----Original Message-----
>From: Matthew Wilcox [mailto:matthew@wil.cx]
>Sent: Wednesday, January 14, 2009 5:22 PM
>To: Andrew Morton
>Cc: Wilcox, Matthew R; Ma, Chinang; linux-kernel@vger.kernel.org; Tripathi,
>Sharad C; arjan@linux.intel.com; Kleen, Andi; Siddha, Suresh B; Chilukuri,
>Harita; Styner, Douglas W; Wang, Peter Xihong; Nueckel, Hubert;
>chris.mason@oracle.com; srostedt@redhat.com; linux-scsi@vger.kernel.org;
>Andrew Vasquez; Anirban Chakraborty
>Subject: Re: Mainline kernel OLTP performance update
>
>On Wed, Jan 14, 2009 at 04:35:57PM -0800, Andrew Morton wrote:
>> On Tue, 13 Jan 2009 15:44:17 -0700
>> "Wilcox, Matthew R" <matthew.r.wilcox@intel.com> wrote:
>> >
>>
>> (top-posting repaired.  That @intel.com address is a bad influence ;))
>
>Alas, that email address goes to an Outlook client.  Not much to be done
>about that.
>
>> (cc linux-scsi)
>>
>> > > This is latest 2.6.29-rc1 kernel OLTP performance result. Compare to
>> > > 2.6.24.2 the regression is around 3.5%.
>> > >
>> > > Linux OLTP Performance summary
>> > > Kernel#            Speedup(x)   Intr/s  CtxSw/s us%  sys%   idle%
>iowait%
>> > > 2.6.24.2                1.000   21969   43425   76   24     0      0
>> > > 2.6.27.2                0.973   30402   43523   74   25     0      1
>> > > 2.6.29-rc1              0.965   30331   41970   74   26     0      0
>
>> But the interrupt rate went through the roof.
>
>Yes.  I forget why that was; I'll have to dig through my archives for
>that.

I took a quick look at the interrupts figure between 2.6.24 and 2.6.27. i/o interuputs is slightly down in 2.6.27 (due to reduce throughput). But both NMI and reschedule interrupt increased.  Reschedule interrupts is 2x of 2.6.24.

>
>> A 3.5% slowdown in this workload is considered pretty serious, isn't it?
>
>Yes.  Anything above 0.3% is statistically significant.  1% is a big
>deal.  The fact that we've lost 3.5% in the last year doesn't make
>people happy.  There's a few things we've identified that have a big
>effect:
>
> - Per-partition statistics.  Putting in a sysctl to stop doing them gets
>   some of that back, but not as much as taking them out (even when
>   the sysctl'd variable is in a __read_mostly section).  We tried a
>   patch from Jens to speed up the search for a new partition, but it
>   had no effect.
>
> - The RT scheduler changes.  They're better for some RT tasks, but not
>   the database benchmark workload.  Chinang has posted about
>   this before, but the thread didn't really go anywhere.
>   http://marc.info/?t=122903815000001&r=1&w=2
>
>SLUB would have had a huge negative effect if we were using it -- on the
>order of 7% iirc.  SLQB is at least performance-neutral with SLAB.
>
>--
>Matthew Wilcox				Intel Open Source Technology Centre
>"Bill, look, we understand that you're interested in selling us this
>operating system, but compare it to ours.  We can't possibly take such
>a retrograde step."

-Chinang

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-15 14:12         ` James Bottomley
@ 2009-01-15 17:44           ` Andrew Morton
  2009-01-15 18:00             ` Matthew Wilcox
  0 siblings, 1 reply; 105+ messages in thread
From: Andrew Morton @ 2009-01-15 17:44 UTC (permalink / raw)
  To: James Bottomley
  Cc: Matthew Wilcox, Wilcox, Matthew R, chinang.ma, linux-kernel,
	sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	Andrew Vasquez, Anirban Chakraborty

On Thu, 15 Jan 2009 09:12:46 -0500 James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

> On Wed, 2009-01-14 at 18:04 -0800, Andrew Morton wrote:
> > On Wed, 14 Jan 2009 18:21:47 -0700 Matthew Wilcox <matthew@wil.cx> wrote:
> > > On Wed, Jan 14, 2009 at 04:35:57PM -0800, Andrew Morton wrote:
> > > > > > Linux OLTP Performance summary
> > > > > > Kernel#            Speedup(x)   Intr/s  CtxSw/s us%  sys%   idle%  iowait%
> > > > > > 2.6.24.2                1.000   21969   43425   76   24     0      0
> > > > > > 2.6.27.2                0.973   30402   43523   74   25     0      1
> > > > > > 2.6.29-rc1              0.965   30331   41970   74   26     0      0
> > > 
> > > > But the interrupt rate went through the roof.
> > > 
> > > Yes.  I forget why that was; I'll have to dig through my archives for
> > > that.
> > 
> > Oh.  I'd have thought that this alone could account for 3.5%.
> 
> Me too.  Anecdotally, I haven't noticed this in my lab machines, but
> what I have noticed is on someone else's laptop (a hyperthreaded atom)
> that I was trying to demo powertop on was that IPI reschedule interrupts
> seem to be out of control ... they were ticking over at a really high
> rate and preventing the CPU from spending much time in the low C and P
> states.  To me this implicates some scheduler problem since that's the
> primary producer of IPI reschedules ... I think it wouldn't be a
> significant extrapolation to predict that the scheduler might be the
> cause of the above problem as well.
> 

Good point.

The context switch rate actually went down a bit.

I wonder if the Intel test people have records of /proc/interrupts for
the various kernel versions.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-15 17:44           ` Andrew Morton
@ 2009-01-15 18:00             ` Matthew Wilcox
  2009-01-15 18:14               ` Steven Rostedt
  0 siblings, 1 reply; 105+ messages in thread
From: Matthew Wilcox @ 2009-01-15 18:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, Wilcox, Matthew R, chinang.ma, linux-kernel,
	sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	Andrew Vasquez, Anirban Chakraborty

On Thu, Jan 15, 2009 at 09:44:42AM -0800, Andrew Morton wrote:
> > Me too.  Anecdotally, I haven't noticed this in my lab machines, but
> > what I have noticed is on someone else's laptop (a hyperthreaded atom)
> > that I was trying to demo powertop on was that IPI reschedule interrupts
> > seem to be out of control ... they were ticking over at a really high
> > rate and preventing the CPU from spending much time in the low C and P
> > states.  To me this implicates some scheduler problem since that's the
> > primary producer of IPI reschedules ... I think it wouldn't be a
> > significant extrapolation to predict that the scheduler might be the
> > cause of the above problem as well.
> > 
> 
> Good point.
> 
> The context switch rate actually went down a bit.
> 
> I wonder if the Intel test people have records of /proc/interrupts for
> the various kernel versions.

I think Chinang does, but he's out of office today.  He did say in an
earlier reply:

> I took a quick look at the interrupts figure between 2.6.24 and 2.6.27.
> i/o interuputs is slightly down in 2.6.27 (due to reduce throughput).
> But both NMI and reschedule interrupt increased.  Reschedule interrupts
> is 2x of 2.6.24.

So if the reschedule interrupt is happening twice as often, and the
context switch rate is basically unchanged, I guess that means the
scheduler is doing a lot more work to get approximately the same
results.  And that seems like a bad thing.

Again, it's worth bearing in mind that these are all RT tasks, so the
underlying problem may be very different from the one that both James and
I have observed with an Atom laptop running predominantly non-RT tasks.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-15 18:00             ` Matthew Wilcox
@ 2009-01-15 18:14               ` Steven Rostedt
  2009-01-15 18:44                 ` Gregory Haskins
  2009-01-15 19:28                 ` Ma, Chinang
  0 siblings, 2 replies; 105+ messages in thread
From: Steven Rostedt @ 2009-01-15 18:14 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, James Bottomley, Wilcox, Matthew R, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, linux-scsi,
	Andrew Vasquez, Anirban Chakraborty, Gregory Haskins


On Thu, 2009-01-15 at 11:00 -0700, Matthew Wilcox wrote:
> On Thu, Jan 15, 2009 at 09:44:42AM -0800, Andrew Morton wrote:
> > > Me too.  Anecdotally, I haven't noticed this in my lab machines, but
> > > what I have noticed is on someone else's laptop (a hyperthreaded atom)
> > > that I was trying to demo powertop on was that IPI reschedule interrupts
> > > seem to be out of control ... they were ticking over at a really high
> > > rate and preventing the CPU from spending much time in the low C and P
> > > states.  To me this implicates some scheduler problem since that's the
> > > primary producer of IPI reschedules ... I think it wouldn't be a
> > > significant extrapolation to predict that the scheduler might be the
> > > cause of the above problem as well.
> > > 
> > 
> > Good point.
> > 
> > The context switch rate actually went down a bit.
> > 
> > I wonder if the Intel test people have records of /proc/interrupts for
> > the various kernel versions.
> 
> I think Chinang does, but he's out of office today.  He did say in an
> earlier reply:
> 
> > I took a quick look at the interrupts figure between 2.6.24 and 2.6.27.
> > i/o interuputs is slightly down in 2.6.27 (due to reduce throughput).
> > But both NMI and reschedule interrupt increased.  Reschedule interrupts
> > is 2x of 2.6.24.
> 
> So if the reschedule interrupt is happening twice as often, and the
> context switch rate is basically unchanged, I guess that means the
> scheduler is doing a lot more work to get approximately the same
> results.  And that seems like a bad thing.
> 
> Again, it's worth bearing in mind that these are all RT tasks, so the
> underlying problem may be very different from the one that both James and
> I have observed with an Atom laptop running predominantly non-RT tasks.
> 

The RT scheduler is a bit more aggressive than it use to be. It use to
just migrate RT tasks when the migration thread woke up, and did that in
"bulk".  Now, when an individual RT task wakes up and it can not run on
the current CPU but can on another CPU, it is scheduled immediately, and
an IPI is sent out.

As for context switching, it would be the same amount as before, but the
difference is that the RT task will try to wake up as soon as possible.
This also causes RT tasks to bounce around CPUs more often.

If there are many threads, they should not be RT, unless there is some
design behind it.

Forgive me if you already did this and said so, but what is the result
of just making the writer an RT task and keeping all the readers as
SCHED_OTHER?

-- Steve



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-15 18:14               ` Steven Rostedt
@ 2009-01-15 18:44                 ` Gregory Haskins
  2009-01-15 18:46                     ` Wilcox, Matthew R
  2009-01-15 19:28                 ` Ma, Chinang
  1 sibling, 1 reply; 105+ messages in thread
From: Gregory Haskins @ 2009-01-15 18:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Matthew Wilcox, Andrew Morton, James Bottomley, Wilcox,
	Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, linux-scsi,
	Andrew Vasquez, Anirban Chakraborty

[-- Attachment #1: Type: text/plain, Size: 2605 bytes --]

Steven Rostedt wrote:
> On Thu, 2009-01-15 at 11:00 -0700, Matthew Wilcox wrote:
>   
>> On Thu, Jan 15, 2009 at 09:44:42AM -0800, Andrew Morton wrote:
>>     
>>>> Me too.  Anecdotally, I haven't noticed this in my lab machines, but
>>>> what I have noticed is on someone else's laptop (a hyperthreaded atom)
>>>> that I was trying to demo powertop on was that IPI reschedule interrupts
>>>> seem to be out of control ... they were ticking over at a really high
>>>> rate and preventing the CPU from spending much time in the low C and P
>>>> states.  To me this implicates some scheduler problem since that's the
>>>> primary producer of IPI reschedules ... I think it wouldn't be a
>>>> significant extrapolation to predict that the scheduler might be the
>>>> cause of the above problem as well.
>>>>
>>>>         
>>> Good point.
>>>
>>> The context switch rate actually went down a bit.
>>>
>>> I wonder if the Intel test people have records of /proc/interrupts for
>>> the various kernel versions.
>>>       
>> I think Chinang does, but he's out of office today.  He did say in an
>> earlier reply:
>>
>>     
>>> I took a quick look at the interrupts figure between 2.6.24 and 2.6.27.
>>> i/o interuputs is slightly down in 2.6.27 (due to reduce throughput).
>>> But both NMI and reschedule interrupt increased.  Reschedule interrupts
>>> is 2x of 2.6.24.
>>>       
>> So if the reschedule interrupt is happening twice as often, and the
>> context switch rate is basically unchanged, I guess that means the
>> scheduler is doing a lot more work to get approximately the same
>> results.  And that seems like a bad thing.
>>     

I would be very interested in gathering some data in this area.  One
thing that pops to mind is to instrument the resched-ipi with
ftrace_printk() and gather a trace of this system in action.  I assume
that I wouldn't have access to this OLTP suite, so I may need a
volunteer to try this for me.  I could put together an instrumentation
patch for the testers convenience if they prefer.

Another data-point I wouldn't mind seeing is  looking at the scheduler
statistics, particularly with my sched-top utility, which you can find here:

http://rt.wiki.kernel.org/index.php/Schedtop_utility

(Note you may want to exclude the sched_info stats, as they are
inherently noisy and make it hard to see the real trends.  To do this
run it with: 'schedtop -x "sched_info"'

In the meantime, I will try similar approaches here on other non-OLTP
based workloads to see if I spy anything that looks amiss.
 
-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 105+ messages in thread

* RE: Mainline kernel OLTP performance update
  2009-01-15 18:44                 ` Gregory Haskins
@ 2009-01-15 18:46                     ` Wilcox, Matthew R
  0 siblings, 0 replies; 105+ messages in thread
From: Wilcox, Matthew R @ 2009-01-15 18:46 UTC (permalink / raw)
  To: Gregory Haskins, Steven Rostedt
  Cc: Matthew Wilcox, Andrew Morton, James Bottomley, Ma, Chinang,
	linux-kernel, Tripathi, Sharad C, arjan, Kleen, Andi, Siddha,
	Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang,
	Peter Xihong, Nueckel, Hubert, chris.mason, linux-scsi,
	Andrew Vasquez, Anirban Chakraborty

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1342 bytes --]

Gregory Haskins [mailto:ghaskins@novell.com] wrote:
> > On Thu, 2009-01-15 at 11:00 -0700, Matthew Wilcox wrote:
> >> So if the reschedule interrupt is happening twice as often, and the
> >> context switch rate is basically unchanged, I guess that means the
> >> scheduler is doing a lot more work to get approximately the same
> >> results.  And that seems like a bad thing.
> 
> I would be very interested in gathering some data in this area.  One
> thing that pops to mind is to instrument the resched-ipi with
> ftrace_printk() and gather a trace of this system in action.  I assume
> that I wouldn't have access to this OLTP suite, so I may need a
> volunteer to try this for me.  I could put together an instrumentation
> patch for the testers convenience if they prefer.

I don't know whether Novell have an arrangement with the Well-Known Commercial Database and the Well-Known OLTP Benchmark to do runs like this.  Chinang is normally only too happy to build his own kernels with patches from people who are interested in helping, so that's probably the best way to do it.

I'm leaving for LCA in an hour or so, so further responses from me to this thread are unlikely ;-)
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 105+ messages in thread

* RE: Mainline kernel OLTP performance update
@ 2009-01-15 18:46                     ` Wilcox, Matthew R
  0 siblings, 0 replies; 105+ messages in thread
From: Wilcox, Matthew R @ 2009-01-15 18:46 UTC (permalink / raw)
  To: Gregory Haskins, Steven Rostedt
  Cc: Matthew Wilcox, Andrew Morton, James Bottomley, Ma, Chinang,
	linux-kernel, Tripathi, Sharad C, arjan, Kleen, Andi, Siddha,
	Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang,
	Peter Xihong, Nueckel, Hubert, chris.mason, linux-scsi,
	Andrew Vasquez, Anirban Chakraborty

Gregory Haskins [mailto:ghaskins@novell.com] wrote:
> > On Thu, 2009-01-15 at 11:00 -0700, Matthew Wilcox wrote:
> >> So if the reschedule interrupt is happening twice as often, and the
> >> context switch rate is basically unchanged, I guess that means the
> >> scheduler is doing a lot more work to get approximately the same
> >> results.  And that seems like a bad thing.
> 
> I would be very interested in gathering some data in this area.  One
> thing that pops to mind is to instrument the resched-ipi with
> ftrace_printk() and gather a trace of this system in action.  I assume
> that I wouldn't have access to this OLTP suite, so I may need a
> volunteer to try this for me.  I could put together an instrumentation
> patch for the testers convenience if they prefer.

I don't know whether Novell have an arrangement with the Well-Known Commercial Database and the Well-Known OLTP Benchmark to do runs like this.  Chinang is normally only too happy to build his own kernels with patches from people who are interested in helping, so that's probably the best way to do it.

I'm leaving for LCA in an hour or so, so further responses from me to this thread are unlikely ;-)

^ permalink raw reply	[flat|nested] 105+ messages in thread

* RE: Mainline kernel OLTP performance update
  2009-01-15 18:14               ` Steven Rostedt
  2009-01-15 18:44                 ` Gregory Haskins
@ 2009-01-15 19:28                 ` Ma, Chinang
  1 sibling, 0 replies; 105+ messages in thread
From: Ma, Chinang @ 2009-01-15 19:28 UTC (permalink / raw)
  To: Steven Rostedt, Matthew Wilcox
  Cc: Andrew Morton, James Bottomley, Wilcox, Matthew R, linux-kernel,
	Tripathi, Sharad C, arjan, Kleen, Andi, Siddha, Suresh B,
	Chilukuri, Harita, Styner, Douglas W, Wang, Peter Xihong,
	Nueckel, Hubert, chris.mason, linux-scsi, Andrew Vasquez,
	Anirban Chakraborty, Gregory Haskins



>-----Original Message-----
>From: Steven Rostedt [mailto:srostedt@redhat.com]
>Sent: Thursday, January 15, 2009 10:15 AM
>To: Matthew Wilcox
>Cc: Andrew Morton; James Bottomley; Wilcox, Matthew R; Ma, Chinang; linux-
>kernel@vger.kernel.org; Tripathi, Sharad C; arjan@linux.intel.com; Kleen,
>Andi; Siddha, Suresh B; Chilukuri, Harita; Styner, Douglas W; Wang, Peter
>Xihong; Nueckel, Hubert; chris.mason@oracle.com; linux-scsi@vger.kernel.org;
>Andrew Vasquez; Anirban Chakraborty; Gregory Haskins
>Subject: Re: Mainline kernel OLTP performance update
>
>
>On Thu, 2009-01-15 at 11:00 -0700, Matthew Wilcox wrote:
>> On Thu, Jan 15, 2009 at 09:44:42AM -0800, Andrew Morton wrote:
>> > > Me too.  Anecdotally, I haven't noticed this in my lab machines, but
>> > > what I have noticed is on someone else's laptop (a hyperthreaded atom)
>> > > that I was trying to demo powertop on was that IPI reschedule
>interrupts
>> > > seem to be out of control ... they were ticking over at a really high
>> > > rate and preventing the CPU from spending much time in the low C and
>P
>> > > states.  To me this implicates some scheduler problem since that's
>the
>> > > primary producer of IPI reschedules ... I think it wouldn't be a
>> > > significant extrapolation to predict that the scheduler might be the
>> > > cause of the above problem as well.
>> > >
>> >
>> > Good point.
>> >
>> > The context switch rate actually went down a bit.
>> >
>> > I wonder if the Intel test people have records of /proc/interrupts for
>> > the various kernel versions.
>>
>> I think Chinang does, but he's out of office today.  He did say in an
>> earlier reply:
>>
>> > I took a quick look at the interrupts figure between 2.6.24 and 2.6.27.
>> > i/o interuputs is slightly down in 2.6.27 (due to reduce throughput).
>> > But both NMI and reschedule interrupt increased.  Reschedule interrupts
>> > is 2x of 2.6.24.
>>
>> So if the reschedule interrupt is happening twice as often, and the
>> context switch rate is basically unchanged, I guess that means the
>> scheduler is doing a lot more work to get approximately the same
>> results.  And that seems like a bad thing.
>>
>> Again, it's worth bearing in mind that these are all RT tasks, so the
>> underlying problem may be very different from the one that both James and
>> I have observed with an Atom laptop running predominantly non-RT tasks.
>>
>
>The RT scheduler is a bit more aggressive than it use to be. It use to
>just migrate RT tasks when the migration thread woke up, and did that in
>"bulk".  Now, when an individual RT task wakes up and it can not run on
>the current CPU but can on another CPU, it is scheduled immediately, and
>an IPI is sent out.
>
>As for context switching, it would be the same amount as before, but the
>difference is that the RT task will try to wake up as soon as possible.
>This also causes RT tasks to bounce around CPUs more often.
>
>If there are many threads, they should not be RT, unless there is some
>design behind it.
>
>Forgive me if you already did this and said so, but what is the result
>of just making the writer an RT task and keeping all the readers as
>SCHED_OTHER?
>
>-- Steve
>

I think the high OLTP throughtput with rt-prio is due to the fixed time-slice. It is better to give DBMS process a bigger timeslice for getting a data buffer lock, process data, release the lock and switch out due to waiting on i/o instead of being force to switch out while still holding a data lock. 

I suppose SCHED_OTHER is the default policy for user processes. We tried setting only the log writer to RT and left all other DBMS orocess in default sched policy and the performance is ~1.5% lower than the all rt-prio process result.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* RE: Mainline kernel OLTP performance update
  2009-01-15 18:46                     ` Wilcox, Matthew R
  (?)
@ 2009-01-15 19:44                     ` Ma, Chinang
  2009-01-16 18:14                       ` Gregory Haskins
  -1 siblings, 1 reply; 105+ messages in thread
From: Ma, Chinang @ 2009-01-15 19:44 UTC (permalink / raw)
  To: Wilcox, Matthew R, Gregory Haskins, Steven Rostedt
  Cc: Matthew Wilcox, Andrew Morton, James Bottomley, linux-kernel,
	Tripathi, Sharad C, arjan, Kleen, Andi, Siddha, Suresh B,
	Chilukuri, Harita, Styner, Douglas W, Wang, Peter Xihong,
	Nueckel, Hubert, chris.mason, linux-scsi, Andrew Vasquez,
	Anirban Chakraborty

Gregory. 
I will test the resched-ipi instrumentation patch with our OLTP if you can post the patch and some instructions.
Thanks,
-Chinang

>-----Original Message-----
>From: Wilcox, Matthew R
>Sent: Thursday, January 15, 2009 10:47 AM
>To: Gregory Haskins; Steven Rostedt
>Cc: Matthew Wilcox; Andrew Morton; James Bottomley; Ma, Chinang; linux-
>kernel@vger.kernel.org; Tripathi, Sharad C; arjan@linux.intel.com; Kleen,
>Andi; Siddha, Suresh B; Chilukuri, Harita; Styner, Douglas W; Wang, Peter
>Xihong; Nueckel, Hubert; chris.mason@oracle.com; linux-scsi@vger.kernel.org;
>Andrew Vasquez; Anirban Chakraborty
>Subject: RE: Mainline kernel OLTP performance update
>
>Gregory Haskins [mailto:ghaskins@novell.com] wrote:
>> > On Thu, 2009-01-15 at 11:00 -0700, Matthew Wilcox wrote:
>> >> So if the reschedule interrupt is happening twice as often, and the
>> >> context switch rate is basically unchanged, I guess that means the
>> >> scheduler is doing a lot more work to get approximately the same
>> >> results.  And that seems like a bad thing.
>>
>> I would be very interested in gathering some data in this area.  One
>> thing that pops to mind is to instrument the resched-ipi with
>> ftrace_printk() and gather a trace of this system in action.  I assume
>> that I wouldn't have access to this OLTP suite, so I may need a
>> volunteer to try this for me.  I could put together an instrumentation
>> patch for the testers convenience if they prefer.
>
>I don't know whether Novell have an arrangement with the Well-Known
>Commercial Database and the Well-Known OLTP Benchmark to do runs like this.
>Chinang is normally only too happy to build his own kernels with patches
>from people who are interested in helping, so that's probably the best way
>to do it.
>
>I'm leaving for LCA in an hour or so, so further responses from me to this
>thread are unlikely ;-)

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-15  7:24         ` Nick Piggin
  2009-01-15  9:46           ` Pekka Enberg
@ 2009-01-16  0:27           ` Andrew Morton
  2009-01-16  4:03             ` Nick Piggin
  1 sibling, 1 reply; 105+ messages in thread
From: Andrew Morton @ 2009-01-16  0:27 UTC (permalink / raw)
  To: Nick Piggin
  Cc: matthew, matthew.r.wilcox, chinang.ma, linux-kernel,
	sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	andrew.vasquez, anirban.chakraborty

On Thu, 15 Jan 2009 18:24:36 +1100
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Given that SLAB and SLUB are fairly mature, I wonder what you'd think of
> taking SLQB into -mm and making it the default there for a while, to see
> if anybody reports a problem?

Nobody would test it in interesting ways.

We'd get more testing in linux-next, but still not enough, and not of
the right type.

It would be better to just make the desision, merge it and forge ahead.

Me, I'd be 100% behind the idea if it had a credible prospect of a net
reduction in the number of slab allocator implementations.

I guess the naming convention will limit us to 26 of them.  Fortunate
indeed that the kernel isn't written in cyrillic!



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16  0:27           ` Andrew Morton
@ 2009-01-16  4:03             ` Nick Piggin
  2009-01-16  4:12               ` Andrew Morton
  0 siblings, 1 reply; 105+ messages in thread
From: Nick Piggin @ 2009-01-16  4:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: matthew, matthew.r.wilcox, chinang.ma, linux-kernel,
	sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	andrew.vasquez, anirban.chakraborty

On Friday 16 January 2009 11:27:35 Andrew Morton wrote:
> On Thu, 15 Jan 2009 18:24:36 +1100
>
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > Given that SLAB and SLUB are fairly mature, I wonder what you'd think of
> > taking SLQB into -mm and making it the default there for a while, to see
> > if anybody reports a problem?
>
> Nobody would test it in interesting ways.
>
> We'd get more testing in linux-next, but still not enough, and not of
> the right type.

It would be better than nothing, for SLQB, I guess.


> It would be better to just make the desision, merge it and forge ahead.
>
> Me, I'd be 100% behind the idea if it had a credible prospect of a net
> reduction in the number of slab allocator implementations.

>From the data we have so far, I think SLQB is a "credible prospect" to
replace SLUB and SLAB. But then again, apparently SLUB was a credible
prospect to replace SLAB when it was merged.

Unfortunately I can't honestly say that some serious regression will not
be discovered in SLQB that cannot be fixed. I guess that's never stopped
us merging other rewrites before, though.

I would like to see SLQB merged in mainline, made default, and wait for
some number releases. Then we take what we know, and try to make an
informed decision about the best one to take. I guess that is problematic
in that the rest of the kernel is moving underneath us. Do you have
another idea?


> I guess the naming convention will limit us to 26 of them.  Fortunate
> indeed that the kernel isn't written in cyrillic!

I could have called it SL4B. 4 would be somehow fitting...


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16  4:03             ` Nick Piggin
@ 2009-01-16  4:12               ` Andrew Morton
  2009-01-16  6:46                 ` Nick Piggin
  0 siblings, 1 reply; 105+ messages in thread
From: Andrew Morton @ 2009-01-16  4:12 UTC (permalink / raw)
  To: Nick Piggin
  Cc: matthew, matthew.r.wilcox, chinang.ma, linux-kernel,
	sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	andrew.vasquez, anirban.chakraborty

On Fri, 16 Jan 2009 15:03:12 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> I would like to see SLQB merged in mainline, made default, and wait for
> some number releases. Then we take what we know, and try to make an
> informed decision about the best one to take. I guess that is problematic
> in that the rest of the kernel is moving underneath us. Do you have
> another idea?

Nope.  If it doesn't work out, we can remove it again I guess.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16  4:12               ` Andrew Morton
@ 2009-01-16  6:46                 ` Nick Piggin
  2009-01-16  6:55                   ` Matthew Wilcox
                                     ` (2 more replies)
  0 siblings, 3 replies; 105+ messages in thread
From: Nick Piggin @ 2009-01-16  6:46 UTC (permalink / raw)
  To: Andrew Morton, netdev, sfr
  Cc: matthew, matthew.r.wilcox, chinang.ma, linux-kernel,
	sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	andrew.vasquez, anirban.chakraborty

On Friday 16 January 2009 15:12:10 Andrew Morton wrote:
> On Fri, 16 Jan 2009 15:03:12 +1100 Nick Piggin <nickpiggin@yahoo.com.au> 
wrote:
> > I would like to see SLQB merged in mainline, made default, and wait for
> > some number releases. Then we take what we know, and try to make an
> > informed decision about the best one to take. I guess that is problematic
> > in that the rest of the kernel is moving underneath us. Do you have
> > another idea?
>
> Nope.  If it doesn't work out, we can remove it again I guess.

OK, I have these numbers to show I'm not completely off my rocker to suggest
we merge SLQB :) Given these results, how about I ask to merge SLQB as default
in linux-next, then if nothing catastrophic happens, merge it upstream in the
next merge window, then a couple of releases after that, given some time to
test and tweak SLQB, then we plan to bite the bullet and emerge with just one
main slab allocator (plus SLOB).


System is a 2socket, 4 core AMD. All debug and stats options turned off for
all the allocators; default parameters (ie. SLUB using higher order pages,
and the others tend to be using order-0). SLQB is the version I recently
posted, with some of the prefetching removed according to Pekka's review
(probably a good idea to only add things like that in if/when they prove to
be an improvement).

time fio examples/netio (10 runs, lower better):
SLAB AVG=13.19 STD=0.40
SLQB AVG=13.78 STD=0.24
SLUB AVG=14.47 STD=0.23

SLAB makes a good showing here. The allocation/freeing pattern seems to be
very regular and easy (fast allocs and frees). So it could be some "lucky"
caching behaviour, I'm not exactly sure. I'll have to run more tests and
profiles here.


hackbench (10 runs, lower better):
1 GROUP
SLAB AVG=1.34 STD=0.05
SLQB AVG=1.31 STD=0.06
SLUB AVG=1.46 STD=0.07

2 GROUPS
SLAB AVG=1.20 STD=0.09
SLQB AVG=1.22 STD=0.12
SLUB AVG=1.21 STD=0.06

4 GROUPS
SLAB AVG=0.84 STD=0.05
SLQB AVG=0.81 STD=0.10
SLUB AVG=0.98 STD=0.07

8 GROUPS
SLAB AVG=0.79 STD=0.10
SLQB AVG=0.76 STD=0.15
SLUB AVG=0.89 STD=0.08

16 GROUPS
SLAB AVG=0.78 STD=0.08
SLQB AVG=0.79 STD=0.10
SLUB AVG=0.86 STD=0.05

32 GROUPS
SLAB AVG=0.86 STD=0.05
SLQB AVG=0.78 STD=0.06
SLUB AVG=0.88 STD=0.06

64 GROUPS
SLAB AVG=1.03 STD=0.05
SLQB AVG=0.90 STD=0.04
SLUB AVG=1.05 STD=0.06

128 GROUPS
SLAB AVG=1.31 STD=0.19
SLQB AVG=1.16 STD=0.36
SLUB AVG=1.29 STD=0.11

SLQB tends to be the winner here. SLAB is close at lower numbers of
groups, but drops behind a bit more as they increase.


tbench (10 runs, higher better):
1 THREAD
SLAB AVG=239.25 STD=31.74
SLQB AVG=257.75 STD=33.89
SLUB AVG=223.02 STD=14.73

2 THREADS
SLAB AVG=649.56 STD=9.77
SLQB AVG=647.77 STD=7.48
SLUB AVG=634.50 STD=7.66

4 THREADS
SLAB AVG=1294.52 STD=13.19
SLQB AVG=1266.58 STD=35.71
SLUB AVG=1228.31 STD=48.08

8 THREADS
SLAB AVG=2750.78 STD=26.67
SLQB AVG=2758.90 STD=18.86
SLUB AVG=2685.59 STD=22.41

16 THREADS
SLAB AVG=2669.11 STD=58.34
SLQB AVG=2671.69 STD=31.84
SLUB AVG=2571.05 STD=45.39

SLAB and SLQB seem to be pretty close, winning some and losing some.
They're always within a standard deviation of one another, so we can't
make conclusions between them. SLUB seems to be a bit slower.


Netperf UDP unidirectional send test (10 runs, higher better):

Server and client bound to same CPU
SLAB AVG=60.111 STD=1.59382
SLQB AVG=60.167 STD=0.685347
SLUB AVG=58.277 STD=0.788328

Server and client bound to same socket, different CPUs
SLAB AVG=85.938 STD=0.875794
SLQB AVG=93.662 STD=2.07434
SLUB AVG=81.983 STD=0.864362

Server and client bound to different sockets
SLAB AVG=78.801 STD=1.44118
SLQB AVG=78.269 STD=1.10457
SLUB AVG=71.334 STD=1.16809

SLQB is up with SLAB for the first and last cases, and faster in
the second case. SLUB trails in each case. (Any ideas for better types
of netperf tests?)


Kbuild numbers don't seem to be significantly different. SLAB and SLQB
actually got exactly the same average over 10 runs. The user+sys times
tend to be almost identical between allocators, with elapsed time mainly
depending on how much time the CPU was not idle.


Intel's OLTP shows SLQB is "neutral" to SLAB. That is, literally within
their measurement confidence interval. If it comes down to it, I think we
could get them to do more runs to narrow that down, but we're talking a
couple of tenths of a percent already.


I haven't done any non-local network tests. Networking is the one of the
subsystems most heavily dependent on slab performance, so if anybody
cares to run their favourite tests, that would be really helpful.

Disclaimer
----------
Now remember this is just one specific HW configuration, and some
allocators for some reason give significantly (and sometimes perplexingly)
different results between different CPU and system architectures.

The other frustrating thing is that sometimes you happen to get a lucky
or unlucky cache or NUMA layout depending on the compile, the boot, etc.
So sometimes results get a little "skewed" in a way that isn't reflected
in the STDDEV. But I've tried to minimise that. Dropping caches and
restarting services etc. between individual runs.



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16  6:46                 ` Nick Piggin
@ 2009-01-16  6:55                   ` Matthew Wilcox
  2009-01-16  7:06                     ` Nick Piggin
  2009-01-16  7:53                     ` Zhang, Yanmin
  2009-01-16  7:00                   ` Mainline kernel OLTP performance update Andrew Morton
  2009-01-16 18:11                   ` Rick Jones
  2 siblings, 2 replies; 105+ messages in thread
From: Matthew Wilcox @ 2009-01-16  6:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Zhang, Yanmin

On Fri, Jan 16, 2009 at 05:46:23PM +1100, Nick Piggin wrote:
> Intel's OLTP shows SLQB is "neutral" to SLAB. That is, literally within
> their measurement confidence interval. If it comes down to it, I think we
> could get them to do more runs to narrow that down, but we're talking a
> couple of tenths of a percent already.

I think I can speak with some measure of confidence for at least the
OLTP-testing part of my company when I say that I have no objection to
Nick's planned merge scheme.

I believe the kernel benchmark group have also done some testing with
SLQB and have generally positive things to say about it (Yanmin added to
the gargantuan cc).

Did slabtop get fixed to work with SLQB?

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16  6:46                 ` Nick Piggin
  2009-01-16  6:55                   ` Matthew Wilcox
@ 2009-01-16  7:00                   ` Andrew Morton
  2009-01-16  7:25                     ` Nick Piggin
  2009-01-16  8:59                     ` Nick Piggin
  2009-01-16 18:11                   ` Rick Jones
  2 siblings, 2 replies; 105+ messages in thread
From: Andrew Morton @ 2009-01-16  7:00 UTC (permalink / raw)
  To: Nick Piggin
  Cc: netdev, sfr, matthew, matthew.r.wilcox, chinang.ma, linux-kernel,
	sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	andrew.vasquez, anirban.chakraborty

On Fri, 16 Jan 2009 17:46:23 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> On Friday 16 January 2009 15:12:10 Andrew Morton wrote:
> > On Fri, 16 Jan 2009 15:03:12 +1100 Nick Piggin <nickpiggin@yahoo.com.au> 
> wrote:
> > > I would like to see SLQB merged in mainline, made default, and wait for
> > > some number releases. Then we take what we know, and try to make an
> > > informed decision about the best one to take. I guess that is problematic
> > > in that the rest of the kernel is moving underneath us. Do you have
> > > another idea?
> >
> > Nope.  If it doesn't work out, we can remove it again I guess.
> 
> OK, I have these numbers to show I'm not completely off my rocker to suggest
> we merge SLQB :) Given these results, how about I ask to merge SLQB as default
> in linux-next, then if nothing catastrophic happens, merge it upstream in the
> next merge window, then a couple of releases after that, given some time to
> test and tweak SLQB, then we plan to bite the bullet and emerge with just one
> main slab allocator (plus SLOB).

That's a plan.

> SLQB tends to be the winner here.

Can you think of anything with which it will be the loser?


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16  6:55                   ` Matthew Wilcox
@ 2009-01-16  7:06                     ` Nick Piggin
  2009-01-16  7:53                     ` Zhang, Yanmin
  1 sibling, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2009-01-16  7:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Zhang, Yanmin

On Friday 16 January 2009 17:55:47 Matthew Wilcox wrote:
> On Fri, Jan 16, 2009 at 05:46:23PM +1100, Nick Piggin wrote:
> > Intel's OLTP shows SLQB is "neutral" to SLAB. That is, literally within
> > their measurement confidence interval. If it comes down to it, I think we
> > could get them to do more runs to narrow that down, but we're talking a
> > couple of tenths of a percent already.
>
> I think I can speak with some measure of confidence for at least the
> OLTP-testing part of my company when I say that I have no objection to
> Nick's planned merge scheme.
>
> I believe the kernel benchmark group have also done some testing with
> SLQB and have generally positive things to say about it (Yanmin added to
> the gargantuan cc).
>
> Did slabtop get fixed to work with SLQB?

Yes the old slabtop that works on /proc/slabinfo works with SLQB (ie. SLQB
implements /proc/slabinfo).

Lin Ming recently also ported the SLUB /sys/kernel/slab/ specific slabinfo
tool to SLQB. Basically it reports in-depth internal event counts etc. and
can operate on individual caches, making it very useful for performance
"observability" and tuning.

It is hard to come up with a single set of statistics that apply usefully
to all the allocators. FWIW, it would be a useful tool to port over to
SLAB too, if we end up deciding to go with SLAB.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16  7:00                   ` Mainline kernel OLTP performance update Andrew Morton
@ 2009-01-16  7:25                     ` Nick Piggin
  2009-01-16  8:59                     ` Nick Piggin
  1 sibling, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2009-01-16  7:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: netdev, sfr, matthew, matthew.r.wilcox, chinang.ma, linux-kernel,
	sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	andrew.vasquez, anirban.chakraborty

On Friday 16 January 2009 18:00:43 Andrew Morton wrote:
> On Fri, 16 Jan 2009 17:46:23 +1100 Nick Piggin <nickpiggin@yahoo.com.au> 
wrote:
> > On Friday 16 January 2009 15:12:10 Andrew Morton wrote:
> > > On Fri, 16 Jan 2009 15:03:12 +1100 Nick Piggin
> > > <nickpiggin@yahoo.com.au>
> >
> > wrote:
> > > > I would like to see SLQB merged in mainline, made default, and wait
> > > > for some number releases. Then we take what we know, and try to make
> > > > an informed decision about the best one to take. I guess that is
> > > > problematic in that the rest of the kernel is moving underneath us.
> > > > Do you have another idea?
> > >
> > > Nope.  If it doesn't work out, we can remove it again I guess.
> >
> > OK, I have these numbers to show I'm not completely off my rocker to
> > suggest we merge SLQB :) Given these results, how about I ask to merge
> > SLQB as default in linux-next, then if nothing catastrophic happens,
> > merge it upstream in the next merge window, then a couple of releases
> > after that, given some time to test and tweak SLQB, then we plan to bite
> > the bullet and emerge with just one main slab allocator (plus SLOB).
>
> That's a plan.
>
> > SLQB tends to be the winner here.
>
> Can you think of anything with which it will be the loser?

Well, that fio test showed it was behind SLAB. I just discovered that
yesterday during running these tests, so I'll take a look at that. The
Intel performance guys I think have one or two cases where it is slower.
They don't seem to be too serious, and tend to be specific to some
machines (eg. the same test with a different CPU architecture turns out
to be faster). So I'll be looking into these things, but I haven't seen
anything too serious yet. I'm mostly interested in macro benchmarks and
more real world workloads.

At a higher level, SLAB has some interesting features. It basically has
"crossbars" of queues, that basically provide queues for allocating and
freeing to and from different CPUs and nodes. This is what bloats up
the kmem_cache data structures to tens or hundreds of gigabytes each
on SGI size systems. But it is also has good properties. On smaller
multiprocessor and NUMA systems, it might be the case that SLAB does
better in workloads that involve objects being allocated on one CPU and
freed on another. I haven't actually observed problems here, but I don't
have a lot of good tests.

SLAB is also fundamentally different from SLUB and SLQB in that it uses
arrays to store pointers to objects in its queues, rather than having
a linked list using pointers embedded in the objects. This might in some
cases make it easier to prefetch objects in parallel with finding the
object itself. I haven't actually been able to attribute a particular
regression to this interesting difference, but it might turn up as an
issue.

These are two big differences between SLAB and SLQB.

The linked lists of objects were used in favour of arrays again because of
the memory overhead, and to have a better ability to tune the size of the
queues, and reduced overhead in copying around arrays of pointers (SLQB can
just copy the head of one the list to the tail of another in order to move
objects around), and eliminated the need to have additional metadata beyond
the struct page for each slab.

The crossbars of queues were removed because of the bloating and memory
overhead issues. The fact that we now have linked lists helps a little bit
with this, because moving lists of objects around gets a bit easier.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16  6:55                   ` Matthew Wilcox
  2009-01-16  7:06                     ` Nick Piggin
@ 2009-01-16  7:53                     ` Zhang, Yanmin
  2009-01-16 10:20                       ` Andi Kleen
  1 sibling, 1 reply; 105+ messages in thread
From: Zhang, Yanmin @ 2009-01-16  7:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty

On Thu, 2009-01-15 at 23:55 -0700, Matthew Wilcox wrote:
> On Fri, Jan 16, 2009 at 05:46:23PM +1100, Nick Piggin wrote:
> > Intel's OLTP shows SLQB is "neutral" to SLAB. That is, literally within
> > their measurement confidence interval. If it comes down to it, I think we
> > could get them to do more runs to narrow that down, but we're talking a
> > couple of tenths of a percent already.
> 
> I think I can speak with some measure of confidence for at least the
> OLTP-testing part of my company when I say that I have no objection to
> Nick's planned merge scheme.
> 
> I believe the kernel benchmark group have also done some testing with
> SLQB and have generally positive things to say about it (Yanmin added to
> the gargantuan cc).
We did run lots of benchmarks with SLQB. Comparing with SLUB, one highlighting of
SLQB is with netperf UDP-U-4k. On my x86-64 machines, if I start 1 client and 1 server
process and bind them to different physical cpus, the result of SLQB is about 20% better
than SLUB's. If I start CPU_NUM clients and the same number of servers without binding,
the results of SLQB is about 100% better than SLUB's. I think that's because SLQB
doesn't pass through big object allocation to page allocator.
netperf UDP-U-1k has less improvement with SLQB.

The results of other benchmarks have variations. They are good on some machines,
but bad on other machines. However, the variation is small. For example, hackbench's result
with SLQB is about 1 second than with SLUB on 8-core stoakley. After we worked with
Nick to do small code changing, SLQB's result is a little better than SLUB's
with hackbench on stoakley.

We consider other variations as fluctuation.

All the testing use default SLUB and SLQB configuration.

> 
> Did slabtop get fixed to work with SLQB?
> 


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16  7:00                   ` Mainline kernel OLTP performance update Andrew Morton
  2009-01-16  7:25                     ` Nick Piggin
@ 2009-01-16  8:59                     ` Nick Piggin
  1 sibling, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2009-01-16  8:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: netdev, sfr, matthew, matthew.r.wilcox, chinang.ma, linux-kernel,
	sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	andrew.vasquez, anirban.chakraborty

On Friday 16 January 2009 18:00:43 Andrew Morton wrote:
> On Fri, 16 Jan 2009 17:46:23 +1100 Nick Piggin <nickpiggin@yahoo.com.au> 
> > SLQB tends to be the winner here.
>
> Can you think of anything with which it will be the loser?

Here are some more performance numbers with "slub_test" kernel module.
It's basically a really tiny microbenchmark, so I don't really consider
it gives too useful results, except it does show up some problems in
SLAB's scalability  that may start to bite as we continue to get more
threads per socket.

(I ran a few of these tests on one of Dave's 2 socket, 128 thread
systems, and slab gets really painful... these kinds of thread counts
may only be a couple of years away from x86).

All numbers are in CPU cycles.

Single thread testing
=====================
1. Kmalloc: Repeatedly allocate 10000 objs then free them
obj size  SLAB       SLQB      SLUB
8           77+ 128   69+ 47   61+ 77
16          69+ 104  116+ 70   77+ 80
32          66+ 101   82+ 81   71+ 89
64          82+ 116   95+ 81   94+105
128        100+ 148  106+ 94  114+163
256        153+ 136  134+ 98  124+186
512        209+ 161  170+186  134+276
1024       331+ 249  236+245  134+283
2048       608+ 443  380+386  172+312
4096      1109+ 624  678+661  239+372
8192      1166+1077  767+683  535+433
16384     1213+1160  914+731  577+682

We can see SLAB has a fair bit more overhead in this case. SLUB starts
doing higher order allocations I think around size 256, which reduces
costs there. Don't know what the SLQB artifact at 16 is caused by...


2. Kmalloc: alloc/free test (repeatedly allocate and free)
       SLAB  SLQB  SLUB
8       98   90     94
16      98   90     93
32      98   90     93
64      99   90     94
128    100   92     93
256    104   93     95
512    105   94     97
1024   106   93     97
2048   107   95     95
4096   111   92     97
8192   111   94    631
16384  114   92    741

Here we see SLUB's allocator passthrough (or is the the lack of queueing?).
Straight line speed at small sizes is probably due to instructions in the
fastpaths. It's pretty meaningless though because it probably changes if
there is any actual load on the CPU, or another CPU architecture. Doesn't
look bad for SLQB though :)


Concurrent allocs
=================
1. Like the first single thread test, lots of allocs, then lots of frees.
But running on all CPUs. Average over all CPUs.
       SLAB        SLQB         SLUB
8        251+ 322    73+  47   65+  76
16       240+ 331    84+  53   67+  82
32       235+ 316    94+  57   77+  92
64       338+ 303   120+  66  105+ 136
128      549+ 355   139+ 166  127+ 344
256     1129+ 456   189+ 178  236+ 404
512     2085+ 872   240+ 217  244+ 419
1024    3895+1373   347+ 333  251+ 440
2048    7725+2579   616+ 695  373+ 588
4096   15320+4534  1245+1442  689+1002

A problem with SLAB scalability starts showing up on this system with only
4 threads per socket. Again, SLUB sees a benefit from higher order
allocations.


2. Same as 2nd single threaded test, alloc then free, on all CPUs.
      SLAB  SLQB  SLUB
8      99   90    93
16     99   90    93
32     99   90    93
64    100   91    94
128   102   90    93
256   105   94    97
512   106   93    97
1024  108   93    97
2048  109   93    96
4096  110   93    96

No surprises. Objects always fit in queues (or unqueues, in the case of
SLUB), so there is no cross cache traffic.


Remote free test
================
1. Allocate N objects on CPUs 1-7, then free them all from CPU 0. Average cost
   of all kmalloc+kfree
      SLAB        SLQB     SLUB
8       191+ 142   53+ 64  56+99
16      180+ 141   82+ 69  60+117
32      173+ 142  100+ 71  78+151
64      240+ 147  131+ 73  117+216
128     441+ 162  158+114  114+251
256     833+ 181  179+119  185+263
512    1546+ 243  220+132  194+292
1024   2886+ 341  299+135  201+312
2048   5737+ 577  517+139  291+370
4096  11288+1201  976+153  528+482


2. All CPUs allocate on objects on CPU N, then freed by CPU N+1 % NR_CPUS
   (ie. CPU1 frees objects allocated by CPU0).
      SLAB        SLQB     SLUB
8       236+ 331   72+123   64+ 114
16      232+ 345   80+125   71+ 139
32      227+ 342   85+134   82+ 183
64      324+ 336  140+138  111+ 219
128     569+ 384  245+201  145+ 337
256    1111+ 448  243+222  238+ 447
512    2091+ 871  249+244  247+ 470
1024   3923+1593  254+256  254+ 503
2048   7700+2968  273+277  369+ 699
4096  15154+5061  310+323  693+1220

SLAB's concurrent allocation bottlnecks show up again in these tests.

Unfortunately these are not too realistic tests of remote freeing pattern,
because normally you would expect remote freeing and allocation happening
concurrently, rather than all allocations up front, then all frees. If
the test behaved like that, then object could probably fit in SLAB's
queues and it might see some good numbers.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-15 13:52             ` Matthew Wilcox
  2009-01-15 14:42               ` Pekka Enberg
@ 2009-01-16 10:16               ` Pekka Enberg
  2009-01-16 10:21                 ` Nick Piggin
  1 sibling, 1 reply; 105+ messages in thread
From: Pekka Enberg @ 2009-01-16 10:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nick Piggin, Andrew Morton, Wilcox, Matthew R, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, Andrew Vasquez, Anirban Chakraborty,
	Christoph Lameter

On Thu, Jan 15, 2009 at 11:46:09AM +0200, Pekka Enberg wrote:
>> It would also be nice if someone could do the performance analysis on
>> the SLUB bug. I ran sysbench in oltp mode here and the results look
>> like this:
>>
>>   [ number of transactions per second from 10 runs. ]
>>
>>                    min      max      avg      sd
>>   2.6.29-rc1-slab  833.77   852.32   845.10   4.72
>>   2.6.29-rc1-slub  823.61   851.94   836.74   8.57
>>
>> And no, the numbers are not flipped, SLUB beats SLAB here. :(

On Thu, Jan 15, 2009 at 3:52 PM, Matthew Wilcox <matthew@wil.cx> wrote:
> Um.  More transactions per second is good.  Your numbers show SLAB
> beating SLUB (even on your dual-CPU system).  And SLAB shows a lower
> standard deviation, which is also good.

I had lockdep enabled in my config so I ran the tests again with
x86-64 defconfig and I'm back to square one:

  [ number of transactions per second from 10 runs, bigger is better ]

                   min      max      avg      sd
  2.6.29-rc1-slab  802.02   805.37   803.93   0.97
  2.6.29-rc1-slub  807.78   811.20   809.86   1.05

                        Pekka

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16  7:53                     ` Zhang, Yanmin
@ 2009-01-16 10:20                       ` Andi Kleen
  2009-01-20  5:16                         ` Zhang, Yanmin
  0 siblings, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2009-01-16 10:20 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr,
	matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi,
	arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty

"Zhang, Yanmin" <yanmin_zhang@linux.intel.com> writes:


> I think that's because SLQB
> doesn't pass through big object allocation to page allocator.
> netperf UDP-U-1k has less improvement with SLQB.

That sounds like just the page allocator needs to be improved.
That would help everyone. We talked a bit about this earlier,
some of the heuristics for hot/cold pages are quite outdated
and have been tuned for obsolete machines and also its fast path
is quite long. Unfortunately no code currently.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16 10:16               ` Pekka Enberg
@ 2009-01-16 10:21                 ` Nick Piggin
  2009-01-16 10:31                   ` Pekka Enberg
  0 siblings, 1 reply; 105+ messages in thread
From: Nick Piggin @ 2009-01-16 10:21 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Matthew Wilcox, Andrew Morton, Wilcox, Matthew R, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, Andrew Vasquez, Anirban Chakraborty,
	Christoph Lameter

On Friday 16 January 2009 21:16:31 Pekka Enberg wrote:
> On Thu, Jan 15, 2009 at 11:46:09AM +0200, Pekka Enberg wrote:
> >> It would also be nice if someone could do the performance analysis on
> >> the SLUB bug. I ran sysbench in oltp mode here and the results look
> >> like this:
> >>
> >>   [ number of transactions per second from 10 runs. ]
> >>
> >>                    min      max      avg      sd
> >>   2.6.29-rc1-slab  833.77   852.32   845.10   4.72
> >>   2.6.29-rc1-slub  823.61   851.94   836.74   8.57

> I had lockdep enabled in my config so I ran the tests again with
> x86-64 defconfig and I'm back to square one:
>
>   [ number of transactions per second from 10 runs, bigger is better ]
>
>                    min      max      avg      sd
>   2.6.29-rc1-slab  802.02   805.37   803.93   0.97
>   2.6.29-rc1-slub  807.78   811.20   809.86   1.05

Hm, I wonder why it is going slower with lockdep disabled?
Did something else change?


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16 10:21                 ` Nick Piggin
@ 2009-01-16 10:31                   ` Pekka Enberg
  2009-01-16 10:42                     ` Nick Piggin
  2009-01-16 20:59                     ` Christoph Lameter
  0 siblings, 2 replies; 105+ messages in thread
From: Pekka Enberg @ 2009-01-16 10:31 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Matthew Wilcox, Andrew Morton, Wilcox, Matthew R, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, Andrew Vasquez, Anirban Chakraborty,
	Christoph Lameter

On Friday 16 January 2009 21:16:31 Pekka Enberg wrote:
>> I had lockdep enabled in my config so I ran the tests again with
>> x86-64 defconfig and I'm back to square one:
>>
>>   [ number of transactions per second from 10 runs, bigger is better ]
>>
>>                    min      max      avg      sd
>>   2.6.29-rc1-slab  802.02   805.37   803.93   0.97
>>   2.6.29-rc1-slub  807.78   811.20   809.86   1.05

On Fri, Jan 16, 2009 at 12:21 PM, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> Hm, I wonder why it is going slower with lockdep disabled?
> Did something else change?

I don't have the exact config for the previous tests but it's was just
my laptop regular config whereas the new tests are x86-64 defconfig.
So I think I'm just hitting some of the other OLTP regressions here,
aren't I? There's some scheduler related options such as
CONFIG_GROUP_SCHED and CONFIG_FAIR_GROUP_SCHED enabled in defconfig
that I didn't have in the original tests. I can try without them if
you want but I'm not sure it's relevant for SLAB vs SLUB tests.

                                Pekka

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16 10:31                   ` Pekka Enberg
@ 2009-01-16 10:42                     ` Nick Piggin
  2009-01-16 10:55                       ` Pekka Enberg
  2009-01-16 20:59                     ` Christoph Lameter
  1 sibling, 1 reply; 105+ messages in thread
From: Nick Piggin @ 2009-01-16 10:42 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Matthew Wilcox, Andrew Morton, Wilcox, Matthew R, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, Andrew Vasquez, Anirban Chakraborty,
	Christoph Lameter

On Friday 16 January 2009 21:31:03 Pekka Enberg wrote:
> On Friday 16 January 2009 21:16:31 Pekka Enberg wrote:
> >> I had lockdep enabled in my config so I ran the tests again with
> >> x86-64 defconfig and I'm back to square one:
> >>
> >>   [ number of transactions per second from 10 runs, bigger is better ]
> >>
> >>                    min      max      avg      sd
> >>   2.6.29-rc1-slab  802.02   805.37   803.93   0.97
> >>   2.6.29-rc1-slub  807.78   811.20   809.86   1.05
>
> On Fri, Jan 16, 2009 at 12:21 PM, Nick Piggin <nickpiggin@yahoo.com.au> 
wrote:
> > Hm, I wonder why it is going slower with lockdep disabled?
> > Did something else change?
>
> I don't have the exact config for the previous tests but it's was just
> my laptop regular config whereas the new tests are x86-64 defconfig.
> So I think I'm just hitting some of the other OLTP regressions here,
> aren't I? There's some scheduler related options such as
> CONFIG_GROUP_SCHED and CONFIG_FAIR_GROUP_SCHED enabled in defconfig
> that I didn't have in the original tests. I can try without them if
> you want but I'm not sure it's relevant for SLAB vs SLUB tests.

Oh no that's fine. It just looked like you repeated the test but
with lockdep disabled (and no other changes).



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16 10:42                     ` Nick Piggin
@ 2009-01-16 10:55                       ` Pekka Enberg
  2009-01-19  7:13                         ` Nick Piggin
  0 siblings, 1 reply; 105+ messages in thread
From: Pekka Enberg @ 2009-01-16 10:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Matthew Wilcox, Andrew Morton, Wilcox, Matthew R, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, Andrew Vasquez, Anirban Chakraborty,
	Christoph Lameter

Hi Nick,

On Fri, Jan 16, 2009 at 12:42 PM, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>> I don't have the exact config for the previous tests but it's was just
>> my laptop regular config whereas the new tests are x86-64 defconfig.
>> So I think I'm just hitting some of the other OLTP regressions here,
>> aren't I? There's some scheduler related options such as
>> CONFIG_GROUP_SCHED and CONFIG_FAIR_GROUP_SCHED enabled in defconfig
>> that I didn't have in the original tests. I can try without them if
>> you want but I'm not sure it's relevant for SLAB vs SLUB tests.
>
> Oh no that's fine. It just looked like you repeated the test but
> with lockdep disabled (and no other changes).

Right. In any case, I am still unable to reproduce the OLTP issue and
I've seen SLUB beat SLAB on my machine in most of the benchmarks
you've posted. So I have very mixed feelings about SLQB. It's very
nice that it works for OLTP but we still don't have much insight (i.e.
numbers) on why it's better. I'm also bit worried if SLQB has gotten
enough attention from the NUMA and HPC folks that brought us SLUB.

The good news is that SLQB can replace SLAB so either way, we're not
going to end up with four allocators. Whether it can replace SLUB
remains to be seen.

                        Pekka

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16  6:46                 ` Nick Piggin
  2009-01-16  6:55                   ` Matthew Wilcox
  2009-01-16  7:00                   ` Mainline kernel OLTP performance update Andrew Morton
@ 2009-01-16 18:11                   ` Rick Jones
  2009-01-19  7:43                     ` Nick Piggin
  2 siblings, 1 reply; 105+ messages in thread
From: Rick Jones @ 2009-01-16 18:11 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, netdev, sfr, matthew, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty

Nick Piggin wrote:
> OK, I have these numbers to show I'm not completely off my rocker to suggest
> we merge SLQB :) Given these results, how about I ask to merge SLQB as default
> in linux-next, then if nothing catastrophic happens, merge it upstream in the
> next merge window, then a couple of releases after that, given some time to
> test and tweak SLQB, then we plan to bite the bullet and emerge with just one
> main slab allocator (plus SLOB).
> 
> 
> System is a 2socket, 4 core AMD. 

Not exactly a large system :)  Barely NUMA even with just two sockets.

> All debug and stats options turned off for
> all the allocators; default parameters (ie. SLUB using higher order pages,
> and the others tend to be using order-0). SLQB is the version I recently
> posted, with some of the prefetching removed according to Pekka's review
> (probably a good idea to only add things like that in if/when they prove to
> be an improvement).
> 
> ...
 >
> Netperf UDP unidirectional send test (10 runs, higher better):
> 
> Server and client bound to same CPU
> SLAB AVG=60.111 STD=1.59382
> SLQB AVG=60.167 STD=0.685347
> SLUB AVG=58.277 STD=0.788328
> 
> Server and client bound to same socket, different CPUs
> SLAB AVG=85.938 STD=0.875794
> SLQB AVG=93.662 STD=2.07434
> SLUB AVG=81.983 STD=0.864362
> 
> Server and client bound to different sockets
> SLAB AVG=78.801 STD=1.44118
> SLQB AVG=78.269 STD=1.10457
> SLUB AVG=71.334 STD=1.16809
 > ...
> I haven't done any non-local network tests. Networking is the one of the
> subsystems most heavily dependent on slab performance, so if anybody
> cares to run their favourite tests, that would be really helpful.

I'm guessing, but then are these Mbit/s figures? Would that be the sending 
throughput or the receiving throughput?

I love to see netperf used, but why UDP and loopback?  Also, how about the 
service demands?

rick jones

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-15 19:44                     ` Ma, Chinang
@ 2009-01-16 18:14                       ` Gregory Haskins
  2009-01-16 19:09                         ` Steven Rostedt
  2009-01-20 12:45                         ` Gregory Haskins
  0 siblings, 2 replies; 105+ messages in thread
From: Gregory Haskins @ 2009-01-16 18:14 UTC (permalink / raw)
  To: Ma, Chinang
  Cc: Wilcox, Matthew R, Steven Rostedt, Matthew Wilcox, Andrew Morton,
	James Bottomley, linux-kernel, Tripathi, Sharad C, arjan, Kleen,
	Andi, Siddha, Suresh B, Chilukuri, Harita, Styner, Douglas W,
	Wang, Peter Xihong, Nueckel, Hubert, chris.mason, linux-scsi,
	Andrew Vasquez, Anirban Chakraborty


[-- Attachment #1.1: Type: text/plain, Size: 1964 bytes --]

Ma, Chinang wrote:
> Gregory. 
> I will test the resched-ipi instrumentation patch with our OLTP if you can post the patch and some instructions.
> Thanks,
> -Chinang
>   

Hi Chinang,
  Please find a patch attached which applies to linus.git as of today. 
You will also want to enable CONFIG_FUNCTION_TRACER as well as the trace
components.  Here is my system:

ghaskins@dev:~/sandbox/git/linux-2.6-rt> grep TRACE .config
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_TRACEPOINTS=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_BLK_DEV_IO_TRACE=y
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_PREEMPT_RCU_TRACE is not set
CONFIG_X86_PTRACE_BTS=y
# CONFIG_ACPI_DEBUG_FUNC_TRACE is not set
CONFIG_NETFILTER_XT_TARGET_TRACE=m
CONFIG_SOUND_TRACEINIT=y
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_TRACE_IRQFLAGS=y
CONFIG_STACKTRACE=y
# CONFIG_BACKTRACE_SELF_TEST is not set
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_HW_BRANCH_TRACER=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_IRQSOFF_TRACER=y
CONFIG_SYSPROF_TRACER=y
CONFIG_SCHED_TRACER=y
CONFIG_CONTEXT_SWITCH_TRACER=y
# CONFIG_BOOT_TRACER is not set
# CONFIG_TRACE_BRANCH_PROFILING is not set
CONFIG_POWER_TRACER=y
CONFIG_STACK_TRACER=y
CONFIG_HW_BRANCH_TRACER=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FTRACE_MCOUNT_RECORD=y
# CONFIG_FTRACE_STARTUP_TEST is not set
# CONFIG_MMIOTRACE is not set
# CONFIG_KVM_TRACE is not set


Then on your booted system, do:

echo sched_switch > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/tracing_enabled
$run_oltp && echo 0 > /sys/kernel/debug/tracing/tracing_enabled

(where $run_oltp is your suite)

Then, email the contents of /sys/kernel/debug/tracing/trace to me

-Greg


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.2: instrumentation.patch --]
[-- Type: text/x-patch; name="instrumentation.patch", Size: 3263 bytes --]

ftrace instrumentation for RT tasks

From: Gregory Haskins <ghaskins@novell.com>

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 arch/x86/kernel/smp.c |    2 ++
 include/linux/sched.h |    6 ++++++
 kernel/sched.c        |    3 +++
 kernel/sched_rt.c     |   10 ++++++++++
 4 files changed, 21 insertions(+), 0 deletions(-)


diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index e6faa33..468abeb 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -118,6 +118,7 @@ static void native_smp_send_reschedule(int cpu)
 		WARN_ON(1);
 		return;
 	}
+	ftrace_printk("cpu %d\n", cpu);
 	send_IPI_mask(cpumask_of(cpu), RESCHEDULE_VECTOR);
 }
 
@@ -171,6 +172,7 @@ static void native_smp_send_stop(void)
  */
 void smp_reschedule_interrupt(struct pt_regs *regs)
 {
+	ftrace_printk("NEEDS_RESCHED\n");
 	ack_APIC_irq();
 	inc_irq_stat(irq_resched_count);
 }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4cae9b8..a320692 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2094,8 +2094,14 @@ static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag)
 	return test_ti_thread_flag(task_thread_info(tsk), flag);
 }
 
+# define ftrace_printk(fmt...) __ftrace_printk(_THIS_IP_, fmt)
+extern int
+__ftrace_printk(unsigned long ip, const char *fmt, ...)
+	__attribute__ ((format (printf, 2, 3)));
+
 static inline void set_tsk_need_resched(struct task_struct *tsk)
 {
+	ftrace_printk("%s/%d\n", tsk->comm, tsk->pid);
 	set_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
 }
 
diff --git a/kernel/sched.c b/kernel/sched.c
index 52bbf1c..d55fcf1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1874,6 +1874,9 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 		      *new_cfsrq = cpu_cfs_rq(old_cfsrq, new_cpu);
 	u64 clock_offset;
 
+	ftrace_printk("migrate %s/%d [%d] -> [%d]\n",
+		      p->comm, p->pid, task_cpu(p), new_cpu);
+
 	clock_offset = old_rq->clock - new_rq->clock;
 
 	trace_sched_migrate_task(p, task_cpu(p), new_cpu);
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 954e1a8..59cf64b 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1102,6 +1102,8 @@ static int push_rt_task(struct rq *rq)
 	if (!next_task)
 		return 0;
 
+	ftrace_printk("attempting push\n");
+
  retry:
 	if (unlikely(next_task == rq->curr)) {
 		WARN_ON(1);
@@ -1139,6 +1141,8 @@ static int push_rt_task(struct rq *rq)
 		goto out;
 	}
 
+	ftrace_printk("%s/%d\n", next_task->comm, next_task->pid);
+
 	deactivate_task(rq, next_task, 0);
 	set_task_cpu(next_task, lowest_rq->cpu);
 	activate_task(lowest_rq, next_task, 0);
@@ -1180,6 +1184,8 @@ static int pull_rt_task(struct rq *this_rq)
 	if (likely(!rt_overloaded(this_rq)))
 		return 0;
 
+	ftrace_printk("attempting pull\n");
+
 	next = pick_next_task_rt(this_rq);
 
 	for_each_cpu(cpu, this_rq->rd->rto_mask) {
@@ -1234,6 +1240,10 @@ static int pull_rt_task(struct rq *this_rq)
 
 			ret = 1;
 
+			ftrace_printk("pull %s/%d [%d] -> [%d]\n",
+				      p->comm, p->pid,
+				      src_rq->cpu, this_rq->cpu);
+
 			deactivate_task(src_rq, p, 0);
 			set_task_cpu(p, this_cpu);
 			activate_task(this_rq, p, 0);

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16 18:14                       ` Gregory Haskins
@ 2009-01-16 19:09                         ` Steven Rostedt
  2009-01-20 12:45                         ` Gregory Haskins
  1 sibling, 0 replies; 105+ messages in thread
From: Steven Rostedt @ 2009-01-16 19:09 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Ma, Chinang, Wilcox, Matthew R, Matthew Wilcox, Andrew Morton,
	James Bottomley, linux-kernel, Tripathi, Sharad C, arjan, Kleen,
	Andi, Siddha, Suresh B, Chilukuri, Harita, Styner, Douglas W,
	Wang, Peter Xihong, Nueckel, Hubert, chris.mason, linux-scsi,
	Andrew Vasquez, Anirban Chakraborty


On Fri, 2009-01-16 at 13:14 -0500, Gregory Haskins wrote:
> Ma, Chinang wrote:
> > Gregory. 
> > I will test the resched-ipi instrumentation patch with our OLTP if you can post the patch and some instructions.
> > Thanks,
> > -Chinang
> >   
> 
> Hi Chinang,
>   Please find a patch attached which applies to linus.git as of today. 
> You will also want to enable CONFIG_FUNCTION_TRACER as well as the trace
> components.  Here is my system:
> 

I don't see why CONFIG_FUNCTION_TRACER is needed.

> ghaskins@dev:~/sandbox/git/linux-2.6-rt> grep TRACE .config
> CONFIG_STACKTRACE_SUPPORT=y
> CONFIG_TRACEPOINTS=y
> CONFIG_HAVE_ARCH_TRACEHOOK=y
> CONFIG_BLK_DEV_IO_TRACE=y
> # CONFIG_TREE_RCU_TRACE is not set
> # CONFIG_PREEMPT_RCU_TRACE is not set
> CONFIG_X86_PTRACE_BTS=y
> # CONFIG_ACPI_DEBUG_FUNC_TRACE is not set
> CONFIG_NETFILTER_XT_TARGET_TRACE=m
> CONFIG_SOUND_TRACEINIT=y
> CONFIG_TRACE_IRQFLAGS_SUPPORT=y
> CONFIG_TRACE_IRQFLAGS=y
> CONFIG_STACKTRACE=y
> # CONFIG_BACKTRACE_SELF_TEST is not set
> CONFIG_USER_STACKTRACE_SUPPORT=y
> CONFIG_NOP_TRACER=y
> CONFIG_HAVE_FUNCTION_TRACER=y
> CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
> CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
> CONFIG_HAVE_DYNAMIC_FTRACE=y
> CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
> CONFIG_HAVE_HW_BRANCH_TRACER=y
> CONFIG_TRACER_MAX_TRACE=y
> CONFIG_FUNCTION_TRACER=y
> CONFIG_FUNCTION_GRAPH_TRACER=y
> CONFIG_IRQSOFF_TRACER=y
> CONFIG_SYSPROF_TRACER=y
> CONFIG_SCHED_TRACER=y

This CONFIG_SCHED_TRACER should be enough.

-- Steve

> CONFIG_CONTEXT_SWITCH_TRACER=y
> # CONFIG_BOOT_TRACER is not set
> # CONFIG_TRACE_BRANCH_PROFILING is not set
> CONFIG_POWER_TRACER=y
> CONFIG_STACK_TRACER=y
> CONFIG_HW_BRANCH_TRACER=y
> CONFIG_DYNAMIC_FTRACE=y
> CONFIG_FTRACE_MCOUNT_RECORD=y
> # CONFIG_FTRACE_STARTUP_TEST is not set
> # CONFIG_MMIOTRACE is not set
> # CONFIG_KVM_TRACE is not set
> 
> 
> Then on your booted system, do:
> 
> echo sched_switch > /sys/kernel/debug/tracing/current_tracer
> echo 1 > /sys/kernel/debug/tracing/tracing_enabled
> $run_oltp && echo 0 > /sys/kernel/debug/tracing/tracing_enabled
> 
> (where $run_oltp is your suite)
> 
> Then, email the contents of /sys/kernel/debug/tracing/trace to me
> 
> -Greg
> 


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16 10:31                   ` Pekka Enberg
  2009-01-16 10:42                     ` Nick Piggin
@ 2009-01-16 20:59                     ` Christoph Lameter
  1 sibling, 0 replies; 105+ messages in thread
From: Christoph Lameter @ 2009-01-16 20:59 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Matthew Wilcox, Andrew Morton, Wilcox, Matthew R,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, Andrew Vasquez, Anirban Chakraborty

On Fri, 16 Jan 2009, Pekka Enberg wrote:

> aren't I? There's some scheduler related options such as
> CONFIG_GROUP_SCHED and CONFIG_FAIR_GROUP_SCHED enabled in defconfig
> that I didn't have in the original tests. I can try without them if
> you want but I'm not sure it's relevant for SLAB vs SLUB tests.

I have seen CONFIG_GROUP_SCHED to affect latency tests in significant
ways.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16 10:55                       ` Pekka Enberg
@ 2009-01-19  7:13                         ` Nick Piggin
  2009-01-19  8:05                           ` Pekka Enberg
  0 siblings, 1 reply; 105+ messages in thread
From: Nick Piggin @ 2009-01-19  7:13 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Matthew Wilcox, Andrew Morton, Wilcox, Matthew R, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, Andrew Vasquez, Anirban Chakraborty,
	Christoph Lameter

On Friday 16 January 2009 21:55:30 Pekka Enberg wrote:
> Hi Nick,
>
> On Fri, Jan 16, 2009 at 12:42 PM, Nick Piggin <nickpiggin@yahoo.com.au> 
wrote:
> >> I don't have the exact config for the previous tests but it's was just
> >> my laptop regular config whereas the new tests are x86-64 defconfig.
> >> So I think I'm just hitting some of the other OLTP regressions here,
> >> aren't I? There's some scheduler related options such as
> >> CONFIG_GROUP_SCHED and CONFIG_FAIR_GROUP_SCHED enabled in defconfig
> >> that I didn't have in the original tests. I can try without them if
> >> you want but I'm not sure it's relevant for SLAB vs SLUB tests.
> >
> > Oh no that's fine. It just looked like you repeated the test but
> > with lockdep disabled (and no other changes).
>
> Right. In any case, I am still unable to reproduce the OLTP issue and
> I've seen SLUB beat SLAB on my machine in most of the benchmarks
> you've posted.

SLUB was distinctly slower on the tbench, netperf, and hackbench
tests that I ran. These were faster with SLUB on your machine?
What kind of system is it?


> So I have very mixed feelings about SLQB. It's very
> nice that it works for OLTP but we still don't have much insight (i.e.
> numbers) on why it's better.

According to estimates in this thread, I think Matthew said SLUB would
be around 6% slower? SLQB is within measurement error of SLAB.

Fair point about personally reproducing the OLTP problem yourself. But
the fact is that we will get problem reports that cannot be reproduced.
That does not make them less relevant. I can't reproduce the OLTP
benchmark myself. And I'm fully expecting to get problem reports for
SLQB against insanely sized SGI systems, which I will take very seriously
and try to fix them.


> I'm also bit worried if SLQB has gotten
> enough attention from the NUMA and HPC folks that brought us SLUB.

It hasn't, but that's the problem we're hoping to solve by getting it
merged. People can give it more attention, and we can try to fix any
problems. SLUB has been default for quite a while now and not able to
solve all problems it has had reported against it. So I hope SLQB will
be able to unblock this situation.


> The good news is that SLQB can replace SLAB so either way, we're not
> going to end up with four allocators. Whether it can replace SLUB
> remains to be seen.

Well I think being able to simply replace SLAB is not ideal. The plan
I'm hoping is to have four allocators for a few releases, and then
go back to having two. That is going to mean some groups might not
have their ideal allocator merged... but I think it is crazy to settle
with more than one main compile-time allocator for the long term.

I don't know what the next redhat enterprise release is going to do,
but if they go with SLAB, then I think that means no SGI systems would
run in production with SLUB anyway, so what would be the purpose of
having a special "HPC/huge system" allocator? Or... what other reasons
should users select SLUB vs SLAB? (in terms of core allocator behaviour,
versus extras that can be ported from one to the other) If we can't even
make up our own minds, then will others be able to?


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16 18:11                   ` Rick Jones
@ 2009-01-19  7:43                     ` Nick Piggin
  2009-01-19 22:19                       ` Rick Jones
  0 siblings, 1 reply; 105+ messages in thread
From: Nick Piggin @ 2009-01-19  7:43 UTC (permalink / raw)
  To: Rick Jones
  Cc: Andrew Morton, netdev, sfr, matthew, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty

On Saturday 17 January 2009 05:11:02 Rick Jones wrote:
> Nick Piggin wrote:
> > OK, I have these numbers to show I'm not completely off my rocker to
> > suggest we merge SLQB :) Given these results, how about I ask to merge
> > SLQB as default in linux-next, then if nothing catastrophic happens,
> > merge it upstream in the next merge window, then a couple of releases
> > after that, given some time to test and tweak SLQB, then we plan to bite
> > the bullet and emerge with just one main slab allocator (plus SLOB).
> >
> >
> > System is a 2socket, 4 core AMD.
>
> Not exactly a large system :)  Barely NUMA even with just two sockets.

You're right ;)

But at least it is exercising the NUMA paths in the allocator, and
represents a pretty common size of system...

I can run some tests on bigger systems at SUSE, but it is not always
easy to set up "real" meaningful workloads on them or configure
significant IO for them.


> > Netperf UDP unidirectional send test (10 runs, higher better):
> >
> > Server and client bound to same CPU
> > SLAB AVG=60.111 STD=1.59382
> > SLQB AVG=60.167 STD=0.685347
> > SLUB AVG=58.277 STD=0.788328
> >
> > Server and client bound to same socket, different CPUs
> > SLAB AVG=85.938 STD=0.875794
> > SLQB AVG=93.662 STD=2.07434
> > SLUB AVG=81.983 STD=0.864362
> >
> > Server and client bound to different sockets
> > SLAB AVG=78.801 STD=1.44118
> > SLQB AVG=78.269 STD=1.10457
> > SLUB AVG=71.334 STD=1.16809
> >
>  > ...
> >
> > I haven't done any non-local network tests. Networking is the one of the
> > subsystems most heavily dependent on slab performance, so if anybody
> > cares to run their favourite tests, that would be really helpful.
>
> I'm guessing, but then are these Mbit/s figures? Would that be the sending
> throughput or the receiving throughput?

Yes, Mbit/s. They were... hmm, sending throughput I think, but each pair
of numbers seemed to be identical IIRC?


> I love to see netperf used, but why UDP and loopback?

No really good reason. I guess I was hoping to keep other variables as
small as possible. But I guess a real remote test would be a lot more
realistic as a networking test. Hmm, but I could probably set up a test
over a simple GbE link here.  I'll try that.


> Also, how about the
> service demands?

Well, over loopback and using CPU binding, I was hoping it wouldn't
change much... but I see netperf does some measurements for you. I
will consider those in future too.

BTW. is it possible to do parallel netperf tests?



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-19  7:13                         ` Nick Piggin
@ 2009-01-19  8:05                           ` Pekka Enberg
  2009-01-19  8:33                             ` Nick Piggin
  0 siblings, 1 reply; 105+ messages in thread
From: Pekka Enberg @ 2009-01-19  8:05 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Matthew Wilcox, Andrew Morton, Wilcox, Matthew R, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, Andrew Vasquez, Anirban Chakraborty,
	Christoph Lameter

Hi Nick,

On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote:
> SLUB was distinctly slower on the tbench, netperf, and hackbench
> tests that I ran. These were faster with SLUB on your machine?

I was trying to bisect a somewhat recent SLAB vs. SLUB regression in
tbench that seems to be triggered by CONFIG_SLUB as suggested by Evgeniy
Polyakov performance tests. Unfortunately I bisected it down to a bogus
commit so while I saw SLUB beating SLAB, I also saw the reverse in
nearby commits which didn't touch anything interesting. So for tbench,
SLUB _used to_ dominate SLAB on my machine but the current situation is
not as clear with all the tbench regressions in other subsystems.

SLUB has been a consistent winner for hackbench after Christoph fixed
the regression reported by Ingo Molnar two years (?) ago. I don't think
I've ran netperf, but for the fio test you mentioned, SLUB is beating
SLAB here.

On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote:
> What kind of system is it?

2-way Core2. I posted my /proc/cpuinfo in this thread if you're
interested.

On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote:
> > So I have very mixed feelings about SLQB. It's very
> > nice that it works for OLTP but we still don't have much insight (i.e.
> > numbers) on why it's better.

On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote:
> According to estimates in this thread, I think Matthew said SLUB would
> be around 6% slower? SLQB is within measurement error of SLAB.

Yeah but I say that we don't know _why_ it's better. There's the
kmalloc()/kfree() CPU ping-pong hypothesis but it could also be due to
page allocator interaction or just a plain bug in SLUB. And lets not
forget bad interaction with some random subsystem (SCSI, for example).

On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote:
> Fair point about personally reproducing the OLTP problem yourself. But
> the fact is that we will get problem reports that cannot be reproduced.
> That does not make them less relevant. I can't reproduce the OLTP
> benchmark myself. And I'm fully expecting to get problem reports for
> SLQB against insanely sized SGI systems, which I will take very seriously
> and try to fix them.

Again, it's not that I don't take the OLTP regression seriously (I do)
but as a "part-time maintainer" I simply don't have the time and
resources to attempt to fix it without either (a) being able to
reproduce the problem or (b) have someone who can reproduce it who is
willing to do oprofile and so on.

So as much as I would have preferred that you had at least attempted to
fix SLUB, I'm more than happy that we have a very active developer
working on the problem now. I mean, I don't really care which allocator
we decide to go forward with, if all the relevant regressions are dealt
with.

All I am saying is that I don't like how we're fixing a performance bug
with a shiny new allocator without a credible explanation why the
current approach is not fixable.

On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote:
> > The good news is that SLQB can replace SLAB so either way, we're not
> > going to end up with four allocators. Whether it can replace SLUB
> > remains to be seen.
> 
> Well I think being able to simply replace SLAB is not ideal. The plan
> I'm hoping is to have four allocators for a few releases, and then
> go back to having two. That is going to mean some groups might not
> have their ideal allocator merged... but I think it is crazy to settle
> with more than one main compile-time allocator for the long term.

So now the HPC folk will be screwed over by the OLTP folk? I guess
that's okay as the latter have been treated rather badly for the past
two years.... ;-)

			Pekka


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-19  8:05                           ` Pekka Enberg
@ 2009-01-19  8:33                             ` Nick Piggin
  2009-01-19  8:42                               ` Nick Piggin
  2009-01-19  9:48                               ` Pekka Enberg
  0 siblings, 2 replies; 105+ messages in thread
From: Nick Piggin @ 2009-01-19  8:33 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Matthew Wilcox, Andrew Morton, Wilcox, Matthew R, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, Andrew Vasquez, Anirban Chakraborty,
	Christoph Lameter

On Monday 19 January 2009 19:05:03 Pekka Enberg wrote:
> Hi Nick,
>
> On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote:
> > SLUB was distinctly slower on the tbench, netperf, and hackbench
> > tests that I ran. These were faster with SLUB on your machine?
>
> I was trying to bisect a somewhat recent SLAB vs. SLUB regression in
> tbench that seems to be triggered by CONFIG_SLUB as suggested by Evgeniy
> Polyakov performance tests. Unfortunately I bisected it down to a bogus
> commit so while I saw SLUB beating SLAB, I also saw the reverse in
> nearby commits which didn't touch anything interesting. So for tbench,
> SLUB _used to_ dominate SLAB on my machine but the current situation is
> not as clear with all the tbench regressions in other subsystems.

OK.


> SLUB has been a consistent winner for hackbench after Christoph fixed
> the regression reported by Ingo Molnar two years (?) ago. I don't think
> I've ran netperf, but for the fio test you mentioned, SLUB is beating
> SLAB here.

Hmm, netperf, hackbench, and fio are all faster with SLAB than SLUB.


> On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote:
> > What kind of system is it?
>
> 2-way Core2. I posted my /proc/cpuinfo in this thread if you're
> interested.

Thanks. I guess one of three obvious differences, mine is a K10, is
NUMA, and has significantly more cores. I can try setting it to
interleave cachelines over nodes or use fewer cores to see if the
picture changes...


> On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote:
> > > So I have very mixed feelings about SLQB. It's very
> > > nice that it works for OLTP but we still don't have much insight (i.e.
> > > numbers) on why it's better.
>
> On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote:
> > According to estimates in this thread, I think Matthew said SLUB would
> > be around 6% slower? SLQB is within measurement error of SLAB.
>
> Yeah but I say that we don't know _why_ it's better. There's the
> kmalloc()/kfree() CPU ping-pong hypothesis but it could also be due to
> page allocator interaction or just a plain bug in SLUB. And lets not
> forget bad interaction with some random subsystem (SCSI, for example).
>
> On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote:
> > Fair point about personally reproducing the OLTP problem yourself. But
> > the fact is that we will get problem reports that cannot be reproduced.
> > That does not make them less relevant. I can't reproduce the OLTP
> > benchmark myself. And I'm fully expecting to get problem reports for
> > SLQB against insanely sized SGI systems, which I will take very seriously
> > and try to fix them.
>
> Again, it's not that I don't take the OLTP regression seriously (I do)
> but as a "part-time maintainer" I simply don't have the time and
> resources to attempt to fix it without either (a) being able to
> reproduce the problem or (b) have someone who can reproduce it who is
> willing to do oprofile and so on.
>
> So as much as I would have preferred that you had at least attempted to
> fix SLUB, I'm more than happy that we have a very active developer
> working on the problem now. I mean, I don't really care which allocator
> we decide to go forward with, if all the relevant regressions are dealt
> with.

OK, good to know.


> All I am saying is that I don't like how we're fixing a performance bug
> with a shiny new allocator without a credible explanation why the
> current approach is not fixable.

To be honest, my biggest concern with SLUB is the higher order pages
thing. But Christoph always poo poos me when I raise that concern, and
it's hard to get concrete numbers showing real fragmentation problems
when it can take days or months to start biting.

It really stems from queueing versus not queueing I guess. And I think
SLUB is flawed due to its avoidance of queueing.


> On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote:
> > > The good news is that SLQB can replace SLAB so either way, we're not
> > > going to end up with four allocators. Whether it can replace SLUB
> > > remains to be seen.
> >
> > Well I think being able to simply replace SLAB is not ideal. The plan
> > I'm hoping is to have four allocators for a few releases, and then
> > go back to having two. That is going to mean some groups might not
> > have their ideal allocator merged... but I think it is crazy to settle
> > with more than one main compile-time allocator for the long term.
>
> So now the HPC folk will be screwed over by the OLTP folk?

No. I'm imagining there will be a discussion of the 3, and at some
point an executive decision will be made if an agreement can't be
reached. At this point, I think that is a better and fairer option
than just asserting one allocator is better than another and making
it the default.

And... we have no indication that SLQB will be worse for HPC than
SLUB ;)


> I guess
> that's okay as the latter have been treated rather badly for the past
> two years.... ;-)

I don't know if that is meant to be sarcastic, but the OLTP performance
numbers almost never get better from one kernel to the next. Actually
the trend is downward. Mainly due to bloat or new features being added.

I think that at some level, controlled addition of features that may
add some cycles to these paths is not a bad idea (what good is Moore's
Law if we can't have shiny new features? :) But on the other hand, this
OLTP test is incredibly valuable to monitor the general performance-
health of this area of the kernel.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-19  8:33                             ` Nick Piggin
@ 2009-01-19  8:42                               ` Nick Piggin
  2009-01-19  8:47                                 ` Pekka Enberg
  2009-01-19  9:48                               ` Pekka Enberg
  1 sibling, 1 reply; 105+ messages in thread
From: Nick Piggin @ 2009-01-19  8:42 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Matthew Wilcox, Andrew Morton, Wilcox, Matthew R, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, Andrew Vasquez, Anirban Chakraborty,
	Christoph Lameter

On Monday 19 January 2009 19:33:27 Nick Piggin wrote:
> On Monday 19 January 2009 19:05:03 Pekka Enberg wrote:

> > All I am saying is that I don't like how we're fixing a performance bug
> > with a shiny new allocator without a credible explanation why the
> > current approach is not fixable.
>
> To be honest, my biggest concern with SLUB is the higher order pages
> thing. But Christoph always poo poos me when I raise that concern, and
> it's hard to get concrete numbers showing real fragmentation problems
> when it can take days or months to start biting.
>
> It really stems from queueing versus not queueing I guess. And I think
> SLUB is flawed due to its avoidance of queueing.

And FWIW, Christoph was also not able to fix the OLTP problem although
I think it has been known for nearly two years ago now (I remember we
talked about it at 2007 KS, although I wasn't following slab development
very keenly back then).

At this point I feel spending time working on SLUB isn't a good idea if
a) Christoph himself hadn't fixed this problem; and b) we disagree about
fundamental design choices (see the "SLQB slab allocator" thread).

Anyway, nobody has disagreed with my proposal to merge SLQB, so in the
worst case I don't think it will cause too much harm, and in the best
case it might turn out to make the best tradeoffs and who knows, it
might actually not be catastrophic for HPC ;)


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-19  8:42                               ` Nick Piggin
@ 2009-01-19  8:47                                 ` Pekka Enberg
  2009-01-19  8:57                                   ` Nick Piggin
  0 siblings, 1 reply; 105+ messages in thread
From: Pekka Enberg @ 2009-01-19  8:47 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Matthew Wilcox, Andrew Morton, Wilcox, Matthew R, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, Andrew Vasquez, Anirban Chakraborty,
	Christoph Lameter

On Mon, Jan 19, 2009 at 10:42 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> Anyway, nobody has disagreed with my proposal to merge SLQB, so in the
> worst case I don't think it will cause too much harm, and in the best
> case it might turn out to make the best tradeoffs and who knows, it
> might actually not be catastrophic for HPC ;)

Yeah. If Andrew/Linus doesn't want to merge SLQB to 2.6.29, we can
stick it in linux-next through slab.git if you want.

                                Pekka

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-19  8:47                                 ` Pekka Enberg
@ 2009-01-19  8:57                                   ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2009-01-19  8:57 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Matthew Wilcox, Andrew Morton, Wilcox, Matthew R, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, Andrew Vasquez, Anirban Chakraborty,
	Christoph Lameter

On Monday 19 January 2009 19:47:24 Pekka Enberg wrote:
> On Mon, Jan 19, 2009 at 10:42 AM, Nick Piggin <nickpiggin@yahoo.com.au> 
wrote:
> > Anyway, nobody has disagreed with my proposal to merge SLQB, so in the
> > worst case I don't think it will cause too much harm, and in the best
> > case it might turn out to make the best tradeoffs and who knows, it
> > might actually not be catastrophic for HPC ;)
>
> Yeah. If Andrew/Linus doesn't want to merge SLQB to 2.6.29, we can

I would prefer not. Apart from not practicing what I preach about
merging, if it has stupid bugs on some systems or obvious performance
problems, it will not be a good start ;)

> stick it in linux-next through slab.git if you want.

That would be appreciated. It's not quite ready yet...

Thanks.
Nick


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-19  8:33                             ` Nick Piggin
  2009-01-19  8:42                               ` Nick Piggin
@ 2009-01-19  9:48                               ` Pekka Enberg
  2009-01-19 10:03                                 ` Nick Piggin
  1 sibling, 1 reply; 105+ messages in thread
From: Pekka Enberg @ 2009-01-19  9:48 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Matthew Wilcox, Andrew Morton, Wilcox, Matthew R, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, Andrew Vasquez, Anirban Chakraborty,
	Christoph Lameter

Hi Nick,

On Mon, Jan 19, 2009 at 10:33 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>> All I am saying is that I don't like how we're fixing a performance bug
>> with a shiny new allocator without a credible explanation why the
>> current approach is not fixable.
>
> To be honest, my biggest concern with SLUB is the higher order pages
> thing. But Christoph always poo poos me when I raise that concern, and
> it's hard to get concrete numbers showing real fragmentation problems
> when it can take days or months to start biting.

To be fair to SLUB, we do have the pending slab defragmentation
patches in my tree. Not that we have any numbers on if defragmentation
helps and how much. IIRC, Christoph said one of the reasons for
avoiding queues in SLUB is to be able to do defragmentation. But I
suppose with SLQB we can do the same thing as long as we flush the
queues before attempting to defrag.

                                Pekka

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-19  9:48                               ` Pekka Enberg
@ 2009-01-19 10:03                                 ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2009-01-19 10:03 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Matthew Wilcox, Andrew Morton, Wilcox, Matthew R, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, Andrew Vasquez, Anirban Chakraborty,
	Christoph Lameter

On Monday 19 January 2009 20:48:52 Pekka Enberg wrote:
> Hi Nick,
>
> On Mon, Jan 19, 2009 at 10:33 AM, Nick Piggin <nickpiggin@yahoo.com.au> 
wrote:
> >> All I am saying is that I don't like how we're fixing a performance bug
> >> with a shiny new allocator without a credible explanation why the
> >> current approach is not fixable.
> >
> > To be honest, my biggest concern with SLUB is the higher order pages
> > thing. But Christoph always poo poos me when I raise that concern, and
> > it's hard to get concrete numbers showing real fragmentation problems
> > when it can take days or months to start biting.
>
> To be fair to SLUB, we do have the pending slab defragmentation
> patches in my tree. Not that we have any numbers on if defragmentation
> helps and how much. IIRC, Christoph said one of the reasons for
> avoiding queues in SLUB is to be able to do defragmentation. But I
> suppose with SLQB we can do the same thing as long as we flush the
> queues before attempting to defrag.

I have had a look at them, (and I raised some concerns about races with
the bufferhead "defragmentation" patch which I didn't get a reply to,
but now's not the time to get into that).

Christoph's design AFAIKS is not impossible with queued slab allocators,
but they would just need to do either some kind of per-cpu processing,
at least a way to flush queues of objects. This should not be impossible.

But in my reply, I also outlined an idea for a possibly better design for
targetted slab reclaim that could have fewer of the locking complexitiesin 
other subsystems like the slub defrag patches do. I plan to look at this
at some point, but I think we need to sort out the basics first.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* RE: Mainline kernel OLTP performance update
  2009-01-15  7:11             ` Ma, Chinang
@ 2009-01-19 18:04               ` Chris Mason
  -1 siblings, 0 replies; 105+ messages in thread
From: Chris Mason @ 2009-01-19 18:04 UTC (permalink / raw)
  To: Ma, Chinang
  Cc: Steven Rostedt, Andrew Morton, Matthew Wilcox, Wilcox, Matthew R,
	linux-kernel, Tripathi, Sharad C, arjan, Kleen, Andi, Siddha,
	Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang,
	Peter Xihong, Nueckel, Hubert, linux-scsi, Andrew Vasquez,
	Anirban Chakraborty, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Gregory Haskins

On Thu, 2009-01-15 at 00:11 -0700, Ma, Chinang wrote:
> >> > > > >
> >> > > > > Linux OLTP Performance summary
> >> > > > > Kernel#            Speedup(x)   Intr/s  CtxSw/s us%  sys%   idle%
> >iowait%
> >> > > > > 2.6.24.2                1.000   21969   43425   76   24     0
> >0
> >> > > > > 2.6.27.2                0.973   30402   43523   74   25     0
> >1
> >> > > > > 2.6.29-rc1              0.965   30331   41970   74   26     0
> >0
> >> >
> >> > > But the interrupt rate went through the roof.
> >> >
> >> > Yes.  I forget why that was; I'll have to dig through my archives for
> >> > that.
> >>
> >> Oh.  I'd have thought that this alone could account for 3.5%.

A later email indicated the reschedule interrupt count doubled since
2.6.24, and so I poked around a bit at the causes of resched_task.

I think the -rt version of check_preempt_equal_prio has gotten much more
expensive since 2.6.24.

I'm sure these changes were made for good reasons, and this workload may
not be a good reason to change it back.  But, what does the patch below
do to performance on 2.6.29-rcX?

-chris

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 954e1a8..bbe3492 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -842,6 +842,7 @@ static void check_preempt_curr_rt(struct rq *rq,
struct task_struct *p, int sync
 		resched_task(rq->curr);
 		return;
 	}
+	return;
 
 #ifdef CONFIG_SMP
 	/*





^ permalink raw reply related	[flat|nested] 105+ messages in thread

* RE: Mainline kernel OLTP performance update
@ 2009-01-19 18:04               ` Chris Mason
  0 siblings, 0 replies; 105+ messages in thread
From: Chris Mason @ 2009-01-19 18:04 UTC (permalink / raw)
  To: Ma, Chinang
  Cc: Steven Rostedt, Andrew Morton, Matthew Wilcox, Wilcox, Matthew R,
	linux-kernel, Tripathi, Sharad C, arjan, Kleen, Andi, Siddha,
	Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang,
	Peter Xihong, Nueckel, Hubert, linux-scsi, Andrew Vasquez,
	Anirban Chakraborty, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra

On Thu, 2009-01-15 at 00:11 -0700, Ma, Chinang wrote:
> >> > > > >
> >> > > > > Linux OLTP Performance summary
> >> > > > > Kernel#            Speedup(x)   Intr/s  CtxSw/s us%  sys%   idle%
> >iowait%
> >> > > > > 2.6.24.2                1.000   21969   43425   76   24     0
> >0
> >> > > > > 2.6.27.2                0.973   30402   43523   74   25     0
> >1
> >> > > > > 2.6.29-rc1              0.965   30331   41970   74   26     0
> >0
> >> >
> >> > > But the interrupt rate went through the roof.
> >> >
> >> > Yes.  I forget why that was; I'll have to dig through my archives for
> >> > that.
> >>
> >> Oh.  I'd have thought that this alone could account for 3.5%.

A later email indicated the reschedule interrupt count doubled since
2.6.24, and so I poked around a bit at the causes of resched_task.

I think the -rt version of check_preempt_equal_prio has gotten much more
expensive since 2.6.24.

I'm sure these changes were made for good reasons, and this workload may
not be a good reason to change it back.  But, what does the patch below
do to performance on 2.6.29-rcX?

-chris

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 954e1a8..bbe3492 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -842,6 +842,7 @@ static void check_preempt_curr_rt(struct rq *rq,
struct task_struct *p, int sync
 		resched_task(rq->curr);
 		return;
 	}
+	return;
 
 #ifdef CONFIG_SMP
 	/*





^ permalink raw reply related	[flat|nested] 105+ messages in thread

* RE: Mainline kernel OLTP performance update
  2009-01-19 18:04               ` Chris Mason
  (?)
  (?)
@ 2009-01-19 18:37               ` Steven Rostedt
  2009-01-19 18:55                   ` Chris Mason
  2009-01-19 23:40                   ` Ingo Molnar
  -1 siblings, 2 replies; 105+ messages in thread
From: Steven Rostedt @ 2009-01-19 18:37 UTC (permalink / raw)
  To: Chris Mason
  Cc: Ma, Chinang, Andrew Morton, Matthew Wilcox, Wilcox, Matthew R,
	linux-kernel, Tripathi, Sharad C, arjan, Kleen, Andi, Siddha,
	Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang,
	Peter Xihong, Nueckel, Hubert, linux-scsi, Andrew Vasquez,
	Anirban Chakraborty, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Gregory Haskins, Rusty Russell

(added Rusty)

On Mon, 2009-01-19 at 13:04 -0500, Chris Mason wrote:
> On Thu, 2009-01-15 at 00:11 -0700, Ma, Chinang wrote:
> > >> > > > >
> > >> > > > > Linux OLTP Performance summary
> > >> > > > > Kernel#            Speedup(x)   Intr/s  CtxSw/s us%  sys%   idle%
> > >iowait%
> > >> > > > > 2.6.24.2                1.000   21969   43425   76   24     0
> > >0
> > >> > > > > 2.6.27.2                0.973   30402   43523   74   25     0
> > >1
> > >> > > > > 2.6.29-rc1              0.965   30331   41970   74   26     0
> > >0
> > >> >
> > >> > > But the interrupt rate went through the roof.
> > >> >
> > >> > Yes.  I forget why that was; I'll have to dig through my archives for
> > >> > that.
> > >>
> > >> Oh.  I'd have thought that this alone could account for 3.5%.
> 
> A later email indicated the reschedule interrupt count doubled since
> 2.6.24, and so I poked around a bit at the causes of resched_task.
> 
> I think the -rt version of check_preempt_equal_prio has gotten much more
> expensive since 2.6.24.
> 
> I'm sure these changes were made for good reasons, and this workload may
> not be a good reason to change it back.  But, what does the patch below
> do to performance on 2.6.29-rcX?
> 
> -chris
> 
> diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> index 954e1a8..bbe3492 100644
> --- a/kernel/sched_rt.c
> +++ b/kernel/sched_rt.c
> @@ -842,6 +842,7 @@ static void check_preempt_curr_rt(struct rq *rq,
> struct task_struct *p, int sync
>  		resched_task(rq->curr);
>  		return;
>  	}
> +	return;
>  
>  #ifdef CONFIG_SMP
>  	/*

That should not cause much of a problem if the scheduling task is not
pinned to an CPU. But!!!!!

A recent change makes it expensive:

commit 24600ce89a819a8f2fb4fd69fd777218a82ade20
Author: Rusty Russell <rusty@rustcorp.com.au>
Date:   Tue Nov 25 02:35:13 2008 +1030

    sched: convert check_preempt_equal_prio to cpumask_var_t.
    
    Impact: stack reduction for large NR_CPUS



which has:

 static void check_preempt_equal_prio(struct rq *rq, struct task_struct
*p)
 {
-       cpumask_t mask;
+       cpumask_var_t mask;
 
        if (rq->curr->rt.nr_cpus_allowed == 1)
                return;
 
-       if (p->rt.nr_cpus_allowed != 1
-           && cpupri_find(&rq->rd->cpupri, p, &mask))
+       if (!alloc_cpumask_var(&mask, GFP_ATOMIC))
                return;




check_preempt_equal_prio is in a scheduling hot path!!!!!

WTF are we allocating there for?

-- Steve




^ permalink raw reply	[flat|nested] 105+ messages in thread

* RE: Mainline kernel OLTP performance update
  2009-01-19 18:04               ` Chris Mason
  (?)
@ 2009-01-19 18:37               ` Steven Rostedt
  -1 siblings, 0 replies; 105+ messages in thread
From: Steven Rostedt @ 2009-01-19 18:37 UTC (permalink / raw)
  To: Chris Mason
  Cc: Ma, Chinang, Andrew Morton, Matthew Wilcox, Wilcox, Matthew R,
	linux-kernel, Tripathi, Sharad C, arjan, Kleen, Andi, Siddha,
	Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang,
	Peter Xihong, Nueckel, Hubert, linux-scsi, Andrew Vasquez,
	Anirban Chakraborty, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra

(added Rusty)

On Mon, 2009-01-19 at 13:04 -0500, Chris Mason wrote:
> On Thu, 2009-01-15 at 00:11 -0700, Ma, Chinang wrote:
> > >> > > > >
> > >> > > > > Linux OLTP Performance summary
> > >> > > > > Kernel#            Speedup(x)   Intr/s  CtxSw/s us%  sys%   idle%
> > >iowait%
> > >> > > > > 2.6.24.2                1.000   21969   43425   76   24     0
> > >0
> > >> > > > > 2.6.27.2                0.973   30402   43523   74   25     0
> > >1
> > >> > > > > 2.6.29-rc1              0.965   30331   41970   74   26     0
> > >0
> > >> >
> > >> > > But the interrupt rate went through the roof.
> > >> >
> > >> > Yes.  I forget why that was; I'll have to dig through my archives for
> > >> > that.
> > >>
> > >> Oh.  I'd have thought that this alone could account for 3.5%.
> 
> A later email indicated the reschedule interrupt count doubled since
> 2.6.24, and so I poked around a bit at the causes of resched_task.
> 
> I think the -rt version of check_preempt_equal_prio has gotten much more
> expensive since 2.6.24.
> 
> I'm sure these changes were made for good reasons, and this workload may
> not be a good reason to change it back.  But, what does the patch below
> do to performance on 2.6.29-rcX?
> 
> -chris
> 
> diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> index 954e1a8..bbe3492 100644
> --- a/kernel/sched_rt.c
> +++ b/kernel/sched_rt.c
> @@ -842,6 +842,7 @@ static void check_preempt_curr_rt(struct rq *rq,
> struct task_struct *p, int sync
>  		resched_task(rq->curr);
>  		return;
>  	}
> +	return;
>  
>  #ifdef CONFIG_SMP
>  	/*

That should not cause much of a problem if the scheduling task is not
pinned to an CPU. But!!!!!

A recent change makes it expensive:

commit 24600ce89a819a8f2fb4fd69fd777218a82ade20
Author: Rusty Russell <rusty@rustcorp.com.au>
Date:   Tue Nov 25 02:35:13 2008 +1030

    sched: convert check_preempt_equal_prio to cpumask_var_t.
    
    Impact: stack reduction for large NR_CPUS



which has:

 static void check_preempt_equal_prio(struct rq *rq, struct task_struct
*p)
 {
-       cpumask_t mask;
+       cpumask_var_t mask;
 
        if (rq->curr->rt.nr_cpus_allowed == 1)
                return;
 
-       if (p->rt.nr_cpus_allowed != 1
-           && cpupri_find(&rq->rd->cpupri, p, &mask))
+       if (!alloc_cpumask_var(&mask, GFP_ATOMIC))
                return;




check_preempt_equal_prio is in a scheduling hot path!!!!!

WTF are we allocating there for?

-- Steve




^ permalink raw reply	[flat|nested] 105+ messages in thread

* RE: Mainline kernel OLTP performance update
  2009-01-19 18:37               ` Steven Rostedt
@ 2009-01-19 18:55                   ` Chris Mason
  2009-01-19 23:40                   ` Ingo Molnar
  1 sibling, 0 replies; 105+ messages in thread
From: Chris Mason @ 2009-01-19 18:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ma, Chinang, Andrew Morton, Matthew Wilcox, Wilcox, Matthew R,
	linux-kernel, Tripathi, Sharad C, arjan, Kleen, Andi, Siddha,
	Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang,
	Peter Xihong, Nueckel, Hubert, linux-scsi, Andrew Vasquez,
	Anirban Chakraborty, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Gregory Haskins, Rusty Russell

On Mon, 2009-01-19 at 13:37 -0500, Steven Rostedt wrote:
> (added Rusty)
> 
> On Mon, 2009-01-19 at 13:04 -0500, Chris Mason wrote:
> > 
> > I think the -rt version of check_preempt_equal_prio has gotten much more
> > expensive since 2.6.24.
> > 
> > I'm sure these changes were made for good reasons, and this workload may
> > not be a good reason to change it back.  But, what does the patch below
> > do to performance on 2.6.29-rcX?
> > 
> > -chris
> > 
> > diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> > index 954e1a8..bbe3492 100644
> > --- a/kernel/sched_rt.c
> > +++ b/kernel/sched_rt.c
> > @@ -842,6 +842,7 @@ static void check_preempt_curr_rt(struct rq *rq,
> > struct task_struct *p, int sync
> >  		resched_task(rq->curr);
> >  		return;
> >  	}
> > +	return;
> >  
> >  #ifdef CONFIG_SMP
> >  	/*
> 
> That should not cause much of a problem if the scheduling task is not
> pinned to an CPU. But!!!!!
> 
> A recent change makes it expensive:


> +       if (!alloc_cpumask_var(&mask, GFP_ATOMIC))
>                 return;

> check_preempt_equal_prio is in a scheduling hot path!!!!!
> 
> WTF are we allocating there for?

I wasn't actually looking at the cost of the checks, even though they do
look higher (if they are using CONFIG_CPUMASK_OFFSTACK anyway).

The 2.6.24 code would trigger a rescheduling interrupt only when the
prio of the inbound task was higher than the running task.

This workload has a large number of equal priority rt tasks that are not
bound to a single CPU, and so I think it should trigger more
preempts/reschedules with the today's check_preempt_equal_prio().

-chris



^ permalink raw reply	[flat|nested] 105+ messages in thread

* RE: Mainline kernel OLTP performance update
@ 2009-01-19 18:55                   ` Chris Mason
  0 siblings, 0 replies; 105+ messages in thread
From: Chris Mason @ 2009-01-19 18:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ma, Chinang, Andrew Morton, Matthew Wilcox, Wilcox, Matthew R,
	linux-kernel, Tripathi, Sharad C, arjan, Kleen, Andi, Siddha,
	Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang,
	Peter Xihong, Nueckel, Hubert, linux-scsi, Andrew Vasquez,
	Anirban Chakraborty, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra

On Mon, 2009-01-19 at 13:37 -0500, Steven Rostedt wrote:
> (added Rusty)
> 
> On Mon, 2009-01-19 at 13:04 -0500, Chris Mason wrote:
> > 
> > I think the -rt version of check_preempt_equal_prio has gotten much more
> > expensive since 2.6.24.
> > 
> > I'm sure these changes were made for good reasons, and this workload may
> > not be a good reason to change it back.  But, what does the patch below
> > do to performance on 2.6.29-rcX?
> > 
> > -chris
> > 
> > diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> > index 954e1a8..bbe3492 100644
> > --- a/kernel/sched_rt.c
> > +++ b/kernel/sched_rt.c
> > @@ -842,6 +842,7 @@ static void check_preempt_curr_rt(struct rq *rq,
> > struct task_struct *p, int sync
> >  		resched_task(rq->curr);
> >  		return;
> >  	}
> > +	return;
> >  
> >  #ifdef CONFIG_SMP
> >  	/*
> 
> That should not cause much of a problem if the scheduling task is not
> pinned to an CPU. But!!!!!
> 
> A recent change makes it expensive:


> +       if (!alloc_cpumask_var(&mask, GFP_ATOMIC))
>                 return;

> check_preempt_equal_prio is in a scheduling hot path!!!!!
> 
> WTF are we allocating there for?

I wasn't actually looking at the cost of the checks, even though they do
look higher (if they are using CONFIG_CPUMASK_OFFSTACK anyway).

The 2.6.24 code would trigger a rescheduling interrupt only when the
prio of the inbound task was higher than the running task.

This workload has a large number of equal priority rt tasks that are not
bound to a single CPU, and so I think it should trigger more
preempts/reschedules with the today's check_preempt_equal_prio().

-chris



^ permalink raw reply	[flat|nested] 105+ messages in thread

* RE: Mainline kernel OLTP performance update
  2009-01-19 18:55                   ` Chris Mason
@ 2009-01-19 19:07                     ` Steven Rostedt
  -1 siblings, 0 replies; 105+ messages in thread
From: Steven Rostedt @ 2009-01-19 19:07 UTC (permalink / raw)
  To: Chris Mason
  Cc: Ma, Chinang, Andrew Morton, Matthew Wilcox, Wilcox, Matthew R,
	linux-kernel, Tripathi, Sharad C, arjan, Kleen, Andi, Siddha,
	Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang,
	Peter Xihong, Nueckel, Hubert, linux-scsi, Andrew Vasquez,
	Anirban Chakraborty, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Gregory Haskins, Rusty Russell


On Mon, 2009-01-19 at 13:55 -0500, Chris Mason wrote:

> I wasn't actually looking at the cost of the checks, even though they do
> look higher (if they are using CONFIG_CPUMASK_OFFSTACK anyway).
> 
> The 2.6.24 code would trigger a rescheduling interrupt only when the
> prio of the inbound task was higher than the running task.
> 
> This workload has a large number of equal priority rt tasks that are not
> bound to a single CPU, and so I think it should trigger more
> preempts/reschedules with the today's check_preempt_equal_prio().

Ah yeah. This is one of the things that shows RT being more "responsive"
but less on performance. An RT task wants to run ASAP even if that means
there's a chance of more interrupts and higher cache misses.

The old way would be much faster in general through put, but I measured
RT tasks taking up to tens of milliseconds to get scheduled. This is
unacceptable for an RT task.

-- Steve



^ permalink raw reply	[flat|nested] 105+ messages in thread

* RE: Mainline kernel OLTP performance update
@ 2009-01-19 19:07                     ` Steven Rostedt
  0 siblings, 0 replies; 105+ messages in thread
From: Steven Rostedt @ 2009-01-19 19:07 UTC (permalink / raw)
  To: Chris Mason
  Cc: Ma, Chinang, Andrew Morton, Matthew Wilcox, Wilcox, Matthew R,
	linux-kernel, Tripathi, Sharad C, arjan, Kleen, Andi, Siddha,
	Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang,
	Peter Xihong, Nueckel, Hubert, linux-scsi, Andrew Vasquez,
	Anirban Chakraborty, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra


On Mon, 2009-01-19 at 13:55 -0500, Chris Mason wrote:

> I wasn't actually looking at the cost of the checks, even though they do
> look higher (if they are using CONFIG_CPUMASK_OFFSTACK anyway).
> 
> The 2.6.24 code would trigger a rescheduling interrupt only when the
> prio of the inbound task was higher than the running task.
> 
> This workload has a large number of equal priority rt tasks that are not
> bound to a single CPU, and so I think it should trigger more
> preempts/reschedules with the today's check_preempt_equal_prio().

Ah yeah. This is one of the things that shows RT being more "responsive"
but less on performance. An RT task wants to run ASAP even if that means
there's a chance of more interrupts and higher cache misses.

The old way would be much faster in general through put, but I measured
RT tasks taking up to tens of milliseconds to get scheduled. This is
unacceptable for an RT task.

-- Steve

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-19  7:43                     ` Nick Piggin
@ 2009-01-19 22:19                       ` Rick Jones
  0 siblings, 0 replies; 105+ messages in thread
From: Rick Jones @ 2009-01-19 22:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, netdev, sfr, matthew, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty

>>>System is a 2socket, 4 core AMD.
>>
>>Not exactly a large system :)  Barely NUMA even with just two sockets.
> 
> 
> You're right ;)
> 
> But at least it is exercising the NUMA paths in the allocator, and
> represents a pretty common size of system...
> 
> I can run some tests on bigger systems at SUSE, but it is not always
> easy to set up "real" meaningful workloads on them or configure
> significant IO for them.

Not sure if I know enough git to pull your trees, or if this cobbler's child will 
have much in the way of bigger systems, but there is a chance I might - contact 
me offline with some pointers on how to pull and build the bits and such.

>>>Netperf UDP unidirectional send test (10 runs, higher better):
>>>
>>>Server and client bound to same CPU
>>>SLAB AVG=60.111 STD=1.59382
>>>SLQB AVG=60.167 STD=0.685347
>>>SLUB AVG=58.277 STD=0.788328
>>>
>>>Server and client bound to same socket, different CPUs
>>>SLAB AVG=85.938 STD=0.875794
>>>SLQB AVG=93.662 STD=2.07434
>>>SLUB AVG=81.983 STD=0.864362
>>>
>>>Server and client bound to different sockets
>>>SLAB AVG=78.801 STD=1.44118
>>>SLQB AVG=78.269 STD=1.10457
>>>SLUB AVG=71.334 STD=1.16809
>>>
>>
>> > ...
>>
>>>I haven't done any non-local network tests. Networking is the one of the
>>>subsystems most heavily dependent on slab performance, so if anybody
>>>cares to run their favourite tests, that would be really helpful.
>>
>>I'm guessing, but then are these Mbit/s figures? Would that be the sending
>>throughput or the receiving throughput?
> 
> 
> Yes, Mbit/s. They were... hmm, sending throughput I think, but each pair
> of numbers seemed to be identical IIRC?

Mega *bits* per second?  And those were 4K sends right?  That seems rather low 
for loopback - I would have expected nearly two orders of magnitude more.  I 
wonder if the intra-stack flow control kicked-in?  You might try adding test 
specific -S and -s options to set much larger socket buffers to try to avoid 
that.  Or simply use TCP.

netperf -H <foo> ... -- -s 1M -S 1M -m 4K

>>I love to see netperf used, but why UDP and loopback?
> 
> 
> No really good reason. I guess I was hoping to keep other variables as
> small as possible. But I guess a real remote test would be a lot more
> realistic as a networking test. Hmm, but I could probably set up a test
> over a simple GbE link here.  I'll try that.

If bandwidth is an issue, that is to say one saturates the link before much of 
anything "interesting" happens in the host you can use something like aggregate 
TCP_RR - ./configure with --enable_burst and then something like

netperf -H <remote> -t TCP_RR -- -D -b 32

and it will have as many as 33 discrete transactions in flight at one time on the 
one connection.  The -D is there to set TCP_NODELAY to preclude TCP chunking the 
single-byte (default, take your pick of a more reasonable size) transactions into 
one segment.

>>Also, how about the service demands?
> 
> 
> Well, over loopback and using CPU binding, I was hoping it wouldn't
> change much... 

Hope... but verify :)

> but I see netperf does some measurements for you. I
> will consider those in future too.
> 
> BTW. is it possible to do parallel netperf tests?

Yes, by (ab)using the confidence intervals code.  Poke around in 
http://www.netperf.org/svn/netperf2/doc/netperf.html in the "Aggregates" section, 
and I can go into further details offline (or here if folks want to see the 
discussion).

rick jones

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-19 18:37               ` Steven Rostedt
@ 2009-01-19 23:40                   ` Ingo Molnar
  2009-01-19 23:40                   ` Ingo Molnar
  1 sibling, 0 replies; 105+ messages in thread
From: Ingo Molnar @ 2009-01-19 23:40 UTC (permalink / raw)
  To: Steven Rostedt, Mike Travis, Rusty Russell
  Cc: Chris Mason, Ma, Chinang, Andrew Morton, Matthew Wilcox, Wilcox,
	Matthew R, linux-kernel, Tripathi, Sharad C, arjan, Kleen, Andi,
	Siddha, Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang,
	Peter Xihong, Nueckel, Hubert, linux-scsi, Andrew Vasquez,
	Anirban Chakraborty, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Gregory Haskins, Rusty Russell


* Steven Rostedt <srostedt@redhat.com> wrote:

> (added Rusty)
> 
> On Mon, 2009-01-19 at 13:04 -0500, Chris Mason wrote:
> > On Thu, 2009-01-15 at 00:11 -0700, Ma, Chinang wrote:
> > > >> > > > >
> > > >> > > > > Linux OLTP Performance summary
> > > >> > > > > Kernel#            Speedup(x)   Intr/s  CtxSw/s us%  sys%   idle%
> > > >iowait%
> > > >> > > > > 2.6.24.2                1.000   21969   43425   76   24     0
> > > >0
> > > >> > > > > 2.6.27.2                0.973   30402   43523   74   25     0
> > > >1
> > > >> > > > > 2.6.29-rc1              0.965   30331   41970   74   26     0
> > > >0
> > > >> >
> > > >> > > But the interrupt rate went through the roof.
> > > >> >
> > > >> > Yes.  I forget why that was; I'll have to dig through my archives for
> > > >> > that.
> > > >>
> > > >> Oh.  I'd have thought that this alone could account for 3.5%.
> > 
> > A later email indicated the reschedule interrupt count doubled since
> > 2.6.24, and so I poked around a bit at the causes of resched_task.
> > 
> > I think the -rt version of check_preempt_equal_prio has gotten much more
> > expensive since 2.6.24.
> > 
> > I'm sure these changes were made for good reasons, and this workload may
> > not be a good reason to change it back.  But, what does the patch below
> > do to performance on 2.6.29-rcX?
> > 
> > -chris
> > 
> > diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> > index 954e1a8..bbe3492 100644
> > --- a/kernel/sched_rt.c
> > +++ b/kernel/sched_rt.c
> > @@ -842,6 +842,7 @@ static void check_preempt_curr_rt(struct rq *rq,
> > struct task_struct *p, int sync
> >  		resched_task(rq->curr);
> >  		return;
> >  	}
> > +	return;
> >  
> >  #ifdef CONFIG_SMP
> >  	/*
> 
> That should not cause much of a problem if the scheduling task is not
> pinned to an CPU. But!!!!!
> 
> A recent change makes it expensive:
> 
> commit 24600ce89a819a8f2fb4fd69fd777218a82ade20
> Author: Rusty Russell <rusty@rustcorp.com.au>
> Date:   Tue Nov 25 02:35:13 2008 +1030
> 
>     sched: convert check_preempt_equal_prio to cpumask_var_t.
>     
>     Impact: stack reduction for large NR_CPUS
> 
> 
> 
> which has:
> 
>  static void check_preempt_equal_prio(struct rq *rq, struct task_struct
> *p)
>  {
> -       cpumask_t mask;
> +       cpumask_var_t mask;
>  
>         if (rq->curr->rt.nr_cpus_allowed == 1)
>                 return;
>  
> -       if (p->rt.nr_cpus_allowed != 1
> -           && cpupri_find(&rq->rd->cpupri, p, &mask))
> +       if (!alloc_cpumask_var(&mask, GFP_ATOMIC))
>                 return;
> 
> 
> 
> 
> check_preempt_equal_prio is in a scheduling hot path!!!!!
> 
> WTF are we allocating there for?

Agreed - this needs to be fixed. Since this runs under the runqueue lock 
we can have a temporary cpumask in the runqueue itself, not on the stack.

	Ingo

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
@ 2009-01-19 23:40                   ` Ingo Molnar
  0 siblings, 0 replies; 105+ messages in thread
From: Ingo Molnar @ 2009-01-19 23:40 UTC (permalink / raw)
  To: Steven Rostedt, Mike Travis, Rusty Russell
  Cc: Chris Mason, Ma, Chinang, Andrew Morton, Matthew Wilcox, Wilcox,
	Matthew R, linux-kernel, Tripathi, Sharad C, arjan, Kleen, Andi,
	Siddha, Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang,
	Peter Xihong, Nueckel, Hubert, linux-scsi, Andrew Vasquez,
	Anirban Chakraborty, Ingo Molnar, Thomas Gleixner


* Steven Rostedt <srostedt@redhat.com> wrote:

> (added Rusty)
> 
> On Mon, 2009-01-19 at 13:04 -0500, Chris Mason wrote:
> > On Thu, 2009-01-15 at 00:11 -0700, Ma, Chinang wrote:
> > > >> > > > >
> > > >> > > > > Linux OLTP Performance summary
> > > >> > > > > Kernel#            Speedup(x)   Intr/s  CtxSw/s us%  sys%   idle%
> > > >iowait%
> > > >> > > > > 2.6.24.2                1.000   21969   43425   76   24     0
> > > >0
> > > >> > > > > 2.6.27.2                0.973   30402   43523   74   25     0
> > > >1
> > > >> > > > > 2.6.29-rc1              0.965   30331   41970   74   26     0
> > > >0
> > > >> >
> > > >> > > But the interrupt rate went through the roof.
> > > >> >
> > > >> > Yes.  I forget why that was; I'll have to dig through my archives for
> > > >> > that.
> > > >>
> > > >> Oh.  I'd have thought that this alone could account for 3.5%.
> > 
> > A later email indicated the reschedule interrupt count doubled since
> > 2.6.24, and so I poked around a bit at the causes of resched_task.
> > 
> > I think the -rt version of check_preempt_equal_prio has gotten much more
> > expensive since 2.6.24.
> > 
> > I'm sure these changes were made for good reasons, and this workload may
> > not be a good reason to change it back.  But, what does the patch below
> > do to performance on 2.6.29-rcX?
> > 
> > -chris
> > 
> > diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> > index 954e1a8..bbe3492 100644
> > --- a/kernel/sched_rt.c
> > +++ b/kernel/sched_rt.c
> > @@ -842,6 +842,7 @@ static void check_preempt_curr_rt(struct rq *rq,
> > struct task_struct *p, int sync
> >  		resched_task(rq->curr);
> >  		return;
> >  	}
> > +	return;
> >  
> >  #ifdef CONFIG_SMP
> >  	/*
> 
> That should not cause much of a problem if the scheduling task is not
> pinned to an CPU. But!!!!!
> 
> A recent change makes it expensive:
> 
> commit 24600ce89a819a8f2fb4fd69fd777218a82ade20
> Author: Rusty Russell <rusty@rustcorp.com.au>
> Date:   Tue Nov 25 02:35:13 2008 +1030
> 
>     sched: convert check_preempt_equal_prio to cpumask_var_t.
>     
>     Impact: stack reduction for large NR_CPUS
> 
> 
> 
> which has:
> 
>  static void check_preempt_equal_prio(struct rq *rq, struct task_struct
> *p)
>  {
> -       cpumask_t mask;
> +       cpumask_var_t mask;
>  
>         if (rq->curr->rt.nr_cpus_allowed == 1)
>                 return;
>  
> -       if (p->rt.nr_cpus_allowed != 1
> -           && cpupri_find(&rq->rd->cpupri, p, &mask))
> +       if (!alloc_cpumask_var(&mask, GFP_ATOMIC))
>                 return;
> 
> 
> 
> 
> check_preempt_equal_prio is in a scheduling hot path!!!!!
> 
> WTF are we allocating there for?

Agreed - this needs to be fixed. Since this runs under the runqueue lock 
we can have a temporary cpumask in the runqueue itself, not on the stack.

	Ingo

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16 10:20                       ` Andi Kleen
@ 2009-01-20  5:16                         ` Zhang, Yanmin
  2009-01-21 23:58                           ` Christoph Lameter
  0 siblings, 1 reply; 105+ messages in thread
From: Zhang, Yanmin @ 2009-01-20  5:16 UTC (permalink / raw)
  To: Andi Kleen, Christoph Lameter, Pekka Enberg
  Cc: Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr,
	matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi,
	arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty

On Fri, 2009-01-16 at 11:20 +0100, Andi Kleen wrote:
> "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> writes:
> 
> 
> > I think that's because SLQB
> > doesn't pass through big object allocation to page allocator.
> > netperf UDP-U-1k has less improvement with SLQB.
> 
> That sounds like just the page allocator needs to be improved.
> That would help everyone. We talked a bit about this earlier,
> some of the heuristics for hot/cold pages are quite outdated
> and have been tuned for obsolete machines and also its fast path
> is quite long. Unfortunately no code currently.
Andi,

Thanks for your kind information. I did more investigation with SLUB
on netperf UDP-U-4k issue.

oprofile shows:
328058   30.1342  linux-2.6.29-rc2         copy_user_generic_string
134666   12.3699  linux-2.6.29-rc2         __free_pages_ok
125447   11.5231  linux-2.6.29-rc2         get_page_from_freelist
22611     2.0770  linux-2.6.29-rc2         __sk_mem_reclaim
21442     1.9696  linux-2.6.29-rc2         list_del
21187     1.9462  linux-2.6.29-rc2         __ip_route_output_key

So __free_pages_ok and get_page_from_freelist consume too much cpu time.
With SLQB, these 2 functions almost don't consume time.

Command 'slabinfo -AD' shows:
Name                   Objects    Alloc     Free   %Fast
:0000256                  1685 29611065 29609548  99  99
:0000168                  2987   164689   161859  94  39
:0004096                  1471   114918   113490  99  97

So kmem_cache :0000256 is very active.

Kernel stack dump in __free_pages_ok shows
 [<ffffffff8027010f>] __free_pages_ok+0x109/0x2e0
 [<ffffffff8024bb34>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8060f387>] __kfree_skb+0x9/0x6f
 [<ffffffff8061204b>] skb_free_datagram+0xc/0x31
 [<ffffffff8064b528>] udp_recvmsg+0x1e7/0x26f
 [<ffffffff8060b509>] sock_common_recvmsg+0x30/0x45
 [<ffffffff80609acd>] sock_recvmsg+0xd5/0xed

The callchain is:
__kfree_skb =>
	kfree_skbmem =>
		kmem_cache_free(skbuff_head_cache, skb);

kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache
with :0000256. Their order is 1 which means every slab consists of 2 physical pages.

netperf UDP-U-4k is a UDP stream testing. client process keeps sending 4k-size packets
to server process and server process just receives the packets one by one.

If we start CPU_NUM clients and the same number of servers, every client will send lots
of packets within one sched slice, then process scheduler schedules the server to receive
many packets within one sched slice; then client resends again. So there are many packets
in the queue. When server receive the packets, it frees skbuff_head_cache. When the slab's
objects are all free, the slab will be released by calling __free_pages. Such batch
sending/receiving creates lots of slab free activity.

Page allocator has an array at zone_pcp(zone, cpu)->pcp to keep a page buffer for page order 0.
But here skbuff_head_cache's order is 1, so UDP-U-4k couldn't benefit from the page buffer.

SLQB has no such issue, because:
1) SLQB has a percpu freelist. Free objects are put to the list firstly and can be picked up
later on quickly without lock. A batch parameter to control the free object recollection is mostly
1024.
2) SLQB slab order mostly is 0, so although sometimes it calls alloc_pages/free_pages, it can
benefit from zone_pcp(zone, cpu)->pcp page buffer.

So SLUB need resolve such issues that one process allocates a batch of objects and another process
frees them batchly.

yanmin



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16 18:14                       ` Gregory Haskins
  2009-01-16 19:09                         ` Steven Rostedt
@ 2009-01-20 12:45                         ` Gregory Haskins
  1 sibling, 0 replies; 105+ messages in thread
From: Gregory Haskins @ 2009-01-20 12:45 UTC (permalink / raw)
  To: Ma, Chinang
  Cc: Wilcox, Matthew R, Steven Rostedt, Matthew Wilcox, Andrew Morton,
	James Bottomley, linux-kernel, Tripathi, Sharad C, arjan, Kleen,
	Andi, Siddha, Suresh B, Chilukuri, Harita, Styner, Douglas W,
	Wang, Peter Xihong, Nueckel, Hubert, chris.mason, linux-scsi,
	Andrew Vasquez, Anirban Chakraborty

[-- Attachment #1: Type: text/plain, Size: 1535 bytes --]

Gregory Haskins wrote:
>
> Then, email the contents of /sys/kernel/debug/tracing/trace to me
>
>
>   

[ Chinang has performed the trace as requested, but replied with a
reduced CC to avoid spamming people with a large file.  This is
restoring the original list]


Ma, Chinang wrote:
> Hi Gregory,
> Trace in attachment. I trim down the distribution list. As the attachment is quite big.
>
> Thanks,
> -Chinang
>   
Hi Chinang,

  Thank you very much for taking the time to do this.  I have analyzed
the trace: I do not see any smoking gun w.r.t. the theory that we are
over IPI'ing the system.  There were holes in the data due to trace
limitations that rendered some of the data inconclusive.  However, the
places where we did not run into trace limitations looked like
everything was functioning as designed.

That being said, I do see that you have a ton of prio 48(ish) threads
that are over-straining the RT push logic.  The interesting thing here
is I recently pushed some patches to tip that have potential to help you
here.  Could you try your test using the sched/rt branch from -tip? 
Here is a clone link, for your convenience:

git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-tip.git sched/rt

For this run, do _not_ use the trace patch/config.  I just want to see
if you observe performance improvements with OLTP configured for RT prio
when compared to historic rt-push/pull based kernels (including HEAD on
linus.git, as tested in the last run)

Thanks!
-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-15  2:47           ` Matthew Wilcox
  2009-01-15  3:36             ` Andi Kleen
@ 2009-01-20 13:27             ` Jens Axboe
       [not found]               ` <588992150B702C48B3312184F1B810AD03A497632C@azsmsx501.amr.corp.intel.com>
  1 sibling, 1 reply; 105+ messages in thread
From: Jens Axboe @ 2009-01-20 13:27 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andi Kleen, Andrew Morton, Wilcox, Matthew R, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	Andrew Vasquez, Anirban Chakraborty

On Wed, Jan 14 2009, Matthew Wilcox wrote:
> On Thu, Jan 15, 2009 at 03:39:05AM +0100, Andi Kleen wrote:
> > Andrew Morton <akpm@linux-foundation.org> writes:
> > >>    some of that back, but not as much as taking them out (even when
> > >>    the sysctl'd variable is in a __read_mostly section).  We tried a
> > >>    patch from Jens to speed up the search for a new partition, but it
> > >>    had no effect.
> > >
> > > I find this surprising.
> > 
> > The test system has thousands of disks/LUNs which it writes to
> > all the time, in addition to a workload which is a real cache pig. 
> > So any increase in the per LUN overhead directly leads to a lot
> > more cache misses in the kernel because it increases the working set
> > there sigificantly.
> 
> This particular system has 450 spindles, but they're amalgamated into
> 30 logical volumes by the hardware or firmware.  Linux sees 30 LUNs.
> Each one, though, has fifteen partitions on it, so that brings us back
> up to 450 partitions.
> 
> This system, btw, is a scale model of the full system that would be used
> to get published results.  If I remember correctly, a 1% performance
> regression on this system is likely to translate to a 2% regression on
> the full-scale system.

Matthew, lets see if we can get this a little closer to disappearing. I
don't see lookup problems in the current kernel with the one-hit cache,
but perhaps it's either not getting enough hits in this bigger test case
or perhaps it's simply the rcu locking and preempt disables that build
up enough to cause a slowdown.

First things first, can you get a run of 2.6.29-rc2 with this patch?
It'll enable you to turn off per-partition stats in sysfs. I'd suggest
doing a run with a 2.6.29-rc2 booted with this patch, and then another
run with part_stats set to 0 for every exposed spindle. Then post those
profiles!

diff --git a/block/blk-core.c b/block/blk-core.c
index a824e49..6f693ae 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -600,7 +600,8 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
 	q->prep_rq_fn		= NULL;
 	q->unplug_fn		= generic_unplug_device;
 	q->queue_flags		= (1 << QUEUE_FLAG_CLUSTER |
-				   1 << QUEUE_FLAG_STACKABLE);
+				   1 << QUEUE_FLAG_STACKABLE |
+				   1 << QUEUE_FLAG_PART_STAT);
 	q->queue_lock		= lock;
 
 	blk_queue_segment_boundary(q, BLK_SEG_BOUNDARY_MASK);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index a29cb78..a6ec2e3 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -158,6 +158,29 @@ static ssize_t queue_rq_affinity_show(struct request_queue *q, char *page)
 	return queue_var_show(set != 0, page);
 }
 
+static ssize_t queue_part_stat_store(struct request_queue *q, const char *page,
+				     size_t count)
+{
+	unsigned long nm;
+	ssize_t ret = queue_var_store(&nm, page, count);
+
+	spin_lock_irq(q->queue_lock);
+	if (nm)
+		queue_flag_set(QUEUE_FLAG_PART_STAT, q);
+	else
+		queue_flag_clear(QUEUE_FLAG_PART_STAT, q);
+
+	spin_unlock_irq(q->queue_lock);
+	return ret;
+}
+
+static ssize_t queue_part_stat_show(struct request_queue *q, char *page)
+{
+	unsigned int set = test_bit(QUEUE_FLAG_PART_STAT, &q->queue_flags);
+
+	return queue_var_show(set != 0, page);
+}
+
 static ssize_t
 queue_rq_affinity_store(struct request_queue *q, const char *page, size_t count)
 {
@@ -222,6 +245,12 @@ static struct queue_sysfs_entry queue_rq_affinity_entry = {
 	.store = queue_rq_affinity_store,
 };
 
+static struct queue_sysfs_entry queue_part_stat_entry = {
+	.attr = {.name = "part_stats", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_part_stat_show,
+	.store = queue_part_stat_store,
+};
+
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
@@ -231,6 +260,7 @@ static struct attribute *default_attrs[] = {
 	&queue_hw_sector_size_entry.attr,
 	&queue_nomerges_entry.attr,
 	&queue_rq_affinity_entry.attr,
+	&queue_part_stat_entry.attr,
 	NULL,
 };
 
diff --git a/block/genhd.c b/block/genhd.c
index 397960c..09cbac2 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -208,6 +208,9 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
 	struct hd_struct *part;
 	int i;
 
+	if (!blk_queue_part_stat(disk->queue))
+		goto part0;
+
 	ptbl = rcu_dereference(disk->part_tbl);
 
 	part = rcu_dereference(ptbl->last_lookup);
@@ -222,6 +225,7 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
 			return part;
 		}
 	}
+part0:
 	return &disk->part0;
 }
 EXPORT_SYMBOL_GPL(disk_map_sector_rcu);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 044467e..4d45842 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -449,6 +449,7 @@ struct request_queue
 #define QUEUE_FLAG_STACKABLE   13	/* supports request stacking */
 #define QUEUE_FLAG_NONROT      14	/* non-rotational device (SSD) */
 #define QUEUE_FLAG_VIRT        QUEUE_FLAG_NONROT /* paravirt device */
+#define QUEUE_FLAG_PART_STAT   15	/* per-partition stats enabled */
 
 static inline int queue_is_locked(struct request_queue *q)
 {
@@ -568,6 +569,8 @@ enum {
 #define blk_queue_flushing(q)	((q)->ordseq)
 #define blk_queue_stackable(q)	\
 	test_bit(QUEUE_FLAG_STACKABLE, &(q)->queue_flags)
+#define blk_queue_part_stat(q)	\
+	test_bit(QUEUE_FLAG_PART_STAT, &(q)->queue_flags)
 
 #define blk_fs_request(rq)	((rq)->cmd_type == REQ_TYPE_FS)
 #define blk_pc_request(rq)	((rq)->cmd_type == REQ_TYPE_BLOCK_PC)

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-20  5:16                         ` Zhang, Yanmin
@ 2009-01-21 23:58                           ` Christoph Lameter
  2009-01-22  8:36                             ` Zhang, Yanmin
  0 siblings, 1 reply; 105+ messages in thread
From: Christoph Lameter @ 2009-01-21 23:58 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Andi Kleen, Pekka Enberg, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	andrew.vasquez, anirban.chakraborty

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1708 bytes --]

On Tue, 20 Jan 2009, Zhang, Yanmin wrote:

> kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache
> with :0000256. Their order is 1 which means every slab consists of 2 physical pages.

That order can be changed. Try specifying slub_max_order=0 on the kernel
command line to force an order 0 alloc.

The queues of the page allocator are of limited use due to their overhead.
Order-1 allocations can actually be 5% faster than order-0. order-0 makes
sense if pages are pushed rapidly to the page allocator and are then
reissues elsewhere. If there is a linear consumption then the page
allocator queues are just overhead.

> Page allocator has an array at zone_pcp(zone, cpu)->pcp to keep a page buffer for page order 0.
> But here skbuff_head_cache's order is 1, so UDP-U-4k couldn't benefit from the page buffer.

That usually does not matter because of partial list avoiding page
allocator actions.

> SLQB has no such issue, because:
> 1) SLQB has a percpu freelist. Free objects are put to the list firstly and can be picked up
> later on quickly without lock. A batch parameter to control the free object recollection is mostly
> 1024.
> 2) SLQB slab order mostly is 0, so although sometimes it calls alloc_pages/free_pages, it can
> benefit from zone_pcp(zone, cpu)->pcp page buffer.
>
> So SLUB need resolve such issues that one process allocates a batch of objects and another process
> frees them batchly.

SLUB has a percpu freelist but its bounded by the basic allocation unit.
You can increase that by modifying the allocation order. Writing a 3 or 5
into the order value in /sys/kernel/slab/xxx/order would do the trick.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-21 23:58                           ` Christoph Lameter
@ 2009-01-22  8:36                             ` Zhang, Yanmin
  2009-01-22  9:15                                 ` Pekka Enberg
  0 siblings, 1 reply; 105+ messages in thread
From: Zhang, Yanmin @ 2009-01-22  8:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Pekka Enberg, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	andrew.vasquez, anirban.chakraborty

On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote:
> On Tue, 20 Jan 2009, Zhang, Yanmin wrote:
> 
> > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache
> > with :0000256. Their order is 1 which means every slab consists of 2 physical pages.
> 
> That order can be changed. Try specifying slub_max_order=0 on the kernel
> command line to force an order 0 alloc.
I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue.
Both get_page_from_freelist and __free_pages_ok's cpu time are still very high.

I checked my instrumentation in kernel and found it's caused by large object allocation/free
whose size is more than PAGE_SIZE. Here its order is 1.

The right free callchain is __kfree_skb => skb_release_all => skb_release_data.

So this case isn't the issue that batch of allocation/free might erase partial page
functionality.

'#slaninfo -AD' couldn't show statistics of large object allocation/free. Can we add
such info? That will be more helpful.

In addition, I didn't find such issue wih TCP stream testing.

> 
> The queues of the page allocator are of limited use due to their overhead.
> Order-1 allocations can actually be 5% faster than order-0. order-0 makes
> sense if pages are pushed rapidly to the page allocator and are then
> reissues elsewhere. If there is a linear consumption then the page
> allocator queues are just overhead.
> 
> > Page allocator has an array at zone_pcp(zone, cpu)->pcp to keep a page buffer for page order 0.
> > But here skbuff_head_cache's order is 1, so UDP-U-4k couldn't benefit from the page buffer.
> 
> That usually does not matter because of partial list avoiding page
> allocator actions.

> 
> > SLQB has no such issue, because:
> > 1) SLQB has a percpu freelist. Free objects are put to the list firstly and can be picked up
> > later on quickly without lock. A batch parameter to control the free object recollection is mostly
> > 1024.
> > 2) SLQB slab order mostly is 0, so although sometimes it calls alloc_pages/free_pages, it can
> > benefit from zone_pcp(zone, cpu)->pcp page buffer.
> >
> > So SLUB need resolve such issues that one process allocates a batch of objects and another process
> > frees them batchly.
> 
> SLUB has a percpu freelist but its bounded by the basic allocation unit.
> You can increase that by modifying the allocation order. Writing a 3 or 5
> into the order value in /sys/kernel/slab/xxx/order would do the trick.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-22  8:36                             ` Zhang, Yanmin
@ 2009-01-22  9:15                                 ` Pekka Enberg
  0 siblings, 0 replies; 105+ messages in thread
From: Pekka Enberg @ 2009-01-22  9:15 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	andrew.vasquez, anirban.chakraborty

On Thu, 2009-01-22 at 16:36 +0800, Zhang, Yanmin wrote:
> On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote:
> > On Tue, 20 Jan 2009, Zhang, Yanmin wrote:
> > 
> > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache
> > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages.
> > 
> > That order can be changed. Try specifying slub_max_order=0 on the kernel
> > command line to force an order 0 alloc.
> I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue.
> Both get_page_from_freelist and __free_pages_ok's cpu time are still very high.
> 
> I checked my instrumentation in kernel and found it's caused by large object allocation/free
> whose size is more than PAGE_SIZE. Here its order is 1.
> 
> The right free callchain is __kfree_skb => skb_release_all => skb_release_data.
> 
> So this case isn't the issue that batch of allocation/free might erase partial page
> functionality.

So is this the kfree(skb->head) in skb_release_data() or the put_page()
calls in the same function in a loop?

If it's the former, with big enough size passed to __alloc_skb(), the
networking code might be taking a hit from the SLUB page allocator
pass-through.

		Pekka


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
@ 2009-01-22  9:15                                 ` Pekka Enberg
  0 siblings, 0 replies; 105+ messages in thread
From: Pekka Enberg @ 2009-01-22  9:15 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	andrew.vasquez, anirban.chakraborty

On Thu, 2009-01-22 at 16:36 +0800, Zhang, Yanmin wrote:
> On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote:
> > On Tue, 20 Jan 2009, Zhang, Yanmin wrote:
> > 
> > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache
> > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages.
> > 
> > That order can be changed. Try specifying slub_max_order=0 on the kernel
> > command line to force an order 0 alloc.
> I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue.
> Both get_page_from_freelist and __free_pages_ok's cpu time are still very high.
> 
> I checked my instrumentation in kernel and found it's caused by large object allocation/free
> whose size is more than PAGE_SIZE. Here its order is 1.
> 
> The right free callchain is __kfree_skb => skb_release_all => skb_release_data.
> 
> So this case isn't the issue that batch of allocation/free might erase partial page
> functionality.

So is this the kfree(skb->head) in skb_release_data() or the put_page()
calls in the same function in a loop?

If it's the former, with big enough size passed to __alloc_skb(), the
networking code might be taking a hit from the SLUB page allocator
pass-through.

		Pekka

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-22  9:15                                 ` Pekka Enberg
  (?)
@ 2009-01-22  9:28                                 ` Zhang, Yanmin
  2009-01-22  9:47                                   ` Pekka Enberg
  -1 siblings, 1 reply; 105+ messages in thread
From: Zhang, Yanmin @ 2009-01-22  9:28 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	andrew.vasquez, anirban.chakraborty

On Thu, 2009-01-22 at 11:15 +0200, Pekka Enberg wrote:
> On Thu, 2009-01-22 at 16:36 +0800, Zhang, Yanmin wrote:
> > On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote:
> > > On Tue, 20 Jan 2009, Zhang, Yanmin wrote:
> > > 
> > > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache
> > > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages.
> > > 
> > > That order can be changed. Try specifying slub_max_order=0 on the kernel
> > > command line to force an order 0 alloc.
> > I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue.
> > Both get_page_from_freelist and __free_pages_ok's cpu time are still very high.
> > 
> > I checked my instrumentation in kernel and found it's caused by large object allocation/free
> > whose size is more than PAGE_SIZE. Here its order is 1.
> > 
> > The right free callchain is __kfree_skb => skb_release_all => skb_release_data.
> > 
> > So this case isn't the issue that batch of allocation/free might erase partial page
> > functionality.
> 
> So is this the kfree(skb->head) in skb_release_data() or the put_page()
> calls in the same function in a loop?
It's kfree(skb->head).

> 
> If it's the former, with big enough size passed to __alloc_skb(), the
> networking code might be taking a hit from the SLUB page allocator
> pass-through.



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-22  9:28                                 ` Zhang, Yanmin
@ 2009-01-22  9:47                                   ` Pekka Enberg
  2009-01-23  3:02                                       ` Zhang, Yanmin
  0 siblings, 1 reply; 105+ messages in thread
From: Pekka Enberg @ 2009-01-22  9:47 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	andrew.vasquez, anirban.chakraborty

On Thu, 2009-01-22 at 17:28 +0800, Zhang, Yanmin wrote:
> On Thu, 2009-01-22 at 11:15 +0200, Pekka Enberg wrote:
> > On Thu, 2009-01-22 at 16:36 +0800, Zhang, Yanmin wrote:
> > > On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote:
> > > > On Tue, 20 Jan 2009, Zhang, Yanmin wrote:
> > > > 
> > > > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache
> > > > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages.
> > > > 
> > > > That order can be changed. Try specifying slub_max_order=0 on the kernel
> > > > command line to force an order 0 alloc.
> > > I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue.
> > > Both get_page_from_freelist and __free_pages_ok's cpu time are still very high.
> > > 
> > > I checked my instrumentation in kernel and found it's caused by large object allocation/free
> > > whose size is more than PAGE_SIZE. Here its order is 1.
> > > 
> > > The right free callchain is __kfree_skb => skb_release_all => skb_release_data.
> > > 
> > > So this case isn't the issue that batch of allocation/free might erase partial page
> > > functionality.
> > 
> > So is this the kfree(skb->head) in skb_release_data() or the put_page()
> > calls in the same function in a loop?
> It's kfree(skb->head).
> 
> > 
> > If it's the former, with big enough size passed to __alloc_skb(), the
> > networking code might be taking a hit from the SLUB page allocator
> > pass-through.

Do we know what kind of size is being passed to __alloc_skb() in this
case? Maybe we want to do something like this.

		Pekka

SLUB: revert page allocator pass-through

This is a revert of commit aadb4bc4a1f9108c1d0fbd121827c936c2ed4217 ("SLUB:
direct pass through of page size or higher kmalloc requests").
---

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 2f5c16b..3bd3662 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -124,7 +124,7 @@ struct kmem_cache {
  * We keep the general caches in an array of slab caches that are used for
  * 2^x bytes of allocations.
  */
-extern struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1];
+extern struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_HIGH + 1];
 
 /*
  * Sorry that the following has to be that ugly but some versions of GCC
@@ -135,6 +135,9 @@ static __always_inline int kmalloc_index(size_t size)
 	if (!size)
 		return 0;
 
+	if (size > KMALLOC_MAX_SIZE)
+		return -1;
+
 	if (size <= KMALLOC_MIN_SIZE)
 		return KMALLOC_SHIFT_LOW;
 
@@ -154,10 +157,6 @@ static __always_inline int kmalloc_index(size_t size)
 	if (size <=       1024) return 10;
 	if (size <=   2 * 1024) return 11;
 	if (size <=   4 * 1024) return 12;
-/*
- * The following is only needed to support architectures with a larger page
- * size than 4k.
- */
 	if (size <=   8 * 1024) return 13;
 	if (size <=  16 * 1024) return 14;
 	if (size <=  32 * 1024) return 15;
@@ -167,6 +166,10 @@ static __always_inline int kmalloc_index(size_t size)
 	if (size <= 512 * 1024) return 19;
 	if (size <= 1024 * 1024) return 20;
 	if (size <=  2 * 1024 * 1024) return 21;
+	if (size <=  4 * 1024 * 1024) return 22;
+	if (size <=  8 * 1024 * 1024) return 23;
+	if (size <= 16 * 1024 * 1024) return 24;
+	if (size <= 32 * 1024 * 1024) return 25;
 	return -1;
 
 /*
@@ -191,6 +194,19 @@ static __always_inline struct kmem_cache *kmalloc_slab(size_t size)
 	if (index == 0)
 		return NULL;
 
+	/*
+	 * This function only gets expanded if __builtin_constant_p(size), so
+	 * testing it here shouldn't be needed.  But some versions of gcc need
+	 * help.
+	 */
+	if (__builtin_constant_p(size) && index < 0) {
+		/*
+		 * Generate a link failure. Would be great if we could
+		 * do something to stop the compile here.
+		 */
+		extern void __kmalloc_size_too_large(void);
+		__kmalloc_size_too_large();
+	}
 	return &kmalloc_caches[index];
 }
 
@@ -204,17 +220,9 @@ static __always_inline struct kmem_cache *kmalloc_slab(size_t size)
 void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
 void *__kmalloc(size_t size, gfp_t flags);
 
-static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
-{
-	return (void *)__get_free_pages(flags | __GFP_COMP, get_order(size));
-}
-
 static __always_inline void *kmalloc(size_t size, gfp_t flags)
 {
 	if (__builtin_constant_p(size)) {
-		if (size > PAGE_SIZE)
-			return kmalloc_large(size, flags);
-
 		if (!(flags & SLUB_DMA)) {
 			struct kmem_cache *s = kmalloc_slab(size);
 
diff --git a/mm/slub.c b/mm/slub.c
index 6392ae5..8fad23f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2475,7 +2475,7 @@ EXPORT_SYMBOL(kmem_cache_destroy);
  *		Kmalloc subsystem
  *******************************************************************/
 
-struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1] __cacheline_aligned;
+struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_HIGH + 1] __cacheline_aligned;
 EXPORT_SYMBOL(kmalloc_caches);
 
 static int __init setup_slub_min_order(char *str)
@@ -2537,7 +2537,7 @@ panic:
 }
 
 #ifdef CONFIG_ZONE_DMA
-static struct kmem_cache *kmalloc_caches_dma[PAGE_SHIFT + 1];
+static struct kmem_cache *kmalloc_caches_dma[KMALLOC_SHIFT_HIGH + 1];
 
 static void sysfs_add_func(struct work_struct *w)
 {
@@ -2643,8 +2643,12 @@ static struct kmem_cache *get_slab(size_t size, gfp_t flags)
 			return ZERO_SIZE_PTR;
 
 		index = size_index[(size - 1) / 8];
-	} else
+	} else {
+		if (size > KMALLOC_MAX_SIZE)
+			return NULL;
+
 		index = fls(size - 1);
+	}
 
 #ifdef CONFIG_ZONE_DMA
 	if (unlikely((flags & SLUB_DMA)))
@@ -2658,9 +2662,6 @@ void *__kmalloc(size_t size, gfp_t flags)
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
-		return kmalloc_large(size, flags);
-
 	s = get_slab(size, flags);
 
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
@@ -2670,25 +2671,11 @@ void *__kmalloc(size_t size, gfp_t flags)
 }
 EXPORT_SYMBOL(__kmalloc);
 
-static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
-{
-	struct page *page = alloc_pages_node(node, flags | __GFP_COMP,
-						get_order(size));
-
-	if (page)
-		return page_address(page);
-	else
-		return NULL;
-}
-
 #ifdef CONFIG_NUMA
 void *__kmalloc_node(size_t size, gfp_t flags, int node)
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
-		return kmalloc_large_node(size, flags, node);
-
 	s = get_slab(size, flags);
 
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
@@ -2746,11 +2733,8 @@ void kfree(const void *x)
 		return;
 
 	page = virt_to_head_page(x);
-	if (unlikely(!PageSlab(page))) {
-		BUG_ON(!PageCompound(page));
-		put_page(page);
+	if (unlikely(WARN_ON(!PageSlab(page)))) /* XXX */
 		return;
-	}
 	slab_free(page->slab, page, object, _RET_IP_);
 }
 EXPORT_SYMBOL(kfree);
@@ -2985,7 +2969,7 @@ void __init kmem_cache_init(void)
 		caches++;
 	}
 
-	for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++) {
+	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) {
 		create_kmalloc_cache(&kmalloc_caches[i],
 			"kmalloc", 1 << i, GFP_KERNEL);
 		caches++;
@@ -3022,7 +3006,7 @@ void __init kmem_cache_init(void)
 	slab_state = UP;
 
 	/* Provide the correct kmalloc names now that the caches are up */
-	for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++)
+	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++)
 		kmalloc_caches[i]. name =
 			kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
 
@@ -3222,9 +3206,6 @@ void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller)
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
-		return kmalloc_large(size, gfpflags);
-
 	s = get_slab(size, gfpflags);
 
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
@@ -3238,9 +3219,6 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags,
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
-		return kmalloc_large_node(size, gfpflags, node);
-
 	s = get_slab(size, gfpflags);
 
 	if (unlikely(ZERO_OR_NULL_PTR(s)))



^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
       [not found]               ` <588992150B702C48B3312184F1B810AD03A497632C@azsmsx501.amr.corp.intel.com>
@ 2009-01-22 11:29                 ` Jens Axboe
       [not found]                   ` <588992150B702C48B3312184F1B810AD03A4F59632@azsmsx501.amr.corp.intel.com>
  0 siblings, 1 reply; 105+ messages in thread
From: Jens Axboe @ 2009-01-22 11:29 UTC (permalink / raw)
  To: Chilukuri, Harita
  Cc: Matthew Wilcox, Andi Kleen, Andrew Morton, Wilcox, Matthew R, Ma,
	Chinang, linux-kernel, Tripathi, Sharad C, arjan, Siddha,
	Suresh B, Styner, Douglas W, Wang, Peter Xihong, Nueckel, Hubert,
	chris.mason, srostedt, linux-scsi, Andrew Vasquez,
	Anirban Chakraborty

On Wed, Jan 21 2009, Chilukuri, Harita wrote:
> Jen, we work with Matthew on the OLTP workload and have tested the part_stats patch on 2.6.29-rc2. Below are the details:
> 
> Disabling the part_stats has positive impact on the OLTP workload.
> 
> Linux OLTP Performance summary
> Kernel#                           Speedup(x) Intr/s  CtxSw/s us%  sys% idle% iowait%
> 2.6.29-rc2-part_stats                1.000   30329   41716   74    26   0       0
> 2.6.29-rc2-disable-part_stats        1.006   30413   42582   74    25   0       0
> 
> Server configurations:
> Intel Xeon Quad-core 2.0GHz  2 cpus/8 cores/8 threads
> 64GB memory, 3 qle2462 FC HBA, 450 spindles (30 logical units)
> 
> 
> ======oprofile CPU_CLK_UNHALTED for top 30 functions
> Cycles% 2.6.29-rc2-part_stats      Cycles% 2.6.29-rc2-disable-part_stats
> 0.9634 qla24xx_intr_handler        1.0372 qla24xx_intr_handler
> 0.9057 copy_user_generic_string    0.7461 qla24xx_wrt_req_reg
> 0.7583 unmap_vmas                  0.7130 kmem_cache_alloc
> 0.6280 qla24xx_wrt_req_reg         0.6876 copy_user_generic_string
> 0.6088 kmem_cache_alloc            0.5656 qla24xx_start_scsi
> 0.5468 clear_page_c                0.4881 __blockdev_direct_IO
> 0.5191 qla24xx_start_scsi          0.4728 try_to_wake_up
> 0.4892 try_to_wake_up              0.4588 unmap_vmas
> 0.4870 __blockdev_direct_IO        0.4360 scsi_request_fn
> 0.4187 scsi_request_fn             0.3711 __switch_to
> 0.3717 __switch_to                 0.3699 aio_complete
> 0.3567 rb_get_reader_page          0.3648 rb_get_reader_page
> 0.3396 aio_complete                0.3597 ring_buffer_consume
> 0.3012 __end_that_request_first    0.3292 memset_c
> 0.2926 memset_c                    0.3076 __list_add
> 0.2926 ring_buffer_consume         0.2771 clear_page_c
> 0.2884 page_remove_rmap            0.2745 task_rq_lock
> 0.2691 disk_map_sector_rcu         0.2733 generic_make_request
> 0.2670 copy_page_c                 0.2555 tcp_sendmsg
> 0.2670 lock_timer_base             0.2529 qla2x00_process_completed_re
> 0.2606 qla2x00_process_completed_re0.2440 e1000_xmit_frame
> 0.2521 task_rq_lock                0.2390 lock_timer_base
> 0.2328 __list_add                  0.2364 qla24xx_queuecommand
> 0.2286 generic_make_request        0.2301 kmem_cache_free
> 0.2286 pick_next_highest_task_rt   0.2262 blk_queue_end_tag
> 0.2136 push_rt_task                0.2262 kref_get
> 0.2115 blk_queue_end_tag           0.2250 push_rt_task
> 0.2115 kmem_cache_free             0.2135 scsi_dispatch_cmd
> 0.2051 e1000_xmit_frame            0.2084 sd_prep_fn
> 0.2051 scsi_device_unbusy          0.2059 kfree

Alright, so that 0.6%. IIRC, 0.1% (or there abouts) is significant with
this benchmark, correct? To get a feel for the rest of the accounting
overhead, could you try with this patch that just disables the whole
thing?

diff --git a/block/blk-core.c b/block/blk-core.c
index a824e49..eec9126 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -64,6 +64,7 @@ static struct workqueue_struct *kblockd_workqueue;
 
 static void drive_stat_acct(struct request *rq, int new_io)
 {
+#if 0
 	struct hd_struct *part;
 	int rw = rq_data_dir(rq);
 	int cpu;
@@ -82,6 +83,7 @@ static void drive_stat_acct(struct request *rq, int new_io)
 	}
 
 	part_stat_unlock();
+#endif
 }
 
 void blk_queue_congestion_threshold(struct request_queue *q)
@@ -1014,6 +1017,7 @@ static inline void add_request(struct request_queue *q, struct request *req)
 	__elv_add_request(q, req, ELEVATOR_INSERT_SORT, 0);
 }
 
+#if 0
 static void part_round_stats_single(int cpu, struct hd_struct *part,
 				    unsigned long now)
 {
@@ -1027,6 +1031,7 @@ static void part_round_stats_single(int cpu, struct hd_struct *part,
 	}
 	part->stamp = now;
 }
+#endif
 
 /**
  * part_round_stats() - Round off the performance stats on a struct disk_stats.
@@ -1046,11 +1051,13 @@ static void part_round_stats_single(int cpu, struct hd_struct *part,
  */
 void part_round_stats(int cpu, struct hd_struct *part)
 {
+#if 0
 	unsigned long now = jiffies;
 
 	if (part->partno)
 		part_round_stats_single(cpu, &part_to_disk(part)->part0, now);
 	part_round_stats_single(cpu, part, now);
+#endif
 }
 EXPORT_SYMBOL_GPL(part_round_stats);
 
@@ -1690,6 +1697,7 @@ static int __end_that_request_first(struct request *req, int error,
 				(unsigned long long)req->sector);
 	}
 
+#if 0
 	if (blk_fs_request(req) && req->rq_disk) {
 		const int rw = rq_data_dir(req);
 		struct hd_struct *part;
@@ -1700,6 +1708,7 @@ static int __end_that_request_first(struct request *req, int error,
 		part_stat_add(cpu, part, sectors[rw], nr_bytes >> 9);
 		part_stat_unlock();
 	}
+#endif
 
 	total_bytes = bio_nbytes = 0;
 	while ((bio = req->bio) != NULL) {
@@ -1779,7 +1788,9 @@ static int __end_that_request_first(struct request *req, int error,
  */
 static void end_that_request_last(struct request *req, int error)
 {
+#if 0
 	struct gendisk *disk = req->rq_disk;
+#endif
 
 	if (blk_rq_tagged(req))
 		blk_queue_end_tag(req->q, req);
@@ -1797,6 +1808,7 @@ static void end_that_request_last(struct request *req, int error)
 	 * IO on queueing nor completion.  Accounting the containing
 	 * request is enough.
 	 */
+#if 0
 	if (disk && blk_fs_request(req) && req != &req->q->bar_rq) {
 		unsigned long duration = jiffies - req->start_time;
 		const int rw = rq_data_dir(req);
@@ -1813,6 +1825,7 @@ static void end_that_request_last(struct request *req, int error)
 
 		part_stat_unlock();
 	}
+#endif
 
 	if (req->end_io)
 		req->end_io(req, error);

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-22  9:47                                   ` Pekka Enberg
@ 2009-01-23  3:02                                       ` Zhang, Yanmin
  0 siblings, 0 replies; 105+ messages in thread
From: Zhang, Yanmin @ 2009-01-23  3:02 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	andrew.vasquez, anirban.chakraborty

On Thu, 2009-01-22 at 11:47 +0200, Pekka Enberg wrote:
> On Thu, 2009-01-22 at 17:28 +0800, Zhang, Yanmin wrote:
> > On Thu, 2009-01-22 at 11:15 +0200, Pekka Enberg wrote:
> > > On Thu, 2009-01-22 at 16:36 +0800, Zhang, Yanmin wrote:
> > > > On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote:
> > > > > On Tue, 20 Jan 2009, Zhang, Yanmin wrote:
> > > > > 
> > > > > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache
> > > > > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages.
> > > > > 
> > > > > That order can be changed. Try specifying slub_max_order=0 on the kernel
> > > > > command line to force an order 0 alloc.
> > > > I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue.
> > > > Both get_page_from_freelist and __free_pages_ok's cpu time are still very high.
> > > > 
> > > > I checked my instrumentation in kernel and found it's caused by large object allocation/free
> > > > whose size is more than PAGE_SIZE. Here its order is 1.
> > > > 
> > > > The right free callchain is __kfree_skb => skb_release_all => skb_release_data.
> > > > 
> > > > So this case isn't the issue that batch of allocation/free might erase partial page
> > > > functionality.
> > > 
> > > So is this the kfree(skb->head) in skb_release_data() or the put_page()
> > > calls in the same function in a loop?
> > It's kfree(skb->head).
> > 
> > > 
> > > If it's the former, with big enough size passed to __alloc_skb(), the
> > > networking code might be taking a hit from the SLUB page allocator
> > > pass-through.
> 
> Do we know what kind of size is being passed to __alloc_skb() in this
> case?
In function __alloc_skb, original parameter size=4155,
SKB_DATA_ALIGN(size)=4224, sizeof(struct skb_shared_info)=472, so
__kmalloc_track_caller's parameter size=4696.

>  Maybe we want to do something like this.
> 
> 		Pekka
> 
> SLUB: revert page allocator pass-through
This patch amost fixes the netperf UDP-U-4k issue.

#slabinfo -AD
Name                   Objects    Alloc     Free   %Fast
:0000256                  1658 70350463 70348946  99  99 
kmalloc-8192                31 70322309 70322293  99  99 
:0000168                  2592   143154   140684  93  28 
:0004096                  1456    91072    89644  99  96 
:0000192                  3402    63838    60491  89  11 
:0000064                  6177    49635    43743  98  77 

So kmalloc-8192 appears. Without the patch, kmalloc-8192 hides.
kmalloc-8192's default order on my 8-core stoakley is 2.

1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's;
2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result
is about 10% better than SLUB's.

I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it?

> 
> This is a revert of commit aadb4bc4a1f9108c1d0fbd121827c936c2ed4217 ("SLUB:
> direct pass through of page size or higher kmalloc requests").
> ---
> 
> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> index 2f5c16b..3bd3662 100644
> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
@ 2009-01-23  3:02                                       ` Zhang, Yanmin
  0 siblings, 0 replies; 105+ messages in thread
From: Zhang, Yanmin @ 2009-01-23  3:02 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	andrew.vasquez, anirban.chakraborty

On Thu, 2009-01-22 at 11:47 +0200, Pekka Enberg wrote:
> On Thu, 2009-01-22 at 17:28 +0800, Zhang, Yanmin wrote:
> > On Thu, 2009-01-22 at 11:15 +0200, Pekka Enberg wrote:
> > > On Thu, 2009-01-22 at 16:36 +0800, Zhang, Yanmin wrote:
> > > > On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote:
> > > > > On Tue, 20 Jan 2009, Zhang, Yanmin wrote:
> > > > > 
> > > > > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache
> > > > > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages.
> > > > > 
> > > > > That order can be changed. Try specifying slub_max_order=0 on the kernel
> > > > > command line to force an order 0 alloc.
> > > > I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue.
> > > > Both get_page_from_freelist and __free_pages_ok's cpu time are still very high.
> > > > 
> > > > I checked my instrumentation in kernel and found it's caused by large object allocation/free
> > > > whose size is more than PAGE_SIZE. Here its order is 1.
> > > > 
> > > > The right free callchain is __kfree_skb => skb_release_all => skb_release_data.
> > > > 
> > > > So this case isn't the issue that batch of allocation/free might erase partial page
> > > > functionality.
> > > 
> > > So is this the kfree(skb->head) in skb_release_data() or the put_page()
> > > calls in the same function in a loop?
> > It's kfree(skb->head).
> > 
> > > 
> > > If it's the former, with big enough size passed to __alloc_skb(), the
> > > networking code might be taking a hit from the SLUB page allocator
> > > pass-through.
> 
> Do we know what kind of size is being passed to __alloc_skb() in this
> case?
In function __alloc_skb, original parameter size=4155,
SKB_DATA_ALIGN(size)=4224, sizeof(struct skb_shared_info)=472, so
__kmalloc_track_caller's parameter size=4696.

>  Maybe we want to do something like this.
> 
> 		Pekka
> 
> SLUB: revert page allocator pass-through
This patch amost fixes the netperf UDP-U-4k issue.

#slabinfo -AD
Name                   Objects    Alloc     Free   %Fast
:0000256                  1658 70350463 70348946  99  99 
kmalloc-8192                31 70322309 70322293  99  99 
:0000168                  2592   143154   140684  93  28 
:0004096                  1456    91072    89644  99  96 
:0000192                  3402    63838    60491  89  11 
:0000064                  6177    49635    43743  98  77 

So kmalloc-8192 appears. Without the patch, kmalloc-8192 hides.
kmalloc-8192's default order on my 8-core stoakley is 2.

1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's;
2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result
is about 10% better than SLUB's.

I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it?

> 
> This is a revert of commit aadb4bc4a1f9108c1d0fbd121827c936c2ed4217 ("SLUB:
> direct pass through of page size or higher kmalloc requests").
> ---
> 
> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> index 2f5c16b..3bd3662 100644
> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-23  3:02                                       ` Zhang, Yanmin
@ 2009-01-23  6:52                                         ` Pekka Enberg
  -1 siblings, 0 replies; 105+ messages in thread
From: Pekka Enberg @ 2009-01-23  6:52 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	andrew.vasquez, anirban.chakraborty, mingo

Zhang, Yanmin wrote:
>>>> If it's the former, with big enough size passed to __alloc_skb(), the
>>>> networking code might be taking a hit from the SLUB page allocator
>>>> pass-through.
>> Do we know what kind of size is being passed to __alloc_skb() in this
>> case?
> In function __alloc_skb, original parameter size=4155,
> SKB_DATA_ALIGN(size)=4224, sizeof(struct skb_shared_info)=472, so
> __kmalloc_track_caller's parameter size=4696.

OK, so all allocations go straight to the page allocator.

> 
>>  Maybe we want to do something like this.
>>
>> SLUB: revert page allocator pass-through
> This patch amost fixes the netperf UDP-U-4k issue.
> 
> #slabinfo -AD
> Name                   Objects    Alloc     Free   %Fast
> :0000256                  1658 70350463 70348946  99  99 
> kmalloc-8192                31 70322309 70322293  99  99 
> :0000168                  2592   143154   140684  93  28 
> :0004096                  1456    91072    89644  99  96 
> :0000192                  3402    63838    60491  89  11 
> :0000064                  6177    49635    43743  98  77 
> 
> So kmalloc-8192 appears. Without the patch, kmalloc-8192 hides.
> kmalloc-8192's default order on my 8-core stoakley is 2.

Christoph, should we merge my patch as-is or do you have an alternative 
fix in mind? We could, of course, increase kmalloc() caches one level up 
to 8192 or higher.

> 
> 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's;
> 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result
> is about 10% better than SLUB's.
> 
> I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it?

Maybe we can use the perfstat and/or kerneltop utilities of the new perf 
counters patch to diagnose this:

http://lkml.org/lkml/2009/1/21/273

And do oprofile, of course. Thanks!

		Pekka

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
@ 2009-01-23  6:52                                         ` Pekka Enberg
  0 siblings, 0 replies; 105+ messages in thread
From: Pekka Enberg @ 2009-01-23  6:52 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	andrew.vasquez, anirban.chakraborty, mingo

Zhang, Yanmin wrote:
>>>> If it's the former, with big enough size passed to __alloc_skb(), the
>>>> networking code might be taking a hit from the SLUB page allocator
>>>> pass-through.
>> Do we know what kind of size is being passed to __alloc_skb() in this
>> case?
> In function __alloc_skb, original parameter size=4155,
> SKB_DATA_ALIGN(size)=4224, sizeof(struct skb_shared_info)=472, so
> __kmalloc_track_caller's parameter size=4696.

OK, so all allocations go straight to the page allocator.

> 
>>  Maybe we want to do something like this.
>>
>> SLUB: revert page allocator pass-through
> This patch amost fixes the netperf UDP-U-4k issue.
> 
> #slabinfo -AD
> Name                   Objects    Alloc     Free   %Fast
> :0000256                  1658 70350463 70348946  99  99 
> kmalloc-8192                31 70322309 70322293  99  99 
> :0000168                  2592   143154   140684  93  28 
> :0004096                  1456    91072    89644  99  96 
> :0000192                  3402    63838    60491  89  11 
> :0000064                  6177    49635    43743  98  77 
> 
> So kmalloc-8192 appears. Without the patch, kmalloc-8192 hides.
> kmalloc-8192's default order on my 8-core stoakley is 2.

Christoph, should we merge my patch as-is or do you have an alternative 
fix in mind? We could, of course, increase kmalloc() caches one level up 
to 8192 or higher.

> 
> 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's;
> 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result
> is about 10% better than SLUB's.
> 
> I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it?

Maybe we can use the perfstat and/or kerneltop utilities of the new perf 
counters patch to diagnose this:

http://lkml.org/lkml/2009/1/21/273

And do oprofile, of course. Thanks!

		Pekka
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-23  6:52                                         ` Pekka Enberg
  (?)
@ 2009-01-23  8:06                                         ` Pekka Enberg
  2009-01-23  8:30                                           ` Zhang, Yanmin
  -1 siblings, 1 reply; 105+ messages in thread
From: Pekka Enberg @ 2009-01-23  8:06 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	andrew.vasquez, anirban.chakraborty, mingo

On Fri, 2009-01-23 at 08:52 +0200, Pekka Enberg wrote:
> > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's;
> > 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result
> > is about 10% better than SLUB's.
> > 
> > I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it?
> 
> Maybe we can use the perfstat and/or kerneltop utilities of the new perf 
> counters patch to diagnose this:
> 
> http://lkml.org/lkml/2009/1/21/273
> 
> And do oprofile, of course. Thanks!

I assume binding the client and the server to different physical CPUs
also  means that the SKB is always allocated on CPU 1 and freed on CPU
2? If so, we will be taking the __slab_free() slow path all the time on
kfree() which will cause cache effects, no doubt.

But there's another potential performance hit we're taking because the
object size of the cache is so big. As allocations from CPU 1 keep
coming in, we need to allocate new pages and unfreeze the per-cpu page.
That in turn causes __slab_free() to be more eager to discard the slab
(see the PageSlubFrozen check there).

So before going for cache profiling, I'd really like to see an oprofile
report. I suspect we're still going to see much more page allocator
activity there than with SLAB or SLQB which is why we're still behaving
so badly here.

		Pekka


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-23  8:06                                         ` Pekka Enberg
@ 2009-01-23  8:30                                           ` Zhang, Yanmin
  2009-01-23  8:40                                             ` Pekka Enberg
  2009-01-23  9:46                                             ` Pekka Enberg
  0 siblings, 2 replies; 105+ messages in thread
From: Zhang, Yanmin @ 2009-01-23  8:30 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	andrew.vasquez, anirban.chakraborty, mingo

On Fri, 2009-01-23 at 10:06 +0200, Pekka Enberg wrote:
> On Fri, 2009-01-23 at 08:52 +0200, Pekka Enberg wrote:
> > > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's;
> > > 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result
> > > is about 10% better than SLUB's.
> > > 
> > > I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it?
> > 
> > Maybe we can use the perfstat and/or kerneltop utilities of the new perf 
> > counters patch to diagnose this:
> > 
> > http://lkml.org/lkml/2009/1/21/273
> > 
> > And do oprofile, of course. Thanks!
> 
> I assume binding the client and the server to different physical CPUs
> also  means that the SKB is always allocated on CPU 1 and freed on CPU
> 2? If so, we will be taking the __slab_free() slow path all the time on
> kfree() which will cause cache effects, no doubt.
> 
> But there's another potential performance hit we're taking because the
> object size of the cache is so big. As allocations from CPU 1 keep
> coming in, we need to allocate new pages and unfreeze the per-cpu page.
> That in turn causes __slab_free() to be more eager to discard the slab
> (see the PageSlubFrozen check there).
> 
> So before going for cache profiling, I'd really like to see an oprofile
> report. I suspect we're still going to see much more page allocator
> activity
Theoretically, it should, but oprofile doesn't show that.

>  there than with SLAB or SLQB which is why we're still behaving
> so badly here.

oprofile output with 2.6.29-rc2-slubrevertlarge:
CPU: Core 2, speed 2666.71 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        app name                 symbol name
132779   32.9951  vmlinux                  copy_user_generic_string
25334     6.2954  vmlinux                  schedule
21032     5.2264  vmlinux                  tg_shares_up
17175     4.2679  vmlinux                  __skb_recv_datagram
9091      2.2591  vmlinux                  sock_def_readable
8934      2.2201  vmlinux                  mwait_idle
8796      2.1858  vmlinux                  try_to_wake_up
6940      1.7246  vmlinux                  __slab_free

#slaninfo -AD
Name                   Objects    Alloc     Free   %Fast
:0000256                  1643  5215544  5214027  94   0 
kmalloc-8192                28  5189576  5189560   0   0 
:0000168                  2631   141466   138976  92  28 
:0004096                  1452    88697    87269  99  96 
:0000192                  3402    63050    59732  89  11 
:0000064                  6265    46611    40721  98  82 
:0000128                  1895    30429    28654  93  32 


oprofile output with kernel 2.6.29-rc2-slqb0121:
CPU: Core 2, speed 2666.76 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        image name               app name                 symbol name
114793   28.7163  vmlinux                  vmlinux                  copy_user_generic_string
27880     6.9744  vmlinux                  vmlinux                  tg_shares_up
22218     5.5580  vmlinux                  vmlinux                  schedule
12238     3.0614  vmlinux                  vmlinux                  mwait_idle
7395      1.8499  vmlinux                  vmlinux                  task_rq_lock
7348      1.8382  vmlinux                  vmlinux                  sock_def_readable
7202      1.8016  vmlinux                  vmlinux                  sched_clock_cpu
6981      1.7464  vmlinux                  vmlinux                  __skb_recv_datagram
6566      1.6425  vmlinux                  vmlinux                  udp_queue_rcv_skb



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-23  3:02                                       ` Zhang, Yanmin
  (?)
  (?)
@ 2009-01-23  8:33                                       ` Nick Piggin
  2009-01-23  9:02                                         ` Zhang, Yanmin
  -1 siblings, 1 reply; 105+ messages in thread
From: Nick Piggin @ 2009-01-23  8:33 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Pekka Enberg, Christoph Lameter, Andi Kleen, Matthew Wilcox,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	andrew.vasquez, anirban.chakraborty

On Friday 23 January 2009 14:02:53 Zhang, Yanmin wrote:

> 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better
> than SLQB's;

I'll have to look into this too. Could be evidence of the possible
TLB improvement from using bigger pages and/or page-specific freelist,
I suppose.

Do you have a scripted used to start netperf in that configuration?


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-23  8:30                                           ` Zhang, Yanmin
@ 2009-01-23  8:40                                             ` Pekka Enberg
  2009-01-23  9:46                                             ` Pekka Enberg
  1 sibling, 0 replies; 105+ messages in thread
From: Pekka Enberg @ 2009-01-23  8:40 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	andrew.vasquez, anirban.chakraborty, mingo

On Fri, 2009-01-23 at 16:30 +0800, Zhang, Yanmin wrote:
> > I assume binding the client and the server to different physical CPUs
> > also  means that the SKB is always allocated on CPU 1 and freed on CPU
> > 2? If so, we will be taking the __slab_free() slow path all the time on
> > kfree() which will cause cache effects, no doubt.
> > 
> > But there's another potential performance hit we're taking because the
> > object size of the cache is so big. As allocations from CPU 1 keep
> > coming in, we need to allocate new pages and unfreeze the per-cpu page.
> > That in turn causes __slab_free() to be more eager to discard the slab
> > (see the PageSlubFrozen check there).
> > 
> > So before going for cache profiling, I'd really like to see an oprofile
> > report. I suspect we're still going to see much more page allocator
> > activity
> Theoretically, it should, but oprofile doesn't show that.
> 
> > there than with SLAB or SLQB which is why we're still behaving
> > so badly here.
> 
> oprofile output with 2.6.29-rc2-slubrevertlarge:
> CPU: Core 2, speed 2666.71 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
> samples  %        app name                 symbol name
> 132779   32.9951  vmlinux                  copy_user_generic_string
> 25334     6.2954  vmlinux                  schedule
> 21032     5.2264  vmlinux                  tg_shares_up
> 17175     4.2679  vmlinux                  __skb_recv_datagram
> 9091      2.2591  vmlinux                  sock_def_readable
> 8934      2.2201  vmlinux                  mwait_idle
> 8796      2.1858  vmlinux                  try_to_wake_up
> 6940      1.7246  vmlinux                  __slab_free
> 
> #slaninfo -AD
> Name                   Objects    Alloc     Free   %Fast
> :0000256                  1643  5215544  5214027  94   0 
> kmalloc-8192                28  5189576  5189560   0   0 
                                                    ^^^^^^

This looks bit funny. Hmm.

> :0000168                  2631   141466   138976  92  28 
> :0004096                  1452    88697    87269  99  96 
> :0000192                  3402    63050    59732  89  11 
> :0000064                  6265    46611    40721  98  82 
> :0000128                  1895    30429    28654  93  32 



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-23  8:33                                       ` Nick Piggin
@ 2009-01-23  9:02                                         ` Zhang, Yanmin
  2009-01-23 18:40                                           ` care and feeding of netperf (Re: Mainline kernel OLTP performance update) Rick Jones
  0 siblings, 1 reply; 105+ messages in thread
From: Zhang, Yanmin @ 2009-01-23  9:02 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Christoph Lameter, Andi Kleen, Matthew Wilcox,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	andrew.vasquez, anirban.chakraborty

[-- Attachment #1: Type: text/plain, Size: 622 bytes --]

On Fri, 2009-01-23 at 19:33 +1100, Nick Piggin wrote:
> On Friday 23 January 2009 14:02:53 Zhang, Yanmin wrote:
> 
> > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better
> > than SLQB's;
> 
> I'll have to look into this too. Could be evidence of the possible
> TLB improvement from using bigger pages and/or page-specific freelist,
> I suppose.
> 
> Do you have a scripted used to start netperf in that configuration?
See the attachment.

Steps to run testing:
1) compile netperf;
2) Change PROG_DIR to path/to/netperf/src;
3) ./start_netperf_udp_v4.sh 8 #Assume your machine has 8 logical cpus.


[-- Attachment #2: start_netperf_udp_v4.sh --]
[-- Type: application/x-shellscript, Size: 1361 bytes --]

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-23  8:30                                           ` Zhang, Yanmin
  2009-01-23  8:40                                             ` Pekka Enberg
@ 2009-01-23  9:46                                             ` Pekka Enberg
  2009-01-23 15:22                                               ` Christoph Lameter
  1 sibling, 1 reply; 105+ messages in thread
From: Pekka Enberg @ 2009-01-23  9:46 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi,
	andrew.vasquez, anirban.chakraborty, mingo

On Fri, 2009-01-23 at 16:30 +0800, Zhang, Yanmin wrote:
> On Fri, 2009-01-23 at 10:06 +0200, Pekka Enberg wrote:
> > On Fri, 2009-01-23 at 08:52 +0200, Pekka Enberg wrote:
> > > > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's;
> > > > 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result
> > > > is about 10% better than SLUB's.
> > > > 
> > > > I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it?
> > > 
> > > Maybe we can use the perfstat and/or kerneltop utilities of the new perf 
> > > counters patch to diagnose this:
> > > 
> > > http://lkml.org/lkml/2009/1/21/273
> > > 
> > > And do oprofile, of course. Thanks!
> > 
> > I assume binding the client and the server to different physical CPUs
> > also  means that the SKB is always allocated on CPU 1 and freed on CPU
> > 2? If so, we will be taking the __slab_free() slow path all the time on
> > kfree() which will cause cache effects, no doubt.
> > 
> > But there's another potential performance hit we're taking because the
> > object size of the cache is so big. As allocations from CPU 1 keep
> > coming in, we need to allocate new pages and unfreeze the per-cpu page.
> > That in turn causes __slab_free() to be more eager to discard the slab
> > (see the PageSlubFrozen check there).
> > 
> > So before going for cache profiling, I'd really like to see an oprofile
> > report. I suspect we're still going to see much more page allocator
> > activity
> Theoretically, it should, but oprofile doesn't show that.

That's bit surprising, actually. FWIW, I've included a patch for empty
slab lists. But it's probably not going to help here.

> >  there than with SLAB or SLQB which is why we're still behaving
> > so badly here.
> 
> oprofile output with 2.6.29-rc2-slubrevertlarge:
> CPU: Core 2, speed 2666.71 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
> samples  %        app name                 symbol name
> 132779   32.9951  vmlinux                  copy_user_generic_string
> 25334     6.2954  vmlinux                  schedule
> 21032     5.2264  vmlinux                  tg_shares_up
> 17175     4.2679  vmlinux                  __skb_recv_datagram
> 9091      2.2591  vmlinux                  sock_def_readable
> 8934      2.2201  vmlinux                  mwait_idle
> 8796      2.1858  vmlinux                  try_to_wake_up
> 6940      1.7246  vmlinux                  __slab_free
> 
> #slaninfo -AD
> Name                   Objects    Alloc     Free   %Fast
> :0000256                  1643  5215544  5214027  94   0 
> kmalloc-8192                28  5189576  5189560   0   0 
> :0000168                  2631   141466   138976  92  28 
> :0004096                  1452    88697    87269  99  96 
> :0000192                  3402    63050    59732  89  11 
> :0000064                  6265    46611    40721  98  82 
> :0000128                  1895    30429    28654  93  32 

Looking at __slab_free(), unless page->inuse is constantly zero and we
discard the slab, it really is just cache effects (10% sounds like a
lot, though!). AFAICT, the only way to optimize that is with Christoph's
unfinished pointer freelists patches or with a remote free list like in
SLQB.

		Pekka

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 3bd3662..41a4c1a 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -48,6 +48,9 @@ struct kmem_cache_node {
 	unsigned long nr_partial;
 	unsigned long min_partial;
 	struct list_head partial;
+	unsigned long nr_empty;
+	unsigned long max_empty;
+	struct list_head empty;
 #ifdef CONFIG_SLUB_DEBUG
 	atomic_long_t nr_slabs;
 	atomic_long_t total_objects;
diff --git a/mm/slub.c b/mm/slub.c
index 8fad23f..5a12597 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -134,6 +134,11 @@
  */
 #define MAX_PARTIAL 10
 
+/*
+ * Maximum number of empty slabs.
+ */
+#define MAX_EMPTY 1
+
 #define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
 				SLAB_POISON | SLAB_STORE_USER)
 
@@ -1205,6 +1210,24 @@ static void discard_slab(struct kmem_cache *s, struct page *page)
 	free_slab(s, page);
 }
 
+static void discard_or_cache_slab(struct kmem_cache *s, struct page *page)
+{
+	struct kmem_cache_node *n;
+	int node;
+
+	node = page_to_nid(page);
+	n = get_node(s, node);
+
+	dec_slabs_node(s, node, page->objects);
+
+	if (likely(n->nr_empty >= n->max_empty)) {
+		free_slab(s, page);
+	} else {
+		n->nr_empty++;
+		list_add(&page->lru, &n->partial);
+	}
+}
+
 /*
  * Per slab locking using the pagelock
  */
@@ -1252,7 +1275,7 @@ static void remove_partial(struct kmem_cache *s, struct page *page)
 }
 
 /*
- * Lock slab and remove from the partial list.
+ * Lock slab and remove from the partial or empty list.
  *
  * Must hold list_lock.
  */
@@ -1261,7 +1284,6 @@ static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
 {
 	if (slab_trylock(page)) {
 		list_del(&page->lru);
-		n->nr_partial--;
 		__SetPageSlubFrozen(page);
 		return 1;
 	}
@@ -1271,7 +1293,7 @@ static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
 /*
  * Try to allocate a partial slab from a specific node.
  */
-static struct page *get_partial_node(struct kmem_cache_node *n)
+static struct page *get_partial_or_empty_node(struct kmem_cache_node *n)
 {
 	struct page *page;
 
@@ -1281,13 +1303,22 @@ static struct page *get_partial_node(struct kmem_cache_node *n)
 	 * partial slab and there is none available then get_partials()
 	 * will return NULL.
 	 */
-	if (!n || !n->nr_partial)
+	if (!n || (!n->nr_partial && !n->nr_empty))
 		return NULL;
 
 	spin_lock(&n->list_lock);
+
 	list_for_each_entry(page, &n->partial, lru)
-		if (lock_and_freeze_slab(n, page))
+		if (lock_and_freeze_slab(n, page)) {
+			n->nr_partial--;
+			goto out;
+		}
+
+	list_for_each_entry(page, &n->empty, lru)
+		if (lock_and_freeze_slab(n, page)) {
+			n->nr_empty--;
 			goto out;
+		}
 	page = NULL;
 out:
 	spin_unlock(&n->list_lock);
@@ -1297,7 +1328,7 @@ out:
 /*
  * Get a page from somewhere. Search in increasing NUMA distances.
  */
-static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
+static struct page *get_any_partial_or_empty(struct kmem_cache *s, gfp_t flags)
 {
 #ifdef CONFIG_NUMA
 	struct zonelist *zonelist;
@@ -1336,7 +1367,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
 
 		if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
 				n->nr_partial > n->min_partial) {
-			page = get_partial_node(n);
+			page = get_partial_or_empty_node(n);
 			if (page)
 				return page;
 		}
@@ -1346,18 +1377,19 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
 }
 
 /*
- * Get a partial page, lock it and return it.
+ * Get a partial or empty page, lock it and return it.
  */
-static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *
+get_partial_or_empty(struct kmem_cache *s, gfp_t flags, int node)
 {
 	struct page *page;
 	int searchnode = (node == -1) ? numa_node_id() : node;
 
-	page = get_partial_node(get_node(s, searchnode));
+	page = get_partial_or_empty_node(get_node(s, searchnode));
 	if (page || (flags & __GFP_THISNODE))
 		return page;
 
-	return get_any_partial(s, flags);
+	return get_any_partial_or_empty(s, flags);
 }
 
 /*
@@ -1403,7 +1435,7 @@ static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
 		} else {
 			slab_unlock(page);
 			stat(get_cpu_slab(s, raw_smp_processor_id()), FREE_SLAB);
-			discard_slab(s, page);
+			discard_or_cache_slab(s, page);
 		}
 	}
 }
@@ -1542,7 +1574,7 @@ another_slab:
 	deactivate_slab(s, c);
 
 new_slab:
-	new = get_partial(s, gfpflags, node);
+	new = get_partial_or_empty(s, gfpflags, node);
 	if (new) {
 		c->page = new;
 		stat(c, ALLOC_FROM_PARTIAL);
@@ -1693,7 +1725,7 @@ slab_empty:
 	}
 	slab_unlock(page);
 	stat(c, FREE_SLAB);
-	discard_slab(s, page);
+	discard_or_cache_slab(s, page);
 	return;
 
 debug:
@@ -1927,6 +1959,8 @@ static void init_kmem_cache_cpu(struct kmem_cache *s,
 static void
 init_kmem_cache_node(struct kmem_cache_node *n, struct kmem_cache *s)
 {
+	spin_lock_init(&n->list_lock);
+
 	n->nr_partial = 0;
 
 	/*
@@ -1939,8 +1973,18 @@ init_kmem_cache_node(struct kmem_cache_node *n, struct kmem_cache *s)
 	else if (n->min_partial > MAX_PARTIAL)
 		n->min_partial = MAX_PARTIAL;
 
-	spin_lock_init(&n->list_lock);
 	INIT_LIST_HEAD(&n->partial);
+
+	n->nr_empty = 0;
+	/*
+	 * XXX: This needs to take object size into account. We don't need
+	 * empty slabs for caches which will have plenty of partial slabs
+	 * available. Only caches that have either full or empty slabs need
+	 * this kind of optimization.
+	 */
+	n->max_empty = MAX_EMPTY;
+	INIT_LIST_HEAD(&n->empty);
+
 #ifdef CONFIG_SLUB_DEBUG
 	atomic_long_set(&n->nr_slabs, 0);
 	atomic_long_set(&n->total_objects, 0);
@@ -2427,6 +2471,32 @@ static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n)
 	spin_unlock_irqrestore(&n->list_lock, flags);
 }
 
+static void free_empty_slabs(struct kmem_cache *s)
+{
+	int node;
+
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n;
+		struct page *page, *t;
+		unsigned long flags;
+
+		n = get_node(s, node);
+
+		if (!n->nr_empty)
+			continue;
+
+		spin_lock_irqsave(&n->list_lock, flags);
+
+		list_for_each_entry_safe(page, t, &n->empty, lru) {
+			list_del(&page->lru);
+			n->nr_empty--;
+
+			free_slab(s, page);
+		}
+		spin_unlock_irqrestore(&n->list_lock, flags);
+	}
+}
+
 /*
  * Release all resources used by a slab cache.
  */
@@ -2436,6 +2506,8 @@ static inline int kmem_cache_close(struct kmem_cache *s)
 
 	flush_all(s);
 
+	free_empty_slabs(s);
+
 	/* Attempt to free all objects */
 	free_kmem_cache_cpus(s);
 	for_each_node_state(node, N_NORMAL_MEMORY) {
@@ -2765,6 +2837,7 @@ int kmem_cache_shrink(struct kmem_cache *s)
 		return -ENOMEM;
 
 	flush_all(s);
+	free_empty_slabs(s);
 	for_each_node_state(node, N_NORMAL_MEMORY) {
 		n = get_node(s, node);
 



^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-23  9:46                                             ` Pekka Enberg
@ 2009-01-23 15:22                                               ` Christoph Lameter
  2009-01-23 15:31                                                 ` Pekka Enberg
  2009-01-24  2:55                                                 ` Zhang, Yanmin
  0 siblings, 2 replies; 105+ messages in thread
From: Christoph Lameter @ 2009-01-23 15:22 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

On Fri, 23 Jan 2009, Pekka Enberg wrote:

> Looking at __slab_free(), unless page->inuse is constantly zero and we
> discard the slab, it really is just cache effects (10% sounds like a
> lot, though!). AFAICT, the only way to optimize that is with Christoph's
> unfinished pointer freelists patches or with a remote free list like in
> SLQB.

No there is another way. Increase the allocator order to 3 for the
kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the
larger chunks of data gotten from the page allocator. That will allow slub
to do fast allocs.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-23 15:22                                               ` Christoph Lameter
@ 2009-01-23 15:31                                                 ` Pekka Enberg
  2009-01-23 15:55                                                   ` Christoph Lameter
  2009-01-24  2:55                                                 ` Zhang, Yanmin
  1 sibling, 1 reply; 105+ messages in thread
From: Pekka Enberg @ 2009-01-23 15:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote:
> On Fri, 23 Jan 2009, Pekka Enberg wrote:
> 
> > Looking at __slab_free(), unless page->inuse is constantly zero and we
> > discard the slab, it really is just cache effects (10% sounds like a
> > lot, though!). AFAICT, the only way to optimize that is with Christoph's
> > unfinished pointer freelists patches or with a remote free list like in
> > SLQB.
> 
> No there is another way. Increase the allocator order to 3 for the
> kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the
> larger chunks of data gotten from the page allocator. That will allow slub
> to do fast allocs.

I wonder why that doesn't happen already, actually. The slub_max_order
know is capped to PAGE_ALLOC_COSTLY_ORDER ("3") by default and obviously
order 3 should be as good fit as order 2 so 'fraction' can't be too high
either. Hmm.

		Pekka


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-23 15:31                                                 ` Pekka Enberg
@ 2009-01-23 15:55                                                   ` Christoph Lameter
  2009-01-23 16:01                                                     ` Pekka Enberg
  0 siblings, 1 reply; 105+ messages in thread
From: Christoph Lameter @ 2009-01-23 15:55 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

On Fri, 23 Jan 2009, Pekka Enberg wrote:

> I wonder why that doesn't happen already, actually. The slub_max_order
> know is capped to PAGE_ALLOC_COSTLY_ORDER ("3") by default and obviously
> order 3 should be as good fit as order 2 so 'fraction' can't be too high
> either. Hmm.

The kmalloc-8192 is new. Look at slabinfo output to see what allocation
orders are chosen.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-23 15:55                                                   ` Christoph Lameter
@ 2009-01-23 16:01                                                     ` Pekka Enberg
  0 siblings, 0 replies; 105+ messages in thread
From: Pekka Enberg @ 2009-01-23 16:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

On Fri, 23 Jan 2009, Pekka Enberg wrote:
> > I wonder why that doesn't happen already, actually. The slub_max_order
> > know is capped to PAGE_ALLOC_COSTLY_ORDER ("3") by default and obviously
> > order 3 should be as good fit as order 2 so 'fraction' can't be too high
> > either. Hmm.

On Fri, 2009-01-23 at 10:55 -0500, Christoph Lameter wrote:
> The kmalloc-8192 is new. Look at slabinfo output to see what allocation
> orders are chosen.

Yes, yes, I know the new cache a result of my patch. I'm just saying
that AFAICT, the existing logic should set the order to 3 but IIRC
Yanmin said it's 2.

			Pekka


^ permalink raw reply	[flat|nested] 105+ messages in thread

* care and feeding of netperf (Re: Mainline kernel OLTP performance update)
  2009-01-23  9:02                                         ` Zhang, Yanmin
@ 2009-01-23 18:40                                           ` Rick Jones
  2009-01-23 18:51                                               ` Grant Grundler
  2009-01-24  3:03                                             ` Zhang, Yanmin
  0 siblings, 2 replies; 105+ messages in thread
From: Rick Jones @ 2009-01-23 18:40 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Nick Piggin, Pekka Enberg, Christoph Lameter, Andi Kleen,
	Matthew Wilcox, Andrew Morton, netdev, sfr, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty

> 3) ./start_netperf_udp_v4.sh 8 #Assume your machine has 8 logical cpus.

Some comments on the script:

> #!/bin/sh
> 
> PROG_DIR=/home/ymzhang/test/netperf/src
> date=`date +%H%M%N`
> #PROG_DIR=/root/netperf/netperf/src
> client_num=$1
> pin_cpu=$2
> 
> start_port_server=12384
> start_port_client=15888
> 
> killall netserver
> ${PROG_DIR}/netserver
> sleep 2

Any particular reason for killing-off the netserver daemon?

> if [ ! -d result ]; then
>         mkdir result
> fi
> 
> all_result_files=""
> for i in `seq 1 ${client_num}`; do
>         if [ "${pin_cpu}" == "pin" ]; then
>                 pin_param="-T ${i} ${i}"

The -T option takes arguments of the form:

N   - bind both netperf and netserver to core N
N,  - bind only netperf to core N, float netserver
  ,M - float netperf, bind only netserver to core M
N,M - bind netperf to core N and netserver to core M

Without a comma between N and M knuth only knows what the command line parser 
will do :)

>         fi
>         result_file=result/netperf_${start_port_client}.${date}
>         #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -P 15895 12391 -s 32768 -S 32768 -m 4096
>         #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50 3 -I 99 5 -- -P 12384 12888 -s 32768 -S 32768 -m 4096
>         #${PROG_DIR}/netperf -p ${port_num} -t TCP_RR -l 60 -H 127.0.0.1 ${pin_param} -- -r 1,1 >${result_file} &
>         ${PROG_DIR}/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 ${pin_param} -- -P ${start_port_client} ${start_port_server} -s 32768 -S 32768 -m 4096 >${result_file}  &

Same thing here for the -P option - there needs to be a comma between the two 
port numbers otherwise, the best case is that the second port number is ignored. 
  Worst case is that netperf starts doing knuth only knows what.


To get quick profiles, that form of aggregate netperf is OK - just the one 
iteration with background processes using a moderatly long run time.  However, 
for result reporting, it is best to (ab)use the confidence intervals 
functionality to try to avoid skew errors.  I tend to add-in a global -i 30 
option to get each netperf to repeat its measurments 30 times.  That way one is 
reasonably confident that skew issues are minimized.

http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance

And I would probably add the -c and -C options to have netperf report service 
demands.


>         sub_pid="${sub_pid} `echo $!`"
>         port_num=$((${port_num}+1))
>         all_result_files="${all_result_files} ${result_file}"
>         start_port_server=$((${start_port_server}+1))
>         start_port_client=$((${start_port_client}+1))
> done;
> 
> wait ${sub_pid}
> killall netserver
> 
> result="0"
> for i in `echo ${all_result_files}`; do
>         sub_result=`awk '/Throughput/ {getline; getline; getline; print " "$6}' ${i}`
>         result=`echo "${result}+${sub_result}"|bc`
> done;

The documented-only-in-source :( "omni" tests in top-of-trunk netperf:

http://www.netperf.org/svn/netperf2/trunk

./configure --enable-omni

allow one to specify which result values one wants, in which order, either as 
more or less traditional netperf output (test-specific -O), CSV (test-specific 
-o) or keyval (test-specific -k).  All three take an optional filename as an 
argument with the file containing a list of desired output values.  You can give 
a "filename" of '?' to get the list of output values known to that version of 
netperf.

Might help simplify parsing and whatnot.

happy benchmarking,

rick jones

> 
> echo $result

> 


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: care and feeding of netperf (Re: Mainline kernel OLTP performance  update)
  2009-01-23 18:40                                           ` care and feeding of netperf (Re: Mainline kernel OLTP performance update) Rick Jones
@ 2009-01-23 18:51                                               ` Grant Grundler
  2009-01-24  3:03                                             ` Zhang, Yanmin
  1 sibling, 0 replies; 105+ messages in thread
From: Grant Grundler @ 2009-01-23 18:51 UTC (permalink / raw)
  To: Rick Jones
  Cc: Zhang, Yanmin, Nick Piggin, Pekka Enberg, Christoph Lameter,
	Andi Kleen, Matthew Wilcox, Andrew Morton, netdev, sfr,
	matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi,
	arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty

On Fri, Jan 23, 2009 at 10:40 AM, Rick Jones <rick.jones2@hp.com> wrote:
...
> And I would probably add the -c and -C options to have netperf report
> service demands.

For performance analysis, the service demand is often more interesting
than the absolute performance (which typically only varies a few Mb/s
for gigE NICs). I strongly encourage adding -c and -C.

grant

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: care and feeding of netperf (Re: Mainline kernel OLTP performance update)
@ 2009-01-23 18:51                                               ` Grant Grundler
  0 siblings, 0 replies; 105+ messages in thread
From: Grant Grundler @ 2009-01-23 18:51 UTC (permalink / raw)
  To: Rick Jones
  Cc: Zhang, Yanmin, Nick Piggin, Pekka Enberg, Christoph Lameter,
	Andi Kleen, Matthew Wilcox, Andrew Morton, netdev, sfr,
	matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi,
	arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty

On Fri, Jan 23, 2009 at 10:40 AM, Rick Jones <rick.jones2@hp.com> wrote:
...
> And I would probably add the -c and -C options to have netperf report
> service demands.

For performance analysis, the service demand is often more interesting
than the absolute performance (which typically only varies a few Mb/s
for gigE NICs). I strongly encourage adding -c and -C.

grant

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-23 15:22                                               ` Christoph Lameter
  2009-01-23 15:31                                                 ` Pekka Enberg
@ 2009-01-24  2:55                                                 ` Zhang, Yanmin
  2009-01-24  7:36                                                   ` Pekka Enberg
  2009-01-26 17:36                                                   ` Christoph Lameter
  1 sibling, 2 replies; 105+ messages in thread
From: Zhang, Yanmin @ 2009-01-24  2:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote:
> On Fri, 23 Jan 2009, Pekka Enberg wrote:
> 
> > Looking at __slab_free(), unless page->inuse is constantly zero and we
> > discard the slab, it really is just cache effects (10% sounds like a
> > lot, though!). AFAICT, the only way to optimize that is with Christoph's
> > unfinished pointer freelists patches or with a remote free list like in
> > SLQB.
> 
> No there is another way. Increase the allocator order to 3 for the
> kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the
> larger chunks of data gotten from the page allocator. That will allow slub
> to do fast allocs.
After I change kmalloc-8192/order to 3, the result(pinned netperf UDP-U-4k)
difference between SLUB and SLQB becomes 1% which can be considered as fluctuation.

But when trying to increased it to 4, I got:
[root@lkp-st02-x8664 slab]# echo "3">kmalloc-8192/order
[root@lkp-st02-x8664 slab]# echo "4">kmalloc-8192/order
-bash: echo: write error: Invalid argument

Comparing with SLQB, it seems SLUB needs too many investigation/manual finer-tuning
against specific benchmarks. One hard is to tune page order number. Although SLQB also
has many tuning options, I almost doesn't tune it manually, just run benchmark and
collect results to compare. Does that mean the scalability of SLQB is better?



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: care and feeding of netperf (Re: Mainline kernel OLTP performance update)
  2009-01-23 18:40                                           ` care and feeding of netperf (Re: Mainline kernel OLTP performance update) Rick Jones
  2009-01-23 18:51                                               ` Grant Grundler
@ 2009-01-24  3:03                                             ` Zhang, Yanmin
  2009-01-26 18:26                                               ` Rick Jones
  1 sibling, 1 reply; 105+ messages in thread
From: Zhang, Yanmin @ 2009-01-24  3:03 UTC (permalink / raw)
  To: Rick Jones
  Cc: Nick Piggin, Pekka Enberg, Christoph Lameter, Andi Kleen,
	Matthew Wilcox, Andrew Morton, netdev, sfr, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty

On Fri, 2009-01-23 at 10:40 -0800, Rick Jones wrote:
> > 3) ./start_netperf_udp_v4.sh 8 #Assume your machine has 8 logical cpus.
> 
> Some comments on the script:
Thanks. I wanted to run the testing to get result quickly as long as
the result has no big fluctuation.

> 
> > #!/bin/sh
> > 
> > PROG_DIR=/home/ymzhang/test/netperf/src
> > date=`date +%H%M%N`
> > #PROG_DIR=/root/netperf/netperf/src
> > client_num=$1
> > pin_cpu=$2
> > 
> > start_port_server=12384
> > start_port_client=15888
> > 
> > killall netserver
> > ${PROG_DIR}/netserver
> > sleep 2
> 
> Any particular reason for killing-off the netserver daemon?
I'm not sure if prior running might leave any impact on later running, so
just kill netserver.

> 
> > if [ ! -d result ]; then
> >         mkdir result
> > fi
> > 
> > all_result_files=""
> > for i in `seq 1 ${client_num}`; do
> >         if [ "${pin_cpu}" == "pin" ]; then
> >                 pin_param="-T ${i} ${i}"
> 
> The -T option takes arguments of the form:
> 
> N   - bind both netperf and netserver to core N
> N,  - bind only netperf to core N, float netserver
>   ,M - float netperf, bind only netserver to core M
> N,M - bind netperf to core N and netserver to core M
> 
> Without a comma between N and M knuth only knows what the command line parser 
> will do :)
> 
> >         fi
> >         result_file=result/netperf_${start_port_client}.${date}
> >         #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -P 15895 12391 -s 32768 -S 32768 -m 4096
> >         #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50 3 -I 99 5 -- -P 12384 12888 -s 32768 -S 32768 -m 4096
> >         #${PROG_DIR}/netperf -p ${port_num} -t TCP_RR -l 60 -H 127.0.0.1 ${pin_param} -- -r 1,1 >${result_file} &
> >         ${PROG_DIR}/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 ${pin_param} -- -P ${start_port_client} ${start_port_server} -s 32768 -S 32768 -m 4096 >${result_file}  &
> 
> Same thing here for the -P option - there needs to be a comma between the two 
> port numbers otherwise, the best case is that the second port number is ignored. 
>   Worst case is that netperf starts doing knuth only knows what.
Thanks.

> 
> 
> To get quick profiles, that form of aggregate netperf is OK - just the one 
> iteration with background processes using a moderatly long run time.  However, 
> for result reporting, it is best to (ab)use the confidence intervals 
> functionality to try to avoid skew errors.
Yes. My formal testing uses -i 50. I just wanted a quick testing. If I need
finer-tuning or investigation, I would turn on more options.

>   I tend to add-in a global -i 30 
> option to get each netperf to repeat its measurments 30 times.  That way one is 
> reasonably confident that skew issues are minimized.
> 
> http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance
> 
> And I would probably add the -c and -C options to have netperf report service 
> demands.
Yes. That's good. I'm used to start vmstat or mpstat to monitor cpu utilization
in real time.

> 
> 
> >         sub_pid="${sub_pid} `echo $!`"
> >         port_num=$((${port_num}+1))
> >         all_result_files="${all_result_files} ${result_file}"
> >         start_port_server=$((${start_port_server}+1))
> >         start_port_client=$((${start_port_client}+1))
> > done;
> > 
> > wait ${sub_pid}
> > killall netserver
> > 
> > result="0"
> > for i in `echo ${all_result_files}`; do
> >         sub_result=`awk '/Throughput/ {getline; getline; getline; print " "$6}' ${i}`
> >         result=`echo "${result}+${sub_result}"|bc`
> > done;
> 
> The documented-only-in-source :( "omni" tests in top-of-trunk netperf:
> 
> http://www.netperf.org/svn/netperf2/trunk
> 
> ./configure --enable-omni
> 
> allow one to specify which result values one wants, in which order, either as 
> more or less traditional netperf output (test-specific -O), CSV (test-specific 
> -o) or keyval (test-specific -k).  All three take an optional filename as an 
> argument with the file containing a list of desired output values.  You can give 
> a "filename" of '?' to get the list of output values known to that version of 
> netperf.
> 
> Might help simplify parsing and whatnot.
Yes, it does.

> 
> happy benchmarking,
> 
> rick jones
Thanks again. I learned a lot.

> 
> > 
> > echo $result
> 
> > 


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-24  2:55                                                 ` Zhang, Yanmin
@ 2009-01-24  7:36                                                   ` Pekka Enberg
  2009-02-12  5:22                                                     ` Zhang, Yanmin
  2009-01-26 17:36                                                   ` Christoph Lameter
  1 sibling, 1 reply; 105+ messages in thread
From: Pekka Enberg @ 2009-01-24  7:36 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote:
>> No there is another way. Increase the allocator order to 3 for the
>> kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the
>> larger chunks of data gotten from the page allocator. That will allow slub
>> to do fast allocs.

On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin
<yanmin_zhang@linux.intel.com> wrote:
> After I change kmalloc-8192/order to 3, the result(pinned netperf UDP-U-4k)
> difference between SLUB and SLQB becomes 1% which can be considered as fluctuation.

Great. We should fix calculate_order() to be order 3 for kmalloc-8192.
Are you interested in doing that?

On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin
<yanmin_zhang@linux.intel.com> wrote:
> But when trying to increased it to 4, I got:
> [root@lkp-st02-x8664 slab]# echo "3">kmalloc-8192/order
> [root@lkp-st02-x8664 slab]# echo "4">kmalloc-8192/order
> -bash: echo: write error: Invalid argument

That's probably because max order is capped to 3. You can change that
by passing slub_max_order=<n> as kernel parameter.

On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin
<yanmin_zhang@linux.intel.com> wrote:
> Comparing with SLQB, it seems SLUB needs too many investigation/manual finer-tuning
> against specific benchmarks. One hard is to tune page order number. Although SLQB also
> has many tuning options, I almost doesn't tune it manually, just run benchmark and
> collect results to compare. Does that mean the scalability of SLQB is better?

One thing is sure, SLUB seems to be hard to tune. Probably because
it's dependent on the page order so much.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-24  2:55                                                 ` Zhang, Yanmin
  2009-01-24  7:36                                                   ` Pekka Enberg
@ 2009-01-26 17:36                                                   ` Christoph Lameter
  2009-02-01  2:52                                                     ` Zhang, Yanmin
  1 sibling, 1 reply; 105+ messages in thread
From: Christoph Lameter @ 2009-01-26 17:36 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Pekka Enberg, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

On Sat, 24 Jan 2009, Zhang, Yanmin wrote:

> But when trying to increased it to 4, I got:
> [root@lkp-st02-x8664 slab]# echo "3">kmalloc-8192/order
> [root@lkp-st02-x8664 slab]# echo "4">kmalloc-8192/order
> -bash: echo: write error: Invalid argument

This is because 4 is more than the maximum allowed order. You can
reconfigure that by setting

slub_max_order=5

or so on boot.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: care and feeding of netperf (Re: Mainline kernel OLTP performance update)
  2009-01-24  3:03                                             ` Zhang, Yanmin
@ 2009-01-26 18:26                                               ` Rick Jones
  0 siblings, 0 replies; 105+ messages in thread
From: Rick Jones @ 2009-01-26 18:26 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Nick Piggin, Pekka Enberg, Christoph Lameter, Andi Kleen,
	Matthew Wilcox, Andrew Morton, netdev, sfr, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty

>>To get quick profiles, that form of aggregate netperf is OK - just the one 
>>iteration with background processes using a moderatly long run time.  However, 
>>for result reporting, it is best to (ab)use the confidence intervals 
>>functionality to try to avoid skew errors.
> 
> Yes. My formal testing uses -i 50. I just wanted a quick testing. If I need
> finer-tuning or investigation, I would turn on more options.

Netperf will silently clip that to 30 as that is all the built-in tables know.

> Thanks again. I learned a lot.

Feel free to wander over to netperf-talk over at netperf.org if you want to talk 
some more about the care and feeding of netperf.

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
       [not found]                   ` <588992150B702C48B3312184F1B810AD03A4F59632@azsmsx501.amr.corp.intel.com>
@ 2009-01-27  8:28                     ` Jens Axboe
  0 siblings, 0 replies; 105+ messages in thread
From: Jens Axboe @ 2009-01-27  8:28 UTC (permalink / raw)
  To: Chilukuri, Harita
  Cc: Matthew Wilcox, Andi Kleen, Andrew Morton, Wilcox, Matthew R, Ma,
	Chinang, linux-kernel, Tripathi, Sharad C, arjan, Siddha,
	Suresh B, Styner, Douglas W, Wang, Peter Xihong, Nueckel, Hubert,
	chris.mason, srostedt, linux-scsi, Andrew Vasquez,
	Anirban Chakraborty

On Mon, Jan 26 2009, Chilukuri, Harita wrote:
> Jens, we did test the patch that disables the whole stats. We get 0.5% gain with this patch on 2.6.29-rc2 comparing to 2.6.29-rc2-disbale_part_stats
> 
> Below is the description of the result:
> 
> Linux OLTP Performance summary
> Kernel#                               Speedup(x) Intr/s  CtxSw/s  us%    sys%    idle% iowait%
> 2.6.29-rc2-disbale_partition_stats     1.000    30413   42582    74      25      0       0
> 2.6.29-rc2-disable_all                 1.005    30401   42656    74      25      0       0
> 
> Server configurations:
> Intel Xeon Quad-core 2.0GHz  2 cpus/8 cores/8 threads
> 64GB memory, 3 qle2462 FC HBA, 450 spindles (30 logical units)

OK, so about the same, which means the lookup is likely the expensive
bit. I have merged this patch:

http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=e5b74b703da41fab060adc335a0b98fa5a5ea61d

which exposes an 'iostats' toggle that allows users to disable disk
statistics completely.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-26 17:36                                                   ` Christoph Lameter
@ 2009-02-01  2:52                                                     ` Zhang, Yanmin
  0 siblings, 0 replies; 105+ messages in thread
From: Zhang, Yanmin @ 2009-02-01  2:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

On Mon, 2009-01-26 at 12:36 -0500, Christoph Lameter wrote:
> On Sat, 24 Jan 2009, Zhang, Yanmin wrote:
> 
> > But when trying to increased it to 4, I got:
> > [root@lkp-st02-x8664 slab]# echo "3">kmalloc-8192/order
> > [root@lkp-st02-x8664 slab]# echo "4">kmalloc-8192/order
> > -bash: echo: write error: Invalid argument
> 
> This is because 4 is more than the maximum allowed order. You can
> reconfigure that by setting
> 
> slub_max_order=5
> 
> or so on boot.
With slub_max_order=5, the default order of kmalloc-8192 becomes
5. I tested it with netperf UDP-U-4k and the result difference from
SLAB/SLQB is less than 1% which is really fluctuation.



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-24  7:36                                                   ` Pekka Enberg
@ 2009-02-12  5:22                                                     ` Zhang, Yanmin
  2009-02-12  5:47                                                         ` Zhang, Yanmin
  0 siblings, 1 reply; 105+ messages in thread
From: Zhang, Yanmin @ 2009-02-12  5:22 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

On Sat, 2009-01-24 at 09:36 +0200, Pekka Enberg wrote:
> On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote:
> >> No there is another way. Increase the allocator order to 3 for the
> >> kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the
> >> larger chunks of data gotten from the page allocator. That will allow slub
> >> to do fast allocs.
> 
> On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin
> <yanmin_zhang@linux.intel.com> wrote:
> > After I change kmalloc-8192/order to 3, the result(pinned netperf UDP-U-4k)
> > difference between SLUB and SLQB becomes 1% which can be considered as fluctuation.
> 
> Great. We should fix calculate_order() to be order 3 for kmalloc-8192.
> Are you interested in doing that?
Pekka,

Sorry for the late update.
The default order of kmalloc-8192 on 2*4 stoakley is really an issue of calculate_order.


slab_size	order		name
-------------------------------------------------
4096            3               sgpool-128
8192            2               kmalloc-8192
16384           3               kmalloc-16384

kmalloc-8192's default order is smaller than sgpool-128's.

On 4*4 tigerton machine, a similiar issue appears on another kmem_cache.

Function calculate_order uses 'min_objects /= 2;' to shrink. Plus size calculation/checking
in slab_order, sometimes above issue appear.

Below patch against 2.6.29-rc2 fixes it.

I checked the default orders of all kmem_cache and they don't become smaller than before. So
the patch wouldn't hurt performance.

Signed-off-by Zhang Yanmin <yanmin.zhang@linux.intel.com>

---

diff -Nraup linux-2.6.29-rc2/mm/slub.c linux-2.6.29-rc2_slubcalc_order/mm/slub.c
--- linux-2.6.29-rc2/mm/slub.c	2009-02-11 00:49:48.000000000 -0500
+++ linux-2.6.29-rc2_slubcalc_order/mm/slub.c	2009-02-12 00:08:24.000000000 -0500
@@ -1856,6 +1856,7 @@ static inline int calculate_order(int si
 	min_objects = slub_min_objects;
 	if (!min_objects)
 		min_objects = 4 * (fls(nr_cpu_ids) + 1);
+	min_objects = min(min_objects, (PAGE_SIZE << slub_max_order)/size);
 	while (min_objects > 1) {
 		fraction = 16;
 		while (fraction >= 4) {
@@ -1865,7 +1866,7 @@ static inline int calculate_order(int si
 				return order;
 			fraction /= 2;
 		}
-		min_objects /= 2;
+		min_objects --;
 	}
 
 	/*



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-02-12  5:22                                                     ` Zhang, Yanmin
@ 2009-02-12  5:47                                                         ` Zhang, Yanmin
  0 siblings, 0 replies; 105+ messages in thread
From: Zhang, Yanmin @ 2009-02-12  5:47 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

On Thu, 2009-02-12 at 13:22 +0800, Zhang, Yanmin wrote:
> On Sat, 2009-01-24 at 09:36 +0200, Pekka Enberg wrote:
> > On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote:
> > >> No there is another way. Increase the allocator order to 3 for the
> > >> kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the
> > >> larger chunks of data gotten from the page allocator. That will allow slub
> > >> to do fast allocs.
> > 
> > On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin
> > <yanmin_zhang@linux.intel.com> wrote:
> > > After I change kmalloc-8192/order to 3, the result(pinned netperf UDP-U-4k)
> > > difference between SLUB and SLQB becomes 1% which can be considered as fluctuation.
> > 
> > Great. We should fix calculate_order() to be order 3 for kmalloc-8192.
> > Are you interested in doing that?
> Pekka,
> 
> Sorry for the late update.
> The default order of kmalloc-8192 on 2*4 stoakley is really an issue of calculate_order.
Oh, previous patch has a compiling warning. Pls. use below patch.

From: Zhang Yanmin <yanmin.zhang@linux.intel.com>

The default order of kmalloc-8192 on 2*4 stoakley is an issue of calculate_order.


slab_size       order           name
-------------------------------------------------
4096            3               sgpool-128
8192            2               kmalloc-8192
16384           3               kmalloc-16384

kmalloc-8192's default order is smaller than sgpool-128's.

On 4*4 tigerton machine, a similiar issue appears on another kmem_cache.

Function calculate_order uses 'min_objects /= 2;' to shrink. Plus size calculation/checking
in slab_order, sometimes above issue appear.

Below patch against 2.6.29-rc2 fixes it.

I checked the default orders of all kmem_cache and they don't become smaller than before. So
the patch wouldn't hurt performance.

Signed-off-by Zhang Yanmin <yanmin.zhang@linux.intel.com>

---

--- linux-2.6.29-rc2/mm/slub.c	2009-02-11 00:49:48.000000000 -0500
+++ linux-2.6.29-rc2_slubcalc_order/mm/slub.c	2009-02-12 00:47:52.000000000 -0500
@@ -1844,6 +1844,7 @@ static inline int calculate_order(int si
 	int order;
 	int min_objects;
 	int fraction;
+	int max_objects;
 
 	/*
 	 * Attempt to find best configuration for a slab. This
@@ -1856,6 +1857,9 @@ static inline int calculate_order(int si
 	min_objects = slub_min_objects;
 	if (!min_objects)
 		min_objects = 4 * (fls(nr_cpu_ids) + 1);
+	max_objects = (PAGE_SIZE << slub_max_order)/size;
+	min_objects = min(min_objects, max_objects);
+
 	while (min_objects > 1) {
 		fraction = 16;
 		while (fraction >= 4) {
@@ -1865,7 +1869,7 @@ static inline int calculate_order(int si
 				return order;
 			fraction /= 2;
 		}
-		min_objects /= 2;
+		min_objects --;
 	}
 
 	/*



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
@ 2009-02-12  5:47                                                         ` Zhang, Yanmin
  0 siblings, 0 replies; 105+ messages in thread
From: Zhang, Yanmin @ 2009-02-12  5:47 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

On Thu, 2009-02-12 at 13:22 +0800, Zhang, Yanmin wrote:
> On Sat, 2009-01-24 at 09:36 +0200, Pekka Enberg wrote:
> > On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote:
> > >> No there is another way. Increase the allocator order to 3 for the
> > >> kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the
> > >> larger chunks of data gotten from the page allocator. That will allow slub
> > >> to do fast allocs.
> > 
> > On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin
> > <yanmin_zhang@linux.intel.com> wrote:
> > > After I change kmalloc-8192/order to 3, the result(pinned netperf UDP-U-4k)
> > > difference between SLUB and SLQB becomes 1% which can be considered as fluctuation.
> > 
> > Great. We should fix calculate_order() to be order 3 for kmalloc-8192.
> > Are you interested in doing that?
> Pekka,
> 
> Sorry for the late update.
> The default order of kmalloc-8192 on 2*4 stoakley is really an issue of calculate_order.
Oh, previous patch has a compiling warning. Pls. use below patch.

From: Zhang Yanmin <yanmin.zhang@linux.intel.com>

The default order of kmalloc-8192 on 2*4 stoakley is an issue of calculate_order.


slab_size       order           name
-------------------------------------------------
4096            3               sgpool-128
8192            2               kmalloc-8192
16384           3               kmalloc-16384

kmalloc-8192's default order is smaller than sgpool-128's.

On 4*4 tigerton machine, a similiar issue appears on another kmem_cache.

Function calculate_order uses 'min_objects /= 2;' to shrink. Plus size calculation/checking
in slab_order, sometimes above issue appear.

Below patch against 2.6.29-rc2 fixes it.

I checked the default orders of all kmem_cache and they don't become smaller than before. So
the patch wouldn't hurt performance.

Signed-off-by Zhang Yanmin <yanmin.zhang@linux.intel.com>

---

--- linux-2.6.29-rc2/mm/slub.c	2009-02-11 00:49:48.000000000 -0500
+++ linux-2.6.29-rc2_slubcalc_order/mm/slub.c	2009-02-12 00:47:52.000000000 -0500
@@ -1844,6 +1844,7 @@ static inline int calculate_order(int si
 	int order;
 	int min_objects;
 	int fraction;
+	int max_objects;
 
 	/*
 	 * Attempt to find best configuration for a slab. This
@@ -1856,6 +1857,9 @@ static inline int calculate_order(int si
 	min_objects = slub_min_objects;
 	if (!min_objects)
 		min_objects = 4 * (fls(nr_cpu_ids) + 1);
+	max_objects = (PAGE_SIZE << slub_max_order)/size;
+	min_objects = min(min_objects, max_objects);
+
 	while (min_objects > 1) {
 		fraction = 16;
 		while (fraction >= 4) {
@@ -1865,7 +1869,7 @@ static inline int calculate_order(int si
 				return order;
 			fraction /= 2;
 		}
-		min_objects /= 2;
+		min_objects --;
 	}
 
 	/*


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-02-12  5:47                                                         ` Zhang, Yanmin
  (?)
@ 2009-02-12 15:25                                                         ` Christoph Lameter
  2009-02-12 16:07                                                           ` Pekka Enberg
  -1 siblings, 1 reply; 105+ messages in thread
From: Christoph Lameter @ 2009-02-12 15:25 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Pekka Enberg, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

[-- Attachment #1: Type: TEXT/PLAIN, Size: 679 bytes --]

On Thu, 12 Feb 2009, Zhang, Yanmin wrote:

> The default order of kmalloc-8192 on 2*4 stoakley is an issue of calculate_order.
>
>
> slab_size       order           name
> -------------------------------------------------
> 4096            3               sgpool-128
> 8192            2               kmalloc-8192
> 16384           3               kmalloc-16384
>
> kmalloc-8192's default order is smaller than sgpool-128's.

You reverted the page allocator passthrough patch before this right?
Otherwise kmalloc-8192 should not exist and allocation calls for 8192
bytes would be converted inline to request of an order 1 page from the
page allocator.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-02-12  5:47                                                         ` Zhang, Yanmin
  (?)
  (?)
@ 2009-02-12 16:03                                                         ` Pekka Enberg
  -1 siblings, 0 replies; 105+ messages in thread
From: Pekka Enberg @ 2009-02-12 16:03 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

On Sat, 2009-01-24 at 09:36 +0200, Pekka Enberg wrote:
> > > On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote:
> > > >> No there is another way. Increase the allocator order to 3 for the
> > > >> kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the
> > > >> larger chunks of data gotten from the page allocator. That will allow slub
> > > >> to do fast allocs.
> > > 
> > > On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin
> > > <yanmin_zhang@linux.intel.com> wrote:
> > > > After I change kmalloc-8192/order to 3, the result(pinned netperf UDP-U-4k)
> > > > difference between SLUB and SLQB becomes 1% which can be considered as fluctuation.
> > > 
> > > Great. We should fix calculate_order() to be order 3 for kmalloc-8192.
> > > Are you interested in doing that?

On Thu, 2009-02-12 at 13:22 +0800, Zhang, Yanmin wrote:
> > Pekka,
> > 
> > Sorry for the late update.
> > The default order of kmalloc-8192 on 2*4 stoakley is really an issue of calculate_order.

On Thu, 2009-02-12 at 13:47 +0800, Zhang, Yanmin wrote:
> Oh, previous patch has a compiling warning. Pls. use below patch.
> 
> From: Zhang Yanmin <yanmin.zhang@linux.intel.com>
> 
> The default order of kmalloc-8192 on 2*4 stoakley is an issue of calculate_order.

Applied to the 'topic/slub/perf' branch. Thanks!

			Pekka


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-02-12 15:25                                                         ` Christoph Lameter
@ 2009-02-12 16:07                                                           ` Pekka Enberg
  0 siblings, 0 replies; 105+ messages in thread
From: Pekka Enberg @ 2009-02-12 16:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

Hi Christoph,

On Thu, 12 Feb 2009, Zhang, Yanmin wrote:
>> The default order of kmalloc-8192 on 2*4 stoakley is an issue of calculate_order.
>>
>>
>> slab_size       order           name
>> -------------------------------------------------
>> 4096            3               sgpool-128
>> 8192            2               kmalloc-8192
>> 16384           3               kmalloc-16384
>>
>> kmalloc-8192's default order is smaller than sgpool-128's.

On Thu, Feb 12, 2009 at 5:25 PM, Christoph Lameter
<cl@linux-foundation.org> wrote:
> You reverted the page allocator passthrough patch before this right?
> Otherwise kmalloc-8192 should not exist and allocation calls for 8192
> bytes would be converted inline to request of an order 1 page from the
> page allocator.

Yup, I assume that's the case here.

^ permalink raw reply	[flat|nested] 105+ messages in thread

end of thread, other threads:[~2009-02-12 16:08 UTC | newest]

Thread overview: 105+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-01-13 21:10 Mainline kernel OLTP performance update Ma, Chinang
2009-01-13 22:44 ` Wilcox, Matthew R
2009-01-15  0:35   ` Andrew Morton
2009-01-15  1:21     ` Matthew Wilcox
2009-01-15  2:04       ` Andrew Morton
2009-01-15  2:27         ` Steven Rostedt
2009-01-15  7:11           ` Ma, Chinang
2009-01-15  7:11             ` Ma, Chinang
2009-01-19 18:04             ` Chris Mason
2009-01-19 18:04               ` Chris Mason
2009-01-19 18:37               ` Steven Rostedt
2009-01-19 18:37               ` Steven Rostedt
2009-01-19 18:55                 ` Chris Mason
2009-01-19 18:55                   ` Chris Mason
2009-01-19 19:07                   ` Steven Rostedt
2009-01-19 19:07                     ` Steven Rostedt
2009-01-19 23:40                 ` Ingo Molnar
2009-01-19 23:40                   ` Ingo Molnar
2009-01-15  2:39         ` Andi Kleen
2009-01-15  2:47           ` Matthew Wilcox
2009-01-15  3:36             ` Andi Kleen
2009-01-20 13:27             ` Jens Axboe
     [not found]               ` <588992150B702C48B3312184F1B810AD03A497632C@azsmsx501.amr.corp.intel.com>
2009-01-22 11:29                 ` Jens Axboe
     [not found]                   ` <588992150B702C48B3312184F1B810AD03A4F59632@azsmsx501.amr.corp.intel.com>
2009-01-27  8:28                     ` Jens Axboe
2009-01-15  7:24         ` Nick Piggin
2009-01-15  9:46           ` Pekka Enberg
2009-01-15 13:52             ` Matthew Wilcox
2009-01-15 14:42               ` Pekka Enberg
2009-01-16 10:16               ` Pekka Enberg
2009-01-16 10:21                 ` Nick Piggin
2009-01-16 10:31                   ` Pekka Enberg
2009-01-16 10:42                     ` Nick Piggin
2009-01-16 10:55                       ` Pekka Enberg
2009-01-19  7:13                         ` Nick Piggin
2009-01-19  8:05                           ` Pekka Enberg
2009-01-19  8:33                             ` Nick Piggin
2009-01-19  8:42                               ` Nick Piggin
2009-01-19  8:47                                 ` Pekka Enberg
2009-01-19  8:57                                   ` Nick Piggin
2009-01-19  9:48                               ` Pekka Enberg
2009-01-19 10:03                                 ` Nick Piggin
2009-01-16 20:59                     ` Christoph Lameter
2009-01-16  0:27           ` Andrew Morton
2009-01-16  4:03             ` Nick Piggin
2009-01-16  4:12               ` Andrew Morton
2009-01-16  6:46                 ` Nick Piggin
2009-01-16  6:55                   ` Matthew Wilcox
2009-01-16  7:06                     ` Nick Piggin
2009-01-16  7:53                     ` Zhang, Yanmin
2009-01-16 10:20                       ` Andi Kleen
2009-01-20  5:16                         ` Zhang, Yanmin
2009-01-21 23:58                           ` Christoph Lameter
2009-01-22  8:36                             ` Zhang, Yanmin
2009-01-22  9:15                               ` Pekka Enberg
2009-01-22  9:15                                 ` Pekka Enberg
2009-01-22  9:28                                 ` Zhang, Yanmin
2009-01-22  9:47                                   ` Pekka Enberg
2009-01-23  3:02                                     ` Zhang, Yanmin
2009-01-23  3:02                                       ` Zhang, Yanmin
2009-01-23  6:52                                       ` Pekka Enberg
2009-01-23  6:52                                         ` Pekka Enberg
2009-01-23  8:06                                         ` Pekka Enberg
2009-01-23  8:30                                           ` Zhang, Yanmin
2009-01-23  8:40                                             ` Pekka Enberg
2009-01-23  9:46                                             ` Pekka Enberg
2009-01-23 15:22                                               ` Christoph Lameter
2009-01-23 15:31                                                 ` Pekka Enberg
2009-01-23 15:55                                                   ` Christoph Lameter
2009-01-23 16:01                                                     ` Pekka Enberg
2009-01-24  2:55                                                 ` Zhang, Yanmin
2009-01-24  7:36                                                   ` Pekka Enberg
2009-02-12  5:22                                                     ` Zhang, Yanmin
2009-02-12  5:47                                                       ` Zhang, Yanmin
2009-02-12  5:47                                                         ` Zhang, Yanmin
2009-02-12 15:25                                                         ` Christoph Lameter
2009-02-12 16:07                                                           ` Pekka Enberg
2009-02-12 16:03                                                         ` Pekka Enberg
2009-01-26 17:36                                                   ` Christoph Lameter
2009-02-01  2:52                                                     ` Zhang, Yanmin
2009-01-23  8:33                                       ` Nick Piggin
2009-01-23  9:02                                         ` Zhang, Yanmin
2009-01-23 18:40                                           ` care and feeding of netperf (Re: Mainline kernel OLTP performance update) Rick Jones
2009-01-23 18:51                                             ` Grant Grundler
2009-01-23 18:51                                               ` Grant Grundler
2009-01-24  3:03                                             ` Zhang, Yanmin
2009-01-26 18:26                                               ` Rick Jones
2009-01-16  7:00                   ` Mainline kernel OLTP performance update Andrew Morton
2009-01-16  7:25                     ` Nick Piggin
2009-01-16  8:59                     ` Nick Piggin
2009-01-16 18:11                   ` Rick Jones
2009-01-19  7:43                     ` Nick Piggin
2009-01-19 22:19                       ` Rick Jones
2009-01-15 14:12         ` James Bottomley
2009-01-15 17:44           ` Andrew Morton
2009-01-15 18:00             ` Matthew Wilcox
2009-01-15 18:14               ` Steven Rostedt
2009-01-15 18:44                 ` Gregory Haskins
2009-01-15 18:46                   ` Wilcox, Matthew R
2009-01-15 18:46                     ` Wilcox, Matthew R
2009-01-15 19:44                     ` Ma, Chinang
2009-01-16 18:14                       ` Gregory Haskins
2009-01-16 19:09                         ` Steven Rostedt
2009-01-20 12:45                         ` Gregory Haskins
2009-01-15 19:28                 ` Ma, Chinang
2009-01-15 16:48       ` Ma, Chinang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.